Pairwise Comparison for Photo Evaluation
Even if the pairwise comparison method is unfamiliar to most of us because we are more likely to be offered scale-based assessment in everyday life, it is still widely used in practice wherever simple methods are too imprecise. For example, it is the favoured method for tournament sports such as chess, tennis, fencing, badminton or others. In market research, it is used to determine consumer preferences. In clinical studies, it is used to assess the effectiveness of treatments or medication. The method is also used in psychology to investigate preferences and decision-making behaviour, in product development to evaluate design prototypes and in web design to optimise the user experience.

Personally, I used the method during my time as a research chemist to analyse the effect of influencing factors on complex issues. It is a high-precision statistical method and recognised state of the art. I would like to explain how this method works using a simple example from photography.

Suppose we are presented with five images that we have to rank:

To make the task challenging, examples from different genres were given: Flowers, people, landscape, sport, animals. They are all high-quality photographs. The technical quality is therefore irrelevant. The only decisive factors are the special nature of the shot and the image's effect on the viewer, whereby different standards must be applied to each image.

The task is complex. With a scale-based evaluation (for example, 1 to 5 points), it would most likely result in almost every image receiving 5 points from every juror. This may flatter the respective photographers, but it makes us fail.

This is where the pairwise comparison comes in handy. The images all compete against each other. There are n×(n-1)/2 comparisons necessary. For our example, these are the following 5×(5-1)/2=10 combinations:

As an example, you could make the following 10 individual decisions (grey = discard, coloured = keep)

These decisions are entered into a matrix (0 = discard, 1 = keep) in the blue fields. The white fields are automatically calculated as the opposite. In the column on the far right, the sum of the evaluations appears, automatically resulting in a ranking:

You immediately recognise the great power of this method, but also the big problem: With many images, the matrix becomes huge and manually unmanageable. This is because n×(n-1)/2 comparisons are required for n elements. With 48 images, this means 48×47/2=1128 comparisons! Without suitable software, the analysis would be a tedious task. This is the reason why the pairwise comparison is not yet widely used in photo evaluation, although it is actually the more objective method. But fortunately, we now have suitable software tools at our disposal so that we can easily use this more effective method for our needs.

And now to the confusion that Laura noticed because the display of the interim results was inadvertently activated during the ongoing voting process. (Sorry, that shouldn't have happened. Nowhere in the world are interim results shown during an ongoing electoral process to avoid influencing subsequent voters.)

As Laura noted, she was confused about not having seen all the images. This is understandable, but it is not at all necessary to present all pairs of images to all jurors. It would overwhelm them all to have to judge such a large number of image pairs.

An African proverb says: ‘How do you eat an elephant? You cut it up and eat it piece by piece." In this sense, the total number of comparisons required can easily be divided among several judges in order to reduce the workload and achieve a solid and objective assessment. For example, the comparisons can be divided evenly between the judges using the round robin method. Alternatively, block allocation can be used. The ‘random assignment’ method I have chosen is the best, but also the most labour-intensive: the pairwise comparisons are randomly assigned to the jurors so that each juror evaluates a random subset of the total comparisons. This can also be done multiple times to ensure that each comparison is evaluated by more than one judge, which increases reliability.

Without software support, all this used to be very laborious. You spent hours working on tables in which you were not allowed to make any mistakes. But thanks to suitable software, the method is now also very easy to use.

To summarise, it can be said that the method of pairwise comparison is more reliable and meaningful than the widespread scale-based votings. Nevertheless, due to the amount of work involved, it was previously not very popular and therefore not widely used in the non-professional sector. Thanks to the availability of suitable software tools, however, these obstacles no longer exist. As a result, the method can now also be used with advantage by non-professionals.

St. Johann in Tyrol
June 10, 2024

Addendum: This method is also ideal if we want to select just 12 images from a large number for an annual photo calendar, for example. Or if we want to find the one we want to add to an application from 50 outstanding portrait shots. Try it!


Don Sutherland said:

Very informative.
5 weeks ago ( translate )

Bergfex replied to Don Sutherland:

Thank you!
5 weeks ago

raingirl said:

That's very well stated. Thank you.

I'm curious. When you don't know how many people will actually do the evaluation, doesn't that skew the results? If the image choices people see are truely random (and not knowing how many people will be involved) wouldn't it be possible that an image wouldn't get evaluated at all and thus be left out of the result?
5 weeks ago

Bergfex replied to raingirl:

It is true that the admin has to think about the expected number of jurors when preparing the voting. The number of images that are shown to the jurors depends on this. In the past, between 9 and 30 jurors have taken part. The average was 20 jurors.

In this case, I decided on a package size of 48 images. In order to get all 1128 pairings evaluated, which is necessary for a fully differentiated result, 24 jurors would be needed. If the number of jurors is lower, this means that the result is not yet fully differentiated. If more jurors participate, the differentiation is no more improved, but only more solid. However, we can already see now, with only 12 jurors, that the differentiation is much better than in all of the previous votes using the scale-based method. Above all, the common tendency to judge too positively was neutralised:

Click on the thumbnail for a detailed view.

On the question of whether a picture can be forgotten: This would be a programme error that needs to be reported to PollUnit. However, as you can see in the evaluation, all 48 images are present and also rated. In this respect, everything is perfect.

(The fact that there are only 42 categories so far is due to the fact that the voting with 12 participants is not completely differentiated. Some pictures are still in the same category. However, the differentiation will improve with each additional juror.)
4 weeks ago

Boarischa Krautmo said:

I'ld like to say (if I may):

The mentioned sports do not follow pairwise comparison but a tournament model - i.e. there is not a one-vs-everone-comparison but a random comparison.

scale-based evaluation can be done by a ascending row of "points" from one to "n": If you hve 20 pics to judge, you have one "1 points, one "2 points" .. up to one "20 points". So you'll get a clear result as well.

In my opinion pairwise comparison is well suited for evolution-like decisions (is the successor better than the ancestor or not). Pairwise comparison does very effectively sort out "better" mutations and thus help to optimise procedures or components.

In my very humble opinion it is not very useful to do a ranking.
5 weeks ago

Bergfex replied to Boarischa Krautmo:

Of course, a scale-based assessment could be designed in such a way that there are as many levels as there are pictures. However, practical experience shows that most people are already unable to rate objectively on a scale of 1 to 10.

In addition, there are cultural differences. Germans often rate very sceptically. English people who strive for harmony tend towards the centre, while US-Americans are often unwilling to award low scores. With hundreds of participants from all over the world, this would not be a problem. However, with just a few participants (9 ... 30), there are major systematic distortions, as can be seen from the scale-based votes to date, which are shown in this evaluation:

Click on the thumbnail for a detailed view.

On the far right, you can see the current vote highlighted in yellow. On the left, you can see all previous votes on homepage pictures. You can immediately recognise the general reluctance to use the entire available scale range. All of the jurors rate too positively. As a result, the shape of the curve is relatively flat, which reduces the differentiation.

These irregularities can't occur in a pairwise comparison. That's what makes it so advantageous.
4 weeks ago

Bergfex said:

I have now optimised the introductory text at Poll Unit:
Voting On Summer Pictures 2023/24
4 weeks ago

Bergfex said:

Since today, 15:30 CEST, the link is directly accessible (without the diversions via YouTube):

Please vote on homepage images

As a result, the number of participants doubled within 3 hours and has now reached the desired number of 24.
4 weeks ago

uwschu said:

Für mich hätte die Auswahl von den Bildern gereicht und ich würde mich für welche entscheiden, ganz nüchtern durch Betrachten.
Glaube, habe nach 10 Paaren aufgehört, als immer wieder gesehene mit anderen verglichen wurden. Erinnerte mich an meine Frau mit mir im Farbengeschäft, wo wir einen Blauton aus 30 verschiedenen herausfinden wollten. Nehmen wir den oder den oder doch den :-).
Aber gut, mal eine andere Herangehensweise, ganz emotionslos
4 weeks ago ( translate )

Bergfex replied to uwschu:

Danke für dein Feedback. Das hier ist nämlich auch eine Art Testlauf, um Erfahrungen zu sammeln. Da ist jede Rückmeldung wichtig und wertvoll.

Abgesehen davon ist es ein sehr hilfreiches Verfahren, wenn man beispielsweise aus 30 ähnlichen Aufnahmen einer Fotostrecke die beste heraussuchen muss. Vor dieser Aufgabe stand ich nämlich vor 14 Tagen bei meiner Porträtfotografie. Nur dadurch habe ich mich wieder an die Methode erinnert.
4 weeks ago ( translate )

Bergfex said:

The first practical test has now been completed and analysed:

First Study on Pairwise Comparison
(Click onto the thumbnail for detailed information.)

As a result of this first study, it can be concluded that the pairwise comparison provides a better differentiated result. Further tests should be carried out to verify this finding.
4 weeks ago