In the previous post, I diagnosed one data issue with the IMDB dataset found on Kaggle. On average, the third-party face-recognition software undercounted the number of people on movie posters by 50%.
It turns out that counting the number of people on movie posters is a subjective activity. Reasonable people can disagree about the number of heads on some of those posters.
For example, here is a larger view of the Better Luck Tomorrow poster I showed yesterday:
By my count, there are six people on this poster. But notice the row of photos below the red title: someone could argue that there are more than six people on this poster. (Regardless, the algorithm is still completely wrong in this case, as it counted only one head.)
So one of the "rules" that I followed when counting heads is only count those people to whom the designer of the poster is drawing attention. Using this rule, I ignore the row of photos below the red title. Also by this rule, if a poster contains a main character, and its shadow, I only count the person once. If the poster contains a number of people in the background, such as generic soldiers in the battlefield, I do not count them.
Another rule I used is to count the back or side of a person even if I could not see his or her face provided that this person is a main character of the movie. For example, the following Rocky Balboa poster has one person on it.
(cf. The algorithm counted zero heads.)
According to the distribution of number of heads predicted by the algorithm, I learned that some posters may have dozens of people on them. So I pulled out these outliers and looked at them.
This poster of The Master (2012) is said to contain 31 people.
On a closer look, this is a tesselation of a triangle of faces. Should that count as three people or lots of people? As the color fades off on the sides of the poster, should we count those barely visible faces?
Counting is harder than it seems.
The discussion above leads to an important issue in building models. The analyst must have some working theory about how X is related to Y. If it is believed that the number of faces on the movie poster affects movie-goers' enthusiam, then that guides us to count certain people but not others.
If one were to keep pushing on the rationale of using this face count data, one inevitably arrives at a dead end. Here are the top results from a Google Image Search on "The Master 2012 poster":
Well, every movie is supported by a variety of posters. The bigger the movie, the bigger the marketing budget, the more numerous are the posters. There are two key observations from the above:
The blue tesselation is one of the key designs used for this movie. Within this design framework, some posters contain only three heads, some maybe a dozen heads, and some (like the one shown on IMDB) many dozens of heads.
Further, there are at least three other design concepts, completely different from the IMDB poster, and showing different number of people!
Going back to the theory that movie-goers respond to the poster design (in particular, the number of people in the poster), the analyst now realizes that he or she has a huge hole in the dataset. Which of these posters did the movie-goer see? Did IMDB know which poster was seen the most number of times?
Thus, not only are the counts subjective and imprecise, it is not even clear we are analyzing the right posters.
Once I led the students down this path, almost everyone decided to drop this variable from the dataset.