Pre-processing data is not just about correcting errors

In the last few posts about the IMDB dataset (1, 2, 3), I looked at two variables, the count of people in the movies' posters, and the plot keywords.

The people count variable is a modeled variable, which means that the output of one predictive algorithm is used to build another predictive model. We investigated the accuracy of the face recognition algorithm, and also learned that even for humans, counting faces is not as simple as you think – besides, we don't know who saw which poster.

The plot keywords variable looks rich at the surface but the person who wrote the scraper  didn't realize that each movie is associated with hundreds of keywords, and only five or so are shown on the IMDB movie front page. This causes the variable to become completely useless.

In this post, I investigate the title_year variable, presumed to be the year in which each movie arrived at theaters. (Even this definition lacks clarity but I shall not tackle this aspect here.)

A key question related to any time-series data is stability over time. With IMDB, we can investigate the trend of IMDB score over time. IMDB score is the average (max:10) rating by the visitors to the IMDB website.


The average film scored between 6.5 and 7.5 at all times, with a clear downward trend. One could naively conclude that the quality of movies, as reflected by IMDB rating, has declined over time.

But what separates the good from the not-so-good analysts is the quest for other explanations. Which is the most plausible explanation?

The trendline is only about the average film. What the chart above fails to show is information about the distribution around the average. The following chart contains some added details:


The first thing one notices is the extreme skew in density of samples since the 1990s. There are few movies in the database from earlier times. This is confirmed by the histogram at the bottom of the chart, in which I annotated three key dates.

The website IMDB launched in 1990. It should be obvious that most users will be reviewing current or recent movies, and not surprising that movies before 1990s get reviews at a much lower clip. Secondly, IMDB got a boost in 1998 when it was purchased by the behemoth Amazon. Finally, the last year in the dataset is a steep drop – most likely to do with the lag between watching and reviewing movies.

The other feature of the scatter plot is the shape of the lower envelope (minimum). Clearly, older movies are rated only if they are good movies while IMDB users are much more likely to review recent movies that they did not enjoy. Not surprising if you think about the "causal structure" behind the data: the typical user likely reviews movies that he or she recently watched, the user is more likely to have watched a recently opened movie than an oldie, and if the user happens to watch an oldie, he or she is likely to pull out a well-reviewed title.

We have uncovered a bias in the IMDB average rating data – bad oldies are not in the database. (A smaller but still visible bias is the lull in the most recent titles due to lag.)


For most problems, it is safest to restrict your analysis to a period of time when the IMDB website is "mature". The analyst needs to figure out how this bias affects his or her analysis, if at all. For those keeping track, strike #3 is for those analysts who make statements about trends in this data without recognizing and dealing with the bias.

I should explain the title of the post. Each year, many students claim that they have a dataset with no missing values and no obvious typos, and therefore, they have nothing to do for the data pre-processing assignment. In this post, I did not discover any errors in the data – I uncovered bias in the data by deeply understanding how the data was collected.

Big Data, Plainly Spoken (aka Numbers Rule Your World)