Analysis of Films on Streaming Platforms

CIS545: Big Data Analytics — Spring 2021

Python, PySpark SQL

Overview

The goal of this project is to study trends and relationships among different streaming services and the movies that are available on those services. I worked on this project with two other classmates, and we produced regression models meant to further improve streaming services, specifically focussing on models that predict the user rating of a film given a film overview or given numeric factors such as budget, revenue, and runtime. We used information pertaining to films available on Netflix, Hulu, Prime Video, and Disney+. Streaming platforms and movies were a subject of interest for us because streaming services have become the main way that most people consume video content, especially during the pandemic. The popularity of these services is only going to continue to increase, so it’s interesting to look at what films are available on these services and the many relationships between those films.

Process

First, we cleaned the datasets by dropping unnecessary columns, exploding relevant columns, and getting dates in the correct formatting. An example of cleaning that we performed included normalizing the rating scale between movie rating websites. For example, Rotten Tomatoes gave ratings on a percentage basis whereas IMDb gave it on a score from 1-10 as a decimal number. We normalized Rotten Tomatoes’ scores so that it would follow IMDBs scale and also be on a scale from 1-10 as a decimal number. Additionally, we cleaned up the Age Restrictions column data. It originally provided age rating associated with movies in the form of numbers like ‘16+’. We decided to change this column into an ‘MPA Rating’ column, which is the Motion Picture Association film rating system. We created a dictionary which would map the values given in the data frame column to actual MPA ratings. The mappings looked like this: 16+ --> R ,all --> G, 13+ --> PG-13, 7+ --> PG, 18+ --> NC-17, 0 --> NR.

Exploratory Data Analysis

We then displayed visualizations of the trends and relationships among the movies regarding their released year, revenue, budget, and more.

We explored the relationship between movies and their revenue, along with the average score voted upon that movie:

Most Payed Off Movies

We also explored the correlation between movie budget and revenue:

Correlation on Budget and Revenue

This correlation graph shows that the revenue usually increases when budget increases, and most movies have a budget ranging from 0-100 Million USD with a revenue ranges from 0-0.5 Billion USD.

We explored the most expensive and top grossing movies from 1916 - 2016.

Top 20 most expensive movies from 1916 - 2016

The graph showed that the top 20 expensive movies are all made after the year 2000, implying that companies are willing to spend more money producing movies as the year proceeds. It also showed that the higher budget doesn't always lead up to a high vote average.

On the other hand, we can also investigate the top grossing movies rather than the movies with the highest budget.

Top 20 grossing movies from 1916 - 2016

Highest grossing movies by year

The two graphs above shows that the revenue by year of the most successful movies increases in a positive rate except for a few outliers such as Titanic and Avatar. The highest grossing films usually have a high vote average, but there are some exceptions such as Mission: Impossible II (5.9 vote average).

We had data on movie taglines, which is a line that can refer to the plot summary of a movie or give the viewer an idea of what they will experience through the movie. Through this, we discovered the most common words used in movie taglines throughout the years.

The most common words in the tagline of movies made earlier than 1950:

Common words in tagline of movies before 1950

The most common words in the tagline of movies made later than 2010:

Common words in tagline of movies after 2010

The most common words in the tagline of movies made between 1916 - 2016:

Common words in tagline of movies between 1916 - 2016

The word clouds shows that the most common words of all times are "one", "love", "story", "never", and "world". While words like "woman" and "heart" appears a lot in the earlier movies, they did not appear much recently. Instead, words like "family", "life" appear more in the more recent movies’ taglines.

Models: Linear Regression, Logistic Regression, k-Nearest Neighbors

We use various models to predict the vote average of a movie from various features of a movie, such as the overview and other columns. The overview of a movie gives a small summary of it, and from these words we try to extract how highly people rate it. This could be used for future overviews of movies to predict how popular upcoming movies may be. We compare how accurate models are in finding this relationship and identify the best one. The models we tested are Linear & Logistic Regression Models, kNN model, and Random Forest Model. We used Linear Regression, Logistic Regression, and kNN model to predict the vote average from the movie’s overview. We will use Logistic Regression again and Random Forest model to predict the vote average from a combination of other features that are not text based.

First, we use a Linear Regression and a Logistic Regression model to test the prediction of vote_average from the clean_overview. We evaluate these models by checking out the root mean squared error (RMSE).

The standard error function that is minimized by linear regression is the mean squared error. The mean squared error is a metric which results in the squared difference between the estimated value and the expected value. Because we want these values to be very similar, we would generally prefer a smaller mean squared error. However, a mean squared error too close to 0 could indicate possible overfitting.

To describe linear regression, this is the process by which we fit a linear function to “optimally” map input features to continuous output values. Our output values in this case are the vote_average. Linear regression minimizes squared error. On the other hand, logistic regression provides a probabilistic prediction. The focus of this type of regression is minimizing error by using log cost to measure the error between our probabilistic estimate and the training label. Logistic regression training involves finding weights which minimize a cost function through a process called gradient descent.

Linear Regression RMSE: 1.2085508910435525

Logistic Regression RMSE: 1.0789538517199955

The next model we implemented was a k-Nearest Neighbors model. kNN is a supervised learning algorithm that groups data using clustering. We use a kNN model to also test the prediction of the vote average of movies from its overview. We evaluate this model by checking out the root mean squared error.

Initially, we ran kNN using the default value of n_neighbors = 5, but the RMSE score was high. Since our dataset is so large, we knew that a higher number of neighbors would give us better performance. The difference in performance between 5 neighbors and 50 neighbors is numerically small; however, looking at the plotted data, we can see that the largest drop in RMSE is between 5 and 10. Thus, increasing even just a little bit to a neighbor count of 10 can have a significant impact on performance. We can also more easily see in our graph that the best RMSE score is obtained at 30 neighbors, and that over time there is no real performance value in using 100 or more neighbors.

RMSE Value vs Number of Neighbors

Looking at the first three models we implemented, we can see that the RMSE is decreasing as the model complexity increases which would be expected with complex text data such as ours. Linear regression does not often perform well with text classification so we were not surprised to get an RMSE score as high as 1.241. Logistic regression and the kNN algorithm are both more complex models, but they are still simple compared to neural networks or random forests. We see a significant decrease in RMSE from linear regression, but there is certainly still room for improvement. For logistic regression we got an RMSE score of 1.162 and with 30 neighbors in kNN we got an RMSE of 1.094.

Text data is complex data to work with and was not giving us results as good as we originally hoped to achieve, so we next explored predicting the user rating of films using multiple columns of numeric data. We created both a Linear Regression and Random Forest Model that use budget, revenue, runtime, vote_count, and release_year to predict the user rating of films on IMDB.

Models: Random Forest, Linear Regression (PySpark)

We finally use a Random Forest model and a Linear Regression model (again) to test the prediction of a movie’s vote average from multiple numerical features of movies this time rather than just the overview of the movie. These features are: budget, revenue, runtime, vote_count, release_year. We evaluate this model by checking out the root mean squared error and the r2 score.

We can use many different classifiers and regressors when creating models to predict features. One strategy is using an ensemble approach in which classifiers are trained over different subsets of the input. An ensemble of decision trees is known as a random forest. Sometimes using multiple classifiers like this can reduce bias and variance and can be more effective.

To go over what the r2 score is, it can be read as "r squared" and is known as the coefficient of determination. It is a measurement of how close a set of data is close to a regression line, and also shows the proportion of variance of the labels which is predicted from the features. In this case, the higher the r2 score, the better the model, because the model fits the data more.

Linear Regression RMSE: 0.946049

Linear Regression r2: 0.127155

Random Forest Regression RMSE: 0.537397

Random Forest Regression r2: 0.537397

Comparing linear regression using the film overviews and linear regression using numerical data, we can see a vast improvement from RMSE = 1.241 to RMSE = 0.946. We can see that linear regression has much better performance with numerical data as opposed to text data, which we expected from our simplest model. We further wanted to explore how accurately we could predict user ratings using the other numerical data we had, so we also implemented a random forest regression model that gave us the best performance of all our models so far: 0.5384.

Future Work

The next step we would like to take with this project is further exploring MPA-rating predictions. We currently lose a lot of our data when joining both data frames in order to predict the MPA-rating; however, with more data on the MPA-ratings of films, we would be able to get more accurate models and further expand our model to not only predict user ratings, but also the Movie Picture Association’s ratings for age appropriateness. This information would allow us to not only compare streaming services based on the number of highest rated movies they have, but also compare them based on which age groups they are most geared towards. Moreover, given more time we would like to implement more models using numeric data and create an implementation of Random Forest Regression using text data for comparison.