Creating Recommendation Systems Doesn’t Have To Be Complex

5 Simple ways to create movie recommendations using Python

Patrick Kalkman

Jan 21, 2021 — 6 min read

Creating (Movie) Recommendations Doesn’t Have To Be Complex

Big companies such as Netflix, Amazon, and Google use complex recommendation systems. This helps them to sell more products and to keep customers engaged.

In this article, I will show you that recommendation systems don’t have to be complicated. I will show this using Python, Pandas, and functions from the SciKit learn library.

I will use Python 3.8 with data from the MovieLens database. The source code is available on Github.

MovieLens database

To show recommendations, I will use the MovieLens data set from GroupLens Research. Instead of the full data set, I will use a subset. This subset contains 100,000 ratings, 3,600 tags applied to 9,000 movies by 600 users. GroupLens Research recommends using this subset for education and development.

GroupLens Research stores the data in four CSV files: ratings.csv, movies.csv, links.csv, and tags.csv.

If we use Pandas read_csv method to read the ratings.csv and print the first 5 records, we get the following output.

I want to simplify the recommendation examples. So, I need to add the movie title to the data frame. We do this by using the map method, as you see in the source code below on row seven.

If we print the data frame, we see that it now includes the title. All the data is in place. We are ready to create recommendations.

Non-personalized recommendations

The first three types of recommendations are non-personalized. Non-personalized means that we don’t use any of the user’s preferences or history.

1. Recommendations based on the number of times a movie is watched

The first way we can recommend a movie is to count the number of times a movie is watched. You can implement this using the method value_counts() on the data frame.

A movie that is watched more is probably better. From this list, we can recommend movies to a user.print(ratings_df['title'].value_counts().head(10))

Executing the code above on the subset of the lens movie database gives the following result.

2. Recommendations based on the ratings of viewers

For the next recommendations, we are going to use the ratings that are in the data frame. So, we group the records by movie title and calculate the average rating using the mean function.

We print the first 5 records.

This gives the following result.

Now, something interesting is going on. You would expect that the most-watched movies from the first recommendations would be on the top of the list.

I suspect that these movies have one single five-star rating. Let’s see if that is true by executing the following commands.

This prints one for each line, so our initial suspicion was correct, these movies have a single five-star rating.

We can fix this by only including movies that users have seen over a hundred times using the following code.

First, we count the number of times users watched a movie using the value_counts() method. Then we remove the movies that were watched less than one hundred times. In row three, we create a new data set with the ratings of those watched over 100 times.

We can then use the same method as before. Group the ratings by movie title, and calculate the average rating, and sort the result.

This gives the following result, as you can see this now includes movies that were also in our top 10.

3. Recommendations based on movies that are often seen together

The last non-personalized recommendation is to suggest a movie by finding the movies that are most seen together.

To create these kinds of recommendations, we first have to create a function that creates all permutations of any two movies inside our dataset. We create the permutations using the permutations function from itertools.

This gives the following result.

We then group each two movie combinations and count the number of times that someone watched both movies. We create a new data frame from the result and sort it descending based on the count.

Now we have this data frame combination_counts_df, we can use it to get recommendations by filtering on a movie that the user watched. For example, if someone watched the Godfather movie, we can get the recommendations like this.

This gives the following result in which we can recommend “Star Wars” and Forrest Gump to people that watched “The Godfather”.

Content-Based Recommendations

Content-based recommendations are recommendations that you create based on the attributes of the movies. We can use any attribute. For our first recommendations, we use movie genres. We create a set of recommendations that have more or the same genres. For the second type of recommendation, we will use the synopsis of the movie.

4. Recommendations based on movie genres

If we read the CSV with movies and print the header, we see it has a column that contains all the genres of the movie. A | character separates each genre. To select movies with similar genres, we have to split the genres and create a column for each genre. We convert the genre data to columns so that they become vectors.

We do this by using the crosstab method of the data frame on row eight.

The print on row four gives the following result.

We convert it to a column for each genre; see below. I only printed three of the genre columns of 20.

Now that we have converted the genres, we need a method to detect the similarity between movies. For calculating the similarity, we use the Jaccard similarity.

Luckily for us, the scipy package includes pdist that can calculate the Jaccard index. We use the movie_cross_table as the base. We then wrap the array with all the distances into a new data frame on line five.

We use this new data frame to look up recommendations based upon another movie. For example, on line eight, we search all the movies that are like the movie The Godfather and sort the result in descending order.

This gives the following result. The top movies on this list are the most similar.

5. Recommendations based on the synopsis of the movie

If you have a product without structured data, it is still possible to generate recommendations. For example, by using the description of a product. I will show this by using the synopsis. As we did with genres, we first have to convert the synopsis to a vector.

We use TfidfVectorizer from ScikitLearn for this. Term frequency-inverse document frequency (TFID) divides the number of times a word occurs in a document by a measure of what proportion of all the documents a word occurs in. This has the effect of reducing the value of common words while increasing the weight of words that do not occur in many documents.

Below we read the CSV with movies and synopsis and use the TfidfVectorizer. When constructing the vectorizer, we set min_df to 3 and max_df to 0.7. This means that a word has to occur at least 3 times to become a feature. The 0.7 will exclude words that exist in over 70% of the documents.

This will give the following output. Each word that the tokenizer returns is transformed into a column. The value calculated is the tfid.

With this dataset, we can now again calculate the similarity between items. Before we used the Jaccard similarity but with this dataset, we don’t have boolean values as with the genres.

So, we use a different, more suited method, the Cosine distance. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

Thankfully, scikit-learn has a cosine_similarity function to find the distance between all rows by calling it on the data frame.

With the result of the cosine_similarity, we create a new data frame and search for the movies nearest to Toy Story.

This gives the following result. You see that Toy Story 2 is the most recommended movie based on the synopsis.

Conclusion

What do you think, a useful article? I showed five different ways to generate movie recommendations. Although I used movies, there is nothing specific about them. You could also use these methods with books, music albums, or any other product.

We showed the following five recommendations in the categories, non-personalized recommendations and content-based recommendations.

Recommendations based on the number of times a movie is watched
Recommendations based on the ratings of viewers
Recommendations based on movies that are often seen together
Recommendations based on movie genres
Recommendations based on the synopsis of the movie

The complete source code for the five recommendations is available on Github.

Thank you for reading.