Building a Recommender System with Collaborative Filtering

Are you tired of endlessly scrolling through Netflix’s library trying to find something worthwhile? Do you ever wish someone would suggest the perfect movie or TV show for you based on your viewing history? Well, you’re in luck because today we’re going to talk about building a recommender system with collaborative filtering!

Collaborative filtering is a type of recommendation system that uses data from multiple sources to generate recommendations for users. In this article, we’ll walk through the steps of building a recommender system using collaborative filtering from scratch.

What is Collaborative Filtering?

Collaborative filtering is a machine learning algorithm that works by finding similarities between users based on their past behavior. It then uses those similarities to recommend items to users that they might be interested in.

Unlike other types of recommendation systems that rely on explicitly specified user preferences or item attributes, collaborative filtering uses information about the past behavior of users to generate recommendations. It is a popular approach for building recommender systems as it is relatively easy to implement and tends to provide good results.

There are two main types of collaborative filtering algorithms: user-based and item-based. User-based algorithms find similarities between users based on their interactions with items, while item-based algorithms find similarities between items based on their interactions with users.

Building a Recommender System

In this section, we’ll walk through the steps of building a recommender system using collaborative filtering. We will use the MovieLens dataset as our data source, which contains ratings for movies by users.

Step 1: Load the Data

The first step is to load the data into our program. We’ll be using the MovieLens dataset, which can be downloaded from here. The dataset contains information about movies and ratings given by users.

import pandas as pd

# Load the data into a pandas dataframe
data = pd.read_csv('ratings.csv')

# Print the first 5 rows
print(data.head())

The output should look something like this:

   userId  movieId  rating   timestamp
0       1        1     4.0   964982703
1       1        3     4.0  1112486027
2       1        6     4.0   964981208
3       1       47     5.0   964983815
4       1       50     5.0   964982931

Step 2: Exploratory Data Analysis

The next step is to perform exploratory data analysis (EDA) on the data. EDA helps us to understand the structure of the data and identify any data quality issues. It is important to perform EDA before building the recommender system as it can inform our decisions about data preprocessing and model selection.

# Get some basic stats about the data
print(data.describe())

# Check for missing values
print(data.isnull().sum())

This will output some basic statistics about the data and check for any missing values.

             userId       movieId        rating     timestamp
count  100836.00000  100836.00000  100836.00000  1.008360e+05
mean      326.12756   19435.29572       3.50156  1.205946e+09
std       182.61849   35530.98720       1.04253  2.162610e+08
min         1.00000       1.00000       0.50000  8.281246e+08
25%       177.00000    1199.00000       3.00000  1.019124e+09
50%       325.00000    2991.00000       3.50000  1.186087e+09
75%       477.00000    8092.00000       4.00000  1.435994e+09
max       610.00000  193609.00000       5.00000  1.537799e+09

We can see that there are no missing values in the data, which is good news. We can also see some basic statistics about the data, such as the mean rating and the number of unique users.

Step 3: Data Preprocessing

The next step is to preprocess the data for use in the recommendation system. In particular, we need to transform our data into a format that makes it easy to model user-item interactions.

# Use pivot table to transform the data into a user-item matrix
ratings_matrix = data.pivot_table(index=['userId'], columns=['movieId'], values='rating')

This will pivot our data into a user-item matrix where each row represents a user and each column represents a movie. The values in the matrix represent the rating given by the user for the movie.

Step 4: User-Based Collaborative Filtering

Now that we have our data in the correct format, we can begin building our collaborative filtering model. We’ll start with a user-based collaborative filtering algorithm.

from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity between users
user_sim = cosine_similarity(ratings_matrix, ratings_matrix)

# Convert to pandas dataframe
user_sim_df = pd.DataFrame(user_sim, index=ratings_matrix.index, columns=ratings_matrix.index)

This will compute the cosine similarity between users based on their ratings of movies. We can then convert the similarity matrix to a pandas dataframe for ease of use.

# Get recommendations for user with id 1
user_id = 1
similar_users = user_sim_df[user_id].sort_values(ascending=False)[1:]
recommendations = pd.DataFrame(similar_users).merge(ratings_matrix, on='userId').set_index('userId')
recommendations = recommendations.loc[:, ~(recommendations.loc[user_id] > 0).to_numpy()]
recommendations.mean().sort_values(ascending=False)[:10]

This will generate recommendations for the user with ID 1. We compute the similarity between the user and all other users, sort them in descending order and drop the user himself. Finally, we retrieve the movie ratings of the similar users and suggest movies that he/she has not yet watched.

Step 5: Item-Based Collaborative Filtering

Item-based collaborative filtering is similar to user-based collaborative filtering, except that we compute similarity between items instead of users.

# Compute item-item similarity matrix
item_sim = cosine_similarity(ratings_matrix.T, ratings_matrix.T)

# Convert to pandas dataframe
item_sim_df = pd.DataFrame(item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)

This will compute the cosine similarity between items based on their ratings by users. We can then convert the similarity matrix to a pandas dataframe for ease of use.

# Get recommendations for user with id 1
user_id = 1
user_ratings = ratings_matrix.loc[user_id].sort_values(ascending=False)
similar_items = item_sim_df[user_ratings.index].sum().sort_values(ascending=False).index
recommendations = pd.concat([user_ratings, ratings_matrix[similar_items].mean()], axis=1, keys=['user_ratings', 'similar_ratings'])
recommendations = recommendations.loc[:, ~(recommendations.loc[user_id] > 0).to_numpy()]
recommendations.mean().sort_values(ascending=False)[:10]

This will generate recommendations for the user with ID 1 using item-based collaborative filtering. We first retrieve the movie ratings of the user and compute the similarity between these movies and all other movies. We then sum the similarities for each movie and sort them in descending order. Finally, we retrieve the movie ratings of the similar movies and suggest movies that he/she has not yet watched.

Conclusion

In this article, we walked through the steps of building a recommender system using collaborative filtering. We covered both user-based and item-based collaborative filtering and provided code snippets for each step. Collaborative filtering is a popular approach for building recommender systems as it is relatively easy to implement and tends to provide good results.

There are also many other types of recommendation algorithms that we can explore, such as content-based filtering and hybrid models. Each algorithm has its strengths and weaknesses, and the choice of algorithm ultimately depends on the specific use case.

What did you think about the article? Was it informative and easy to follow? Did you learn something new today? Let us know in the comments!

Editor Recommended Sites

AI and Tech News
Best Online AI Courses
Classic Writing Analysis
Tears of the Kingdom Roleplay
Learn AI Ops: AI operations for machine learning
Rust Book: Best Rust Programming Language Book
Infrastructure As Code: Learn cloud IAC for GCP and AWS
Code Commit - Cloud commit tools & IAC operations: Best practice around cloud code commit git ops
Ethereum Exchange: Ethereum based layer-2 network protocols for Exchanges. Decentralized exchanges supporting ETH