Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Recommender Systems

Save for later
  • 360 min read
  • 2015-09-16 00:00:00

article-image

In this article by Suresh K Gorakala and Michele Usuelli, authors of the book Building a Recommendation System with R, we will learn how to prepare relevant data by covering the following topics:

  • Selecting the most relevant data
  • Exploring the most relevant data
  • Normalizing the data
  • Binarizing the data

(For more resources related to this topic, see here.)

Data preparation

Here, we show how to prepare the data to be used in recommender models. These are the steps:

  1. Select the relevant data.
  2. Normalize the data.

Selecting the most relevant data

On exploring the data, you will notice that the table contains:

  • Movies that have been viewed only a few times; their rating might be biased because of lack of data
  • Users that rated only a few movies; their rating might be biased

We need to determine the minimum number of users per movie and vice versa. The correct solution comes from an iteration of the entire process of preparing the data, building a recommendation model, and validating it. Since we are implementing the model for the first time, we can use a rule of thumb. After having built the models, we can come back and modify the data preparation.

We define ratings_movies containing the matrix that we will use. It takes the following into account:

  • Users who have rated at least 50 movies
  • Movies that have been watched at least 100 times

The following code shows this:

ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

ratings_movies contains about half the number of users and a fifth of the number of movies that MovieLense has.

Exploring the most relevant data

Let's visualize the top 2 percent of users and movies of the new matrix:

# visualize the top matrix
min_movies <- quantile(rowCounts(ratings_movies), 0.98)
min_users <- quantile(colCounts(ratings_movies), 0.98)

Let's build the heat-map:

image(ratings_movies[rowCounts(ratings_movies) > min_movies,
colCounts(ratings_movies) > min_users], main = ""Heatmap of the
top users and movies"")

recommender-systems-img-0

As you have already noticed, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. However, we have visualized the top movies only. In order to have an overview of all the users, let's take a look at the distribution of the average rating by users:

average_ratings_per_user <- rowMeans(ratings_movies)

Let's visualize the distribution:

qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) + ggtitle(""Distribution of the average rating per user"")

recommender-systems-img-1

As suspected, the average rating varies a lot across different users.

Normalizing the data

Users that give high (or low) ratings to all their movies might bias the results. We can remove this effect by normalizing the data in such a way that the average rating of each user is 0. The prebuilt normalize function does it automatically:

ratings_movies_norm <- normalize(ratings_movies)
Let's take a look at the average rating by user.
sum(rowMeans(ratings_movies_norm) > 0.00001)
## [1] 0

As expected, the mean rating of each user is 0 (apart from the approximation error).

We can visualize the new matrix using an image.

Let's build the heat-map:

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
# visualize the normalised matrix
image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies,colCounts(ratings_movies_norm) > min_users],main = ""Heatmap of the top users and movies"")

recommender-systems-img-2

The first difference that we can notice are the colors, and it's because the data is continuous. Previously, the rating was an integer number between 1 and 5. After normalization, the rating can be any number between -5 and 5.

There are still some lines that are more blue and some that are more red. The reason is that we are visualizing only the top movies. We already checked that the average rating is 0 for each user.

Binarizing the data

A few recommendation models work on binary data, so we might want to binarize our data, that is, define a table containing only 0s and 1s. The 0s will be treated as either missing values or bad ratings.

In our case, we can do either of the following:

  • Define a matrix that has 1 if the user rated the movie and 0 otherwise. In this case, we are losing the information about the rating.
  • Define a matrix that has 1 if the rating is more than or equal to a definite threshold (for example 3) and 0 otherwise. In this case, giving a bad rating to a movie is equivalent to not rating it.

Depending on the context, one choice is more appropriate than the other.

The function to binarize the data is binarize. Let's apply it to our data. First, let's define a matrix equal to 1 if the movie has been watched, that is, if its rating is at least 1.

ratings_movies_watched <- binarize(ratings_movies, minRating = 1)

Let's take a look at the results. In this case, we will have black-and-white charts, so we can visualize a bigger portion of users and movies, for example, 5 percent. Similar to what we did earlier, let's select the 5 percent using quantile. The row and column counts are the same as the original matrix, so we can still apply rowCounts and colCounts on ratings_movies:

min_movies_binary <- quantile(rowCounts(ratings_movies), 0.95) min_users_binary <- quantile(colCounts(ratings_movies), 0.95)

Let's build the heat-map:

image(ratings_movies_watched[rowCounts(ratings_movies) > min_movies_binary, colCounts(ratings_movies) > min_users_binary],main = ""Heatmap of the top users and movies"")

recommender-systems-img-3

Only a few cells contain non-watched movies. This is just because we selected the top users and movies.

Let's use the same approach to compute and visualize the other binary matrix. Now, each cell is one if the rating is above a threshold, for example 3, and 0 otherwise.

ratings_movies_good <- binarize(ratings_movies, minRating = 3)

Let's build the heat-map:

image(ratings_movies_good[rowCounts(ratings_movies) >
min_movies_binary, colCounts(ratings_movies) >
min_users_binary], main = ""Heatmap of the top users and movies"")

recommender-systems-img-4

As expected, we have more white cells now.

Depending on the model, we can leave the ratings matrix as it is or normalize/binarize it.

Summary

In this article, you learned about data preparation and how you should select, explore, normalize, and binarize the data.

Resources for Article:


Further resources on this subject: