





















































In this article by Suresh K Gorakala and Michele Usuelli, authors of the book Building a Recommendation System with R, we will learn how to prepare relevant data by covering the following topics:
(For more resources related to this topic, see here.)
Here, we show how to prepare the data to be used in recommender models. These are the steps:
On exploring the data, you will notice that the table contains:
We need to determine the minimum number of users per movie and vice versa. The correct solution comes from an iteration of the entire process of preparing the data, building a recommendation model, and validating it. Since we are implementing the model for the first time, we can use a rule of thumb. After having built the models, we can come back and modify the data preparation.
We define ratings_movies containing the matrix that we will use. It takes the following into account:
The following code shows this:
ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
ratings_movies contains about half the number of users and a fifth of the number of movies that MovieLense has.
Let's visualize the top 2 percent of users and movies of the new matrix:
# visualize the top matrix
min_movies <- quantile(rowCounts(ratings_movies), 0.98)
min_users <- quantile(colCounts(ratings_movies), 0.98)
Let's build the heat-map:
image(ratings_movies[rowCounts(ratings_movies) > min_movies,
colCounts(ratings_movies) > min_users], main = ""Heatmap of the
top users and movies"")
As you have already noticed, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. However, we have visualized the top movies only. In order to have an overview of all the users, let's take a look at the distribution of the average rating by users:
average_ratings_per_user <- rowMeans(ratings_movies)
Let's visualize the distribution:
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) + ggtitle(""Distribution of the average rating per user"")
As suspected, the average rating varies a lot across different users.
Users that give high (or low) ratings to all their movies might bias the results. We can remove this effect by normalizing the data in such a way that the average rating of each user is 0. The prebuilt normalize function does it automatically:
ratings_movies_norm <- normalize(ratings_movies)
Let's take a look at the average rating by user.
sum(rowMeans(ratings_movies_norm) > 0.00001)
## [1] 0
As expected, the mean rating of each user is 0 (apart from the approximation error).
We can visualize the new matrix using an image.
Let's build the heat-map:
# visualize the normalised matrix
image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies,colCounts(ratings_movies_norm) > min_users],main = ""Heatmap of the top users and movies"")
The first difference that we can notice are the colors, and it's because the data is continuous. Previously, the rating was an integer number between 1 and 5. After normalization, the rating can be any number between -5 and 5.
There are still some lines that are more blue and some that are more red. The reason is that we are visualizing only the top movies. We already checked that the average rating is 0 for each user.
A few recommendation models work on binary data, so we might want to binarize our data, that is, define a table containing only 0s and 1s. The 0s will be treated as either missing values or bad ratings.
In our case, we can do either of the following:
Depending on the context, one choice is more appropriate than the other.
The function to binarize the data is binarize. Let's apply it to our data. First, let's define a matrix equal to 1 if the movie has been watched, that is, if its rating is at least 1.
ratings_movies_watched <- binarize(ratings_movies, minRating = 1)
Let's take a look at the results. In this case, we will have black-and-white charts, so we can visualize a bigger portion of users and movies, for example, 5 percent. Similar to what we did earlier, let's select the 5 percent using quantile. The row and column counts are the same as the original matrix, so we can still apply rowCounts and colCounts on ratings_movies:
min_movies_binary <- quantile(rowCounts(ratings_movies), 0.95) min_users_binary <- quantile(colCounts(ratings_movies), 0.95)
Let's build the heat-map:
image(ratings_movies_watched[rowCounts(ratings_movies) > min_movies_binary, colCounts(ratings_movies) > min_users_binary],main = ""Heatmap of the top users and movies"")
Only a few cells contain non-watched movies. This is just because we selected the top users and movies.
Let's use the same approach to compute and visualize the other binary matrix. Now, each cell is one if the rating is above a threshold, for example 3, and 0 otherwise.
ratings_movies_good <- binarize(ratings_movies, minRating = 3)
Let's build the heat-map:
image(ratings_movies_good[rowCounts(ratings_movies) >
min_movies_binary, colCounts(ratings_movies) >
min_users_binary], main = ""Heatmap of the top users and movies"")
As expected, we have more white cells now.
Depending on the model, we can leave the ratings matrix as it is or normalize/binarize it.
In this article, you learned about data preparation and how you should select, explore, normalize, and binarize the data.
Further resources on this subject: