In this article by Robert Layton author of the book Learning Data Mining with Python - Second Edition is the second revision of Learning Data Mining with Python by Robert Layton improves upon the first book with updated examples, more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis.

(For more resources related to this topic, see here.)

Affinity Analysis

Affinity Analysis is the task of determining when objects are used in similar ways. We focused on whether the objects themselves are similar. The data for Affinity Analysis are often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together as a way to recommend products to users that they might purchase. Other use cases for Affinity Analysis include:

Fraud detection
Customer segmentation
Software optimization
Product recommendations

Affinity Analysis is usually much more exploratory than classification. At the very least, we often simply rank the results and choose the top 5 recommendations (or some other number), rather than expect the algorithm to give us a specific answer.

Algorithms for Affinity Analysis

A brute force solution, testing all possible combinations, is not efficient enough for real-world use. We could expect even a small store to have hundreds of items for sale, while many online stores would have thousands (or millions!). As we add more items, the time it takes to compute all rules increases significantly faster. Specifically, the total possible number of rules is 2n - 1. Even the drastic increase in computing power couldn't possibly keep up with the increases in the number of items stored online. Therefore, we need algorithms that work smarter, as opposed to computers that work harder.

The Apriori algorithm addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward.

The intuition behind Apriori is both simple and clever. First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D). Apriori discovers larger frequent itemsets by building off smaller frequent itemsets. The picture below outlines the full process:

The Movie Recommendation Problem

Product recommendation is a big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers.

Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews.

The datasets are available from http://grouplens.org/datasets/movielens/ and the dataset we are going to use in this article is the MovieLens 100K dataset (with 100,000 reviews). Download this dataset and unzip it in your data folder. Start a new Jupyter Notebook and type the following code:

import os
import pandas as pd
data_folder = os.path.join(os.path.expanduser("~"), "Data", "ml-100k")
ratings_filename = os.path.join(data_folder, "u.data")

Ensure that ratings_filename points to the u.data file in the unzipped folder.

Loading with pandas

The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None) and to set the column names with given values. Let's look at the following code:

all_ratings = pd.read_csv(ratings_filename, delimiter="t", header=None, names = ["UserID", "MovieID", "Rating", "Datetime"])

While we won't use it in this article, you can properly parse the date timestamp using the following line. Dates for reviews can be an important feature in recommendation prediction, as movies that are rated together often have more similar rankings than movies ranked separately. Accounting for this can improve models significantly.

all_ratings["Datetime"] = pd.to_datetime(all_ratings['Datetime'], unit='s')

Understanding the Apriori algorithm and its implementation

The goal of this article is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person recommends a set of movies is likely to recommend another particular movie.

To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie:

all_ratings["Favorable"] = all_ratings["Rating"] > 3
We will sample our dataset to form a training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users:
ratings = all_ratings[all_ratings['UserID'].isin(range(200))]

Next, we can create a dataset of only the favorable reviews in our sample:

favorable_ratings = ratings[ratings["Favorable"]]

We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group:

favorable_reviews_by_users = dict((k, frozenset(v.values))
 for k, v in favorable_ratings.groupby("UserID")["MovieID"])

In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user.

Sets are much faster than lists for this type of operation, and we will use them in a later code.

Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review:

num_favorable_by_movie = ratings[["MovieID", "Favorable"]].groupby("MovieID").sum()

We can see the top five movies by running the following code:

num_favorable_by_movie.sort_values(by="Favorable", ascending=False).head()

Implementing the Apriori algorithm

On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step. We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code:

frequent_itemsets = {}

We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value:

min_support = 50

To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster set-based operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot).

Let's look at the following example of frozenset code:

frequent_itemsets[1] = dict((frozenset((movie_id,)), row["Favorable"])
 for movie_id, row in num_favorable_by_movie.iterrows()
 if row["Favorable"] > min_support)

We implement the second and third steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps:

from collections import defaultdict
def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets, min_support):
 counts = defaultdict(int)
 for user, reviews in favorable_reviews_by_users.items():
 for itemset in k_1_itemsets:
 if itemset.issubset(reviews):
 for other_reviewed_movie in reviews - itemset:
 current_superset = itemset | frozenset((other_reviewed_movie,))
 counts[current_superset] += 1
 return dict([(itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support])

In keeping with our rule of thumb of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small compared to the average computer), single-pass is a good practice to get into for larger applications.

Let's have a look at the core of this function in detail. We iterate through each user, and each of the previously discovered itemsets, and then check if it is a subset of the current set of reviews, which are stored in k_1_itemsets (note that here, k_1 means k-1). If it is, this means that the user has reviewed each movie in the itemset. This is done by the itemset.issubset(reviews) line.

We can then go through each individual movie that the user has reviewed (that is not already in the itemset), create a superset by combining the itemset with the new movie and record that we saw this superset in our counting dictionary. These are the candidate frequent itemsets for this value of k.

We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those that have a support more than our min_support value.

This function forms the heart of our Apriori implementation and we now create a loop that iterates over the steps of the larger algorithm, storing the new itemsets as we increase k from 1 to a maximum value. In this loop, k represents the length of the soon-to-be discovered frequent itemsets, allowing us to access the previously most discovered ones by looking in our frequent_itemsets dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code:

for k in range(2, 20):
 # Generate candidates of length k, using the frequent itemsets of length k-1
 # Only store the frequent itemsets
 cur_frequent_itemsets = find_frequent_itemsets(favorable_reviews_by_users,
 frequent_itemsets[k-1], min_support)
 if len(cur_frequent_itemsets) == 0:
 print("Did not find any frequent itemsets of length {}".format(k))
 sys.stdout.flush()
 break
 else:
 print("I found {} frequent itemsets of length {}".format(len(cur_frequent_itemsets), k))
 sys.stdout.flush()
 frequent_itemsets[k] = cur_frequent_itemsets

Extracting association rules

After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they can easily be converted into these rules. For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise.

candidate_rules = []
for itemset_length, itemset_counts in frequent_itemsets.items():
 for itemset in itemset_counts.keys():
 for conclusion in itemset:
 premise = itemset - set((conclusion,))
 candidate_rules.append((premise, conclusion))

In these rules, the first partis the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie 79, they are also likely to recommend movie 258.

The process of computing confidence starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). We then iterate over all reviews and rules, working out whether the premise of the rule applies and, if it does, whether the conclusion is accurate.

correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user, reviews in favorable_reviews_by_users.items():
 for candidate_rule in candidate_rules:
 premise, conclusion = candidate_rule
 if premise.issubset(reviews):
 if conclusion in reviews:
 correct_counts[candidate_rule] += 1
 else:
 incorrect_counts[candidate_rule] += 1

We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen:

rule_confidence = {candidate_rule:
 (correct_counts[candidate_rule] / float(correct_counts[candidate_rule] +
 incorrect_counts[candidate_rule]))
 for candidate_rule in candidate_rules}

Now we can print the top five rules by sorting this confidence dictionary and printing the results:

from operator import itemgetter
sorted_confidence = sorted(rule_confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
 print("Rule #{0}".format(index + 1))
 premise, conclusion = sorted_confidence[index][0]
 print("Rule: If a person recommends {0} they will also recommend {1}".format(premise, conclusion))
 print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
 print("")

The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre).

We can load the titles from this file using pandas. Additional information about the file and categories is available in the README file that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file.

movie_name_filename = os.path.join(data_folder, "u.item")
movie_name_data = pd.read_csv(movie_name_filename, delimiter="|", header=None,
 encoding = "mac-roman")
movie_name_data.columns = ["MovieID", "Title", "Release Date", "Video Release", "IMDB", "<UNK>",
 "Action", "Adventure", "Animation", "Children's", "Comedy", "Crime",
 "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical",
 "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

Let's also create a helper function for finding the name of a movie by its ID:

def get_movie_name(movie_id):
 title_object = movie_name_data[movie_name_data["MovieID"] == movie_id]["Title"]
 title = title_object.values[0]
 return title

We can now adjust our previous code for printing out the top rules to also include the titles:

for index in range(5):
 print("Rule #{0}".format(index + 1))
 premise, conclusion = sorted_confidence[index][0]
 premise_names = ", ".join(get_movie_name(idx) for idx in premise)
 conclusion_name = get_movie_name(conclusion)
 print("Rule: If a person recommends {0} they will also recommend {1}".format(premise_names, conclusion_name))
 print(" - Confidence: {0:.3f}".format(rule_confidence[(premise, conclusion)]))
 print("")

The results gives a recommendation for movies, based on previous movies that person liked. Give it a shot and see if it matches your expectations!

Learning Data Mining with Python

In this short section of Learning Data Mining with Python, Revision 2, we performed Affinity Analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets. We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. We could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule. We cover topics such as classification, clusters, text analysis, image recognition, TensorFlow and Big Data. Each section comes with a practical real-world example, steps through the code in detail and provides suggestions for your to continue your (machine) learning.

Summary

In this article we have covered more in-depth discussion and exercises for your future development with data analytics. In this snippet from the book, we look at movie recommendation with a technique known as Affinity Analysis.

The most recent upgrades to the HTMLG online editor are the tag manager and the attribute filter. Try it for free and purchase a subscription if you like it!