0% found this document useful (0 votes)

396 views23 pages

02-Advanced Recommender Systems With Python - Ipynb

This document discusses building advanced recommender systems with Python. It covers collaborative filtering methods like memory-based and model-based, using the MovieLens dataset. It shows how to read in the dataset, merge it with movie title data, then prepare the data for recommender system training and evaluation by splitting it into train and test sets.

Uploaded by

Rishabh Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

396 views23 pages

02-Advanced Recommender Systems With Python - Ipynb

Uploaded by

Rishabh Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

{

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Advanced Recommender Systems with Python\n",
"\n",
"Welcome to the code notebook for creating Advanced Recommender Systems with Python.
This is an optional lecture notebook for you to check out. Currently there is no video for this
lecture because of the level of mathematics used and the heavy use of SciPy here.\n",
"\n",
"Recommendation Systems usually rely on larger data sets and specifically need to be
organized in a particular fashion. Because of this, we won't have a project to go along with this
topic, instead we will have a more intensive walkthrough process on creating a recommendation
system with Python with the same Movie Lens Data Set.\n",
"\n",
"*Note: The actual mathematics behind recommender systems is pretty heavy in Linear
Algebra.*\n",
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Methods Used\n",
"\n",
"Two most common types of recommender systems are **Content-Based** and
**Collaborative Filtering (CF)**. \n",
"\n",
"* Collaborative filtering produces recommendations based on the knowledge of users’
attitude to items, that is it uses the \"wisdom of the crowd\" to recommend items. \n",
"* Content-based recommender systems focus on the attributes of the items and give you
recommendations based on the similarity between them.\n",
"\n",
"## Collaborative Filtering\n",
"\n",
"In general, Collaborative filtering (CF) is more commonly used than content-based systems
because it usually gives better results and is relatively easy to understand (from an overall
implementation perspective). The algorithm has the ability to do feature learning on its own,
which means that it can start to learn for itself what features to use. \n",
"\n",
"CF can be divided into **Memory-Based Collaborative Filtering** and **Model-Based
Collaborative filtering**. \n",
"\n",
"In this tutorial, we will implement Model-Based CF by using singular value decomposition
(SVD) and Memory-Based CF by computing cosine similarity. \n",
"\n",
"## The Data\n",
"\n",
"We will use famous MovieLens dataset, which is one of the most common datasets used
when implementing and testing recommender engines. It contains 100k movie ratings from 943
users and a selection of 1682 movies.\n",
"\n",
"You can download the dataset
[here](http://files.grouplens.org/datasets/movielens/ml-100k.zip) or just use the u.data file that is
already included in this folder.\n",
"\n",
"____\n",
"## Getting Started\n",
"\n",
"Let's import some libraries we will need:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"We can then read in the **u.data** file, which contains the full dataset. You can read a brief
description of the dataset
[here](http://files.grouplens.org/datasets/movielens/ml-100k-README.txt).\n",
"\n",
"Note how we specify the separator argument for a Tab separated file."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"column_names = ['user_id', 'item_id', 'rating', 'timestamp']\n",
"df = pd.read_csv('u.data', sep='\\t', names=column_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a quick look at the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user_id</th>\n",
" <th>item_id</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>881250949</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>172</td>\n",
" <td>5</td>\n",
" <td>881250949</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>133</td>\n",
" <td>1</td>\n",
" <td>881250949</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>196</td>\n",
" <td>242</td>\n",
" <td>3</td>\n",
" <td>881250949</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>186</td>\n",
" <td>302</td>\n",
" <td>3</td>\n",
" <td>891717742</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" user_id item_id rating timestamp\n",
"0 0 50 5 881250949\n",
"1 0 172 5 881250949\n",
"2 0 133 1 881250949\n",
"3 196 242 3 881250949\n",
"4 186 302 3 891717742"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note how we only have the item_id, not the movie name. We can use the Movie_ID_Titles
csv file to grab the movie names and merge it with this dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>item_id</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Toy Story (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>GoldenEye (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Four Rooms (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Get Shorty (1995)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Copycat (1995)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" item_id title\n",
"0 1 Toy Story (1995)\n",
"1 2 GoldenEye (1995)\n",
"2 3 Four Rooms (1995)\n",
"3 4 Get Shorty (1995)\n",
"4 5 Copycat (1995)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movie_titles = pd.read_csv(\"Movie_Id_Titles\")\n",
"movie_titles.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then merge the dataframes:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user_id</th>\n",
" <th>item_id</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" <th>title</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>881250949</td>\n",
" <td>Star Wars (1977)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>290</td>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>880473582</td>\n",
" <td>Star Wars (1977)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>79</td>\n",
" <td>50</td>\n",
" <td>4</td>\n",
" <td>891271545</td>\n",
" <td>Star Wars (1977)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>888552084</td>\n",
" <td>Star Wars (1977)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>8</td>\n",
" <td>50</td>\n",
" <td>5</td>\n",
" <td>879362124</td>\n",
" <td>Star Wars (1977)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" user_id item_id rating timestamp title\n",
"0 0 50 5 881250949 Star Wars (1977)\n",
"1 290 50 5 880473582 Star Wars (1977)\n",
"2 79 50 4 891271545 Star Wars (1977)\n",
"3 2 50 5 888552084 Star Wars (1977)\n",
"4 8 50 5 879362124 Star Wars (1977)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.merge(df,movie_titles,on='item_id')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's take a quick look at the number of unique users and movies."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Num. of Users: 944\n",
"Num of Movies: 1682\n"
]
}
],
"source": [
"n_users = df.user_id.nunique()\n",
"n_items = df.item_id.nunique()\n",
"\n",
"print('Num. of Users: '+ str(n_users))\n",
"print('Num of Movies: '+str(n_items))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train Test Split\n",
"\n",
"Recommendation Systems by their very nature are very difficult to evaluate, but we will still
show you how to evaluate them in this tutorial. In order to do this, we'll split our data into two
sets. However, we won't do our classic X_train,X_test,y_train,y_test split. Instead we can
actually just segement the data into two sets of data:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.cross_validation import train_test_split\n",
"train_data, test_data = train_test_split(df, test_size=0.25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Memory-Based Collaborative Filtering\n",
"\n",
"Memory-Based Collaborative Filtering approaches can be divided into two main sections:
**user-item filtering** and **item-item filtering**. \n",
"\n",
"A *user-item filtering* will take a particular user, find users that are similar to that user based
on similarity of ratings, and recommend items that those similar users liked. \n",
"\n",
"In contrast, *item-item filtering* will take an item, find users who liked that item, and find
other items that those users or similar users also liked. It takes items and outputs other items as
recommendations. \n",
"\n",
"* *Item-Item Collaborative Filtering*: “Users who liked this item also liked …”\n",
"* *User-Item Collaborative Filtering*: “Users who are similar to you also liked …”"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In both cases, you create a user-item matrix which built from the entire dataset.\n",
"\n",
"Since we have split the data into testing and training we will need to create two ``[943 x
1682]`` matrices (all users by all movies). \n",
"\n",
"The training matrix contains 75% of the ratings and the testing matrix contains 25% of the
ratings. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example of user-item matrix:\n",
"<img class=\"aligncenter size-thumbnail img-responsive\"
src=\"http://s33.postimg.org/ay0ty90fj/BLOG_CCA_8.png\" alt=\"blog8\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After you have built the user-item matrix you calculate the similarity and create a similarity
matrix. \n",
"\n",
"The similarity values between items in *Item-Item Collaborative Filtering* are measured by
observing all the users who have rated both items. \n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\" style=\"max-width:100%; width:
50%; max-width: none\" src=\"http://s33.postimg.org/i522ma83z/BLOG_CCA_10.png\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For *User-Item Collaborative Filtering* the similarity values between users are measured by
observing all the items that are rated by both users.\n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\" style=\"max-width:100%; width:
50%; max-width: none\" src=\"http://s33.postimg.org/mlh3z3z4f/BLOG_CCA_11.png\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A distance metric commonly used in recommender systems is *cosine similarity*, where the
ratings are seen as vectors in ``n``-dimensional space and the similarity is calculated based on
the angle between these vectors. \n",
"Cosine similiarity for users *a* and *m* can be calculated using the formula below, where
you take dot product of the user vector *$u_k$* and the user vector *$u_a$* and divide it by
multiplication of the Euclidean lengths of the vectors.\n",
"<img class=\"aligncenter size-thumbnail img-responsive\"
src=\"https://latex.codecogs.com/gif.latex?s_u^{cos}(u_k,u_a)=\\frac{u_k&space;\\cdot&space;u
_a&space;}{&space;\\left&space;\\|&space;u_k&space;\\right&space;\\|&space;\\left&space;\\|&s
pace;u_a&space;\\right&space;\\|&space;}&space;=\\frac{\\sum&space;x_{k,m}x_{a,m}}{\\sqrt{\\s
um&space;x_{k,m}^2\\sum&space;x_{a,m}^2}}\"/>\n",
"\n",
"To calculate similarity between items *m* and *b* you use the formula:\n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\"
src=\"https://latex.codecogs.com/gif.latex?s_u^{cos}(i_m,i_b)=\\frac{i_m&space;\\cdot&space;i_
b&space;}{&space;\\left&space;\\|&space;i_m&space;\\right&space;\\|&space;\\left&space;\\|&sp
ace;i_b&space;\\right&space;\\|&space;}&space;=\\frac{\\sum&space;x_{a,m}x_{a,b}}{\\sqrt{\\su
m&space;x_{a,m}^2\\sum&space;x_{a,b}^2}}\n",
"\"/>\n",
"\n",
"Your first step will be to create the user-item matrix. Since you have both testing and training
data you need to create two matrices. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Create two user-item matrices, one for training and another for testing\n",
"train_data_matrix = np.zeros((n_users, n_items))\n",
"for line in train_data.itertuples():\n",
" train_data_matrix[line[1]-1, line[2]-1] = line[3] \n",
"\n",
"test_data_matrix = np.zeros((n_users, n_items))\n",
"for line in test_data.itertuples():\n",
" test_data_matrix[line[1]-1, line[2]-1] = line[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use the
[pairwise_distances](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pa
irwise_distances.html) function from sklearn to calculate the cosine similarity. Note, the output
will range from 0 to 1 since the ratings are all positive."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.metrics.pairwise import pairwise_distances\n",
"user_similarity = pairwise_distances(train_data_matrix, metric='cosine')\n",
"item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step is to make predictions. You have already created similarity matrices:
ùser_similarity` and ìtem_similarity` and therefore you can make a prediction by applying
following formula for user-based CF:\n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\"
src=\"https://latex.codecogs.com/gif.latex?\\hat{x}_{k,m}&space;=&space;\\bar{x}_{k}&space;&pl
us;&space;\\frac{\\sum\\limits_{u_a}&space;sim_u(u_k,&space;u_a)&space;(x_{a,m}&space;-&s
pace;\\bar{x_{u_a}})}{\\sum\\limits_{u_a}|sim_u(u_k,&space;u_a)|}\"/>\n",
"\n",
"You can look at the similarity between users *k* and *a* as weights that are multiplied by the
ratings of a similar user *a* (corrected for the average rating of that user). You will need to
normalize it so that the ratings stay between 1 and 5 and, as a final step, sum the average
ratings for the user that you are trying to predict. \n",
"\n",
"The idea here is that some users may tend always to give high or low ratings to all movies.
The relative difference in the ratings that these users give is more important than the absolute
values. To give an example: suppose, user *k* gives 4 stars to his favourite movies and 3 stars
to all other good movies. Suppose now that another user *t* rates movies that he/she likes with
5 stars, and the movies he/she fell asleep over with 3 stars. These two users could have a very
similar taste but treat the rating system differently. \n",
"\n",
"When making a prediction for item-based CF you don't need to correct for users average
rating since query user itself is used to do predictions.\n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\"
src=\"https://latex.codecogs.com/gif.latex?\\hat{x}_{k,m}&space;=&space;\\frac{\\sum\\limits_{i_b
}&space;sim_i(i_m,&space;i_b)&space;(x_{k,b})&space;}{\\sum\\limits_{i_b}|sim_i(i_m,&space;i_
b)|}\"/>"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def predict(ratings, similarity, type='user'):\n",
" if type == 'user':\n",
" mean_user_rating = ratings.mean(axis=1)\n",
" #You use np.newaxis so that mean_user_rating has same format as ratings\n",
" ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) \n",
" pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) /
np.array([np.abs(similarity).sum(axis=1)]).T\n",
" elif type == 'item':\n",
" pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)]) \n",
" return pred"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"item_prediction = predict(train_data_matrix, item_similarity, type='item')\n",
"user_prediction = predict(train_data_matrix, user_similarity, type='user')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluation\n",
"There are many evaluation metrics but one of the most popular metric used to evaluate
accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*. \n",
"<img
src=\"https://latex.codecogs.com/gif.latex?RMSE&space;=\\sqrt{\\frac{1}{N}&space;\\sum&space
;(x_i&space;-\\hat{x_i})^2}\" title=\"RMSE =\\sqrt{\\frac{1}{N} \\sum (x_i -\\hat{x_i})^2}\" />\n",
"\n",
"You can use the
[mean_square_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squ
ared_error.html) (MSE) function from `sklearn`, where the RMSE is just the square root of MSE.
To read more about different evaluation metrics you can take a look at [this
article](http://research.microsoft.com/pubs/115396/EvaluationMetrics.TR.pdf). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since you only want to consider predicted ratings that are in the test dataset, you filter out all
other elements in the prediction matrix with `prediction[ground_truth.nonzero()]`. "
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.metrics import mean_squared_error\n",
"from math import sqrt\n",
"def rmse(prediction, ground_truth):\n",
" prediction = prediction[ground_truth.nonzero()].flatten() \n",
" ground_truth = ground_truth[ground_truth.nonzero()].flatten()\n",
" return sqrt(mean_squared_error(prediction, ground_truth))"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User-based CF RMSE: 3.135451660158989\n",
"Item-based CF RMSE: 3.4593766647252515\n"
]
}
],
"source": [
"print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))\n",
"print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Memory-based algorithms are easy to implement and produce reasonable prediction quality.
\n",
"The drawback of memory-based CF is that it doesn't scale to real-world scenarios and
doesn't address the well-known cold-start problem, that is when new user or new item enters
the system. Model-based CF methods are scalable and can deal with higher sparsity level than
memory-based models, but also suffer when new users or items that don't have any ratings
enter the system. I would like to thank Ethan Rosenthal for his
[post](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/) about
Memory-Based Collaborative Filtering. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model-based Collaborative Filtering\n",
"\n",
"Model-based Collaborative Filtering is based on **matrix factorization (MF)** which has
received greater exposure, mainly as an unsupervised learning method for latent variable
decomposition and dimensionality reduction. Matrix factorization is widely used for
recommender systems where it can deal better with scalability and sparsity than Memory-based
CF. The goal of MF is to learn the latent preferences of users and the latent attributes of items
from known ratings (learn features that describe the characteristics of ratings) to then predict the
unknown ratings through the dot product of the latent features of users and items. \n",
"When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization
you can restructure the user-item matrix into low-rank structure, and you can represent the
matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector.
You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the
low-rank matrices together, which fills in the entries missing in the original matrix.\n",
"\n",
"Let's calculate the sparsity level of MovieLens dataset:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The sparsity level of MovieLens100K is 93.7%\n"
]
}
],
"source": [
"sparsity=round(1.0-len(df)/float(n_users*n_items),3)\n",
"print('The sparsity level of MovieLens100K is ' + str(sparsity*100) + '%')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To give an example of the learned latent preferences of the users and items: let's say for the
MovieLens dataset you have the following information: _(user id, age, location, gender, movie
id, director, actor, language, year, rating)_. By applying matrix factorization the model learns
that important user features are _age group (under 10, 10-18, 18-30, 30-90)_, _location_ and
_gender_, and for movie features it learns that _decade_, _director_ and _actor_ are most
important. Now if you look into the information you have stored, there is no such feature as the
_decade_, but the model can learn on its own. The important aspect is that the CF model only
uses data (user_id, movie_id, rating) to learn the latent features. If there is little data available
model-based CF model will predict poorly, since it will be more difficult to learn the latent
features. \n",
"\n",
"Models that use both ratings and content features are called **Hybrid Recommender
Systems** where both Collaborative Filtering and Content-based Models are combined. Hybrid
recommender systems usually show higher accuracy than Collaborative Filtering or
Content-based Models on their own: they are capable to address the cold-start problem better
since if you don't have any ratings for a user or an item you could use the metadata from the
user or item to make a prediction. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SVD\n",
"A well-known matrix factorization method is **Singular value decomposition (SVD)**.
Collaborative Filtering can be formulated by approximating a matrix `X` by using singular value
decomposition. The winning team at the Netflix Prize competition used SVD matrix factorization
models to produce product recommendations, for more information I recommend to read
articles: [Netflix Recommendations: Beyond the 5
stars](http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html) and
[Netflix Prize and
SVD](http://buzzard.ups.edu/courses/2014spring/420projects/math420-UPS-spring-2014-gower
-netflix-SVD.pdf).\n",
"The general equation can be expressed as follows:\n",
"<img src=\"https://latex.codecogs.com/gif.latex?X=USV^T\" title=\"X=USV^T\" />\n",
"\n",
"\n",
"Given `m x n` matrix `X`:\n",
"* *Ù`* is an *`(m x r)`* orthogonal matrix\n",
"* *`S`* is an *`(r x r)`* diagonal matrix with non-negative real numbers on the diagonal\n",
"* *V^T* is an *`(r x n)`* orthogonal matrix\n",
"\n",
"Elements on the diagnoal in `S` are known as *singular values of `X`*. \n",
"\n",
"\n",
"Matrix *`X`* can be factorized to *Ù`*, *`S`* and *`V`*. The *Ù`* matrix represents the
feature vectors corresponding to the users in the hidden feature space and the *`V`* matrix
represents the feature vectors corresponding to the items in the hidden feature space.\n",
"<img class=\"aligncenter size-thumbnail img-responsive\" style=\"max-width:100%; width:
50%; max-width: none\" src=\"http://s33.postimg.org/kwgsb5g1b/BLOG_CCA_5.png\"/>\n",
"\n",
"Now you can make a prediction by taking dot product of *Ù`*, *`S`* and *`V^T`*.\n",
"\n",
"<img class=\"aligncenter size-thumbnail img-responsive\" style=\"max-width:100%; width:
50%; max-width: none\" src=\"http://s33.postimg.org/ch9lcm6pb/BLOG_CCA_4.png\"/>"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"User-based CF MSE: 2.727093975231784\n"
]
}
],
"source": [
"import scipy.sparse as sp\n",
"from scipy.sparse.linalg import svds\n",
"\n",
"#get SVD components from train matrix. Choose k.\n",
"u, s, vt = svds(train_data_matrix, k = 20)\n",
"s_diag_matrix=np.diag(s)\n",
"X_pred = np.dot(np.dot(u, s_diag_matrix), vt)\n",
"print('User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Carelessly addressing only the relatively few known entries is highly prone to overfitting.
SVD can be very slow and computationally expensive. More recent work minimizes the squared
error by applying alternating least square or stochastic gradient descent and uses regularization
terms to prevent overfitting. Alternating least square and stochastic gradient descent methods
for CF will be covered in the next tutorials.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Review:\n",
"\n",
"* We have covered how to implement simple **Collaborative Filtering** methods, both
memory-based CF and model-based CF.\n",
"* **Memory-based models** are based on similarity between items or users, where we use
cosine-similarity.\n",
"* **Model-based CF** is based on matrix factorization where we use SVD to factorize the
matrix.\n",
"* Building recommender systems that perform well in cold-start scenarios (where little data is
available on new users and items) remains a challenge. The standard collaborative filtering
method performs poorly is such settings. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Looking for more?\n",
"\n",
"If you want to tackle your own recommendation system analysis, check out these data sets.
Note: The files are quite large in most cases, not all the links may stay up to host the data, but
the majority of them still work. Or just Google for your own data set!\n",
"\n",
"**Movies Recommendation:**\n",
"\n",
"MovieLens - Movie Recommendation Data Sets http://www.grouplens.org/node/73\n",
"\n",
"Yahoo! - Movie, Music, and Images Ratings Data Sets
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r\n",
"\n",
"Jester - Movie Ratings Data Sets (Collaborative Filtering Dataset)
http://www.ieor.berkeley.edu/~goldberg/jester-data/\n",
"\n",
"Cornell University - Movie-review data for use in sentiment-analysis experiments
http://www.cs.cornell.edu/people/pabo/movie-review-data/\n",
"\n",
"**Music Recommendation:**\n",
"\n",
"Last.fm - Music Recommendation Data Sets
http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/index.html\n",
"\n",
"Yahoo! - Movie, Music, and Images Ratings Data Sets
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r\n",
"\n",
"Audioscrobbler - Music Recommendation Data Sets
http://www-etud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html\n",
"\n",
"Amazon - Audio CD recommendations http://131.193.40.52/data/\n",
"\n",
"**Books Recommendation:**\n",
"\n",
"Institut für Informatik, Universität Freiburg - Book Ratings Data Sets
http://www.informatik.uni-freiburg.de/~cziegler/BX/\n",
"Food Recommendation:\n",
"\n",
"Chicago Entree - Food Ratings Data Sets
http://archive.ics.uci.edu/ml/datasets/Entree+Chicago+Recommendation+Data\n",
"Merchandise Recommendation:\n",
"\n",
"**Healthcare Recommendation:**\n",
"\n",
"Nursing Home - Provider Ratings Data Set
http://data.medicare.gov/dataset/Nursing-Home-Compare-Provider-Ratings/mufm-vy8d\n",
"\n",
"Hospital Ratings - Survey of Patients Hospital Experiences
http://data.medicare.gov/dataset/Survey-of-Patients-Hospital-Experiences-HCAHPS-/rj76-22dk\
n",
"\n",
"**Dating Recommendation:**\n",
"\n",
"www.libimseti.cz - Dating website recommendation (collaborative filtering)
http://www.occamslab.com/petricek/data/\n",
"Scholarly Paper Recommendation:\n",
"\n",
"National University of Singapore - Scholarly Paper Recommendation
http://www.comp.nus.edu.sg/~sugiyama/SchPaperRecData.html\n",
"\n",
"# Great Job!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Blythe Grossberg - Aspergers and Adulthood
100% (2)
Blythe Grossberg - Aspergers and Adulthood
64 pages
Free Download Here: Suzuki 6 Piano Accompaniment La Folia PDF
0% (1)
Free Download Here: Suzuki 6 Piano Accompaniment La Folia PDF
2 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
5 pages
Movie Recommendation Engine - Ipynb
No ratings yet
Movie Recommendation Engine - Ipynb
19 pages
RecommendationSystem - R5 - Project7 - Amazon Product - Ipynb
No ratings yet
RecommendationSystem - R5 - Project7 - Amazon Product - Ipynb
112 pages
4.text Classification Using ML - Ipynb
No ratings yet
4.text Classification Using ML - Ipynb
47 pages
Netflix Project1.Ipynb
No ratings yet
Netflix Project1.Ipynb
74 pages
Exploratory Data Analysis BCG - Ipynb
No ratings yet
Exploratory Data Analysis BCG - Ipynb
273 pages
Exploratory Data Analysis BCG - Ipynb
No ratings yet
Exploratory Data Analysis BCG - Ipynb
260 pages
Major Project Text Classification Code
No ratings yet
Major Project Text Classification Code
36 pages
TCS Stock Data - Live and Latest-Checkpoint - Ipynb
No ratings yet
TCS Stock Data - Live and Latest-Checkpoint - Ipynb
172 pages
1.text Preprocessing - Ipynb
No ratings yet
1.text Preprocessing - Ipynb
28 pages
Tutorial.ipynb
No ratings yet
Tutorial.ipynb
160 pages
Spam Classifier With LSTM - Ipynb
No ratings yet
Spam Classifier With LSTM - Ipynb
44 pages
Untitled
No ratings yet
Untitled
463 pages
Sesion 01b Pandas V1.ipynb
No ratings yet
Sesion 01b Pandas V1.ipynb
197 pages
Simple Linear Regression PDF
No ratings yet
Simple Linear Regression PDF
40 pages
Movie Recommendation.ipynb
No ratings yet
Movie Recommendation.ipynb
43 pages
Emails-Checkpoint Ipynb
No ratings yet
Emails-Checkpoint Ipynb
34 pages
House Prices - Ipynb
No ratings yet
House Prices - Ipynb
23 pages
30apr Ipynb
No ratings yet
30apr Ipynb
42 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Fake News Classifier - Ipynb 2
No ratings yet
Fake News Classifier - Ipynb 2
39 pages
Stock Price Prediction Project Using TensorFlow - Ipynb
No ratings yet
Stock Price Prediction Project Using TensorFlow - Ipynb
186 pages
Multiple Linear Regression Housing Case Study PDF
No ratings yet
Multiple Linear Regression Housing Case Study PDF
151 pages
Pandas Tutorial - Top 40 Useful Tricks - Ipynb
No ratings yet
Pandas Tutorial - Top 40 Useful Tricks - Ipynb
316 pages
Coca Cola Stock Analysis - Ipynb
No ratings yet
Coca Cola Stock Analysis - Ipynb
197 pages
Heart Disease Prediction - Ipynb
No ratings yet
Heart Disease Prediction - Ipynb
207 pages
1.vecinos Mas Cercanos Ejercicio Propuesto PDF
No ratings yet
1.vecinos Mas Cercanos Ejercicio Propuesto PDF
1,053 pages
Forecast Ipynb
No ratings yet
Forecast Ipynb
278 pages
Shared Bike Demand Analysis - Ipynb
No ratings yet
Shared Bike Demand Analysis - Ipynb
390 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
Untitled1 Ipynb
No ratings yet
Untitled1 Ipynb
19 pages
DS Unit 6
No ratings yet
DS Unit 6
16 pages
Marketing Campaigns Analysis - Ipynb
No ratings yet
Marketing Campaigns Analysis - Ipynb
138 pages
Vecinos Mas Cercanos Ejercicio Propuesto PDF
No ratings yet
Vecinos Mas Cercanos Ejercicio Propuesto PDF
945 pages
Untitled 0
No ratings yet
Untitled 0
537 pages
Pregunta B
No ratings yet
Pregunta B
50 pages
Credit Card Fraud Detection V29.Ipynb
No ratings yet
Credit Card Fraud Detection V29.Ipynb
976 pages
Movie Recommendation System Project
No ratings yet
Movie Recommendation System Project
20 pages
Applai Pandas Notebook 24'.Ipynb
No ratings yet
Applai Pandas Notebook 24'.Ipynb
98 pages
Recommendation Engine 1657857468
No ratings yet
Recommendation Engine 1657857468
15 pages
Pandas Exercises.ipynb
No ratings yet
Pandas Exercises.ipynb
11 pages
Heart Attack - Ipynb
No ratings yet
Heart Attack - Ipynb
162 pages
Metro Systems Worldwide - Ipynb
No ratings yet
Metro Systems Worldwide - Ipynb
2,030 pages
Krishna
No ratings yet
Krishna
278 pages
1 Linear Regression - Ipynb
No ratings yet
1 Linear Regression - Ipynb
16 pages
Tutorial 11 - Python Pandas Working With Json Tutorials - Part 3.ipynb
No ratings yet
Tutorial 11 - Python Pandas Working With Json Tutorials - Part 3.ipynb
31 pages
E-Commerce - Python - Project - Student - File (1) Answer
No ratings yet
E-Commerce - Python - Project - Student - File (1) Answer
167 pages
Cleaning Data
No ratings yet
Cleaning Data
6 pages
KnnImputer Ipynb
No ratings yet
KnnImputer Ipynb
6 pages
AtliQ Hotels - Ipynb
No ratings yet
AtliQ Hotels - Ipynb
508 pages
Python 5,6,7
No ratings yet
Python 5,6,7
15 pages
Neel
No ratings yet
Neel
12 pages
Session6 Pandas - Ipynb
No ratings yet
Session6 Pandas - Ipynb
104 pages
Housing Case Study Using RFE (MLR) PDF
No ratings yet
Housing Case Study Using RFE (MLR) PDF
38 pages
2 3-SVM Ipynb
No ratings yet
2 3-SVM Ipynb
111 pages
Week 1 To Week 9
No ratings yet
Week 1 To Week 9
30 pages
Customer Churn Prediction - Ipynb
No ratings yet
Customer Churn Prediction - Ipynb
170 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
NEEL (1) - Edited
No ratings yet
NEEL (1) - Edited
12 pages
Ai Lab 02
No ratings yet
Ai Lab 02
12 pages
Resume Asrul Lattes
No ratings yet
Resume Asrul Lattes
5 pages
Effectsof Bullyingonthe Academic Performanceof Grade 3 Learners
No ratings yet
Effectsof Bullyingonthe Academic Performanceof Grade 3 Learners
12 pages
Prime Factors Mas, KVP Density
No ratings yet
Prime Factors Mas, KVP Density
10 pages
Notice: Office of The Dean Academic
No ratings yet
Notice: Office of The Dean Academic
2 pages
MBOSE Class 10 IT - ITES (Vocational Course) Question Paper 2021
No ratings yet
MBOSE Class 10 IT - ITES (Vocational Course) Question Paper 2021
4 pages
Strategy: The Totality of Decisions - 47
No ratings yet
Strategy: The Totality of Decisions - 47
1 page
IJSRD - International Journal For Scientific Research & Development - Vol. 8, Issue 3, 2020 - ISSN (Online) - 2321-0613
No ratings yet
IJSRD - International Journal For Scientific Research & Development - Vol. 8, Issue 3, 2020 - ISSN (Online) - 2321-0613
3 pages
Cross Cultural Understanding: Aan Pranata (17018106)
No ratings yet
Cross Cultural Understanding: Aan Pranata (17018106)
3 pages
EAPP Q4module 1... Grade 12 Bezos
No ratings yet
EAPP Q4module 1... Grade 12 Bezos
3 pages
Practical Research
No ratings yet
Practical Research
7 pages
Simolazione Seconda Traccia Inglese 2023 Extra
No ratings yet
Simolazione Seconda Traccia Inglese 2023 Extra
5 pages
Y12 HEco QP 23
No ratings yet
Y12 HEco QP 23
12 pages
UCEED Past 5 Years Cutoff (2020-2024) - 1744044682982
No ratings yet
UCEED Past 5 Years Cutoff (2020-2024) - 1744044682982
9 pages
How To Study Korean - Lesson 1 - Basic Korean Sentences
100% (1)
How To Study Korean - Lesson 1 - Basic Korean Sentences
12 pages
Guidelines For Oral Presentation
No ratings yet
Guidelines For Oral Presentation
5 pages
SMCR Model of Communication
No ratings yet
SMCR Model of Communication
11 pages
Math Lesson For Life: Always Be Careful With The SIGNS )
No ratings yet
Math Lesson For Life: Always Be Careful With The SIGNS )
5 pages
Socpsycho II 1 178 1 94
No ratings yet
Socpsycho II 1 178 1 94
94 pages
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
No ratings yet
What Is Anthropology 2nd Edition Thomas Hylland Eriksen Download
48 pages
Class 12 Chemistry Bengali Cbse
No ratings yet
Class 12 Chemistry Bengali Cbse
10 pages
Analisis Tingkat Kepuasan Pemustaka Terhadap Layanan Di Perpustakaan Balai Arkeologi Daerah Istimewa Yogyakarta Dengan Metode Libqual
No ratings yet
Analisis Tingkat Kepuasan Pemustaka Terhadap Layanan Di Perpustakaan Balai Arkeologi Daerah Istimewa Yogyakarta Dengan Metode Libqual
19 pages
Three Social Theory Handouts
No ratings yet
Three Social Theory Handouts
1 page
Vadnana Luthra Orignal
No ratings yet
Vadnana Luthra Orignal
11 pages
MUET Weekly LESSON PLAN SEM 2
100% (2)
MUET Weekly LESSON PLAN SEM 2
7 pages
SUT Degree College Calendar February 2024 - V1
No ratings yet
SUT Degree College Calendar February 2024 - V1
3 pages
E Twinning
No ratings yet
E Twinning
13 pages
12570
No ratings yet
12570
2 pages

02-Advanced Recommender Systems With Python - Ipynb

Uploaded by

02-Advanced Recommender Systems With Python - Ipynb

Uploaded by

{

You might also like