Notebook - Music Recommendation System Reference
Notebook - Music Recommendation System Reference
Notebook - Music Recommendation System Reference
ipynb - Colaboratory
The Context:
Why is this problem important to solve?
The objective:
What is the intended goal?
Data Dictionary
The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The
first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id,
and the play count of users.
song_data
count_data
Data Source
http://millionsongdataset.com/
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 1/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract
insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular
technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
Mounted at /content/drive
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 2/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 3/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# See top 10 records of song_df data
# Get the top 10 records of song_df
top_10_song = song_df.head(10)
print("\nTop 10 records of song_df:")
print(top_10_song.to_string(index=False))
keyboard_arrow_down Let us check the data types and and missing values of each column
# See the info of the count_df data
print("Data types of count_df:")
print(count_df.dtypes)
Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?
Yes, the user_id and song_id columns, even if they are encrypted or hashed, can still be encoded into numeric features. However, since these
IDs are categorical variables (not ordinal), you should use techniques like Label Encoding or One-Hot Encoding to convert them into numeric
form.
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 4/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
pip install category_encoders
Collecting category_encoders
Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.9/81.9 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.25.2)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.11.4)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.14.1)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (2.0.3)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2024.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.4.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (24.0)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3
import pandas as pd
import category_encoders as ce
Encoded DataFrame:
user_id song_id_101.0 song_id_102.0 song_id_103.0 play_count title \
0 1 1 0 0 5 Song A
1 2 0 1 0 3 Song B
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 5/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
2 3 1 0 0 8 Song A
3 1 0 0 1 2 Song C
4 2 0 1 0 6 Song B
artist_name
0 Artist 1
1 Artist 2
2 Artist 1
3 Artist 3
4 Artist 2
Shape of the DataFrame:
(5, 7)
le = LabelEncoder()
# Apply label encoding for "user_id"
df_encoded['user_id'] = le.fit_transform(df['user_id'])
# Apply label encoding for "song_id"
df_encoded['song_id'] = le.fit_transform(df['song_id'])
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 6/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it
contains users who have listened to a good count of songs and vice versa?
Filtering the data to include only users who have listened to a good count of songs (and vice versa for songs) after one-hot encoding is a
common practice, especially in recommendation systems. This filtering helps in focusing on users or songs that have enough data to make
meaningful recommendations.
Reasons for Filtering: Data Sparsity: One-Hot Encoding can lead to a very sparse dataset, especially if you have many users and songs. Users
may have listened to only a few songs, and songs may have been listened to by only a few users.
Cold Start Problem: Users or songs with very few interactions can pose challenges in making accurate recommendations. The system might
struggle to provide good recommendations for new users or unpopular songs.
Filtering Users: You might want to filter out users who have listened to a very small number of songs. This way, you focus on users who are
more active and have more diverse preferences.
Filtering Songs: Similarly, you might filter out songs that have been listened to by very few users. This helps in recommending popular or
trending songs to users.
Here's how you might filter the data to include only users who have listened to at least a certain number of songs, and songs that have been
listened to by at least a certain number of users:
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 7/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120
remove_songs = []
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 8/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Display the total number of unique user IDs
print("Total Number of Unique User IDs:", total_unique_users)
keyboard_arrow_down Let's find out about the most interacted songs and interacted users
Most interacted songs
count_songs = df_final.groupby('year').count()['title']
count = pd.DataFrame(count_songs)
count.tail()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-42-9a68e85aacb9> in <cell line: 1>()
----> 1 count_songs = df_final.groupby('year').count()['title']
2
3 count = pd.DataFrame(count_songs)
4
5 count.drop(count.index[0], inplace = True)
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
983 in_axis, level, gpr = False, gpr, None
984 else:
--> 985 raise KeyError(gpr)
986 elif isinstance(gpr, Grouper) and gpr.key is not None:
987 # Add key to exclusions
KeyError: 'year'
sns.barplot(x = count.index,
y = 'title',
data = count,
estimator = np.median)
Think About It: What other insights can be drawn using exploratory data analysis?
Now that we have explored the data, let's apply different algorithms to build recommendation systems.
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 10/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold
for a minimum number of playcounts for a song to be considered for recommendation.
# Install the surprise package using pip. Uncomment and run the below code to do the same
# !pip install surprise
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 11/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Import necessary libraries
# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader
Think About It: Which metric should be used for this problem to compare different models?
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 12/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
"""Return precision and recall at k metrics for each user"""
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
accuracy.rmse(predictions)
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 13/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
Think About It: In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing
the threshold? What is the intuition behind using the threshold value of 1.5?
Think About It: How changing the test size would change the results and outputs?
# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(_________)
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(____, _____, verbose = True) # Use user_id 6958 and song_id 3232
Now, let's try to tune the model and see if we can improve the model performance.
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 14/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
'user_based': [True], "min_support": [2, 4]}
}
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
# Predict the play count for a song that is not listened to by the user (with user_id 6958)
Think About It: Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?
# Use inner id 0
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 15/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
def get_recommendations(data, user_id, top_n, algo):
# Extracting those business ids which the user_id has not visited yet
non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the business ids which user_id has not interacted yet
for item_id in non_interacted_products:
# Predicting the ratings for those non visited restaurant ids by this user
est = ___________________
return recommendations[:top_n] # Returing top n highest predicted rating products for this user
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations =___________________________
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(_________________)
return ranked_songs
Think About It: In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is
it also possible to add this quantity instead of subtracting?
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 16/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
# Predict the play count for a user that has not listened to the song (with song_id 1671)
# Extract the combination of parameters that gave the best RMSE score
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
list of hyperparameters here.
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 17/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
recommendations =
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(_______________)
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
# Making a prediction for the user who has not listened to the song (song_id 3232)
keyboard_arrow_down Improving matrix factorization based recommendation system by tuning its hyperparameters
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 18/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# Fitting data
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
available hyperparameters here.
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 19/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Making prediction for user_id 6958 and song_id 1671
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
# Fitting data
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
available hyperparameters here.
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = _______
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 20/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
df_small = df_final
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
indices[ : 5]
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 21/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Importing necessary packages to work with text data
import
We will nltk
create a function to pre-process the text data:
# Download punkt library
nltk.download("punkt")
# Function to tokenize the text
def tokenize(text):
# Download stopwords library
nltk.download("stopwords")
text = re.sub(r"[^a-zA-Z]"," ", text.lower())
# Download
tokens =wordnet
word_tokenize(text)
nltk.download("wordnet")
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 22/22