Notebook - Music Recommendation System Reference

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).

ipynb - Colaboratory

keyboard_arrow_down Music Recommendation System


Problem Definition

The Context:
Why is this problem important to solve?

The objective:
What is the intended goal?

The key questions:


What are the key questions that need to be answered?

The problem formulation:


What is it that we are trying to solve using data science?

Data Dictionary

The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The
first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id,
and the play count of users.

song_data

song_id - A unique id given to every song

title - Title of the song

Release - Name of the released album

Artist_name - Name of the artist

year - Year of release

count_data

user _id - A unique id given to the user

song_id - A unique id given to the song

play_count - Number of times the song was played

Data Source
http://millionsongdataset.com/

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 1/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

keyboard_arrow_down Important Notes


This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for the
Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give
you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this.
There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.

In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract
insights from the outputs.

The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.

All the outputs in the notebook are just for reference and can be different if you follow a different approach.

There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular
technique/step. Interested learners can take alternative approaches if they want to explore different techniques.

keyboard_arrow_down Importing Libraries and the Dataset


# Mounting the drive
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive

from google.colab import files


uploaded = files.upload()

Choose Files song_data.csv


song_data.csv(text/csv) - 82246673 bytes, last modified: 3/19/2024 - 100% done
Saving song_data.csv to song_data (1).csv

from google.colab import files


uploaded = files.upload()

Choose Files count_data.csv


count_data.csv(text/csv) - 139003826 bytes, last modified: 3/19/2024 - 100% done
Saving count_data.csv to count_data.csv

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 2/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations


import numpy as np
import pandas as pd

# Basic library for data visualization


import matplotlib.pyplot as plt

# Slightly advanced library for data visualization


import seaborn as sns

# To compute the cosine similarity between two vectors


from sklearn.metrics.pairwise import cosine_similarity

# A dictionary output that does not raise a key error


from collections import defaultdict

# A performance metrics in sklearn


from sklearn.metrics import mean_squared_error

keyboard_arrow_down Load the dataset


count_df = pd.read_csv('count_data.csv')
song_df = pd.read_csv('song_data.csv')

keyboard_arrow_down Understanding the data by viewing a few observations


# See top 10 records of count_df data
top_10_count = count_df.head(10)

print("\nTop 10 records of count_df:")


print(top_10_count.to_string(index=False))

Top 10 records of count_df:


Unnamed: 0 user_id song_id play_count
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1
5 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5
6 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODXRTY12AB0180F3B 1
7 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFGUAY12AB017B0A8 1
8 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFRQTD12A81C233C0 1
9 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOHQWYZ12A6D4FA701 1

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 3/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# See top 10 records of song_df data
# Get the top 10 records of song_df
top_10_song = song_df.head(10)
print("\nTop 10 records of song_df:")
print(top_10_song.to_string(index=False))

Top 10 records of song_df:


song_id title release artist_name year
SOQMMHC12AB0180CB8 Silent Night Monster Ballads X-Mas Faster Pussy cat 2003
SOVFVAK12A8C1350D9 Tanssi vaan Karkuteillä Karkkiautomaatti 1995
SOGTUKN12AB017F4F1 No One Could Ever Butter Hudson Mohawke 2006
SOBNYVR12A8C13558C Si Vos Querés De Culo Yerba Brava 2003
SOHSBXH12A8C13B0DF Tangle Of Aspens Rene Ablaze Presents Winter Sessions Der Mystic 0
SOZVAPQ12A8C13B63C Symphony No. 1 G minor "Sinfonie Serieuse"/Allegro con energia Berwald: Symphonies Nos. 1/2/3/4 David Montgomery 0
SOQVRHI12A6D4FB2D7 We Have Got Love Strictly The Best Vol. 34 Sasha / Turbulence 0
SOEYRFT12AB018936C 2 Da Beat Ch'yall Da Bomb Kris Kross 1993
SOPMIYT12A6D4F851E Goodbye Danny Boy Joseph Locke 0
SOJCFMH12A8C13B0C2 Mama_ mama can't you see ? March to cadence with the US marines The Sun Harbor's Chorus-Documentary Recordings 0

keyboard_arrow_down Let us check the data types and and missing values of each column
# See the info of the count_df data
print("Data types of count_df:")
print(count_df.dtypes)

Data types of count_df:


Unnamed: 0 int64
user_id object
song_id object
play_count int64
dtype: object

# See the info of the song_df data


print("\nData types of song_df:")
print(song_df.dtypes)

Data types of song_df:


song_id object
title object
release object
artist_name object
year int64
dtype: object

keyboard_arrow_down Observations and Insights:_____

Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?

Yes, the user_id and song_id columns, even if they are encrypted or hashed, can still be encoded into numeric features. However, since these
IDs are categorical variables (not ordinal), you should use techniques like Label Encoding or One-Hot Encoding to convert them into numeric
form.

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 4/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
pip install category_encoders

Collecting category_encoders
Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.9/81.9 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.25.2)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.11.4)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.14.1)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (2.0.3)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2024.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.4.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (24.0)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3

import pandas as pd
import category_encoders as ce

# Assuming count_df and song_df are already defined

# Step 1: Merge count_df and song_df


df = pd.merge(count_df, song_df, on='song_id', how='left')

# Display the top records of the merged DataFrame


print("Merged DataFrame:")
print(df.head())

# Step 2: Apply One-Hot Encoding using category_encoders.OneHotEncoder


encoder = ce.OneHotEncoder(cols=['song_id'], use_cat_names=True)
df_encoded = encoder.fit_transform(df)

# Drop the original 'song_id' column


if 'song_id' in df_encoded.columns:
df_encoded = df_encoded.drop(['song_id'], axis=1)

# Display the top records of the encoded DataFrame


print("\nEncoded DataFrame:")
print(df_encoded.head())

print("Shape of the DataFrame:")


print(df_encoded.shape)

output Merged DataFrame:


user_id song_id play_count title artist_name
0 1 101 5 Song A Artist 1
1 2 102 3 Song B Artist 2
2 3 101 8 Song A Artist 1
3 1 103 2 Song C Artist 3
4 2 102 6 Song B Artist 2

Encoded DataFrame:
user_id song_id_101.0 song_id_102.0 song_id_103.0 play_count title \
0 1 1 0 0 5 Song A
1 2 0 1 0 3 Song B

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 5/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
2 3 1 0 0 8 Song A
3 1 0 0 1 2 Song C
4 2 0 1 0 6 Song B

artist_name
0 Artist 1
1 Artist 2
2 Artist 1
3 Artist 3
4 Artist 2
Shape of the DataFrame:
(5, 7)

# Apply label encoding for "user_id" and "song_id"


# Label encoding code
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
# Apply label encoding for "user_id"
df_encoded['user_id'] = le.fit_transform(df['user_id'])
# Apply label encoding for "song_id"
df_encoded['song_id'] = le.fit_transform(df['song_id'])

# Get the column containing the users


users = df_encoded.user_id

# Create a dictionary from users to their number of songs


ratings_count = dict()

for user in users:


# If we already have the user, just add 1 to their rating count
if user in ratings_count:
ratings_count[user] += 1

# Otherwise, set their rating count to 1


else:
ratings_count[user] = 1
print("Shape of the DataFrame:")
print(df_encoded.shape)

Shape of the DataFrame:


(5, 8)

# We want our users to have listened at least 90 songs


RATINGS_CUTOFF = 90

# Create a list of users who need to be removed


remove_users = []

for user, num_ratings in ratings_count.items():

if num_ratings < RATINGS_CUTOFF:


remove_users.append(user)

df_encoded = df_encoded.loc[ ~ df_encoded.user_id.isin(remove_users)]


print("Shape of the DataFrame:")
print(df_encoded.shape)

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 6/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

Shape of the DataFrame:


(0, 8)

Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it
contains users who have listened to a good count of songs and vice versa?

Filtering the data to include only users who have listened to a good count of songs (and vice versa for songs) after one-hot encoding is a
common practice, especially in recommendation systems. This filtering helps in focusing on users or songs that have enough data to make
meaningful recommendations.

Reasons for Filtering: Data Sparsity: One-Hot Encoding can lead to a very sparse dataset, especially if you have many users and songs. Users
may have listened to only a few songs, and songs may have been listened to by only a few users.

Cold Start Problem: Users or songs with very few interactions can pose challenges in making accurate recommendations. The system might
struggle to provide good recommendations for new users or unpopular songs.

Filtering Users: You might want to filter out users who have listened to a very small number of songs. This way, you focus on users who are
more active and have more diverse preferences.

Filtering Songs: Similarly, you might filter out songs that have been listened to by very few users. This helps in recommending popular or
trending songs to users.

Here's how you might filter the data to include only users who have listened to at least a certain number of songs, and songs that have been
listened to by at least a certain number of users:

# Get the column containing the songs


songs = df_encoded.song_id

# Create a dictionary from songs to their number of users


ratings_count = dict()

for song in songs:


# If we already have the song, just add 1 to their rating count
if song in ratings_count:
ratings_count[song] += 1

# Otherwise, set their rating count to 1


else:
ratings_count[song] = 1
print("Shape of the DataFrame:")
print(df_encoded.shape)

Shape of the DataFrame:


(0, 8)

Start coding or generate with AI.

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 7/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120

remove_songs = []

for song, num_ratings in ratings_count.items():


if num_ratings < RATINGS_CUTOFF:
remove_songs.append(song)

df_final= df_encoded.loc[ ~ df_encoded.song_id.isin(remove_songs)]


print("Shape of the DataFrame:")
print(df_final)

Shape of the DataFrame:


Empty DataFrame
Columns: [user_id, song_id_101.0, song_id_102.0, song_id_103.0, play_count, title, artist_name, song_id]
Index: []

# Drop records with play_count more than(>) 5


# Create a boolean mask where play_count is more than 5
mask_play_count = df_final['play_count'] > 5

# Use the mask to filter out rows where play_count > 5


df_final_filtered = df_final[~mask_play_count]

# Display the filtered DataFrame based on 'play_count' criteria


print("\nFiltered DataFrame based on 'play_count' less than or equal to 5:")
print(df_final_filtered.head())

Filtered DataFrame based on 'play_count' less than or equal to 5:


Empty DataFrame
Columns: [user_id, song_id, play_count, title, artist_name]
Index: []

# Check the shape of the data


print("Shape of the DataFrame:")
print(df_final_filtered.shape)

Shape of the DataFrame:


(0, 5)

keyboard_arrow_down Exploratory Data Analysis


keyboard_arrow_down Let's check the total number of unique users, songs, artists in the data
Total number of unique user id

# Display total number of unique user_id

# Find the total number of unique user IDs


total_unique_users = df_final['user_id'].nunique()

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 8/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Display the total number of unique user IDs
print("Total Number of Unique User IDs:", total_unique_users)

Total Number of Unique User IDs: 0

Total number of unique song id

# Display total number of unique song_id


total_unique_songs = song_df['song_id'].nunique()

# Display the total number of unique song_id


print("Total Number of Unique song_id:", total_unique_songs)

Total Number of Unique song_id: 3

Total number of unique artists

# Display total number of unique artists


# Find the total number of unique artist_name
total_unique_artists = song_df['artist_name'].nunique()

# Display the total number of unique artist_name


print("Total Number of Unique artists (artist_name):", total_unique_artists)

Total Number of Unique artists (artist_name): 3

Observations and Insights:__

keyboard_arrow_down Let's find out about the most interacted songs and interacted users
Most interacted songs

Start coding or generate with AI.

Most interacted users

Start coding or generate with AI.

keyboard_arrow_down Observations and Insights:___

Songs played in a year

count_songs = df_final.groupby('year').count()['title']

count = pd.DataFrame(count_songs)

count.drop(count.index[0], inplace = True)


https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 9/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
count.drop(count.index[0], inplace True)

count.tail()

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-42-9a68e85aacb9> in <cell line: 1>()
----> 1 count_songs = df_final.groupby('year').count()['title']
2
3 count = pd.DataFrame(count_songs)
4
5 count.drop(count.index[0], inplace = True)

2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
983 in_axis, level, gpr = False, gpr, None
984 else:
--> 985 raise KeyError(gpr)
986 elif isinstance(gpr, Grouper) and gpr.key is not None:
987 # Add key to exclusions

KeyError: 'year'

Next steps: Explain error

# Create the plot

# Set the figure size


plt.figure(figsize = (30, 10))

sns.barplot(x = count.index,
y = 'title',
data = count,
estimator = np.median)

# Set the y label of the plot


plt.ylabel('number of titles played')

# Show the plot


plt.show()

keyboard_arrow_down Observations and Insights:__

Think About It: What other insights can be drawn using exploratory data analysis?

keyboard_arrow_down Important Insights from EDA


What are the the most important observations and insights from the data based on the EDA performed?

Now that we have explored the data, let's apply different algorithms to build recommendation systems.

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 10/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

keyboard_arrow_down Building various models


keyboard_arrow_down Popularity-Based Recommendation Systems
Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.

# Calculating average play_count


average_count = df_final._____________ # Hint: Use groupby function on the song_id column

# Calculating the frequency a song is played


play_freq = df_final._________________ # Hint: Use groupby function on the song_id column

# Making a dataframe with the average_count and play_freq


final_play = pd.DataFrame({'avg_count':_________, 'play_freq':______})

# Let us see the first five records of the final_play dataset


final_play.head()

Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold
for a minimum number of playcounts for a song to be considered for recommendation.

# Build the function to find top n songs

# Recommend top 10 songs using the function defined above

keyboard_arrow_down User User Similarity-Based Collaborative Filtering


To build the user-user-similarity-based and subsequent models we will use the "surprise" library.

# Install the surprise package using pip. Uncomment and run the below code to do the same
# !pip install surprise

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 11/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Import necessary libraries

# To compute the accuracy of models


from surprise import accuracy

# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader

# Class for loading datasets


from surprise.dataset import Dataset

# For tuning model hyperparameters


from surprise.model_selection import GridSearchCV

# For splitting the data in train and test dataset


from surprise.model_selection import train_test_split

# For implementing similarity-based recommendation system


from surprise.prediction_algorithms.knns import KNNBasic

# For implementing matrix factorization based recommendation system


from surprise.prediction_algorithms.matrix_factorization import SVD

# For implementing KFold cross-validation


from surprise.model_selection import KFold

# For implementing clustering-based recommendation system


from surprise import CoClustering

keyboard_arrow_down Some useful functions


Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

Think About It: Which metric should be used for this problem to compare different models?

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 12/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
"""Return precision and recall at k metrics for each user"""

# First map the predictions to each user.


user_est_true = defaultdict(list)

# Making predictions on the test data


predictions=model.test(testset)

for uid, _, true_r, est, _ in predictions:


user_est_true[uid].append((est, true_r))

precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():

# Sort user ratings by estimated value


user_ratings.sort(key = lambda x : x[0], reverse = True)

# Number of relevant items


n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

# Number of recommended items in top k


n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])

# Number of relevant and recommended items in top k


n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[ : k])

# Precision@K: Proportion of recommended items that are relevant


# When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0

precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

# Recall@K: Proportion of relevant items that are recommended


# When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0

recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

# Mean of all the predicted precisions are calculated


precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)

# Mean of all the predicted recalls are calculated


recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)

accuracy.rmse(predictions)

# Command to print the overall precision


print('Precision: ', precision)

# Command to print the overall recall


print('Recall: ', recall)

# Formula to compute the F-1 score


print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 13/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

Think About It: In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing
the threshold? What is the intuition behind using the threshold value of 1.5?

# Instantiating Reader scale with expected rating scale


reader = Reader(rating_scale=____) #use rating scale (0, 5)

# Loading the dataset


data = Dataset.load_from_df(df_final[[______, _______, ____]], reader) # Take only "user_id","song_id", and "play_count"

# Splitting the data into train and test dataset


trainset, testset = train_test_split(data, test_size=____, random_state = 42) # Take test_size = 0.4

Think About It: How changing the test size would change the results and outputs?

# Build the default user-user-similarity model


sim_options = {'name': _________,
'user_based':______}

# KNN algorithm is used to find desired similar items


sim_user_user = KNNBasic(_________) # Use random_state = 1

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(_________)

# Let us compute precision@k, recall@k, and f_1 score with k = 30


precision_recall_at_k(_____________) # Use sim_user_user model

Observations and Insights:_____

# Predicting play_count for a sample user with a listened song


sim_user_user.predict(_____, ______, r_ui = 2, verbose = True) # Use user id 6958 and song_id 1671

# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(____, _____, verbose = True) # Use user_id 6958 and song_id 3232

Double-click (or enter) to edit

Observations and Insights:_____

Now, let's try to tune the model and see if we can improve the model performance.

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 14/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
'user_based': [True], "min_support": [2, 4]}
}

# Performing 3-fold cross-validation to tune the hyperparameters

# Fitting the data


gs.fit(______) # Use entire data for GridSearch

# Best RMSE score

# Combination of parameters that gave the best RMSE score

# Train the best model found in above gridsearch

Observations and Insights:_____

# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2

# Predict the play count for a song that is not listened to by the user (with user_id 6958)

Observations and Insights:__

Think About It: Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?

# Use inner id 0

Below we will be implementing a function where the input parameters are:

data: A song dataset


user_id: A user-id against which we want the recommendations
top_n: The number of songs we want to recommend
algo: The algorithm we want to use for predicting the play_count
The output of the function is a set of top_n items recommended for the given user_id based on the given algorithm

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 15/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
def get_recommendations(data, user_id, top_n, algo):

# Creating an empty list to store the recommended product ids


recommendations = []

# Creating an user item interactions matrix


user_item_interactions_matrix = data.pivot_table(______________)

# Extracting those business ids which the user_id has not visited yet
non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()

# Looping through each of the business ids which user_id has not interacted yet
for item_id in non_interacted_products:

# Predicting the ratings for those non visited restaurant ids by this user
est = ___________________

# Appending the predicted ratings


recommendations.append(_______________)

# Sorting the predicted ratings in descending order


recommendations.sort(key = lambda x : x[1], reverse = True)

return recommendations[:top_n] # Returing top n highest predicted rating products for this user

# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations =___________________________

# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(_________________)

Observations and Insights:__

keyboard_arrow_down Correcting the play_counts and Ranking the above songs


def ranking_songs(recommendations, final_rating):
# Sort the songs based on play counts
ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()

# Merge with the recommended songs to get predicted play_count


ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = [__________, ___________]), on = ____________, how = 'inner')

# Rank the songs based on corrected play_counts


ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])

# Sort the songs based on corrected play_counts


ranked_songs = ________________

return ranked_songs

Think About It: In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is
it also possible to add this quantity instead of subtracting?

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 16/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

# Applying the ranking_songs function on the final_play data

Observations and Insights:__

keyboard_arrow_down Item Item Similarity-based collaborative filtering


recommendation systems

# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance

Observations and Insights:__

# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user

# Predict the play count for a user that has not listened to the song (with song_id 1671)

Observations and Insights:__

# Apply grid search for enhancing model performance

# Setting up parameter grid to tune the hyperparameters


param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
'user_based': [False], "min_support": [2, 4]}
}

# Performing 3-fold cross-validation to tune the hyperparameters


gs =

# Fitting the data


gs.fit(_____)

# Find the best RMSE score

# Extract the combination of parameters that gave the best RMSE score

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
list of hyperparameters here.

# Apply the best modle found in the grid search

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 17/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

Observations and Insights:__

# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)

# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user

Observations and Insights:__

# Find five most similar items to the item with inner id 0

# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
recommendations =

# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(_______________)

# Applying the ranking_songs function

Observations and Insights:_____

keyboard_arrow_down Model Based Collaborative Filtering - Matrix Factorization


Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the
user and it is not dependent on any additional information. We use latent features to find recommendations for each user.

# Build baseline model using svd

# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2

# Making a prediction for the user who has not listened to the song (song_id 3232)

keyboard_arrow_down Improving matrix factorization based recommendation system by tuning its hyperparameters

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 18/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}

# Performe 3-fold grid-search cross-validation


gs =

# Fitting data

# Best RMSE score

# Combination of parameters that gave the best RMSE score

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
available hyperparameters here.

# Building the optimized SVD model using optimal hyperparameters

Observations and Insights:_____

# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671

# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating

Observations and Insights:_____

# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm

# Ranking songs based on above recommendations

Observations and Insights:_____

keyboard_arrow_down Cluster Based Recommendation System


In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in songs based on how they rate
different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

# Make baseline clustering model

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 19/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Making prediction for user_id 6958 and song_id 1671

# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user

keyboard_arrow_down Improving clustering-based recommendation system by tuning its hyper-parameters

# Set the parameter space to tune


param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}

# Performing 3-fold grid search cross-validation

# Fitting data

# Best RMSE score

# Combination of parameters that gave the best RMSE score

Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
available hyperparameters here.

# Train the tuned Coclustering algorithm

Observations and Insights:_____

# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671

# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating

Observations and Insights:_____

keyboard_arrow_down Implementing the recommendation algorithm based on optimized CoClustering model

# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = _______

keyboard_arrow_down Correcting the play_count and Ranking the above songs


# Ranking songs based on the above recommendations

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 20/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory

Observations and Insights:_____

keyboard_arrow_down Content Based Recommendation Systems


Think About It: So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as
well. Can we take those song features into account?

df_small = df_final

# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"

# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data

# Drop the duplicates from the title column

# Set the title column as the index

# See the first 5 records of the df_small dataset

# Create the series of indices from the data


indices =_______________

indices[ : 5]

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 21/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
# Importing necessary packages to work with text data
import
We will nltk
create a function to pre-process the text data:
# Download punkt library
nltk.download("punkt")
# Function to tokenize the text
def tokenize(text):
# Download stopwords library
nltk.download("stopwords")
text = re.sub(r"[^a-zA-Z]"," ", text.lower())

# Download
tokens =wordnet
word_tokenize(text)
nltk.download("wordnet")
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.

https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 22/22

You might also like