Notebook - Music Recommendation System Reference

4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).
ipynb - Colaboratory
keyboard_arrow_down Music Recommendation System

Problem Definition
The Context:
Why is this problem important to solve?
The objective:
What is the intended goal?
The key questions:

What are the key questions that need to be answered?
The problem formulation:

What is it that we are trying to solve using data science?
Data Dictionary
The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The
first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id,
and the play count of users.
song_data
song_id - A unique id given to every song
title - Title of the song
Release - Name of the released album
Artist_name - Name of the artist
year - Year of release
count_data
user _id - A unique id given to the user
song_id - A unique id given to the song
play_count - Number of times the song was played
Data Source
http://millionsongdataset.com/
https://colab.research.google.com/drive/1LK2H4RboKV_Ld0dqyKq2vktDSo5Kxdf-#scrollTo=n5E24_Ec2T9W&printMode=true 1/22
4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).ipynb - Colaboratory
keyboard_arrow_down Important Notes

This notebook can be considered a guide to refer to while solving the problem. The evaluation will be as per the Rubric shared for the
Milestone. Unlike previous courses, it does not follow the pattern of the graded questions in different sections. This notebook would give
you a direction on what steps need to be taken to get a feasible solution to the problem. Please note that this is just one way of doing this.
There can be other 'creative' ways to solve the problem, and we encourage you to feel free and explore them as an 'optional' exercise.
In the notebook, there are markdown cells called Observations and Insights. It is a good practice to provide observations and extract
insights from the outputs.
The naming convention for different variables can vary. Please consider the code provided in this notebook as a sample code.
All the outputs in the notebook are just for reference and can be different if you follow a different approach.
There are sections called Think About It in the notebook that will help you get a better understanding of the reasoning behind a particular
technique/step. Interested learners can take alternative approaches if they want to explore different techniques.
keyboard_arrow_down Importing Libraries and the Dataset

# Mounting the drive
from google.colab import drive
drive.mount("/content/drive")
Mounted at /content/drive
from google.colab import files

uploaded = files.upload()
Choose Files song_data.csv

song_data.csv(text/csv) - 82246673 bytes, last modified: 3/19/2024 - 100% done
Saving song_data.csv to song_data (1).csv
from google.colab import files

uploaded = files.upload()
Choose Files count_data.csv

count_data.csv(text/csv) - 139003826 bytes, last modified: 3/19/2024 - 100% done
Saving count_data.csv to count_data.csv
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')
# Basic libraries of python for numeric and dataframe computations

import numpy as np
import pandas as pd
# Basic library for data visualization

import matplotlib.pyplot as plt
# Slightly advanced library for data visualization

import seaborn as sns
# To compute the cosine similarity between two vectors

from sklearn.metrics.pairwise import cosine_similarity
# A dictionary output that does not raise a key error

from collections import defaultdict
# A performance metrics in sklearn

from sklearn.metrics import mean_squared_error
keyboard_arrow_down Load the dataset

count_df = pd.read_csv('count_data.csv')
song_df = pd.read_csv('song_data.csv')
keyboard_arrow_down Understanding the data by viewing a few observations

# See top 10 records of count_df data
top_10_count = count_df.head(10)
print("\nTop 10 records of count_df:")

print(top_10_count.to_string(index=False))
Top 10 records of count_df:

Unnamed: 0 user_id song_id play_count
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
3 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1
4 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1
5 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5
6 b80344d063b5ccb3212f76538f3d9e43d87dca9e SODXRTY12AB0180F3B 1
7 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFGUAY12AB017B0A8 1
8 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOFRQTD12A81C233C0 1
9 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOHQWYZ12A6D4FA701 1
# See top 10 records of song_df data
# Get the top 10 records of song_df
top_10_song = song_df.head(10)
print("\nTop 10 records of song_df:")
print(top_10_song.to_string(index=False))
Top 10 records of song_df:

song_id title release artist_name year
SOQMMHC12AB0180CB8 Silent Night Monster Ballads X-Mas Faster Pussy cat 2003
SOVFVAK12A8C1350D9 Tanssi vaan Karkuteillä Karkkiautomaatti 1995
SOGTUKN12AB017F4F1 No One Could Ever Butter Hudson Mohawke 2006
SOBNYVR12A8C13558C Si Vos Querés De Culo Yerba Brava 2003
SOHSBXH12A8C13B0DF Tangle Of Aspens Rene Ablaze Presents Winter Sessions Der Mystic 0
SOZVAPQ12A8C13B63C Symphony No. 1 G minor "Sinfonie Serieuse"/Allegro con energia Berwald: Symphonies Nos. 1/2/3/4 David Montgomery 0
SOQVRHI12A6D4FB2D7 We Have Got Love Strictly The Best Vol. 34 Sasha / Turbulence 0
SOEYRFT12AB018936C 2 Da Beat Ch'yall Da Bomb Kris Kross 1993
SOPMIYT12A6D4F851E Goodbye Danny Boy Joseph Locke 0
SOJCFMH12A8C13B0C2 Mama_ mama can't you see ? March to cadence with the US marines The Sun Harbor's Chorus-Documentary Recordings 0
keyboard_arrow_down Let us check the data types and and missing values of each column
# See the info of the count_df data
print("Data types of count_df:")
print(count_df.dtypes)
Data types of count_df:

Unnamed: 0 int64
user_id object
song_id object
play_count int64
dtype: object
# See the info of the song_df data

print("\nData types of song_df:")
print(song_df.dtypes)
Data types of song_df:

song_id object
title object
release object
artist_name object
year int64
dtype: object
keyboard_arrow_down Observations and Insights:_____
Think About It: As the user_id and song_id are encrypted. Can they be encoded to numeric features?
Yes, the user_id and song_id columns, even if they are encrypted or hashed, can still be encoded into numeric features. However, since these
IDs are categorical variables (not ordinal), you should use techniques like Label Encoding or One-Hot Encoding to convert them into numeric
form.
pip install category_encoders
Collecting category_encoders
Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.9/81.9 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.25.2)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.11.4)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.14.1)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (2.0.3)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2023.4)
Requirement already satisfied: tzdata>=2022.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2024.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.4.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (24.0)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.3
import pandas as pd
import category_encoders as ce
# Assuming count_df and song_df are already defined
# Step 1: Merge count_df and song_df

df = pd.merge(count_df, song_df, on='song_id', how='left')
# Display the top records of the merged DataFrame

print("Merged DataFrame:")
print(df.head())
# Step 2: Apply One-Hot Encoding using category_encoders.OneHotEncoder

encoder = ce.OneHotEncoder(cols=['song_id'], use_cat_names=True)
df_encoded = encoder.fit_transform(df)
# Drop the original 'song_id' column

if 'song_id' in df_encoded.columns:
df_encoded = df_encoded.drop(['song_id'], axis=1)
# Display the top records of the encoded DataFrame

print("\nEncoded DataFrame:")
print(df_encoded.head())
print("Shape of the DataFrame:")

print(df_encoded.shape)
output Merged DataFrame:

user_id song_id play_count title artist_name
0 1 101 5 Song A Artist 1
1 2 102 3 Song B Artist 2
2 3 101 8 Song A Artist 1
3 1 103 2 Song C Artist 3
4 2 102 6 Song B Artist 2
Encoded DataFrame:
user_id song_id_101.0 song_id_102.0 song_id_103.0 play_count title \
0 1 1 0 0 5 Song A
1 2 0 1 0 3 Song B
2 3 1 0 0 8 Song A
3 1 0 0 1 2 Song C
4 2 0 1 0 6 Song B
artist_name
0 Artist 1
1 Artist 2
2 Artist 1
3 Artist 3
4 Artist 2
Shape of the DataFrame:
(5, 7)
# Apply label encoding for "user_id" and "song_id"

# Label encoding code
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Apply label encoding for "user_id"
df_encoded['user_id'] = le.fit_transform(df['user_id'])
# Apply label encoding for "song_id"
df_encoded['song_id'] = le.fit_transform(df['song_id'])
# Get the column containing the users

users = df_encoded.user_id
# Create a dictionary from users to their number of songs

ratings_count = dict()
for user in users:

# If we already have the user, just add 1 to their rating count
if user in ratings_count:
ratings_count[user] += 1
# Otherwise, set their rating count to 1

else:
ratings_count[user] = 1

(5, 8)
# We want our users to have listened at least 90 songs

RATINGS_CUTOFF = 90
# Create a list of users who need to be removed

remove_users = []
for user, num_ratings in ratings_count.items():
if num_ratings < RATINGS_CUTOFF:

remove_users.append(user)
df_encoded = df_encoded.loc[ ~ df_encoded.user_id.isin(remove_users)]


(0, 8)
Think About It: As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it
contains users who have listened to a good count of songs and vice versa?
Filtering the data to include only users who have listened to a good count of songs (and vice versa for songs) after one-hot encoding is a
common practice, especially in recommendation systems. This filtering helps in focusing on users or songs that have enough data to make
meaningful recommendations.
Reasons for Filtering: Data Sparsity: One-Hot Encoding can lead to a very sparse dataset, especially if you have many users and songs. Users
may have listened to only a few songs, and songs may have been listened to by only a few users.
Cold Start Problem: Users or songs with very few interactions can pose challenges in making accurate recommendations. The system might
struggle to provide good recommendations for new users or unpopular songs.
Filtering Users: You might want to filter out users who have listened to a very small number of songs. This way, you focus on users who are
more active and have more diverse preferences.
Filtering Songs: Similarly, you might filter out songs that have been listened to by very few users. This helps in recommending popular or
trending songs to users.
Here's how you might filter the data to include only users who have listened to at least a certain number of songs, and songs that have been
listened to by at least a certain number of users:
# Get the column containing the songs

songs = df_encoded.song_id
# Create a dictionary from songs to their number of users

ratings_count = dict()
for song in songs:

# If we already have the song, just add 1 to their rating count
if song in ratings_count:
ratings_count[song] += 1
# Otherwise, set their rating count to 1

else:
ratings_count[song] = 1

(0, 8)
Start coding or generate with AI.
# We want our song to be listened by atleast 120 users to be considred
RATINGS_CUTOFF = 120
remove_songs = []
for song, num_ratings in ratings_count.items():

if num_ratings < RATINGS_CUTOFF:
remove_songs.append(song)
df_final= df_encoded.loc[ ~ df_encoded.song_id.isin(remove_songs)]

print(df_final)

Empty DataFrame
Columns: [user_id, song_id_101.0, song_id_102.0, song_id_103.0, play_count, title, artist_name, song_id]
Index: []
# Drop records with play_count more than(>) 5

# Create a boolean mask where play_count is more than 5
mask_play_count = df_final['play_count'] > 5
# Use the mask to filter out rows where play_count > 5

df_final_filtered = df_final[~mask_play_count]
# Display the filtered DataFrame based on 'play_count' criteria

print("\nFiltered DataFrame based on 'play_count' less than or equal to 5:")
print(df_final_filtered.head())
Filtered DataFrame based on 'play_count' less than or equal to 5:

Empty DataFrame
Columns: [user_id, song_id, play_count, title, artist_name]
Index: []
# Check the shape of the data

print(df_final_filtered.shape)

(0, 5)
keyboard_arrow_down Exploratory Data Analysis

keyboard_arrow_down Let's check the total number of unique users, songs, artists in the data
Total number of unique user id
# Display total number of unique user_id
# Find the total number of unique user IDs

total_unique_users = df_final['user_id'].nunique()
# Display the total number of unique user IDs
print("Total Number of Unique User IDs:", total_unique_users)
Total Number of Unique User IDs: 0
Total number of unique song id
# Display total number of unique song_id

total_unique_songs = song_df['song_id'].nunique()
# Display the total number of unique song_id

print("Total Number of Unique song_id:", total_unique_songs)
Total Number of Unique song_id: 3
Total number of unique artists
# Display total number of unique artists

# Find the total number of unique artist_name
total_unique_artists = song_df['artist_name'].nunique()
# Display the total number of unique artist_name

print("Total Number of Unique artists (artist_name):", total_unique_artists)
Total Number of Unique artists (artist_name): 3
Observations and Insights:__
keyboard_arrow_down Let's find out about the most interacted songs and interacted users
Most interacted songs
Most interacted users
keyboard_arrow_down Observations and Insights:___
Songs played in a year
count_songs = df_final.groupby('year').count()['title']
count = pd.DataFrame(count_songs)
count.drop(count.index[0], inplace = True)

count.drop(count.index[0], inplace True)
count.tail()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-42-9a68e85aacb9> in <cell line: 1>()
----> 1 count_songs = df_final.groupby('year').count()['title']
2
3 count = pd.DataFrame(count_songs)
4
5 count.drop(count.index[0], inplace = True)
2 frames
/usr/local/lib/python3.10/dist-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
983 in_axis, level, gpr = False, gpr, None
984 else:
--> 985 raise KeyError(gpr)
986 elif isinstance(gpr, Grouper) and gpr.key is not None:
987 # Add key to exclusions
KeyError: 'year'
Next steps: Explain error
# Create the plot
# Set the figure size

plt.figure(figsize = (30, 10))
sns.barplot(x = count.index,
y = 'title',
data = count,
estimator = np.median)
# Set the y label of the plot

plt.ylabel('number of titles played')
# Show the plot

plt.show()
keyboard_arrow_down Observations and Insights:__
Think About It: What other insights can be drawn using exploratory data analysis?
keyboard_arrow_down Important Insights from EDA

What are the the most important observations and insights from the data based on the EDA performed?
Now that we have explored the data, let's apply different algorithms to build recommendation systems.
keyboard_arrow_down Building various models

keyboard_arrow_down Popularity-Based Recommendation Systems
Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.
# Calculating average play_count

average_count = df_final._____________ # Hint: Use groupby function on the song_id column
# Calculating the frequency a song is played

play_freq = df_final._________________ # Hint: Use groupby function on the song_id column
# Making a dataframe with the average_count and play_freq

final_play = pd.DataFrame({'avg_count':_________, 'play_freq':______})
# Let us see the first five records of the final_play dataset

final_play.head()
Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold
for a minimum number of playcounts for a song to be considered for recommendation.
# Build the function to find top n songs
# Recommend top 10 songs using the function defined above
keyboard_arrow_down User User Similarity-Based Collaborative Filtering

To build the user-user-similarity-based and subsequent models we will use the "surprise" library.
# Install the surprise package using pip. Uncomment and run the below code to do the same
# !pip install surprise
# Import necessary libraries
# To compute the accuracy of models

from surprise import accuracy
# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count
from surprise.reader import Reader
# Class for loading datasets

from surprise.dataset import Dataset
# For tuning model hyperparameters

from surprise.model_selection import GridSearchCV
# For splitting the data in train and test dataset

from surprise.model_selection import train_test_split
# For implementing similarity-based recommendation system

from surprise.prediction_algorithms.knns import KNNBasic
# For implementing matrix factorization based recommendation system

from surprise.prediction_algorithms.matrix_factorization import SVD
# For implementing KFold cross-validation

from surprise.model_selection import KFold
# For implementing clustering-based recommendation system

from surprise import CoClustering
keyboard_arrow_down Some useful functions

Below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.
Think About It: Which metric should be used for this problem to compare different models?
# The function to calulate the RMSE, precision@k, recall@k, and F_1 score
def precision_recall_at_k(model, k = 30, threshold = 1.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.

user_est_true = defaultdict(list)
# Making predictions on the test data

predictions=model.test(testset)
for uid, _, true_r, est, _ in predictions:

user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value

user_ratings.sort(key = lambda x : x[0], reverse = True)
# Number of relevant items

n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k

n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[ : k])
# Number of relevant and recommended items in top k

n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[ : k])
# Precision@K: Proportion of recommended items that are relevant

# When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended

# When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
# Mean of all the predicted precisions are calculated

precision = round((sum(prec for prec in precisions.values()) / len(precisions)), 3)
# Mean of all the predicted recalls are calculated

recall = round((sum(rec for rec in recalls.values()) / len(recalls)), 3)
accuracy.rmse(predictions)
# Command to print the overall precision

print('Precision: ', precision)
# Command to print the overall recall

print('Recall: ', recall)
# Formula to compute the F-1 score

print('F_1 score: ', round((2 * precision * recall) / (precision + recall), 3))
Think About It: In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing
the threshold? What is the intuition behind using the threshold value of 1.5?
# Instantiating Reader scale with expected rating scale

reader = Reader(rating_scale=____) #use rating scale (0, 5)
# Loading the dataset

data = Dataset.load_from_df(df_final[[______, _______, ____]], reader) # Take only "user_id","song_id", and "play_count"
# Splitting the data into train and test dataset

trainset, testset = train_test_split(data, test_size=____, random_state = 42) # Take test_size = 0.4
Think About It: How changing the test size would change the results and outputs?
# Build the default user-user-similarity model

sim_options = {'name': _________,
'user_based':______}
# KNN algorithm is used to find desired similar items

sim_user_user = KNNBasic(_________) # Use random_state = 1
# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(_________)
# Let us compute precision@k, recall@k, and f_1 score with k = 30

precision_recall_at_k(_____________) # Use sim_user_user model
Observations and Insights:_____
# Predicting play_count for a sample user with a listened song

sim_user_user.predict(_____, ______, r_ui = 2, verbose = True) # Use user id 6958 and song_id 1671
# Predicting play_count for a sample user with a song not-listened by the user
sim_user_user.predict(____, _____, verbose = True) # Use user_id 6958 and song_id 3232
Double-click (or enter) to edit
Now, let's try to tune the model and see if we can improve the model performance.
# Setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
'user_based': [True], "min_support": [2, 4]}
}
# Performing 3-fold cross-validation to tune the hyperparameters
# Fitting the data

gs.fit(______) # Use entire data for GridSearch
# Best RMSE score
# Combination of parameters that gave the best RMSE score
# Train the best model found in above gridsearch
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2
# Predict the play count for a song that is not listened to by the user (with user_id 6958)
Think About It: Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?
# Use inner id 0
Below we will be implementing a function where the input parameters are:
data: A song dataset

user_id: A user-id against which we want the recommendations
top_n: The number of songs we want to recommend
algo: The algorithm we want to use for predicting the play_count
The output of the function is a set of top_n items recommended for the given user_id based on the given algorithm
def get_recommendations(data, user_id, top_n, algo):
# Creating an empty list to store the recommended product ids

recommendations = []
# Creating an user item interactions matrix

user_item_interactions_matrix = data.pivot_table(______________)
# Extracting those business ids which the user_id has not visited yet
non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
# Looping through each of the business ids which user_id has not interacted yet
for item_id in non_interacted_products:
# Predicting the ratings for those non visited restaurant ids by this user
est = ___________________
# Appending the predicted ratings

recommendations.append(_______________)
# Sorting the predicted ratings in descending order

recommendations.sort(key = lambda x : x[1], reverse = True)
return recommendations[:top_n] # Returing top n highest predicted rating products for this user
# Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine
recommendations =___________________________
# Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(_________________)
keyboard_arrow_down Correcting the play_counts and Ranking the above songs

def ranking_songs(recommendations, final_rating):
# Sort the songs based on play counts
ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending = False)[['play_freq']].reset_index()
# Merge with the recommended songs to get predicted play_count

ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns = [__________, ___________]), on = ____________, how = 'inner')
# Rank the songs based on corrected play_counts

ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])
# Sort the songs based on corrected play_counts

ranked_songs = ________________
return ranked_songs
Think About It: In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is
it also possible to add this quantity instead of subtracting?
# Applying the ranking_songs function on the final_play data
keyboard_arrow_down Item Item Similarity-based collaborative filtering

recommendation systems
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user
# Predict the play count for a user that has not listened to the song (with song_id 1671)
# Apply grid search for enhancing model performance
# Setting up parameter grid to tune the hyperparameters

param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
'sim_options': {'name': ["cosine", 'pearson', "pearson_baseline"],
'user_based': [False], "min_support": [2, 4]}
}
# Performing 3-fold cross-validation to tune the hyperparameters

gs =
# Fitting the data

gs.fit(_____)
# Find the best RMSE score
# Extract the combination of parameters that gave the best RMSE score
Think About It: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the
list of hyperparameters here.
# Apply the best modle found in the grid search
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user
# Find five most similar items to the item with inner id 0
# Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine
recommendations =
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(_______________)
# Applying the ranking_songs function
keyboard_arrow_down Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the
user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
# Build baseline model using svd
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2
# Making a prediction for the user who has not listened to the song (song_id 3232)
keyboard_arrow_down Improving matrix factorization based recommendation system by tuning its hyperparameters
# Set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
'reg_all': [0.2, 0.4, 0.6]}
# Performe 3-fold grid-search cross-validation

gs =
# Fitting data
# Best RMSE score
available hyperparameters here.
# Building the optimized SVD model using optimal hyperparameters
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm
# Ranking songs based on above recommendations
keyboard_arrow_down Cluster Based Recommendation System

In clustering-based recommendation systems, we explore the similarities and differences in people's tastes in songs based on how they rate
different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.
# Make baseline clustering model
# Making prediction for user_id 6958 and song_id 1671
# Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user
keyboard_arrow_down Improving clustering-based recommendation system by tuning its hyper-parameters
# Set the parameter space to tune

param_grid = {'n_cltr_u': [5, 6, 7, 8], 'n_cltr_i': [5, 6, 7, 8], 'n_epochs': [10, 20, 30]}
# Performing 3-fold grid search cross-validation
# Fitting data
# Best RMSE score
available hyperparameters here.
# Train the tuned Coclustering algorithm
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating
keyboard_arrow_down Implementing the recommendation algorithm based on optimized CoClustering model
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm
clustering_recommendations = _______
keyboard_arrow_down Correcting the play_count and Ranking the above songs

# Ranking songs based on the above recommendations
keyboard_arrow_down Content Based Recommendation Systems

Think About It: So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as
well. Can we take those song features into account?
df_small = df_final
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
# Drop the duplicates from the title column
# Set the title column as the index
# See the first 5 records of the df_small dataset
# Create the series of indices from the data

indices =_______________
indices[ : 5]
# Importing necessary packages to work with text data
import
We will nltk
create a function to pre-process the text data:
# Download punkt library
nltk.download("punkt")
# Function to tokenize the text
def tokenize(text):
# Download stopwords library
nltk.download("stopwords")
text = re.sub(r"[^a-zA-Z]"," ", text.lower())
# Download
tokens =wordnet
word_tokenize(text)
nltk.download("wordnet")
Could not connect to the reCAPTCHA service. Please check your internet connection and reload to get a reCAPTCHA challenge.

Notebook - Music Recommendation System Reference

Uploaded by

Copyright:

Available Formats

Notebook - Music Recommendation System Reference

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notebook - Music Recommendation System Reference

Uploaded by

Copyright:

Available Formats

4/9/24, 4:53 PM Music_Recommendation_System_Reference_Notebook_Low_Code (1).

keyboard_arrow_down Music Recommendation System

The key questions:

The problem formulation:

song_id - A unique id given to every song

title - Title of the song

Release - Name of the released album

Artist_name - Name of the artist

year - Year of release

user _id - A unique id given to the user

song_id - A unique id given to the song

play_count - Number of times the song was played

keyboard_arrow_down Important Notes

keyboard_arrow_down Importing Libraries and the Dataset

from google.colab import files

Choose Files song_data.csv

from google.colab import files

Choose Files count_data.csv

# Basic libraries of python for numeric and dataframe computations

# Basic library for data visualization

# Slightly advanced library for data visualization

# To compute the cosine similarity between two vectors

# A dictionary output that does not raise a key error

# A performance metrics in sklearn

keyboard_arrow_down Load the dataset

keyboard_arrow_down Understanding the data by viewing a few observations

print("\nTop 10 records of count_df:")

Top 10 records of count_df:

Top 10 records of song_df:

Data types of count_df:

# See the info of the song_df data

Data types of song_df:

keyboard_arrow_down Observations and Insights:_____

# Assuming count_df and song_df are already defined

# Step 1: Merge count_df and song_df

# Display the top records of the merged DataFrame

# Step 2: Apply One-Hot Encoding using category_encoders.OneHotEncoder

# Drop the original 'song_id' column

# Display the top records of the encoded DataFrame

print("Shape of the DataFrame:")

output Merged DataFrame:

# Apply label encoding for "user_id" and "song_id"

# Get the column containing the users

# Create a dictionary from users to their number of songs

for user in users:

# Otherwise, set their rating count to 1

Shape of the DataFrame:

# We want our users to have listened at least 90 songs

# Create a list of users who need to be removed

for user, num_ratings in ratings_count.items():

if num_ratings < RATINGS_CUTOFF:

df_encoded = df_encoded.loc[ ~ df_encoded.user_id.isin(remove_users)]

Shape of the DataFrame:

# Get the column containing the songs

# Create a dictionary from songs to their number of users

for song in songs:

# Otherwise, set their rating count to 1

Shape of the DataFrame:

Start coding or generate with AI.

for song, num_ratings in ratings_count.items():