ML Unit 6

Course Name: 6IT4-02:Machine Learning
Unit-VI: Recommended system
 Collaborative filtering
 Content-based filtering
 Artificial neural network
 Perceptron
 Multilayer network
 Backpropagation
 Introduction to Deep learning
Recommender System
These days whether you look at a video on YouTube, a
movie on Netflix or a product on Amazon, you're going to
get recommendations for more things to view, like or buy.
You can thank the advent of machine learning algorithms
and recommender systems for this development.
Recommender System
 Recommender systems are one of the most successful
and widespread application of machine learning
technologies in business.
 Recommender systems are an important class of machine
learning algorithms that offer "relevant" suggestions to
users. Categorized as either collaborative filtering or a
content-based system.
Recommender System
 A recommender system is a subclass of information
filtering that seeks to predict the "rating" or "preference"
a user will give an item, such as a product, movie, song,
etc.
 Recommender systems provide personalized information
by learning the user’s interests through traces of
interaction with that user.
 Much like machine learning algorithms, a recommender
system makes a prediction based on a user's past
behaviors. Specifically, it’s designed to predict user
preference for a set of items based on experience.
Recommender System
Mathematically, a recommendation task is set to be:
Set of users (U)
Set of items (I) that are to be recommended to U
Learn a function based on the user's past interaction data
that predicts the likeliness of item I to U

Examples
Some key examples of recommender systems
at work include:
Product recommendations on Amazon and
other shopping sites

Movie and TV show recommendations on
Netflix
Article recommendations on news sites
Recommendation Engine
 Till recently, people generally tended to buy products recommended to them by
their friends or the people they trust. This used to be the primary method of
purchase when there was any doubt about the product.
 But with the advent of the digital age, that circle has expanded to include online
sites that utilize some sort of recommendation engine.
 A recommendation engine filters the data using different algorithms
and recommends the most relevant items to users. It first captures the
past behavior of a customer and based on that, recommends products
which the users might be likely to buy.
Q. If a completely new user visits an e-commerce site, that site will not
have any past history of that user. So how does the site go about
recommending products to the user in such a scenario?
Solution:
One possible solution could be to recommend the best selling products, i.e.
the products which are high in demand.
Another possible solution could be to recommend the products which

would bring the maximum profit to the business.
If few items can be recommended to a customer based on their needs and
interests, it will create a positive impact on the user experience and lead to
frequent visits.
Hence, businesses nowadays are building smart and intelligent

recommendation engines by studying the past behavior of their users.
Types
There are basically three important types of
recommendation engines:
Collaborative filtering
Content-Based Filtering
Hybrid Recommendation Systems

Types
Types
Content Based Filtering
 Content-based recommendation systems take into
account the data provided by the user both directly and
indirectly. For example, age can be used to determine
classes of products or items reviewed and bought by the
user.
 This type of recommendation system relies on
characteristics of the object.
 New content can be quickly recommended to the user.
E.g. if the user has a history of watching all action movies, a newly
released action movie is recommended by this system.
However, this system does not take into account behavior/data about
other users in the system hence, if a particular action movie fetches
very low rating / negative recommendations by other users, It will still
be recommended to the user.
Techniques used in content-based filtering are:
 TF-IDF ( Term Frequency — Inverse Document Frequency)
 Cosine Similarity
TF-IDF
This technique is used in information retrieval and text mining.
TF-IDF, as the name suggests has two terms.
TF calculates normalized frequency at which a given term appears in
the document.
IDF calculates importance of a term in general. Eg. terms
‘recommendation’, ‘system’, ‘movie’ conveys more information about the
document than terms ‘the’, ‘and’, ‘are’ etc.
TF-IDF
TF: It measures the frequency of a term in the document.

Since the size of a document may vary, It will be futile to
use simple count. Hence this count is normalized.
TF(w) = (Number of times term w appears in a document) /
(Total number of terms in the document).
TF-IDF
IDF: It measures the overall importance of a given term. Since

commonly used terms like ‘is’, ‘the’, ‘are’ doesn’t usually provide
information about the document, IDF for these terms is low.
IDF is calculated as :
IDF(t) = log_e(Total number of documents / Number of
documents with term t in it).
TF-IDF
In content-based filtering technique, TF-IDF can be useful in

determining products which are similar to a given product.
Since this technique is keyword based, it is more useful in
the area with high textual data. Eg. book recommendation.
Cosine-Similarity
As the name promises, this method calculates cosine value

in vector calculation. This method provides the estimation of
similarity between two objects as a measure of the angle
between two vectors.
Cosine similarity can be calculated by obtaining a dot
product between two vectors.
This algorithm recommends products which are similar to
the ones that a user has liked in the past.
Cosine-Similarity
For example : Netflix save all the information related to each user in a
vector form. This vector contains the past behavior of the user, i.e. the
movies liked/disliked by the user and the ratings given by them. This
vector is known as the profile vector.
All the information related to movies is stored in another vector called

the item vector. Item vector contains the details of each movie, like
category, cast, director, etc.
Cosine-Similarity
The content-based filtering algorithm finds the cosine of the
angle between the profile vector and item vector,

i.e. cosine similarity. Suppose A is the profile vector and B
is the item vector, then the similarity between them can be
calculated as:

Cosine-Similarity
Based on the cosine value, which ranges between -1 to 1,
the movies are arranged in descending order and one of the
two below approaches is used for recommendations:
Top-n approach: where the top n movies are
recommended (Here n can be decided by the business)

Rating scale approach: Where a threshold is set and all
the movies above that threshold are recommended

Other methods that can be used to calculate the similarity
are:
Euclidean Distance: Similar items will lie in close
proximity to each other if plotted in n-dimensional space. So,

we can calculate the distance between items and based on
that distance, recommend items to the user. The formula for
the euclidean distance is given by:
 Pearson’s Correlation: It tells us how
much two items are correlated. Higher the
correlation, more will be the similarity.
Pearson’s correlation can be calculated using
the following formula:
A major drawback of this algorithm is that it is limited to
recommending items that are of the same type. It will never
recommend products which the user has not bought or liked
in the past.
So if a user has watched or liked only action movies in the
past, the system will recommend only action movies. It’s a
very narrow way of building an engine.
Collaborative Filtering
The collaborative filtering algorithm uses “User Behavior” for
recommending items. This is one of the most commonly
used algorithms in the industry as it is not dependent on any
additional information.
Let us understand this with an example.

If person A likes 3 movies, say M1, M2 and M3, and person
B likes M2, M3 and M4, then they have almost similar
interests. We can say with some certainty that A should like
The M4 and B should like M1
User-User collaborative filtering
 This algorithm first finds the similarity score between users.
 Based on this similarity score, it then picks out the most
similar users and recommends products which these similar
users have liked or bought previously.
 In terms of our movies example from earlier, this
algorithm finds the similarity between each user
based on the ratings they have previously given to
different movies.
 The prediction of an item for a user u is calculated by
computing the weighted sum of the user ratings
given by other users to an item i.
The prediction Pu,i is given by:
Here,
Pu,i is the prediction of an item
Rv,i is the rating given by a user v to a
movie i
Su,v is the similarity between users
Now, we have the ratings for users in profile vector and based on
that we have to predict the ratings for other users. Following
steps are followed to do so:
1.For predictions we need the similarity between the user u and

v. We can make use of Pearson correlation.
2.First we find the items rated by both the users and based on
the ratings, correlation between the users is calculated.
3.The predictions can be calculated using the similarity

values. This algorithm, first of all calculates the similarity
between each user and then based on each similarity
calculates the predictions. Users having higher
correlation will tend to be similar.
Based on these prediction values, recommendations are made. Let us
understand it with an example:
Consider the user-movie rating matrix:


User/Movie M1 M2 M3 M4 M5 Mean
Rating
A 4 1 - 4 - 3
B - 4 - 2 3 3
C - 1 - 4 4 3
let’s find the similarity between users (A, C) and (B, C) in
the above table. Common movies rated by A/[ and C are
movies M2 and M4 and by B and C are movies M2, M4
and M5.
 The correlation between user A and C is more than the
correlation between B and C. Hence users A and C
have more similarity and the movies liked by user A will
be recommended to user C and vice versa.
Drawback:
This algorithm is quite time consuming as it involves
calculating the similarity for each user and then
calculating prediction for each similarity score.
One way of handling this problem is to select only a few
users (neighbors)
There are various ways to select the neighbors:
1.Select a threshold similarity and choose all the users

above that value.
2.Randomly select the users.
3.Arrange the neighbors in descending order of their

similarity value and choose top-N users.
4.Use clustering for choosing neighbors.

This algorithm is useful
1.When the number of users is less.
2.Its not effective when there are a large number of users as it will
take a lot of time to compute the similarity between all user pairs.
This leads us to item-item collaborative filtering, which is effective

when the number of users is more than the items being
recommended.
item-Item collaborative filtering
 In this algorithm, we compute the similarity between
each pair of items.
This algorithm works similar to user-user collaborative
filtering with just a little change –

 instead of taking the weighted sum of ratings of “user-
neighbors”, we take the weighted sum of ratings of “item-

neighbors”. The prediction is given by:
Now, as we have the similarity between each movie and the ratings,
predictions are made and based on those predictions, similar movies
are recommended. Let us understand it with an example.
User/Movie M1 M2 M3 M4 M5
A 4 1 2 4 4
B 2 4 4 2 1
C - 1 - 3 4
Mean Rating 3 2 3 3 3
Here the mean item rating is the average of all the

ratings given to a particular item (compare it with the
table we saw in user-user filtering). Instead of finding the
user-user similarity as we saw earlier, we find the item-
item similarity.
To do this, first we need to find such users who have rated those
items and based on the ratings, similarity between the items is
calculated.
Let us find the similarity between movies (M1, M4) and (M1, M5).
Common users who have rated movies M1 and M4 are A and B
while the users who have rated movies M1 and M5 are also A and
B.
The similarity between movie M1 and M4 is more than the

similarity between movie M1 and M5.
So based on these similarity values, if any user searches for
movie M1, they will be recommended movie M4 and vice
versa.
Q. What will happen if a new user or a new item is added in
the dataset?
It is called a Cold Start. There can be two types of cold
start:
1.Visitor Cold Start :
means that a new user is introduced in the dataset. Since

there is no history of that user, the system does not know
the preferences of that user. It becomes harder to
recommend products to that user.
1.Visitor Cold Start :
So, how can we solve this problem?
One basic approach could be to apply a popularity based

strategy, i.e. recommend the most popular products.
Once we know the preferences of the user, recommending

products will be easier.
2. Product Cold Start
means that a new product is launched in the market or

added to the system.
User action is most important to determine the value of

any product. More the interaction a product receives, it will
be easier to recommend that product to the right user.
Artificial Neural Network
 Neural networks involve long training times and are therefore
more suitable for applications where this is feasible.
 They require a number of parameters that are typically best
determined empirically such as the network topology or
“structure.”
 Neural networks have been criticized for their poor
interpretability. For example, it is difficult for humans to
interpret the symbolic meaning behind the learned weights and
of “hidden units” in the network. These features initially made
neural networks less desirable for data mining.
 “What is backpropagation?” Backpropagation is a neural network
learning algorithm.
 The neural networks field was originally kindled by psychologists and
neurobiologists who sought to develop and test computational analogs
of neurons.
 Roughly speaking, a neural network is a set of connected input/output
units in which each connection has a weight associated with it. During
the learning phase, the network learns by adjusting the weights so as to
be able to predict the correct class label of the input tuples.
 Neural network learning is also referred to as connectionist learning
due to the connections between units.
 Advantages of neural networks,
include their high tolerance of noisy data as well as their ability to classify
patterns on which they have not been trained.
They can be used when you may have little knowledge of the relationships
between attributes and classes.
They are well suited for continuous-valued inputs and outputs, unlike most
decision tree algorithms. They have been successful on a wide array of
real-world data, including handwritten character recognition, pathology and
laboratory medicine, and training a computer to pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques
can be used to speed up the computation process. In addition, several
techniques have been recently developed for rule extraction from trained
neural networks. These factors contribute to the usefulness of neural
networks for classification and numeric prediction in data mining.
 There are many different kinds of neural networks and neural network
algorithms. The most popular neural network algorithm is
backpropagation, which gained repute in the 1980s.
Multilayer Feedforward Neural Network
 The backpropagation algorithm performs learning on a multilayer feed-
forward neural network. It iteratively learns a set of weights for
prediction of the class label of tuples. A multilayer feed-forward neural
network consists of an input layer, one or more hidden layers, and an
output layer.
 Each layer is made up of units. The inputs to the network correspond to
the attributes measured for each training tuple.
 The inputs are fed simultaneously into the units making up the input
layer.
 These inputs pass through the input layer and are then weighted and fed
simultaneously to a second layer of “neuronlike” units, known as a hidden
layer.
 The outputs of the hidden layer units can be input to another hidden
layer, and so on.
 The number of hidden layers is arbitrary, although in practice, usually
only one is used.
 The weighted outputs of the last hidden layer are input to units making
up the output layer, which emits the network’s prediction for given tuples
 The units in the input layer are called input units. The units in the
hidden layers and output layer are sometimes referred to as neurodes,
due to their symbolic biological basis, or as output units.
 The multilayer neural network shown in Figure 9.2 has two layers of
output units. Therefore, we say that it is a two-layer neural network.
(The input layer is not counted because it serves only to pass the input
values to the next layer.)
 Similarly, a network containing two hidden layers is called a three-layer
neural network, and so on. It is a feed-forward network since none of
the weights cycles back to an input unit or to a previous layer’s output
unit. It is fully connected in that each unit provides input to each unit
in the next forward layer.
 Each output unit takes, as input, a weighted sum of the outputs from
units in the previous layer .
 It applies a nonlinear (activation) function to the weighted input.
Multilayer feed-forward neural networks are able to model the class
prediction as a nonlinear combination of the inputs.
 From a statistical point of view, they perform nonlinear regression.
Multilayer feed-forward networks, given enough hidden units and
enough training samples, can closely approximate any function.
Backpropagation Algorithm
 its output, Oj , is equal to its input value, Ij . Next, the net input and
output of each unit in the hidden and output layers are computed.
 The net input to a unit in the hidden or output layers is computed as a
linear combination of its inputs.
 To compute the net input to the unit, each input connected to the unit
is multiplied by its corresponding weight, and this is summed. Given a
unit, j in a hidden or output layer, the net input, Ij , to unit j is
where wij is the weight of the connection from unit i in

the previous layer to unit j; Oi is the output of unit i
from the previous layer; and θj is the bias of the unit.
The bias acts as a threshold in that it serves to vary
the activity of the unit.
 Given the net input Ij to unit j, then Oj ,

the output of unit j, is computed a
This function is also referred to as a squashing function,

because it maps a large input domain onto the smaller
range of 0 to 1. The logistic function is nonlinear and
differentiable, allowing the backpropagation algorithm to
model classification problems that are linearly inseparable
We compute the output values, Oj , for each hidden

layer, up to and including the output layer, which gives
the network’s prediction. I
Backpropagate the error: The error is propagated

backward by updating the weights and biases to reflect
the error of the network’s prediction. For a unit j in the
output layer, the error Errj is computed by
where Oj is the actual output of unit j, and Tj is the
known target value of the given training tuple. Note that
Oj(1 − Oj) is the derivative of the logistic function. To
compute the error of a hidden layer unit j, the weighted
sum of the errors of the units connected to unit j in the
next layer are considered. The error of a hidden layer unit
j is
where wjk is the weight of the connection from unit j to a
unit k in the next higher layer, and Errk is the error of unit
k. The weights and biases are updated to reflect the
propagated errors. Weights are updated by the following
equations, where 1wij is the change in weight wij:
 Terminating condition: Training stops when All 1wij in the previous epoch are so
small as to be below some specified threshold, or The percentage of tuples
misclassified in the previous epoch is below some threshold, or A prespecified
number of epochs has expired. In practice, several hundreds of thousands of
epochs may be required before the weights will converge.
 “How efficient is backpropagation?” The computational efficiency depends on
the time spent training the network. Given |D| tuples and w weights, each
epoch requires O(|D| × w) time. However, in the worst-case scenario, the
number of epochs can be exponential in n, the number of inputs. In practice,
the time required for the networks to converge is highly variable. A number of
techniques exist that help speed up the training time. For example, a technique
known as simulated annealing can be used, which also ensures convergence to
a global optimum
Example
Example
Example
Example
Example
Example
“How can we classify an unknown tuple using a trained network?”
To classify an unknown tuple, X, the tuple is input to the trained network, and the
net input and output of each unit are computed. (There is no need for computation
and/or backpropagation of the error.)
If there is one output node per class, then the output node with the highest value
determines the predicted class label for X.

If there is only one output node, then output values greater than or equal to 0.5
may be considered as belonging to the positive class, while values less than 0.5 may
be considered negative.
Several variations and alternatives to the backpropagation algorithm have been
proposed for classification in neural networks. These may involve the dynamic
adjustment of the network topology and of the learning rate or other parameters, or
the use of different error functions.
What is Deep Learning (DL)
• A machine learning subfield of learning

representations of data. Exceptional effective at
learning patterns.
• Deep learning algorithms attempt to learn (multiple
levels of) representation by using a hierarchy of
multiple layers
• If you provide the system tons of information, it
begins to understand it and respond in useful ways.
What is Deep Learning (DL)
Why is DL useful?
o Manually designed features are often over-specified,

incomplete and take a long time to design and
validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost?)
universal, learnable framework for representing world,
visual and linguistic information.
o Can learn both unsupervised and supervised
o Effective end-to-end joint system learning

ML Unit 6

Uploaded by

Copyright:

Available Formats

ML Unit 6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Unit 6

Uploaded by

Copyright:

Available Formats

Course Name: 6IT4-02:Machine Learning

Unit-VI: Recommended system

Set of items (I) that are to be recommended to U

Learn a function based on the user's past interaction data

that predicts the likeliness of item I to U

other shopping sites

Another possible solution could be to recommend the products which

Hence, businesses nowadays are building smart and intelligent

Hybrid Recommendation Systems

TF: It measures the frequency of a term in the document.

IDF: It measures the overall importance of a given term. Since

In content-based filtering technique, TF-IDF can be useful in

As the name promises, this method calculates cosine value

All the information related to movies is stored in another vector called

angle between the profile vector and item vector,

recommended (Here n can be decided by the business)

the movies above that threshold are recommended

proximity to each other if plotted in n-dimensional space. So,

Let us understand this with an example.

Rv,i is the rating given by a user v to a

1.For predictions we need the similarity between the user u and

3.The predictions can be calculated using the similarity

Consider the user-movie rating matrix:

1.Select a threshold similarity and choose all the users

2.Randomly select the users.

3.Arrange the neighbors in descending order of their

4.Use clustering for choosing neighbors.

1.When the number of users is less.

This leads us to item-item collaborative filtering, which is effective

filtering with just a little change –

neighbors”, we take the weighted sum of ratings of “item-

Here the mean item rating is the average of all the

The similarity between movie M1 and M4 is more than the

1.Visitor Cold Start :

means that a new user is introduced in the dataset. Since

1.Visitor Cold Start :

So, how can we solve this problem?

One basic approach could be to apply a popularity based

Once we know the preferences of the user, recommending

2. Product Cold Start

means that a new product is launched in the market or

User action is most important to determine the value of

where wij is the weight of the connection from unit i in

 Given the net input Ij to unit j, then Oj ,

This function is also referred to as a squashing function,

We compute the output values, Oj , for each hidden

Backpropagate the error: The error is propagated

determines the predicted class label for X.

• A machine learning subfield of learning

o Manually designed features are often over-specified,

You might also like