Recomender System Notes
Recomender System Notes
to a user . The suggestions relate to various decision-making processes, such as what items to buy, what music
to listen to, or what online news to read. Item is the general term used to denote what the system recommends
to users. A RS normally focuses on a specific type of item (e.g., CDs, or news) and accordingly its design, its
graphical user interface, and the core recommendation technique used to generate the recommendations are all
customized to provide useful and effective suggestions for that specific type of item. RSs are primarily directed
towards individuals who lack sufficient personal experience or competence to evaluate the potentially
overwhelming number of alter-native items that a Web site.
A case in point is a book recommender system that assists users to select a book to read. In the popular Web site,
Amazon.com, the site employs a RS to personalize the online store for each customer. Since recommendations
are usually personalized, different users or user groups receive diverse suggestions. In addition there are also
non-personalized recommendations. These are much simpler to generate and are normally featured in
magazines or newspapers. Typical examples include the top ten selections of books, CDs etc. While they may
be useful and effective in certain situations, these types of non-personalized recommendations are not typically
addressed by RS research. In their simplest form, personalized recommendations are offered as ranked lists of
items. In performing this ranking, RSs try to predict what the most suitable products or services are, based on
the users preferences and constraints. In order to complete such a computational task, RSs collect from users
their preferences, which are either explicitly expressed, e.g., as ratings for products, or are inferred by
interpreting user actions. For instance, a RS may consider the navigation to a particular product page as an
implicit sign of preference for the items shown on that page. In seeking to mimic this behavior, the first RSs
applied algorithms to leverage recommendations produced by a community of users to deliver
recommendations to an active user, i.e., a user looking for suggestions. The recommendations were for items
that similar users (those with similar tastes) had liked. This approach is termed collaborative-filtering and its
rationale is that if the active user agreed in the past with some users, then the other recommendations coming
from these similar users should be relevant as well and of interest to the active user.
As noted above, the study of recommender systems is relatively new compared to research into other classical
information system tools and techniques (e.g., databases or search engines). Recommender systems emerged as
an independent research area in the mid-1990s.
In recent years, the interest in recommender systems has dramatically increased, as the following facts indicate:
1. Recommender systems play an important role in such highly rated Internet sites as Amazon.com, YouTube,
Netflix, Yahoo, Tripadvisor, Last.fm, and IMDb. Moreover many media companies are now developing and
deploying RSs as part of the services they provide to their subscribers. For example Netflix, the online movie
rental service, awarded a million dollar prize to the team that first succeeded in improving substantially the
performance of its recommender system.
2. There are dedicated conferences and workshops related to the field. We refer specifically to ACM
Recommender Systems (RecSys), established in 2007 and now the premier annual event in recommender
technology research and applications. In addition, sessions dedicated to RSs are frequently included in the more
traditional conferences in the area of data bases, information systems and adaptive systems. Among these
conferences are worth mentioning ACM SIGIR Special Interest Group on Information Retrieval (SIGIR), User
Modeling, Adaptation and Personalization (UMAP), and ACMs Special Interest Group on Management Of
Data (SIGMOD).
3. At institutions of higher education around the world, undergraduate and graduate courses are now dedicated
entirely to RSs; tutorials on RSs are very popular at computer science conferences; and recently a book
introducing RSs techniques was published.
4. There have been several special issues in academic journals covering research and developments in the RS
field. Among the journals that have dedicated issues to RS are: AI Communications (2008); IEEE Intelligent
Systems (2007); International Journal of Electronic Commerce (2006); International Journal of Computer
Science and Applications (2006); ACM Transactions on Computer-Human Interaction (2005); and ACM
Transactions on Information Systems (2004).
In general, we can say that from the service providers point of view, the primary goal for introducing a RS is to
increase the conversion rate, i.e., the number of users that accept the recommendation and consume an item,
compared to the number of simple visitors that just browse through the information.
Sell more diverse items. Another major function of a RS is to enable the user to select items that might be hard
to find without a precise recommendation. For instance, in a movie RS such as Netflix, the service provider is
interested in renting all the DVDs in the catalogue, not just the most popular ones. This could be difficult
without a RS since the service provider cannot afford the risk of advertising movies that are not likely to suit a
particular users taste. Therefore, a RS suggests or advertises unpopular movies to the right users
Increase the user satisfaction. A well designed RS can also improve the experience of the user with the site or
the application. The user will find the recommendations interesting, relevant and, with a properly designed
human-computer interaction, she will also enjoy using the system. The combination of effective, i.e., accurate,
recommendations and a usable interface will increase the users subjective evaluation of the system. This in turn
will increase system usage and the likelihood that the recommendations will be accepted.
Increase user fidelity. A user should be loyal to a Web site which, when visited, recognizes the old customer
and treats him as a valuable visitor. This is a normal feature of a RS since many RSs compute recommendations,
leveraging the information acquired from the user in previous interactions, e.g., her ratings of items.
Consequently, the longer the user interacts with the site, the more refined her user model becomes, i.e., the
system representation of the users preferences, and the more the recommender output can be effectively
customized to match the users preferences.
Better understand what the user wants. Another important function of a RS, which can be leveraged to many
other applications, is the description of the users preferences, either collected explicitly or predicted by the
system. The service provider may then decide to re-use this knowledge for a number of other goals such as
improving the management of the items stock or production. For instance, in the travel domain, destination
management organizations can decide to advertise a specific region to new customer sectors or advertise a
particular type of promotional message derived by analyzing the data collected by the RS (transactions of the
users).
We mentioned above some important motivations as to why e-service providers introduce RSs. But users also
may want a RS, if it will effectively support their tasks or goals. Consequently a RS must balance the needs of
these two players and offer a service that is valuable to both.
Herlocker et al., in a paper that has become a classical reference in this field, define eleven popular tasks that a
RS can assist in implementing. Some may be considered as the main or core tasks that are normally associated
with a RS, i.e., to offer suggestions for items that may be useful to a user. Others might be considered as more
opportunistic ways to exploit a RS. As a matter of fact, this task differentiation is very similar to what
happens with a search engine, Its primary function is to locate documents that are relevant to the users
information need, but it can also be used to check the importance of a Web page (looking at the position of the
page in the result list of a query) or to discover the various usages of a word in a collection of documents.
Find Some Good Items: Recommend to a user some items as a ranked list along with predictions of how much
the user would like them (e.g., on a one- to five star scale). This is the main recommendation task that many
commercial systems address (see, for instance, Chapter 9). Some systems do not show the predicted rating.
Find all good items: Recommend all the items that can satisfy some user needs. In such cases it is insufficient
to just find some good items. This is especially true when the number of items is relatively small or when the
RS is mission-critical, such as in medical or financial applications. In these situations, in addition to the benefit
derived from carefully examining all the possibilities, the user may also benefit from the RS ranking of these
items or from additional explanations that the RS generates.
Annotation in context: Given an existing context, e.g., a list of items, emphasize some of them depending on
the users long-term preferences. For example, a TV recommender system might annotate which TV shows
displayed in the electronic program guide (EPG) are worth watching (Chapter 18 provides interesting examples
of this task).
Recommend a sequence: Instead of focusing on the generation of a single recommendation, the idea is to
recommend a sequence of items that is pleasing as a whole. Typical examples include recommending a TV
series; a book on RSs after having recommended a book on data mining; or a compilation of musical tracks .
Recommend a bundle: Suggest a group of items that fits well together. For instance a travel plan may be
composed of various attractions, destinations, and accommodation services that are located in a delimited area.
From the point of view of the user these various alternatives can be considered and selected as a single travel
destination.
Just browsing: In this task, the user browses the catalog without any imminent intention of purchasing an item.
The task of the recommender is to help the user to browse the items that are more likely to fall within the scope
of the users interests for that specific browsing session. This is a task that has been also supported by adaptive
hypermedia techniques.
Find credible recommender: Some users do not trust recommender systems thus they play with them to see
how good they are in making recommendations. Hence, some system may also offer specific functions to let the
users test its behavior in addition to those just required for obtaining recommendations.
Improve the profile: This relates to the capability of the user to provide (input) information to the
recommender system about what he likes and dislikes. This is a fundamental task that is strictly necessary to
provide personalized recommendations. If the system has no specific knowledge about the active user then it
can only provide him with the same recommendations that would be delivered to an average user.
Express self: Some users may not care about the recommendations at all. Rather, what it is important to them
is that they be allowed to contribute with their ratings and express their opinions and beliefs. The user
satisfaction for that activity can still act as a leverage for holding the user tightly to the application (as we
mentioned above in discussing the service providers motivations).
Help others: Some users are happy to contribute with information, e.g., their evaluation of items (ratings),
because they believe that the community benefits from their contribution. This could be a major motivation for
entering information into a recommender system that is not used routinely. For instance, with a car RS, a user,
who has already bought her new car is aware that the rating entered in the system is more likely to be useful for
other users rather than for the next time she will buy a car.
Influence others: In Web-based RSs, there are users whose main goal is to explicitly influence other users into
purchasing particular products. As a matter of fact, there are also some malicious users that may use the system
just to promote or penalize certain items (see Chapter 25).
In any case, as a general classification, data used by RSs refers to three kinds of objects: items, users, and
transactions, i.e., relations between users and items.
Items: Items are the objects that are recommended. Items may be characterized by their complexity and their
value or utility. The value of an item may be positive if the item is useful for the user, or negative if the item is
not appropriate and the user made a wrong decision when selecting it. We note that when a user is acquiring an
item she will always incur in a cost, which includes the cognitive cost of searching for the item and the real
monetary cost eventually paid for the item. For instance, the designer of a news RS must take into account the
complexity of a news item, i.e., its structure, the textual representation, and the time-dependent importance of
any news item. But, at the same time, the RS designer must understand that even if the user is not paying for
reading news, there is always a cognitive cost associated to searching and reading news items. If a selected item
is relevant for the user this cost is dominated by the benefit of having acquired a useful information, whereas if
the item is not relevant the net value of that item for the user, and its recommendation, is negative. In other
domains, e.g., cars, or financial investments, the true monetary cost of the items becomes an important element
to consider when selecting the most appropriate recommendation approach. Items with low complexity and
value are: news, Web pages, books, CDs, movies. Items with larger complexity and value are: digital cameras,
mobile phones, PCs, etc. The most complex items that have been considered are insurance policies, financial
investments, travels, jobs. RSs, according to their core technology, can use a range of properties and features of
the items. For example in a movie recommender system, the genre (such as comedy, thriller, etc.), as well as the
director, and actors can be used to describe a movie and to learn how the utility of an item depends on its
features. Items can be represented using various information and representation approaches, e.g., in a minimalist
way as a single id code, or in a richer form, as a set of attributes, but even as a concept in an ontological
representation of the domain.
Users: Users of a RS, as mentioned above, may have very diverse goals and characteristics. In order to
personalize the recommendations and the human-computer interaction, RSs exploit a range of information about
the users. This information can be structured in various ways and again the selection of what information to
model depends on the recommendation technique. For instance, in collaborative filtering, users are modeled as a
simple list containing the ratings provided by the user for some items. In a demographic RS, sociodemographic
attributes such as age, gender, profession, and education, are used. User data is said to constitute the user model.
The user model profiles the user, i.e., encodes her preferences and needs. Various user modeling approaches
have been used and, in a certain sense, a RS can be viewed as a tool that generates recommendations by
building and exploiting user models. Since no personalization is possible without a convenient user model,
unless the recommendation is non-personalized, as in the top-10 selection, the user model will always play a
central role. For instance, considering again a collaborative filtering approach, the user is either profiled directly
by its ratings to items or, using these ratings, the system derives a vector of factor values, where users differ in
how each factor weights in their model. Users can also be described by their behavior pattern data, for example,
site browsing patterns (in a Web-based recommender system), or travel search patterns (in a travel recommender
system). Moreover, user data may include relations between users such as the trust level of these relations
between users. A RS might utilize this information to recommend items to users that were preferred by similar
or trusted users.
Transactions: We generically refer to a transaction as a recorded interaction between a user and the RS.
Transactions are log-like data that store important information generated during the human-computer interaction
and which are useful for the recommendation generation algorithm that the system is using. For instance, a
transaction log may contain a reference to the item selected by the user and a description of the context (e.g., the
user goal/query) for that particular recommendation. If available, that transaction may also include an explicit
feedback the user has provided, such as the rating for the selected item.
In fact, ratings are the most popular form of transaction data that a RS collects. These ratings may be collected
explicitly or implicitly. In the explicit collection of ratings, the user is asked to provide her opinion about an
item on a rating scale. According to, ratings can take on a variety of forms:
Numerical ratings such as the 1-5 stars provided in the book recommender associated with Amazon.com.
Ordinal ratings, such as strongly agree, agree, neutral, disagree, strongly disagree where the user is asked to
select the term that best indicates her opinion regarding an item (usually via questionnaire).
Binary ratings that model choices in which the user is simply asked to decide if a certain item is good or bad.
Unary ratings can indicate that a user has observed or purchased an item, or otherwise rated the item
positively. In such cases, the absence of a rating indicates that we have no information relating the user to the
item (perhaps she purchased the item somewhere else).
Another form of user evaluation consists of tags associated by the user with the items the system presents. For
instance, in Movielens RS (http://movielens.umn.edu) tags represent how MovieLens users feel about a movie,
e.g.: too long, or acting. In transactions collecting implicit ratings, the system aims to infer the users
opinion based on the users actions. For example, if a user enters the keyword Yoga at Amazon.com she will
be provided with a long list of books. In return, the user may click on a certain book on the list in order to
receive additional information. At this point, the system may infer that the user is somewhat interested in that
book.
In conversational systems, i.e., systems that support an interactive process, the transaction model is more
refined. In these systems user requests alternate with system actions. That is, the user may request a
recommendation and the system may produce a suggestion list. But it can also request additional user
preferences to provide the user with better results. Here, in the transaction model, the system collects the
various requests-responses, and may eventually learn to modify its interaction strategy by observing the
outcome of the recommendation process.
Recommendation Techniques
In order to implement its core function, identifying the useful items for the user, a RS must predict that an item
is worth recommending. In order to do this, the system must be able to predict the utility of some of them, or at
least compare the utility of some items, and then decide what items to recommend based on this comparison.
The prediction step may not be explicit in the recommendation algorithm but we can still apply this unifying
model to describe the general role of a RS.
To illustrate the prediction step of a RS, consider, for instance, a simple, non-personalized, recommendation
algorithm that recommends just the most popular songs. The rationale for using this approach is that in absence
of more precise information about the users preferences, a popular song, i.e., something that is liked (high
utility) by many users, will also be probably liked by a generic user, at least more than another randomly
selected song. Hence the utility of these popular songs is predicted to be reasonably high for this generic user.
This view of the core recommendation computation as the prediction of the utility of an item for a user has been
suggested in. They model this degree of utility of the user u for the item i as a (real valued) function R(u, i), as
is normally done in collaborative filtering by considering the ratings of users for items. Then the fundamental
task of a collaborative filtering RS is to predict the value of R over pairs of users and items, i.e., to compute
R(u, i), where we denote with R the estimation, computed by the RS, of the true function R. Consequently,
having computed this prediction for the active user u on a set of items, i.e., R(u, i1), . . . , R(u, iN) the system
will recommend the items i j1 , . . . , i jK (K N) with the largest predicted utility. K is typically a small
number, i.e., much smaller than the cardinality of the item data set or the items on which a user utility prediction
can be computed, i.e., RSs filter the items that are recommended to users.
As mentioned above, some recommender systems do not fully estimate the utility before making a
recommendation, but they may apply some heuristics to hypothesize that an item is of use to a user. This is
typical, for instance, in knowledge-based systems. These utility predictions are computed with specific
algorithms (see below) and use various kind of knowledge about users, items, and the utility function itself. For
instance, the system may assume that the utility function is Boolean and therefore it will just determine whether
an item is or is not useful for the user. Consequently, assuming that there is some available knowledge (possibly
none) about the user who is requesting the recommendation, knowledge about items, and other users who
received recommendations, the system will leverage this knowledge with an appropriate algorithm to generate
various utility predictions and hence recommendations.
To provide a first overview of the different types of RSs, we want to quote a taxonomy provided by that has
become a classical way of distinguishing between recommender systems and referring to them. [25]
distinguishes between six different classes of recommendation approaches:
Content-based: The system learns to recommend items that are similar to the ones that the user liked in the
past. The similarity of items is calculated based on the features associated with the compared items. For
example, if a user has positively rated a movie that belongs to the comedy genre, then the system can learn to
recommend other movies from this genre. Chapter 3 provides an overview of content based recommender
systems, imposing some order among the extensive and diverse aspects involved in their design and
implementation. It presents the basic concepts and terminology of content-based RSs, their high level
architecture, and their main advantages and drawbacks. The chapter then surveys state-of-the-art systems that
have been adopted in several application domains. The survey encompasses a thorough description of both
classical and advanced techniques for representing items and user profiles. Finally, it discusses trends and future
research which might lead towards the next generation of recommender systems.
Collaborative filtering: The simplest and original implementation of this approach [93] recommends to the
active user the items that other users with similar tastes liked in the past. The similarity in taste of two users is
calculated based on the similarity in the rating history of the users. This is the reason why [94] refers to
collaborative filtering as people-to-people correlation. Collaborative filtering is considered to be the most
popular and widely implemented technique in RS.
instance, CF methods suffer from new-item problems, i.e., they cannot recommend items that have no ratings.
This does not limit content-based approaches since the prediction for new items is based on their description
(features) that are typically easily available. Given two (or more) basic RSs techniques, several ways have been
proposed for combining them to create a new hybrid system. As we have already mentioned, the context of the
user when she is seeking a recommendation can be used to better personalize the output of the system. For
example, in a temporal context, vacation recommendations in winter should be very different from those
provided in summer. Or a restaurant recommendation for a Saturday evening with your friends should be
different from that suggested for a workday lunch with co-workers.
Application and Evaluation
Recommender system research is being conducted with a strong emphasis on practice and commercial
applications, since, aside from its theoretical contribution, is generally aimed at practically improving
commercial RSs. Thus, RS research involves practical aspects that apply to the implementation of these
systems. These aspects are relevant to different stages in the life cycle of a RS, namely, the design of the system,
its implementation and its maintenance and enhancement during system operation.
The aspects that apply to the design stage include factors that might affect the choice of the algorithm. The first
factor to consider, the applications domain, has a major effect on the algorithmic approach that should be taken.
[72] provide a taxonomy of RSs and classify existing RS applications to specific application domains. Based on
these specific application domains, we define more general classes of domains for the most common
recommender systems applications:
Entertainment - recommendations for movies, music, and IPTV.
Content - personalized newspapers, recommendation for documents, recommendations of Web pages, elearning applications, and e-mail filters.
E-commerce - recommendations for consumers of products to buy such as books, cameras, PCs etc.
Services - recommendations of travel services, recommendation of experts for consultation, recommendation
of houses to rent, or matchmaking services.
Data Collection
The logical component in charge of pre-processing the data and generating the input of the recommender
algorithm is referred to as data collector. The data collector gathers data from different sources, such as the EPG
for information about the live programs, the content provider for information about the VOD catalog and the
service provider for information about the users.
The Fastweb recommender system does not rely on personal information from the users (e.g., age, gender,
occupation). Recommendations are based on the past users behavior (what they watched) and on any explicit
preference they have expressed (e.g., preferred genres). If the users did not specify any explicit preferences, the
system is able to infer them by analyzing the users past activities.
An important question has been raised in Section 9.2: users interact with the IPTV system by means of the STB,
but typically we cannot identify who is actually in front of the TV. Consequently, the STB collects the behavior
and the preferences of a set of users (e.g., the component of a family). This represents a considerable problem
since we are limited to generate per-STB recommendations. In order to simplify the notation, in the rest of the
paper we will refer to user and STB to identify the same entity. The user-disambiguation problem has been
partially solved by separating the collected information according to the time slot they refer to. For instance, we
can roughly assume the following pattern: housewives use to watch TV during the morning, children during the
afternoon, the whole family at evening, while only adults watch TV during the night. By means of this simple
time slot distinction we are able to distinguish among different potential users of the same STB.
Formally, the available information has been structured into two main matrices, practically stored into a
relational database: the item-content matrix (ICM) and the user-rating matrix (URM).
The former describes the principal characteristics (metadata) of each item. In the following we will refer to the
item-content matrix as W, whose elements wci represent the relevance of characteristic (metadata) c for item i.
The ICM is generated from the analysis of the set of information given by the content provider (i.e., the EPG).
Such information concerns, for instance, the title of a movie, the actors, the director(s), the genre(s) and the plot.
Note that in a real environment we can face with inaccurate information especially because of the rate new
content is added every day. The information provided by the ICM is used to generate a content-based
recommendation, after being filtered by means of techniques for PoS (Part-of-Speech) tagging, stop words
removal, and latent semantic analysis. Moreover, the ICM can be used to perform some kind of processing on
the items (e.g., parental control).
The URM represents the ratings (i.e., preferences) of users about items. In the following we will refer to such
matrix as R, whose elements rpi represent the rating of user p about item i. Such preferences constitute the basic
information for any collaborative algorithm. The user rating can be either explicit or implicit, according to the
fact that the ratings are explicitly expressed by users or are implicitly collected by the system, respectively.
Explicit ratings confidently represent the user opinion, even though they can be affected by biases due to: user
subjectivity, item popularity or global rating tendency. The first bias depends on arbitrary interpretations of the
rating scale. For instance, in a rating scale between 1 and 5, some user could use the value 3 to indicate an
interesting item; someone else could use 3 for a not much interesting item. Similarly, popular items tend to be
overrated, while unpopular items are usually underrated. Finally, explicit ratings can be affected by global
attitudes (e.g., users are more willing to rate movies they like).
On the other hand, implicit ratings are inferred by the system on the basis of the user-system interaction, which
might not match the user opinion. For instance, the system is able to monitor whether a user has watched a live
program on a certain channel or whether the user has uninterruptedly watched a movie. Despite explicit ratings
are more reliable than implicit ratings in representing the actual user interest towards an item, their collection
can be annoying from the users perspective.
Long Tail
In statistics, a long tail of some distributions of numbers is the portion of the distribution having a large number
of occurrences far from the "head" or central part of the distribution. The distribution could involve popularities,
random numbers of occurrences of events with various probabilities, etc. A probability distribution is said to
have a long tail if a larger share of population rests within its tail than would under a normal distribution. A
long-tail distribution will arise with the inclusion of many values unusually far from the mean, which increase
the magnitude of the skewness of the distribution.A long-tailed distribution is a particular type of heavy-tailed
distribution. The distribution and inventory costs of businesses successfully applying this strategy allow them to
realize significant profit out of selling small volumes of hard-to-find items to many customers instead of only
selling large volumes of a reduced number of popular items. The total sales of this large number of "non-hit
items" is called "the long tail".
Personalization
Today, personalization is something that occurs separately within each system that one interacts with.
Recommender systems are one technique for personalization; in essence the personalization occurs slowly as
each system builds up information about your likes and dislikes, about what interests you and what fails to
interest you. There are numerous other personalization techniques; most of these rely either on collection of
system usage history which is then employed to change the behavior of the system, or on the user taking the
time and trouble to explicitly personalize the behavior of the system in various ways by setting parameters,
making selections or engaging in dialogs with the system.
There are several problems with this model, at least from the user's point of view. Investment in personalizing
one system (either through explicit action or just long use) are not transferable to another system. (Of course,
from the system operator's point of view, this may be very desirable; it increases switching costs for users and
thus helps lock in a user base.) Information such as likes and dislikes or usage patterns are scattered across
multiple systems and can't be combined to obtain maximum leverage. And the user does not have control of the
information bases that define his or her "profile". If you want to buy books from multiple online booksellers this
is annoying. But if we are concerned with developing information discovery systems to assist users in a world
of information overload, these problems are critical. People obtain information from a multiplicity of sources,
and personalization has to happen close to the end user; this is the only place where there is enough information
to do personalization effectively, to keep track of what's new and what isn't, what has and has not proven useful.
The user needs to become anhub and a switch, moving data to allow accurate personalization from one system
to another.
Levels of measurement
What a scale actually means and what we can do with it depends on what its numbers represent. Numbers can
be grouped into 4 types or levels: nominal, ordinal, interval, and ratio. Nominal is the most simple, and ratio the
most sophisticated. Each level possesses the characteristics of the preceding level, plus an additional quality.
Nominal
Nominal is hardly measurement. It refers to quality more than quantity. A nominal level of measurement is
simply a matter of distinguishing by name, e.g., 1 = male, 2 = female. Even though we are using the numbers 1
and 2, they do not denote quantity. The binary category of 0 and 1 used for computers is a nominal level of
measurement. They are categories or classifications. Nominal measurement is like using categorical levels of
variables, described in the Doing Scientific Research section of the Introduction module.
Examples:
MEAL PREFERENCE: Breakfast, Lunch, Dinner
RELIGIOUS PREFERENCE: 1 = Buddhist, 2 = Muslim, 3 = Christian, 4 = Jewish, 5 =
Other
POLITICAL ORIENTATION: Republican, Democratic, Libertarian, Green
Ordinal time of day - indicates direction or order of occurrence; spacing between is uneven
Interval
Interval scales provide information about order, and also possess equal intervals. From the previous example, if
we knew that the distance between 1 and 2 was the same as that between 7 and 8 on our 10-point rating scale,
then we would have an interval scale. An example of an interval scale is temperature, either measured on a
Fahrenheit or Celsius scale. A degree represents the same underlying amount of heat, regardless of where it
occurs on the scale. Measured in Fahrenheit units, the difference between a temperature of 46 and 42 is the
same as the difference between 72 and 68. Equal-interval scales of measurement can be devised for opinions
and attitudes. Constructing them involves an understanding of mathematical and statistical principles beyond
those covered in this course. But it is important to understand the different levels of measurement when using
and interpreting scales.
Examples:
TIME OF DAY on a 12-hour clock
POLITICAL ORIENTATION: Score on standardized scale of political orientation
OTHER scales constructed so as to possess equal intervals
Interval time of day - equal intervals; analog (12-hr.) clock, difference between 1 and 2 pm is same as
difference between 11 and 12 am
Ratio
In addition to possessing the qualities of nominal, ordinal, and interval scales, a ratio scale has an absolute zero
(a point where none of the quality being measured exists). Using a ratio scale permits comparisons such as
being twice as high, or one-half as much. Reaction time (how long it takes to respond to a signal of some sort)
uses a ratio scale of measurement -- time. Although an individual's reaction time is always greater than zero, we
conceptualize a zero point in time, and can state that a response of 24 milliseconds is twice as fast as a response
time of 48 milliseconds.
Examples:
RULER: inches or centimeters
INCOME: money earned last year
GPA: grade point average
Ratio - 24-hr. time has an absolute 0 (midnight); 14 o'clock is twice as long from midnight as 7 o'clock
Applications
The level of measurement for a particular variable is defined by the highest category that it achieves. For
example, categorizing someone as extroverted (outgoing) or introverted (shy) is nominal. If we categorize
people 1 = shy, 2 = neither shy nor outgoing, 3 = outgoing, then we have an ordinal level of measurement. If we
use a standardized measure of shyness (and there are such inventories), we would probably assume the shyness
variable meets the standards of an interval level of measurement. As to whether or not we might have a ratio
scale of shyness, although we might be able to measure zero shyness, it would be difficult to devise a scale
where we would be comfortable talking about someone's being 3 times as shy as someone else.
Measurement at the interval or ratio level is desirable because we can use the more powerful statistical
procedures available for Means and Standard Deviations. To have this advantage, often ordinal data are treated
as though they were interval; for example, subjective ratings scales (1 = terrible, 2= poor, 3 = fair, 4 = good, 5 =
excellent). The scale probably does not meet the requirement of equal intervals -- we don't know that the
difference between 2 (poor) and 3 (fair) is the same as the difference between 4 (good) and 5 (excellent). In
order to take advantage of more powerful statistical techniques, researchers often assume that the intervals are
equal.
Data Preprocessing
Data have quality if they satisfy the requirements of the intended use. There are many factors comprising data
quality, including accuracy, completeness, consistency, timeliness, believability, and interpretability.
Imagine that you are a manager at AllElectronics and have been charged with analyzing the companys data
with respect to your branchs sales. You immediately set out to perform this task. You carefully inspect the
companys database and data warehouse, identifying and selecting the attributes or dimensions (e.g., item, price,
and units sold) to be included in your analysis. Alas! You notice that several of the attributes for various tuples
have no recorded value. For your analysis, you would like to include information as to whether each item
purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore,
users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for
some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete
(lacking attribute values or certain attributes of interest, or containing only aggregate data); inaccurate or noisy
(containing errors, or values that deviate from the expected); and inconsistent (e.g., containing discrepancies in
the department codes used to categorize items).Welcome to the real world!
This scenario illustrates three of the elements defining data quality: accuracy, completeness, and consistency.
Inaccurate, incomplete, and inconsistent data are commonplace properties of large real-world databases and data
warehouses. There are many possible reasons for inaccurate data (i.e., having incorrect attribute values). The
data collection instruments used may be faulty. There may have been human or computer errors occurring at
data entry. Users may purposely submit incorrect data values for mandatory fields when they do not wish to
submit personal information (e.g., by choosing the default value January 1 displayed for birthday). This is
known as disguised missing data. Errors in data transmission can also occur. There may be technology
limitations such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect
data may also result from inconsistencies in naming conventions or data codes, or inconsistent formats for input
fields (e.g., date). Duplicate tuples also require data cleaning.
Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as
customer information for sales transaction data. Other data may not be included simply because they were not
considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding or
because of equipment malfunctions. Data that were inconsistent with other recorded data may have been
deleted. Furthermore, the recording of the data history or modifications may have been overlooked.Missing
data, particularly for tuples with missing values for some attributes, may need to be inferred.
Recall that data quality depends on the intended use of the data. Two different users may have very different
assessments of the quality of a given database. For example, a marketing analyst may need to access the
database mentioned before for a list of customer addresses. Some of the addresses are outdated or incorrect, yet
overall, 80% of the addresses are accurate. The marketing analyst considers this to be a large customer database
for target marketing purposes and is pleased with the databases accuracy, although, as sales manager, you found
the data inaccurate.
Timeliness also affects data quality. Suppose that you are overseeing the distribution of monthly sales bonuses
to the top sales representatives at AllElectronics. Several sales representatives, however, fail to submit their
sales records on time at the end of the month. There are also a number of corrections and adjustments that flow
in after the months end. For a period of time following each month, the data stored in the database are
incomplete. However, once all of the data are received, it is correct. The fact that the month-end data are not
updated in a timely fashion has a negative impact on the data quality.
Two other factors affecting data quality are believability and interpretability. Believability reflects how much
the data are trusted by users, while interpretability reflects how easy the data are understood. Suppose that a
database, at one point, had several errors, all of which have since been corrected. The past errors, however, had
caused many problems for sales department users, and so they no longer trust the data. The data also use many
accounting codes, which the sales department does not know how to interpret. Even though the database is now
accurate, complete, consistent, and timely, sales department users may regard it as of low quality due to poor
believability and interpretability.
Data Cleaning
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines
attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the
data. In this section, you will study basic methods for data cleaning. Section 3.2.1 looks at ways of handling
missing values. Section 3.2.2 explains data smoothing techniques. Section 3.2.3 discusses approaches to data
cleaning as a process.
3.2.1 Missing Values
Imagine that you need to analyze AllElectronics sales and customer data. You note that many tuples have no
recorded value for several attributes such as customer income. How can you go about filling in the missing
values for this attribute? Lets look at the following methods.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves
classification). This method is not very effective, unless the tuple contains several attributes with missing
values. It is especially poor when the percentage of missing values per attribute varies considerably. By ignoring
the tuple, we do not make use of the remaining attributes values in the tuple. Such data could have been useful
to the task at hand.
2. Fill in the missing value manually: In general, this approach is time consuming and may not be feasible given
a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant
such as a label like Unknown or 1. If missing values are replaced by, say, Unknown, then the mining
program may mistakenly think that they form an interesting concept, since they all have a value in common
that of Unknown. Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value:
Chapter 2 discussed measures of central tendency, which indicate the middle value of a data distribution. For
normal (symmetric) data distributions, the mean can be used, while skewed data distribution should employ the
median (Section 2.2). For example, suppose that the data distribution regarding the income of AllElectronics
customers is symmetric and that the mean income is $56,000. Use this value to replace the missing value for
income.
5. Use the attribute mean or median for all samples belonging to the same class as the given tuple: For example,
if classifying customers according to credit risk, we may replace the missing value with the mean income value
for customers in the same credit risk category as that of the given tuple. If the data distribution for a given class
is skewed, the median value is a better choice.
6. Use the most probable value to fill in the missing value: This may be determined with regression, inferencebased tools using a Bayesian formalism, or decision tree induction. For example, using the other customer
attributes in your data set, you may construct a decision tree to predict the missing values for income. Decision
trees and Bayesian inference are described in detail in Chapters 8 and 9, respectively, while regression is
introduced in Section 3.4.5.
Noisy Data
What is noise? Noise is a random error or variance in a measured variable. In Chapter 2, we saw how some
basic statistical description techniques (e.g., boxplots and scatter plots), and methods of data visualization can
be used to identify outliers, which may represent noise. Given a numeric attribute such as, say, price, how can
we smooth out the data to remove the noise? Lets look at the following data smoothing techniques.
Binning: Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values
around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods
consult the neighborhood of values,they perform local smoothing. Figure 3.2 illustrates some binning
techniques. In this example, the data for price are first sorted and then partitioned into equal-frequency bins of
size 3 (i.e., each bin contains three values). In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin. For example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the bin median.
In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest boundary value. In general, the larger the width, the
greater the effect of the smoothing. Alternatively, bins may be equal width, where the interval range of values in
each bin is constant. Binning is also used as a discretization technique and is further discussed in Section 3.5.
Regression: Data smoothing can also be done by regression, a technique that conforms data values to a function.
Linear regression involves finding the best line to fit two attributes (or variables) so that one attribute can be
used to predict the other. Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface. Regression is further described in
Section 3.4.5.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values are organized into
groups, or clusters. Intuitively, values that fall outside of the set of clusters may be considered outliers (Figure
3.3). Chapter 12 is dedicated to the topic of outlier analysis.
Data Integration
Data mining often requires data integrationthe merging of data from multiple data stores. Careful integration
can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the
accuracy and speed of the subsequent data mining process.
Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added from the
given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales
data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in
constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as 1.0 to 1.0, or
0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0
10, 1120, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be recursively organized
into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. Figure 3.12 shows a
concept hierarchy for the attribute price. More than one concept hierarchy can be defined for the same attribute
to accommodate the needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized to higherlevel concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database
schema and can be automatically defined at the schema definition level.
Min-max normalization preserves the relationships among the original data values. It will encounter an out-ofbounds error if a future input case for normalization falls outside of the original data range for A.
In z-score normalization (or zero-mean normalization), the values for an attribute, A, are normalized based on
the mean (i.e., average) and standard deviation of A. A value, vi , of A is normalized to v0 i by computing
Decimal scaling. Suppose that the recorded values of A range from 986 to 917. The maximum absolute value
of A is 986. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j D 3) so that 986
normalizes to 0.986 and 917 normalizes to 0.917.
Note that normalization can change the original data quite a bit, especially when using z-score normalization or
decimal scaling. It is also necessary to save the normalization parameters (e.g., the mean and standard deviation
if using z-score normalization) so that future data can be normalized in a uniform manner.
Discretization by Binning
Binning is a top-down splitting technique based on a specified number of bins. Section 3.2.2 discussed binning
methods for data smoothing. These methods are also used as discretization methods for data reduction and
concept hierarchy generation. For example, attribute values can be discretized by applying equal-width or
equal-frequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin
means or smoothing by bin medians, respectively. These techniques can be applied recursively to the resulting
partitions to generate concept hierarchies. Binning does not use class information and is therefore an
unsupervised discretization technique. It is sensitive to the user-specified number of bins, as well as the
presence of outliers.
Distance/Similarity Measures
Similarity: measure of how close to each other two instances are. The closer the instances are to each other,
the larger is the similarity value.
Dissimilarity: measure of how different two instances are. Dissimilarity is large when instances are very
different and is small when they are close.
Proximity: refers to either similarity or dissimilarity
Distance metric: a measure of dissimilarity that obeys the following laws (laws of triangular norm):
as the corresponding similarity measure. If s is the similarity measure that ranges between 0 and 1 (so called
degree of similarity), then the corresponding dissimilarity measure can be defined as
In general, any monotonically decreasing transformation can be applied to convert similarity measures into
dissimilarity measures, and any monotonically increasing transformtaion can be applied to convert the measures
the other way around.
Distance Metrics for Numeric Attributes
When the data set is presented in a standard form, each instance can be treated as a vector x = (x1, . . . , xN) of
measures for attributes numbered 1, . . . ,N.
Consider for now only non-nominal scales.
Clustering
Click here Applied Multivariate Statistical Analysis ebook
Model Based Clustering Click Here
Expectation Maximization - Click Here
UV-Decomposition Click Here
SVD Method Click Here