Data Mining Notes

INDEX
S. No Topic Page No.

Week 1
1 Introduction 1
2 Data Mining Process 30
3 Introduction To R 49
4 Basic Statistics 75
5 Basic Statistics Part-2 93
Week 2
6 Partitioning Process 111
7 Visualization Techniques 136
8 Visualization Techniques Part-2 155
Week 3
13 Dimension Reduction Techniques 287
14 Dimension Reduction Techniques Part-2 306
15 Dimension Reduction Techniques Part-3 321
Week 4
16 Performance Metrics 336
17 Performance Metrics Part-2 349
Week 5
21 Prediction Performance 406
22 Multiple Linear Regression 425
23 Multiple Linear Regression Part-2 439
Week 6
28 Machine Learning Technique K- Nn 530
29 Machine Learning Technique K- Nn Part-2 549
30 Machine Learning Technique K-Nn Part-3 567
Week 7
31 Naive Bayes 580
32 Naive Bayes Part-2 593
Week 8
36 Classification And Regression Trees 661
37 Classification And Regression Trees Part-2 680
Week 9
42 Pruning Process 774
43 Pruning Process Part-2 793
44 Pruning Process Part-3 816
45 Regression Trees 839
Week 10
46 Logistic Regression 865
47 Logistic Regression Part-2 884
Week 11
53 Artificial Neural Networks 1034
54 Artificial Neural Network Part-2 1048
Week 12
59 Discriminant Analysis 1177
60 Discriminant Analysis Part-2 1197
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture - 01
Introduction
Welcome to the course Business Analytics and Data Mining Modeling Using R. This is
the very first lecture and we are going to cover the introduction part of it. First we need
to understand why we need to study this course. If you follow the industry news and
related reports you would see that there is his requirement of data scientist, data
engineers, business analyst and other positions other relevant positions which require
expertise in domains like business analytics, data mining, R and other related areas.
So, we are going to cover some things in this course. So, let us start. So, what is business
analytics?
(Refer Slide Time: 01:07)
The primary purpose of business analytics is to assist and aid in and drive in decision
making activities of a business organization. Now, if you look at the Gartner’s definition
they have defined business analytics as comprised of solutions used to build analysis,
models and simulations to create scenarios to understand realties and predict future sales.
1
The domains which are actually part of business analytics are data mining predictive
analytics, applied analytics, statistics and generally the solution is delivered in a
application format which is suitable for business user. Now, if we look at analytics as
such it can be classified as 3 categories descriptive analytics, predictive analytics and
prescriptive analytics.
Now, first let us understand the descriptive analytics. Descriptive analytics mainly
revolves around gathering, organizing, tabulating, presenting and depicting data and
describing the characteristics of what you are studying. This is about a descriptive
analytics is mainly about what is happening. So, we have to we try to answer the
question what is happening in a particular context looking at the data.
Now, this is also called reporting and managerial lingo reason being because we
generally look at the sales number in cost numbers, revenues number etcetera are
different ratios. So, we try to understand how our business is performing, how the
company organization is performing or how the industry or country economy is growing
in a overall sense. So, therefore, what is happening is covered in descriptive analytics.
This is also first phase of analytics, this is where you actually start, you try to gather the
sense of what is happening and then you look for other things other categories of
analytics. Now, this descriptive analytics can be useful in the sense to inform why the
results are happening, but you know they do not inform why the results are happening or
2
what can happen in future. To answer these questions we need next phase of analytics
which is predictive.
So, predictive analysis can be defined as you know, predictive analytics can be used to
predict the past two predict the future. Now, the main idea is to identify association
among different variables and predict the likelihood of a phenomena occurring on the
basis of those relationships.
Now, at this point we need to understand the concept of correlation versus causation.
Now, correlation is something you know if something is happening if x is happening, if x
is correlated with y that is sufficient for us to you know predict something.
But if we are looking for what we should be doing about it if we know that what is going
to happen in the future which is going to be you know part of predictive analysis then
what we are going to do about it is becomes part of prescriptive analytics. Now,
prescriptive analytics is where the cause and effect relationship that comes into the fact
and the main idea about main idea of the prescriptive analytics is to suggest a course of
action. So, generally you know prescriptive analytics is about recommending decisions
and which entailes generally in mathematical and computational model you do lots of
simulations optimizations to find out what can be done about this future scenario of the
business or relevant topic.
3
So, prescriptive analytics is also defined as final phase of analytics. Now, methods from
disciplines like statistics, forecasting, data mining, experimental design they are used in
business analytics. Now, next part that brings us to the next part data mining the core of
this particular course is data mining. So, let us understand what is data mining. Brief
definition of data mining could be extracting useful information from large data sets.
Now, if you look at the how Gartner is defining data mining. So, they define data mining
as the process of discovering meaningful correlations, patterns and trends by shifting
through large amount of data stored in repositories.
Now, data mining generally it employs a pattern recognition technology as well as a

statistical and mathematical technique. Now, where is data mining used? What are the
domains where data mining can actually be used?
4
If you look at some of the examples given here first one is related to the medical field.
So, data mining can actually help us in predicting the response of a drug or a medical
treatment on the patient suffering from a serious disease or illness. Another example
could be in the security domain where data mining can help us predicting whether an
intercepted communication is about a potential terror attack. Now, another application of
data mining could be in computer or network security field where it can help us
predicting whether a packet of network data can pose a cyber security threat.
Now, our main interest is in the business domain. So, let us look at some of the examples
where data mining can help in business context. So, common business questions where
data mining can help could be like this which customers are most likely to respond to the
marketing or promotional offer.
5
Another example could be which customers are most likely to default on loan. So,
banking and financial institutions they might be worried about some of their customers
defaulting on loan. So, they would like to identify such customers and then take
appropriate actions.
Now, another question could be which customers are most likely to likely to subscribe to
a magazine. So, for a magazine if you know if they are running advertising and you
know marketing promotional offer it would be important for them to understand the
customer which are likely to subscribe to the kind of content that they are publishing or
selling through their magazine. So, all these are some of the, some of the flavour of you
know kind of questions where data mining can actually help us.
6
Now, let us look at the origins of data mining. So, data mining genesis it is mainly an
interdisciplinary field of computer science and it originates from the fields of machine
learning and statistics. Now, some researchers have also define data mining as a statistics
at scale and speed. Some people also define it has a statistics at scale speed and
simplicity main reason being that in data mining we generally do not use the concepts of
confidence and logic of inference rather than rather we rely on partitioning and using
different samples to test about models. So, that makes the whole process simpler. So, that
is why this simplicity comes from.
7
Now, let us compare the classical statistical setting and data mining paradigm. Now,
classical statistical setting is mainly about you know data scarcity and computational
difficulty.
So, generally in a statistical setting you are dealing with a statistical question where you
are looking for primary data and that is of course, costly to collect. So, there is always
you are going to face this situation where data is not enough or it is very difficult to get
the data. And if we look at the times manage statistical it is an old discipline old
discipline, so the time and statistic statistical studies you know there used a face there
was a time and they used to face lot of computational difficulty where most of the
mathematical computation they have to perform manually. So, that was the time and the
statistical setting originated. So, they generally face is classical statistical setting is
generally they face the data problem, the data scarcity problem and the computational
problem. When will look at the data mining paradigm it mainly its relatively a newer
area and this as mainly you know developed or evolved because of the availability of
large data set and ever improving computing powers. So, therefore, in data mining
paradigm we are not faced with the problem of data sets or the computational problems.
Now, let us look at the another points you know in same sample in classical statistical
setting same sample you used to compute an estimate and to check its reliability, while if
you look at the data mining paradigm because we do not have any problem in terms of
data points or data sets therefore, we can fit a fit a model with one sample and then
evaluate the performance of the model with another sample. So, that is one another
different. Now, third one is the logic of inference. Now, we need confidence intervals and
hypothesis test to actually because we are using the same sample to compute the estimate
and then check its reliability therefore, we need to eliminate this option where a pattern
or a relationship can developed can we see in because of a chance. So, chance variable
has to be eliminated. So, therefore, logic of inference it is important to the statistical
setting and much stricter conditions are actually placed in a statistical modeling.
When we look at the data mining modeling we do not face with the problems of
inference and related problems reason being that different samples are being used. So,
therefore, reliability or robustness of the model is automatically taken care of because the
model is built on one sample and it is evaluated on the different sample different
partition. We look at the machine learning techniques such as trees neural networks they
8
are also less structured, but more computationally intensive in compare comparison to
statistical techniques. So, therefore, they might require more running time more compute
more computation time and they are less structured while we look at the statistical
technique regression and logistic regression discriminant analysis they are highly
structured techniques.
Now, another important aspect of emergence of this field business analytics data mining
big data and related fields is rapid growth of data.
Nowadays millions of transactions are recorded on a daily basis we have organized

retailers such as shoppers stop, big bazaar, pantaloons where on a daily basis lots of the
transactions are being recorded. We also have E-commerce retailers like flipkart,
amazon, snapdeal. So, all these organizations have huge amount of data because of the
transactions which are being recorded.
Our economy is has also been growing there has been growth in the internet
infrastructure that has also let too many people using internet and digital technologies.
So, that has also lead to growth of data. Automatic data capture mechanisms for example,
bar code, POS devices, click-stream data, GPS data, that is also has led to rapid growth
of data. Now, operational databases the ones we talked about whenever you visit a retail
stores or whatever you items that you want to buy all those data’s they are can be
considered you know as transactions between the business and the customers all those
9
days you know transactions or recorded in the operational database. Now, for the data
analytics purposes or business analytics purposes these operational database they have to
now, brought into a data warehouse because you cannot actually do any kind of meaning
full analysis on operational data bases. So, therefore, the data has to be brought into data
warehouse and data marts, from there the data can then be sampled out for the analysis.
Now, another reason for rapid growth of data is constant declining cost of data storage
and improving processing capabilities so that has also you know even smaller
organization can nowadays invest in related IT infrastructure and have the analytical
capabilities and developed analytical capabilities to improve their business. Now, core of
this force core of this cores focuses on predictive analysis mainly we focus on three task
prediction classification and association rules.
Prediction is mainly when we are trying to predict value of a variable will discuss in
detail different terminology and concepts later in this lecture. Classification is a task
where in you are we are trying to predict we are trying to predict type of a particular
variable. Association and rules is where we are trying to find out the association between
different items in a transactions.
Now, another important thing related to data mining is data mining and data mining
process as such is we generally try different several methods for a particular goal and
then a useful method is finally, selected and then used in the production systems.
10
Now, how do we define usefulness of a method? Now, the methods that we select to
perform a particular class of course, they have to be relevant with respect to goal of the
analysis. underlying assumptions the matter method they also have to be there, there
should also meet the requirement of the goal and the problem. Size of the data set
different methods and algorithms they are going to impose their own restrictions on
number of variables and the number of records that are going to be used for analysis. So,
size of the data set is also going to determine.
Types of pattern in the data set. So, different methods or algorithm they are suitable for
suitable for finding out or understanding different types pattern. So, therefore, type of
pattern in a data set is also going to determine the usefulness of a method with respect to
a particular goal.
Let us look at an example to develop a better understanding. Now, we have this

hypothetical example sedan car owner. So, where in main goal is income level and
household area is used to classify whether a household owns a sedan car. So, we have
these two variables income level and household area and we are trying to classify. So,
this is essentially a classification problem and we want to classify whether a particular
household owns a sedan car or not.
11
Now, let us understand the data set format that is typically used in data mining and
business analytics. Now, generally the data that we use is in tabular or matrix format
variables are generally or in columns and observations are in rows.
Now, another important thing is each row represents a household this is unit of analysis
for example, in this case the sedan car data set case unit of analysis house hold. So, all
the data, all the variables that are there in the data set there about this particular
household, so household being the unit of analysis.
Now, the statistical software data mining software that we are going to use in this courses
R and R studio. So, R what is R? R is a programming language and software
environment for a statistical computing and graphics, widely used by statistician across
the world and also by data miners. So, there are many packages available for different
kinds of functionalities, related to statistical techniques and also related to data mining
techniques.
R studio is another you know this is again most commonly used integrated development
environment for R. So, it might be difficult for some users to directly start using with R
because of the interface they might not be very much comfortable with the interface of
R, R studio bridges this gap and provides a much better interface to perform your data
mining, modeling or statistical modeling using R.
12
So, let us look at this example a sedan car. So, this is R studio environment. So, here you
have these four parts here first part this part is about the R script, the R script the code is
actually written here and this part is actually to run this commands that are given in this
script. Then in this part you have in the environment where all your you know data and
variables they would be loaded up if that if a particular data set is loaded into R studio it
would be shown here, if a particular variable is loaded into R studio it would actually be
shown here. In this part you have plots help page for a detailed before.
So, I recommend you to go through the supplementary lecture of introduction to R and

basic statistics if you are not able to understand what is going on in this particular case.
So, first we need to load this library XLSX because we are using the data set which is
available in the excel format, you would see in this environment window that this data
set has been loaded we have 20 observation of 3 variables. So, here you can look at the
tabular data.
13
So, this is the data that we were talking about. You can see that this is in the tabular
format or matrix format.
So, you have these three variables. So, annual income and household area these are your
predictor variables. Will discuss the terminology later in the lecture, and this is your
outcome variable ownership. So, we want to predict based on these two variables we
want to predict the class of a sedan car you know means class of a you know household
whether they own a sedan car or not. If we looking at the whole data set we have 20
14
observations and each observation is about a household, it depicts their annual income in
rupees lakh and the household area in 100 a square feet.
Now, let us go back. Now, if we run this command head df we are going to get first 6
observation from the data set. So, if we you do not want to go and actually have a look at
the full data set which might be a large file.
So, you might look at you might run this command head, the head df and then you can
look at the first 6 observation of the data set.
15
Summary, this command is going to give you this these basic statistics about the data set
variable for example, annual income we can see min max mean and other statistics
similarly for household area and ownerships. So, you can see from the ownership data
this is a categorical data we will discuss later in this lecture what it is a categorical
variable. We can see 10 observation belong to the non-own category and intern
observation belong to the owner class. We can also look at the other things.
Now, let us plot a graph between you know household area and annual income. Annual
income being on the x axis and household being in the on the y axis, so this is the plot,
you can look at if you look at this plot some of the observation which are belonging to
the a non-owner category non owner class they are mainly on this part and the
observation belonging to owner they are mainly on this half. So, our goal is to classify
the ownerships of a sedan car.
Now, as we talked about that different methods could be tried out in a data mining
modeling and then the best one is generally selected. Now, one method in this case could
be a set of horizontal and vertical lines. So, we can look at this data and we can create
you know set of horizontal and vertical lines this could be a hypothetical method one
which could, then be used to classify which than could be used to classify this
observations. Another method could be a single diagonal line. So, we could draw a single
16
diagonal line somewhere here and then it could also be used to classify these
observations.
Now, if you look at this data if you are able to draw a line somewhere here most of the
observations belong to owner call owner you know owner class they will be on the upper
rectangular region and the most of the observation belonging to the non owner class they
would be on the lower rectangular region. So, let us do that.
So, if we look at these two points. So, these two points are actually 7.6 and 6. So, these
two points have been look at if a line can be a horizontal line can be drawn here then two
partitions two rectangular regions can actually be created.
If you similarly keep on drawing the horizontal and vertical lines to keep on separating
these observations you can see here this particular you know only 3 observation
belonging to non owner or there others are owner observations in this lower rectangular
region most of the observation are belonging to the non owner and only 3 belongs to the
category.
17
Generally we can keep on creating similar lines to further classify. So, now finally, will
end up with a particular graphics where each rectangular region is homogeneous; that
means, it contains the observation belonging to only one class either owner or non
owner. So, this could be these set up of horizontal and vertical lines could be a method,
could be a model to classify these observations.
Similarly as we talked about the method two could be about finding a single diagonal
line to separate these observations. Now, if we look at the again if we look the plot again,
the line a diagonal line could be somewhere here it can go from somewhere here and
then we will have homogenous partitions homogeneous rectangle regions. So, again a
similar process can be adopted to find out that particular line you would see a line being
drawn there we can extend this line and create partitions. This could be a single, this
could be another method two , to actually classify the observation.
18
Now, as we you know see as we discussed these two methods, method one set of
horizontal and vertical lines, and method two a single diagonal line. So, now we need to
find out which is the most useful method which is the best method that can be done using
different of assessment matrix that we are going to cover in later lectures.
Now, let us look at the key terms related to this course. As we go through other lectures
we will come across many more terms and then we will discuss them where we need to
first we need to discuss the key terms at this point.
19
So, first one is algorithm. So, we have been using this, this particular term quite often.
So, algorithm can be defined as a specific sequence of actions or set of rules that has to
be followed to perform task. Algorithms they are gently used to implement data mining
techniques like trees, neural networks, for example, neural network we use back
propagation algorithm that is part of neural network. So, there are a number of algorithm
that are actually required to implement this techniques.
Next term is model. So, what we mean by model? So, here be by modeling we mean data
mining model. Now, how we can define a model in a data mining context? A data mining
model is an application of data mining technique on data set. So, when we apply some of
these techniques like trees and neural networks which we are going to cover and later
lectures on a data set then we get a model.
Our next term is, our next term is variable. A variable can be defined as the
operationalize way of representing a characteristic of an object, event or phenomenon.
Now a variable can take different values in different situation. Now, there are generally
two types of you know a variable that we going to deal with here is one is input variable
it is also sometimes called as independent variable, feature, field, attribute or predictor.
Essentially input variable is an input to the model.
20
The other type of variable is output variable which is generally, other names for this
variable are outcome variable, dependent variable, target variable or response. So, output
variable is an output to the, output of the model.
Another term that we come is across record observation, case or row. So, as we talked
about the tabular data set of metric data set each row represented a record each row
represented and observation or case. Now, how do we define it? So, observation is the
unit of analysis on which the variable measurements that are in the column take can such
as a customer, a household, an organization or an industry. For example, in our sedan car
example case the unit of analysis was household and we had the variables like annual
income and household area which actually measure something related to household.
Similarly customer organizational or industry you could also have these as a unit of
analysis and the variables that are measured on these.
21
Now, let us the talk about the variables in detail. Not only two types of variables are
used. So, other types that we talked about input variable and the output variable or
outcome variable they are in the modeling sense. But here we are talking about more in
the data sense. So, two types of variables are used categorical and continuous. Now,
categorical variables can we further classified as nominal and ordinal, and continuous
can further we classified into two categories interval and ratio variables. So, let us
understand these variables.
22
Now, before we go into the details of what these 4 types of variable mean why we need
to understand the type of variables in a data set. As we discussed before that we are go,
in a data mining process in a data mining modeling we generally use a many methods
and then the select the most useful one or the best one. So, therefore, it is important for
us to identify an appropriate statistical or data mining technique and the understanding of
variable type is part of this process.
Now, proper interpretation of the data analysis results, that also depends on the kind of
technique that you are using and the kind of data that was actually analyzed. Now,
another important thing is that data of these variable types or either quantitative or
qualitative in nature. For example, quantitative measure quantitative data numeric values
and our expressed in a number, qualitative data they measure types and our expressed as
label or a numeric code.
Now, if we look at these 4 types nominal ordinal interval and ratio a structure of these
variable types in gears from nominal to ratio in a hierarchical fashion. So, nominal is a
least structured available type followed by ordinal followed by interval and then ratio is
the most a structured variable type. Now, let us understand nominal variables.
Now, nominal values indicate distinct types for example, gender. So, gender values could
be male or female. So, they indicate distinct types. Similarly nationality again get the
values could be Indian, Pakistani, Bangladeshi, Sri Lankan and so all these values
23
indicate they are distinct type. Similarly the religion I could be another nominal variable,
so the values will again indicate distinct types for example, Hindu or Muslim or
Christian. Similarly, pin code could be another example for a nominal variable where
each pin code actually indicate a distance distinct location. Employee ID could be an
nominal variable because each employee ID would actually indicate a different person.
Now, the two operations equal to and is not equal to or supported because these are
distinct types you cannot say male is greater than female or so, greater multiplication
greater than less than multiplication division all these observations are not supported
only equal to or is not equal to is supported.
Let us look at the ordnance. So, values indicate a natural order of sequence. Natural
order or sequence for example, academic grades. So, you have A grades like A B C D E
F. So, they all these grades label they indicate you know A order, but essentially they are
distinct from each other.
Similarly, likert scales, likert scale quality of a food item they are also examples of
ordinal variables. In likert scale you come across whenever you are trying to fill a survey
questionnaire, trying to reach to the respondents and trying to get the responses you
always get the responses from a strongly disagree to a strongly agree or strongly agree to
strongly disagree. So, all those points are actually part of likert scale and can be are
actually can be defined as ordinal variable.
24
Similarly quality of a food item it could be better, it could be average, it could be you
know poor so that can also be an ordinary variable. Now, 4 additional operations are
supported because the values indicate a natural order. So, therefore, these less than or
less than or equal to greater than or greater than or equal to operations are also supported.
If we look at the continuous variable types first one interval variable, now here apart
from the apart from what we discussed a nominal ordinal difference between two values
is also meaningful. Now, another important thing about interval variable is values may be
in reference to a somewhat arbitrary zero point. So, for example, Celsius temperature,
Fahrenheit temperature. So, Celsius temperature is generally used in India where we you
know when we talk about the temperature is 35 degree Celsius 40 degree Celsius and
those numbers. Now, when we talk about 0 degree Celsius it is not actually the absolute
0 point it is rather a can we called an arbitrary zero point. Similarly location specific
wherever for example, distance from landmarks, a geographical coordinates they all are
examples of interval variables.
Now, next, the operations that are supported for interval variables are two additional
operations apart from what we discussed for nominal and ordinal to additional operations
are supported plus and minus addition and subtraction. Now, the last variable type is
ratio variable. Now, in ratio variables ration of two values is also meaning full. So,
values and also values are in reference to an absolute 0 point example Kelvin
25
temperature. So, in Kelvin temperature you have 0 degree Kelvin. So, when we say 0
degree Kelvin we are actually means is actually is actually mean 0 degree, 0 temperature,
that is 0 in absolute in real sense. When we talked about the Celsius temperature or
Fahrenheit temperature 0 degree centigrade or 0 degree Fahrenheit they do not actually
mean a absolute 0 all right, real 0 in the absolute sense.
A is length of weight, height, income and we can also the examples of (Refer Time:
37:26) variables; apart from the operation that we discussed for nominal ordinal interval
two additional operations division and multiplication or supported for these two
variables.
Now, another important thing we lead to variable type is convergent from one variable
type to other. Now, high structure variable type can always be converted because we
discussed that these variable types they have you know hierarchy of a structure. So,
therefore, you know high structure variable can always be converted into a low structure
variable type, but we cannot convert a lower structure variable into a high structure
variable. For example, a ratio variable is can we converted into an ordinal variable age
group.
So, age group could be like you know you know adult, young, old, middle age, but age
could also be you know a specific age could be 20, 21, 25, 40 and 45. So, this is actually
being a ratio variable. Now, based on the these actual numbers you can actually convert
26
this variable into a ordinal variable where you say that less than 20 is young then 20 to
40 its adult then later on middle age and then old age. So, that kind of grouping can
actually be done. So, therefore, a high structure variable cannot always be converted into
a lowest structure variable type.
Now, let us discuss the road map of this particular course. So, module first which we
started in this lecture is a general overview of data mining and its components which
covers the introductory part and the data mining process. Then the module second it is
about data preparation and exploration here we talk about the different you know a steps
that are required to prepare the data, to explore the data, visualize the data and other
techniques like dimension reduction etcetera.
Now, third module is about performance matrix and assessment where will try to
understand the matrix that are actually used for task like do assess the performance of
different models for task like classification and prediction. Now, the next module, fourth
module is about supervised learning methods there will we are going to cover data
mining and statistical techniques like regression and logistic regression, neural networks,
trees. So, those techniques and many others are going to be covered in this particular
module. Module fifth, is about unsupervised learning methods they are we are going to
mainly cover clustering and association rules mining. Then next module is a time series
forecasting there we are going to cover the time series handling regression based
27
forecasting and a smoothing methods. Finally, in the last module we will have some final
discussion and concluding remarks.
Now, apart from this road map will also have two supplementary lectures, one is on
introduction to R, then other one on basic statistical methods. So, it is highly
recommended that you go through these two lectures before proceeding further for next
lecture.
These are some of the key references.
28
Thank you.
29
Dr. Gaurav Dixit
Lecture - 02
Data Mining Process
Welcome to the lecture 2 of course Business Analytics and Data Mining Modeling Using
R. In this particular course, in this particular lecture we are going to cover the data
mining process.
So, there are different phases in a typical data mining effort. So, first phase, phase is
about first phase is named as discovery. So, important activities that are part of this phase
are framing business problem. So, first we need to understand the needs of the
organization and the issues they are you know facing and the kind of resources that are
available the kind of team is available, the kind of data repository other resources are
available, and the you know try to understand the main business problem and then try to
identify the analytics component, so analytics challenge or analytics components,
component part of that particular business problem. Then we start understanding we start
to develop our understanding about the relevant phenomena the relevant concepts
constructs variables and we try to develop or formulate our initial hypothesis that are
going to be later on converted into a data mining problem.
30
So, this is in the discovery in the first phase these are some of the activities that we
generally have to do. The next stages, the next phase is a data preparation in this phase
we have already understood the problem especially the analytics problem that we have to
deal with therefore, we also based on our initial hypothesis formulation we will also have
you know understanding about the kind of variables that would be required to perform
the analysis therefore, we can also we can always look for the relevant data from internal
and external resources and then we can compile the data set and that can be used for the
analytics.
So, another activities that can actually be performing this stage also a data consistency
checks. For example data can come from variety of sources therefore, we need to check
the definition of fields whether they are consistent or not we need to we also need to look
for units of measurement we also need to check for you know data format whether that is
consistent or not. For example, if you have a variable gender and your data set and in one
particular source is it is recorded as male m a l e and in other male and female full form
m a l e and female and in another from another source it is recorded as capital M or
capital F. So, therefore, you have to make sure that data is consistent and when the whole
data is compiled or got together and, these consistency have to be checked.
Then time periods. So, data could be belonging to a particular you know you know
number of years so that we also need to check because the analysis or results could
actually be limited by the time period as well. Sample, now you do not always need all
the records that we prepare in our data set we normally generally we take a sample of it
because smaller sample of the you know that is generally good enough to build accurate
models.
Now a next phase in a typical data mining effort is about data exploration and
conditioning.
31
So, in this phase we generally do activities like missing data handling we also check for
range reasonability whether the range of different variables they are in the expected, they
are as per the expectation or not. We also look for the outliers they can come due to some
human error or measurement error etcetera. Graphical or visual analysis is another
activity. So, we generally you know plot a number of a number of graphics for example,
histogram, scatter plot, that we are going to discuss in later coming lecture other
activities that we can do is transformation of variables, creation of new variables and
normalization.
So, why we require, why we might need to do some of these activities would be more
clear on as we go further in coming lectures. Transformation creation of new variable for
example, sometimes you know we might require that you know sales figures has to be
you know in more than 10 million or less than 10 million kind of format then we might
have to transform our variable. Creation of new variables, if we do some of these
transformation we will end up with some of the new variables sometimes we might use
both the forms in our analysis. Normalization sometimes a scale could pose a problem in
a particular data mining algorithm or statistical technique, therefore, normalization might
be required. So, these are some of the activities that we have to do.
Partitioning, partitioning is another essential part of a data mining modeling. So, be

generally partition our data set our sample into training were relation and test data sets.
32
Training is generally the a large partition and all the models all the elevation they are
generally applied on the training data set and then fine tuning or selection of a you know
particular algorithm or particular technique happens on the validation data set test data
set is again reevaluation of that final method.
Next phase of data mining process is model planning.
So, in this particular phase you know we need to determine the task that we have to
perform whether it is a prediction task or a classification task. We also need to select the
appropriate method for that particular task if it could be regression neural network, it
could be clustering, discriminant analysis and many other methods that we are going to
discuss incoming lectures.
Now next phase is a model building, building different you know candidate models or
using or selected techniques in the previous steps and their variants. So, that is run using
training data then only we define and select the final model using the validation data then
evaluation component that happens the final model that is done on test data. So, in this
phase mainly it try out different candidate models we assess their performance we fine
tune them and then we finally, select a particular model.
33
Now, the next phase is results interpretation. So, once model evaluation you know
happens using performance matrix we can go on and interpret the results. So, one is the
exploratory part then another one could be the prediction part. So, all those interpretation
can actually be done in this particular phase, and they then the model deployment once
we are satisfied with the model that we have we can run a pilot project to. So, that we are
able to integrate and then the model on operational systems. So, that is the last phase.
Now, this is the typical data mining process that one has to follow. Similar data mining
methodologies were developed by SAS and IBM modeler that was previously known as
SPSS clementine. So, these are the commercial software for a statistical modeling and
data mining modeling and they also follow a similar kind of data mining mythology. So,
for SAS methodology is call SEMAA, S E M A A and the IBM modelers methodology is
called crisp DM.
34
Now, at this point we need to understand the classification, important classification and
related to data mining techniques. So, data mining techniques can typically be divided
into supervised learning methods and unsupervised learning methods.
Now, what is supervised learning? So, in supervised learning algorithms are used to learn
the function f that can map input variables that is generally denoted by X into output
variables that is generally denoted by Y. So, algorithms they are used to learn a mapping
function, a function which can actually map input variables X into output variables Y as
you know you can also write this as Y as a function of X. Now, the main idea behind this
is we want to approximate of mapping function f such that new data on input variables
can actually be used to predict the output variables Y with minimum possible error. So,
that is the main idea. So, the whole model development that we do is and this you know
learning of this function is actually perform. So, that on new data we are able to predict
the outcome variables with minimum possible error.
35
Now, supervised learning problems can be further grouped into prediction and
classification problems that we have discussed before. Now, next is unsupervised
learning. So, in unsupervised learning algorithms are used to learn the underlying
structure or patterns hidden in the data. If you look at the examples again unsupervised
learning problems can be grouped into clustering and association rule learning problems.
Now, there are some other key concepts that can actually be we need to discuss before
we move further.
36
So, some of these concepts are related to sampling. So, we do not need to go into detail
of you know all the sampling related concepts, we will need what is mainly applicable
what is mainly relevant to a data mining process. First we need to understand the target
population. So, how we define a target population? Target population is the subset of the
population under study for example, in our sedan car example in the previous lecture it is
the household that was the target population. So, we wanted to study the household and
whether the own a sedan car or not. Now, results whenever we study a particular target
population it is generally understood that results are going to be generalized to the same
target population. Now, when we do an analysis we do not gather all the data of all the
data coming from the target population we take a sample of it, reason are as we have
discussed or indicated before cost related problems we cannot actually go about
collecting data from the whole population because it is going to be a very costly process.
So, therefore, the purpose of you know business analytics data mining modeling and
other related discipline is to reduce this cost and still have some useful insights or solved
some of the analytics problem. So, that is why we require sample. So, we generally take
a subset of a target population and then analyze the data that we have in that sample. So,
generally our data mining scope generally we might be mainly limited to simple random
sampling which is a sampling method where in each of observation has an equal chance
of being selected.
37
Now, what is random sampling? So, an sampling method where in each observation does
not necessarily have an equal chance of being selected. Generally simple random
sampling is used where each observation has a equal chance of being selected. The
problem with simple random sampling could be that sometimes the sampled population
could not be there represented of, could not be the proper you know representation of the
target population because of this you know equal probability of you know equal
probability of each observation being selected this could be one problem.
Sampling, next is sampling with replacement. So, when we do sampling with

replacement. So, when we pick an observation then that observation is placed back in the
sample then again for the next value it again has the equal chance of being selected. So,
in sampling with replacement, sampling values are independent. So, once an observation
is has means you know once and observation or case has been selected again the next
observation does not depend on this the first selection. It can again be the same case or
some other case. So, sample values are independent. When we say sampling without
replacement sample values are independent because once a particular case or observation
has been selected because of this without replacement procedure we cannot put it back
there. So, therefore, the remaining observation we have to select from the remaining
cases therefore, the next selections are going to be dependent on the previous selection.
So, that is why sample values are not independent.
38
Now, whenever we do sampling it is going to result in less number of observation than
the total number of observation that are present in the data set. Now, there are some
other, there are some other issues that could again further bring down the number of
observation or variables in your sample. For example, data mining algorithm, so different
data mining algorithm they could have varying limitation on number of observation on
variables. They might not be able to handle you know more than a certain number of
observation or more than a certain number of variables so that could be one limitation
that could again limit your sample size.
Now, limitation can also be your due to computing power and storage capacity. So, the
available computing power and storage that you have for your analysis purpose, can also
either lower the speed or limit the number of observation the or number of variables that
can be handled. Similarly elimination can also be due to a statistical a software,
nowadays you already understand that different software they have different versions.
So, generally if you are using a free version of you know commercial software that can
actually limit the number of variables or number of observations that can actually be
studied. So, those limitations can further bring down your sample size.
Now, while we are discussing this about you know limitation related to number of
observations we need to understand how many observations are actually required to will
accurate models because the whole idea is to build accurate models build good models.
So, that we have good enough results which could be used on production systems later
on. Especially when we understand that the, you know we cannot get the data from the
you know target population we have to take a sample. So, cost is an important factor. So,
therefore, it is always you know better for us to understand the number of observation
number of you know sample size that would be sufficient for us to build accurate model
robust models.
39
Now, there are many other concepts related to you know data mining process that we
need to understand at this point, sampling size we will come back again. Now, next
concept is rare event. Now, typically when you are dealing with a particular data set now
if it is you know classification problem and you have a outcome variable where you have
you know for example, ownership that we talked about whether a household owns a car
or not. So, owner you know owner or non owner. So, that kind of scenario is there.
So, typically you know you know the split between observation belonging to owner class
and observation belonging to the non owner class there would 60 40 or around 50 50 or
you know that kind of ratio. But it might so happen that in a you know in a if it is rare
event that out of let us say 1000 households only 10 or 20 household actually own a
sedan car, then in that case it will make it a rare event, that ownership of a sedan car
could actually be a rare event in a particular target population. So, in that case how do we
do our modeling.
Another example would be low response rate in advertising by traditional mail or email.
So, you might be having different promotional or marketing advertising offers through
coming to you through traditional mails post and emails as well. Now not everyone is
going to respond to this offers. So, again this can also be a rare event. So, how do we do
our modeling? What are the issues that we face in this kind of situation? So, if you have
a particular class which is which has very few observation belonging to you know very
40
few observation actually belong to this class. So, any kind of modeling that you are
actually going to you know do using that particular data set might not give you an you
know good enough model right, it might become very difficult for you to get a model
build a model that will satisfy your main analytics goal.
For example, if you have 100 observation in your data set and 95 belong to the non
owner class and 5 belong to the owner class and main objective of your business problem
is to identify people who are owners. So, therefore, if you even if you do not build a
model a still you classify every household as a non owner still you will get 95 percent
accuracy of your model. So, in this case modeling for the success cases is becomes an
issue. So, how do we solve these problem we do over sampling. So, we do over sampling
of success cases so that means, we include more we duplicate many of the data points
which are belong to the success case or we change the ratio that means we take the
success cases data points observation and we also take the observation which belong to
the for example, in we take observation belonging to the owner class and we take the
observation belonging to the non owner class in a 50 50 kind of ratio or similar ratio.
This kind of problem again arises mainly in classification task now in other related
concept to this problem is cost of this classification. So, when we say that success class
is more important for us we are dealing with asymmetric cost here and because
identifying success class is more important for us because it is more important for us to
understand which customer is going to respond to our email you know promotional offer
and marketing offer. So, therefore, we dealing essentially dealing with asymmetric cost.
Now, generally if we are not able to identify success cases generally the cost of failing to
identify success cases or is going to be more than cost of detailed review of all cases
reason being, if you are able to identify a particular customer that he or she is going to
respond to your offer then probably they are profit that you can earn by selling the
product or services to that customer would actually be more than cost of detailed review
of cases.
So, therefore, generally the benefit of identifying success cases is higher than that and
that is why modeling is modeling becomes premium. Now, other important aspect now
other important aspect of rare event is when you know success case is important for you
now prediction of success cases is always going to come at a cost of misclassifying a
41
failure case as a success case. Because if you out of 100 observation you have only 5
success cases you would like to identity all those 5 so that you are able to make a profit
from your offerings. So, therefore, eventually the model that you built you might end up
with you know identifying many failure cases also as a success cases. So, therefore, this
is going to be more than usual so, is usually if you have you know 50 50 case of success
case and failure success cases and failure cases your accuracy or your accuracy and error
can be on the error can be on the lower side, but in this case error would actually
increase, but the purpose is to identify success cases to in increase the profit.
Now, another important aspect of data mining process is dummy coding for categorical
variables.
Now, there could be some statistical software which cannot use categorical variables
expected in the label format. If you have format for example, if you have a variable like
you know gender where you have male or female or M or F has been those are the labels
that are they are in the data sets many a statistical software might not be able to use this
particular variable the directly. So, therefore, dummy binary you know dummy coding
might be required for these variables.
When we say dummy coding we actually create dummy binary variables. So, these are
actually having 0s and 1s, 0 indicating the absence of a particular type and 1 indicating
presence of a particular types. For example, if activity a status of individuals we have
42
data on activity status of individuals and has 4 mutually exclusive and jointly exhaustive
classes for as a student unemployed, employed and retired then in that case we can create
different dummy variable wherever when a particular observation able to belongs to if its
activity status belongs to a student, it builts have the value as 1 if it does not then it will
have the value as 0, similarly for other observations and for other classes.
Now, if these types classes are usually exhaustive then we do not need to create all you
know four dummy variables. So, there are 4 types we do not need to create 4 dummy
variables because then being jointly exhaustive have been no 3 of them then the fourth
one is already known. So, therefore, we need to create only three dummy variables.
Now, another important concept is principle of parsimony. So, principle of parsimony is

about when you know a model or theory with less number of assumptions and variables,
but with high explanatory power is generally desirable. So, you are always looking to
develop a model or theory where you make as few assumptions as possible and you
include as few variables as possible and is still able to explain most of the phenomena
you know higher explanatory power. Most of the cases you are able to explain and less
number of your, you must be using less number of assumptions and variables. So, this is
a desirable property and in this particular course we will be putting more impasses,
impasses on this particular principle.
43
Now, another problem with you know more number of variable is that it is going to
increase your sample size requirements because you are always looking for a reliable,
you always want to estimate, you always want to compute a reliable estimate and if you
have more number of variables then your sample size requirement will increase to have
that higher reliability.
Now, another related problem is over fitting. So, what is over fitting? So, over fitting
generally arises when you know a model is built using a complex function that fits the
data perfectly. So, if your model is fitting the data perfectly then probably it might be
over fitting the data. So, what actually happens is you in the over fitting you in the fitting
the noise your model end up fitting the noise and explaining the chance variations. So,
you would ideally you would actually look to avoid explaining chance variations because
you are looking to understand the relationship which can then be used to predict the
future values all right. So, therefore, over fitting is something that is not desirable. Over
fitting can also arise due to more number of iterations, if we you do more number of
iteration that can result in excessive learning of data. So, that can also lead to over fitting.
If you have more number of variables in model some of those you know variables might
have a spurious relationship with your outcome variables all right. So, that can also lead
to over fitting.
44
Now, the next concept is related to sample size. So, how many observation should
actually be good enough for us to build, for you to build an accurate mode? So, domain
knowledge is the you know is important here to understand a sample choice because as
you do more and more modeling, more and more analytics you would be you know able
to understand different phenomenas the construct concepts variables you will have a
better hunch on or better rule of thumbs to understand how many observation will
actually be we will to build a model for a particular analytics problems. So, domain
knowledge is always going to be the crucial part.
We also have rule of thumbs for example, if you have p number of predictors 10 into p.
So, for 10 observation for predictor can actually be a good enough rule of thumb to
determine the sample size. Similarly for classification task also many researchers are
suggested some rule of thumb for example, 6 into m into p observation, where m is the
number of classes in the outcome variable and p is the number of predictors. So, that can
actually help you you know determine a sample size.
Now, another relevant concept is, another element concept is outlier.
Now, what is an outlier? So, outlier can briefly be defined as a distance distant data point
now important thing for us to understand is whether this distant data point is valid point
or erroneous point because if a particular data point is distance from the majority of the
values or very distance from the mean more than 3 standard deviation away from the
45
mean then it could be you know it could be due to human error or measurement error. So,
we need to find out whether a particular data point is because of the human error or
measurement error for example, at you know a temperature a room value of you know
100 or 150 for a room temperature or temperature in a city you know could be you know
human error or measurement error. Sometimes there would be you know errors due to
decimal points and related you know typing error and all that. So, we need to identify
whether outlier whether it is a valid point or erroneous value.
So, how do we do that? So, generally you can do some manual inspection you can sort
your values and find out if anything looks out of place you can also do I mean you can
also look at the minimum and maximum value and form their you can try and identify
whether they are outliers whether they are errors clustering can also help you. So, you
can do clustering you could be able to see and whether a particular point is outline or not.
Domain knowledge, domain knowledge also going to help about this.
Now, next related concept is missing values you might come across a data set which is
very important for your particular you know your particular business analytics problems,
but few records you know few records have missing values. So, if those records are few
in numbers then probably you can remove those records and then go ahead with your
analysis, but if the number of records are more than probably that can end up eliminating
most of your observations. So, therefore, you need to handle those missing values.
Imputation is one way, so you impute those missing values with the average with the
average value of that particular variable right so that could be one solution. You can also
if that is also not, if that is also not desirable then another option is if you have many
missing values then you can identify the variables where the missing values are there and
if those variables are not very important for your analysis then probably you can think
about you know dropping them. If those variables are important for your analysis then
probably you can a look for proxy variable which is having less number of missing
variables and replace that particular variable with their appropriate proxy variables.
46
Another important concept is normalization. Many times in many techniques would

actually require different variables you know you would require for you to normalize
different variables. For example, distance based you know techniques you know
clustering’s and other where distance is an important computation important part of the
process and, important part of the algorithm different you know variables having data
and different units can actually create problem they can actually dominate the
computation later with distance and we can influence the results.
So, in any scenarios normalization is the desired thing and the data mining process. So,
there are two popular ways of normalization one is standardization using z-score where
you subtract these values these value you know subtracted by mean and divided by
standard deviation. Then there is another min max normalization where you subtract
each value by the minimum value and then divide by the difference of max and min.
47
So, these are the key references.
Thank you.
48
Dr. Gaurav Dixit
Lecture - 03
Introduction to R
Welcome to the supplementary lecture number 1 of the course business analytics

and data mining modeling using R. So, this particular lecture is about
introduction to R.
So, we are going to cover basic introductions; we are going to cover basics of R.
So, before we start let us understand the installation steps, this is a specifically
for windows pc or laptops that is the expectation that many of the students they
would be having windows pc or laptop and these instructions are for the same.
So, first you need to install R, so the a link is given here and depending on the
your operating system whether it is 32 bit or 64 bit you can download the
appropriate file; installation file and the after once you are done with installing
R, then you can go ahead and install this R studio desktop version, this is the
GUI for R. So, the link is, download link is already given here, again depending
on settings all configuration of your desktop or laptop, you can download the
49
appropriate installation file 32 bit or 64 bit and then go ahead with your
installation.
Now, because R studio is actually based on java, so therefore, if your pc or

laptop does not already have java, then you can install the java has there before
proceeding further. So, the link for java is already been given, so you can
download and install it as well. Now, before we can start our data mining
modeling in R, we need to have certain R packages installed, so that they could
be used many of the functions and libraries could be loaded and the function are
available for us to use for modeling. So, therefore, installing R packages is the
next step, so what will do?
So, these are some of the packages that I have mentioned in this so installed R
packages is 1 particular function that is used for installing R packages in your
system. So, some of the packages that we are going to use in this particular
course I have mentioned then here. So, you can see installed R packages and c is
the combined function, which can combined all the strings or names of the
packages and in 1 go all these packages would be installed, you would also see I
have assigned dependencies the another argument as true.
So, if there are any dependencies for these packages, they would also be
installed. So, once you are done with your installation of R and R studio and
50
java as well then you can go ahead and use this particular function, this
particular code to install these packages.
Now, let us understand about R’s GUI. So, R’s geographical user interface, it is
mainly command line interface, it is quite similar to the bash shell that we have
in Linux or interactive version of this is scripting language python. So, many
people use python as well for data mining modeling and a statistical modeling.
So, the interactive version of that particular language is also quite similar, is also
based on command line interface. Now, R studio is one of the popular GUI for R
and it has been used for this particular course, most of the R scripts have been
written in this using this particular GUI R studio.
51
Now, as we have seen before R studio has a 4 main window sections we can see
it again. This particular section is the top left section and generally we write and
save our R code here in this particular section, you can see the instruction for
installing packages is written in this section and has been saved as a file called
installation steps dot R. So, most of the R scripts would have this extension
name R dot R, so this particular file is saved as installation steps dot R.
Now, bottom left section that is for actually education of R code and to display
output; to generate output is also called a console sections. So, this can be called
a script section and this is called console sections here we actually see
installation of execution of the R code and the related output.
Then, 3rd window section is top right section which is actually can be called
data section or environment section, in this particular section we manage our
data sets and variables. As we will see later in this particular lecture that many
variables that we initialize are assign some values to those variables or data sets
they would be a visual here, once they are loaded into R studio or R
environment.
Now, the 4th window section is bottom right section, where we actually display
plots and seek help for R functions, it can be called plot and help section. So, in
this, particular plots right now the plot sub section is active and therefore, any
52
plots that we generate would be displayed here. Now, you also have help section,
so if you are looking for a you know some help, related to some particular
function in R you can type that particular name of the that particular function
and the help is would actually be displayed there. So, you would see if we type
help here, then in the documentation how the help function can actually be used
and the details related to arguments and further details, examples, everything
else and notes, references, examples, everything you can see over there.
So, whenever you face some difficulty in understanding how a particular

function in R is supposed to be used, then you can always look for some help in
this particular section.
53
Now, data set import, this is in so far our course for this course data sets are
mainly available in excel files. So, we would be importing data set from excel
files or we would be creating in our studio itself. So, any hypothetical data set
we would be immediately creating in R studio. So, let us with this information
let us start our R basics or open R studio.
One of the important function that we require before we start is, understanding
different packages and load them into the R environment for example, because
54
as we discussed the data sets would generally be imported from excel files, in
this course see I have first line of this particular R script is about this library
xlsx, so this is the package which can actually be used to import data set from a
excel file.
You would see, here in the console section that this particular package has been
loaded and the required package is also I have also been loaded. Now, if we are
looking to you would like to import an excel file a data set. So, for example, the
previous data set that we have used in the lecture 1 sedan car data set that we can
actually import. You would see in this particular environment section or data
section you would see a particular file has been imported.
If you click on this particular link, a new window in the script section would be
open; file section would be open and you can see the data set there. So, we have
3 variables in this particular data set annual income, household area and
ownership and if you scroll down the data set then you can see 20 observation,
the same is displayed in the data section as well, we have 20 observation and 3
variables.
Now, this is a 1 way of importing data from an excel file, in this case I have used
this particular function filed or choose and this can actually allow this can
actually allow us to browse for our file in our windows directories. Another way
55
to import an excel file into R environment could be using this function read dot
xlsx and then giving the whole full complete path to the particular file.
One important difference that we can note here is, in R views forward slash for
mentioning the complete path instead of the backward slash as is used in
windows systems. So, do not forget to change from back slash to forward slash.
Now, another way to import the data set is you can set your working directory,
so this is the command. First let us see, how the, a data set can be imported
using the full path name.
So, let us see how the excel file data set from excel file can be imported using
the full path name of the file. So, you can see the same data it is actually the
same data set, but second import you can see here df 1, 20 observation and 3
variables.
Another way of importing data set could be you can set your working directory
using this command set wd and then, again you can give the full path name of
your working directory, just execute this particular command and then, now
because the your working directory has been changed where your excel file is
located, then you can simply put simply write the name of this particular excel
file and again import the data set, that says you this code and again you would
see another file has been imported it is again same data set, so you can see.
56
3rd instance of the same data set being imported same 20 observation and 3
variables. Now, once you have imported your data set, you can use the data set
to the start your modeling to execute different steps related to the particular
modeling. So, for example, because this is an introduction to R, we are covering
R's basics. So, therefore, we will try to understand some of the basic building
block structures that are used in R.
So, first let us to understand the numeric character and logical data type, how are
they are used in R? How we can create them and then access them? So, how a
numerical variable is created in R? You just need to type this i, i could be
numerical available for your code and then you can assign the value and execute.
So, you would see in the data section and i variable has been created and it
contains it has this value 1. We want to create a character variable, again you can
type the name of your variable for example, in this case we have country and
you can assign the value, in this case we have initialized this particular variable
with this India as our country.
So, again execute in this code you would see in the data section country has been
created and it has value India, similarly a logical variable can be created. So, for
example, in this case we have this logical variable named as flag and the value is
given as true, it could be true or false these are the 2 options for logical
variables. So, once we execute this code, you would again see that flag is created
in this data section and the value is true.
Now, how do we find out whether the characteristics of these variables? So,
these 2 functions could be useful class, the class function can be used to find the
abstract class of any particular variable and the type of function can be used to
find out a storage type of any particular variable for example, the variable i that
we have just created, we can find out the abstract class of this particular variable
which is numeric and then type of, of this particular variable which is double.
So, this particular variable is a numeric and it is being stored as a double in the
memory, similarly we can check for country. So, you can see country is
character the class of this variable country is character and a storage type is
character as well. So, similarly we can check for flag.
57
So, class of this variable flag logical variable flag is again logical and this
storage type is logical. Now, whether a particular variable is an integer or some
other variable type; some other data type that also can be checked into that also
can be examined in R and coercion from 1 variable type to another variable type
can also be done.
For example, if you want to find out whether the variable we have, the whether
the variable i is an integer or not. So, we can use this function is dot integer. So,
this code if we exude this particular code we will get an answer as false because
the i was created as a numeric variable and not as an integer variable.
So, therefore, that can be checked using the function is dot integer. Now, it is
possible for us to quotes this particular variable; numeric variable into an integer
variable, how that can be done? Let us create another variable j and we assign
the value 1.5 to this particular variable you would see that in the data section j is
created j can be seen here with value 1.5. Now, let us check whether this
particular variable is integer or not because we have created it as a numeric
variable it should come as false, so which is the case here.
Now, let us quotes this variable into an integer data type. So, that can be done
using is dot integer command, is dot integer function. So, is dot integer function
58
we can pass on the same variable as argument and we can again store it store the
return value in j itself and then we can display the value of j.
Let us execute this line, you will see 1. So, 1.5, value of 1.5 which was stored as
j which was created as a numeric variable, has now been changed to 1 because
the variable has been coerced into integer data type from numeric data type;
now, if we again do a check whether this particular variable j is integer, now we
will get an answer as true.
Now, another important function is length. So, length function can actually help
us find about the length of a particular variable. So, for example, length of i is 1
length of country and flag these 3 variables length of all these variables 1. Now,
next to discussion point is about vectors.
So, vectors are 1 of the basic building blocks for data in R, simple R variables
that for example, I can be flag if we just created they are actually vectors, now
vectors can take values from the same class. So, let us check whether the vector,
whether the variables that we just created i country and flag, whether they are
vector or not. So, again we can use the function like is dot vector, which is going
to tell us whether these 3 variable belong R vector or not.
So, you would see all these, all 3 variables are vector. Now, creation and
manipulation of vectors; this, so there is this function combined, which is
combined function c or column operator would also be is used. So, c function
and column operator can actually be used to create and manipulate vectors. For
example, if we want to create a vector of character vector of these 3 values.
Three values cricket, badminton and football.
59
So, this vector v can be created we can use the c function and we can pass on
these 3 strings as arguments, greater strings as argument, we will have this
vector created as v vector has been created and it has 3 values cricket badminton
and football. Now, we want to access individual values then we can do so using
the brackets. So, v1, v2, v3 will can be used to access first value and second
value and third value respectively.
You can see v1 cricket is v value in the output. Now, column operator can also
be used to create a vector for example, v1 1 to 5 so, this particular vector is
going to be created with having values 1, 2, 3, 4 and 5, as you can see here. So,
in column operator you mentioned the starting and ending values and the middle
values are actually filled up by the column operator.
We can sum these values using the sum function, you can see sum v1 and we get
this particular output 15, similarly multiplication with a constant can be done,
we want to access a particular value in this vector and that can also be done
using brackets. So, we do 3, so we wanted to access the 3rd value. So, that we
can do using v2 3 in brackets and we can see output as 6.
If we want to add 2 vectors that can also be done v1 and v2, but here it is
important that both should be having the same number of values, similarly if we
want to find out the values in a particular vector which are greater than 8 for
60
example, in this case there are many values which are greater than 8. So, we
want to just find out those values.
So, we can execute this particular line v3 greater than i 8 and we will see the
answer is false for first 2 values because they are less than 8 and the true for rest
of the 3 values which are actually greater than 8. Similarly, if we want to access
the values which are greater than 8 we can do in this form also v3 and within
brackets we can again use v3 greater than 8.
So, it will return the index for all the values which are greater than 8 and then
those values would actually be accessed using the brackets. Similarly, if we want
to access values which are greater than 8, but less than 5 again a similar thing
can be done, you can see the value is 3, 9, 12 and 15 there which are less than 5
or greater than 8.
Now, sometimes we might be required to initialize a vector and then populate it,
till now what we have been doing is at the same line at the same instant we were
creating variables and vectors and immediately we were initializing them as
well, we were populating them as well. So, sometimes we might want to do this
process in 2 steps first initialize and then populate.
61
So, for that we can use a vector function, so vector function can actually be used
to initialize vector, by a number in this case a vector function is being used to
create a vector of length 4. Now, if you want to, so by default if you use the
vector function by default a logical variable, the logical vector is created, if you
want to a create a numeric vector then we can do so, we have to mention the
mode of mode as numeric in the vector function and that can be done.
If you want to reassigns 1 of the values for example, 3rd the value, that can also
be done you can see 1.4 they are just which varies assigned. Similarly, if we
want to create an integer vector that can also be done, we have to mention and
the mode argument it as mode argument has to be assigned as a integer, you can
see the this particular vector v6 we had created with length 0. So, we will check
for it is length again you get you get the right answer as 0, here length function
can actually be used.
Now, whatever we have done so far it might look like that vectors are 1
dimensional arrays, right till now whatever examples that we have gone through
they give this impression, but if you really look in R whether they are 1
dimensional arrays or not, but in R they are they are actually defined as
dimensionless, that can be checked using these 2 function if we check for the
length of this particular vector v5 we will get answer as 4, but if we check for
62
the dimension using the dim function dim function we would find as null, so
which is undefined.
So, vectors are not actually 1 dimensional edits they are actually dimensionless.
That brings us to the next discussion point that is arrays and matrices. So, R also
has these building blocks arrays and matrices. Now, we have array function
which can actually be used to restructure a vector into an array, so how that can
be done we can see this through this example.
Now, array function the initialization 0 is given and we are creating an array of
these dimensions 4 states, 4 quarters and 3 years. So, this particular array would
be created with these dimensions; with these 3 dimensions I am having. So, this
we can see here, now we I assigned one of the value as 1000. So, that can also
we can check here.
You would see 3 dimensional array; so this is our 1st year, this particular matrix
looking structure in this array is for the first year, then this is for the second year,
then for third year. So, we had created this array for 3 years, so we can see 3
matrix structures back to back matrix lecture have been created. Now, that brings
us to the matrix so a 2 D array is a matrix. So, array can be thought of as a series
of matrix and a 2 D array being a matrix. So, we can use matrix function to
initialize an array, you can see the matrix initialization with the value of 0 for all
63
the elements and you can define a number of row as 3 and number of column as
3 in this case so and row and column argument can be used to define.
Now, with a different initialization we can create this program matrix M1 you
can see here, now a matrix multiplication can be used using this operator
percentage as trick and then followed by percentage this. So, this particular
operator can be used for matrix multiplication in R. So, this is the result of
matrix multiplication M1 multiplied by with M1 itself.
If we want to find out the inverse matrix though so we have the matrix do not
inverse function for that, but to be able to access this function, to be able to use
this function we first need to load this particular library matrix calc, so we will
just do that. So, we can just load this particular library matrix calc, once it is
loaded then we can access this function matrix or inverse and for any matrix we
will get the inverse of it.
You can see the inverse matrix here, if we want to transpose a matrix then there
is this function t transpose function that can actually be used to get a transpose
of a matrix. Now, the next discussion point is about data frames, now data
frames provide a structure for storing a variables of different data types. So, till
now we whatever we discuss vectors, arrays and matrices we were using the
64
same data type for a storing values of same data type. Data frames can be used
for storing variables of different data types.
Now, they provide flexibility to handle many data types preferred input, now
they have also because of that they have also become the preferred input format
for modeling in R, they can also be understood as a list of variables of the same
length for example, the data set that we earlier imported is this sedan car data set
we can see 3 variables of same length annual income, household area and
ownership and you can see 2 of the variables annual income and variable
household area are of the same type, they are numeric, but the ownership is
different.
So, we can check whether a particular variable is data frame or not, using this
program; using this particular function is dot data dot frame. So, we will just
check this for df which is true because the in the start of this particular script the
files that we imported they were actually in the data frame; they were actually
stored as data frame. Now, another important aspect related to data frame is the
dollar notation. So, using dollar notation you can access any of the variables
which are there in stored in a data frame for example, annual income. We can
access using this particular line df dollar annual income, see you will get the
values that are stored in this particular variable.
65
We can also check the length of individual variables in stored in a data frame
using the same notation and passing it as argument to the a length function, you
can see a still the answer is 20. Similarly, we can also check, so all these
variables as we discussed before the variables are actually a vectors in R. So, we
can check for the same is dot vector for annual income and household area and
ownership and we will see all these variables are actually vectors.
So, data frames can also be considered as having vectors of same length. Now, if
we look at 1 particular variable ownership it is actually a categorical variable
which is actually called a factor in R. So, if we want to check whether this
particular variable is a factor or categorical variable we can do so by using this
particular function is dot factor. So, you will see the answer true.
So, ownership is actually stored as a; this is actually a categorical variable and it

is stored as a factor in R. Now, another function is str, this function is to display
the structure of a data frame or a list will later see the data frames are also a form
of list. So, let us exude this line you see that in the data frame you have 20
observation 3 variables and you get some details about the variables or vectors
that are there, annual income numeric, num household area and numeric and
owners if you would see factor with 2 levels non owner and owner.
66
So, this kind of a structure of these variables can actually be displayed using str
and the structure command. Now, next important operator is sub setting
operator, so actually the brackets can also be used to subset a data frame for
example, if you want to access just the 3rd column of a data frame, so instead of
using the dollar notation we can actually use this particular subset operator that
is actually brackets. So, first 1 is for rows and the second is for columns. So, as I
have mentioned 3 year, so we will be accessing 3rd column.
You can see 3rd column the ownership related values are there. Now, if you
want to access this first 5 rows, you can mention that the same here 1 to 5 using
the column operator and for the column the space that nothing is mentioned
there, so all the columns would be displayed, would see. Another, if you want to
access 2 columns in 1 go we can do so by using combined function in the
brackets, if you want to; if we do not remember the index whether it was a first
column or third column we can actually use the combined function mention and
mention the actual name of those variables or vectors and then again we can
access the same data you can see here same data has been accessed.
67
Now, if you want to create; if you want to display retrieve few records using
these you know these greater than using these operators for example, annual
income having greater than 8 lpa, so that can be retrieved using this particular
command. So, all the rows where the annual income is greater than 8 all such
columns would be displayed, you can see here all the values are greater than 8,
8.1, 8.2 up to 10.8.
Now, we want to check for the class of df or type of df, that can also be done.
You can see class of df is mention as data frame and type is a list. So, data
frames are basically a list, are actually list. So, what are lists? So, list are a
collection of objects of various types including other lists. So, list function can
be used to again create and a great list, double bracket notation also will come
across the double bracket notation, we will see what is that.
So, a list can be generated for example, this lists. So, different objects can
actually be used as obviously, we have created these objects. So, i, j, v, m and
this is another addition that we are using for this list initialization and creation.
So, this using a list function we can create lists. So, you can see here, list has
been created. So, list can store objects of various type and they can be of
different length.
68
So, 1 big difference is the objects and list they can be of different types various
types and they can be of different length as well, while in data frame the
variables I tribes could be different, but they have to be of same length. For
example, you would not see angle the see double bracket notation and see first
you know a list element is this Roorkee which being slide and second list
element was actually i and values 1 the 3rd list element was actually j value was
again 1 and the 4th element which actually this particular character vector
cricket, badminton and football and then 5th element is actually this matrix.
So, if we check for class and length of a particular element of a list we will see
this we will get this. So, we want to actually check for a particular vector, then
we will have to use double brackets now you would see that l and for within
double brackets you will see a character vector, which was actually just we saw
and the output this one. So, this is the character vector and length of that
character vector is also displayed as 3. So, a structure command can also be used
with lists as well as data frame, which are also lists.
69
So, you can see 5 elements of list and the a structure of all those elements is
displayed for example, Roorkee this was character, then you have numeric
vector and integer vector and then the and the character vector followed by a
numeric. Now, our next discussion point is factors, so factors are as we
discussed before they are categorical variable, they are called factors in R. So,
they can be ordered or unordered as we discussed in before in previous lecture
that factors categorical variable could either be nominal or ordinal. So, here in
this case categorical variables are called factors and ordered or unordered they
are called ordered variable can are generally called ordered and nominal variable
are generally called unordered.
So, we have ownership variable already in our data frame, let us check whether;
let us check the type of class of that particular variable and find out whether it is
a ordered or unordered variable. So, as you can see the class of this particular
variable is factor which is categorical and then let us check whether it is order or
not. So, you will get answer as false because there was no specified order there.
Now, there is another this function head which can be used to actually display
first 6 values and if it is a factor variable then levels would also be displayed.
So, let us run this and you would see that levels non over an owner for a factor
variables are displayed. So, all the first 6 values in case of a integer or numeric
70
variable we would have just seen the values, in case of a factor variable R knows
that it is a factor variable, so in the output they also display the levels that are
used in the and this particular variable; factor variable.
Now, how can we go about; how can we actually a go about creating factor
variables for example, we have 1 particular variable in our data set that is annual
income if we want to a create groups off those and those households and we
want to create a factor categorical variable, where it says some houses would
belonging to lower middle class, some would be belonging to the middle class
and some would be belonging to the upper middle, how that can be done?
So, let us first create a vector of called income groups and mode is going to be
the character because we are going to be storing these strings, we did last lower
middle class and upper middle class and the length would be same as the annual
income. So, let us create this vector and then any record which is having annual
income less than 6 can we called lower middle class.
So, let us assign this value, any household which is having income greater than
or equal to 6 and less than 9 can be called middle class so let us do that and then
any household having income greater than 9 can be called upper middle class,
now let us check the values you would see all the values have been assigned
with lower middle class or middle class or the upper middle class.
71
Now, how do we create a factor of it? Factor variable from this now let us create
another variable income class which is now going to be a factor variable, now
we are using the income group the character vector that we had just created and
levels so because the factor variable are going to have levels. So, therefore, we
need to assign some levels.
So, in this case we already know lower middle, middle and upper middle and
you can see here ordered is true. So, we are ordering these variables, so lower
middle, then middle and then upper middle, so this is the order. So, we are
giving the names of the levels and also we have paired this ordered or women
dash true, now this particular factor ordered factor would actually be created.
Now, we can combine this variable in our data frame df. So, c bind command
can actually be used to combine variables column bias, so we can use this c bind
command. So, income class, but now we actually be combined in the data frame,
we can check the same using the structure command, str command you can see
income glass ordered factor with 3 levels.
72
We can again run the head command to find out a clear visibility of what are
these levels and their order you can see now the values are mentioned and levels
in the level you can see lower middle is less than middle and which is again less
than upper middle, so you can see an order in these variables. So, this particular
factory you has been created as a ordered variable.
Now, the another discussion point is on a contingency table. So, in our in this
particular course we would be for especially for classification tasks, we would
be using contingency tables to understand the results of a particular
classification related technique or algorithm. So, first in we need to understand
these tables, so table is the command which can actually be used to create. So, it
is generally used contingency table generally is used to store counts across the
factors. So, let us see through an example, so we have df, we have ownership
variable and income class that we have just created.
Now, let us see how many owner and what are their classes? What are their
numbers across different classes and for how many non owners? What are their
numbers across different classes? So, let us create a contingency table using
table command. So, we just need to mention these 2 factor variables and it will
be done.
73
You would see that first row is about non owner, second row is about owner and
you will see 3 columns as lower middle, middle and upper middle; you would
see a non owner you know 6 lower middle class non owners are there and then 4
middle class owners are there and the 0 upper middle class non owners are there.
Now, if we look at the owner row you would see that 0 lower middle class
owners are there, 8 middle class owners are there and 2 upper middle class
owners are there. So, this kind of count we can always get using table command.
So, this particular command is going to be useful for us when we do our
classification task. Now, we want to find out the class and type off and
dimension of this particular table. So, we can execute these lines you can see
table and integer and the dimensions are also mentioned 2 and 3. So, this ends
the introduction of R. So, in the next class, next supplementary lecture we are
going to cover the basic statistical technique.
Thank you.
74
Business Analytics and Data Mining Modeling Using R
Prof. Gaurav Dixit
Lecture - 04
Basic Statistics Part-1
Welcome to the course on Business Analytics and Data Mining Modeling using R. This
is our supplementary lecture number two on basic statistics using R. So, Let us start. So,
as we have discussed about three types of analytics, first one being descriptive, then
predictive and then prescriptive. So, we are going to cover the descriptive part, and we
are going to learn some of the basic statistics using R.
75
So, Let us open RStudio. So, as we have done in the previous lecture, we are first we are
required to load this particular library. Why we need this, because we want to load the
data set from we want to import the data set from an excel file. Data set that we want to
import is the same one the Sedan car. So, Let us execute this line. You can see in the data
section this data set has been imported you can see 20 observation and 3 variables.
Again Let us have a relook at first six rows of this particular data set, you can see annual
income, household area and ownership, the same variables are there. Now, let us start our
76
descriptive. Now, one of the first function that is popular and used quite often is
summary function. Summary function in R can help you in getting the idea about the
data magnitude and the range of data. Now, it also provides several descriptive statistics
like mean, median and counts. So, we will see in the output. So, Let us execute this
summary df. So, in df, we have three variables - annual income, household area and
ownership.
We look at the output first start with let us start with annual income. You can see the
values range from minimum value of 4.3 to maximum value of 10.8; mean line some
were between some were at 6.8, and median line somewhere at 6.5. You can also see
other things like first quartile and third quartile this is at 5.75 and 8.15. So, this quartiles
also give you a you know idea about where the majority of the values are lined.
Now, let us look at the second numerical variable that is household area. So, here also
you can see most of the values all the values are going to lie between minimum value of
14 and maximum value of 24. Now, majority of the values are going to lie between first
quartile that is 17 and the third quartile 21, and mean lying at 18.8, and the median at
18.75. Now, these statistics are mainly for numerical variable. Now, for the categorical
variable or the factor variable that we have is ownership. Now, there only the counts are
displayed, some of the statistics related to numerical variable they are not applicable.
77
Now, Let us move on to other basic statistical methods, first one is correlations, how do
we compute the correlations between two variables. So, again correlation is applicable
between two numerical variable. So, we want to find out how a particular variable is
correlated with another variable, so that can be done using the this functions cor
function. So, we can pass on these two arguments annual income and household area and
we can find out the correlation between these two variables. So, the correlation value
comes out to be 0.33 for annual income and household area. So, correlation generally
gives you the idea about the relationship between the variables. So, the correlation value
lies between minus 1 and 1.
Now, in this case it is plus 0.33 the value, which are closer to 1 or minus 1 signify high
level of high degree of correlation and values closer to 0 signify or indicate a low level of
correlation between variables. More discussion on correlation we will do in coming
lectures. Now, next important-statistics is covariance. Now, covariance we have cov
function in R that is available to us. So, again we can pass onto numerical variables in
this case our example is about annual income and household area. Let us execute this
line. Now, you can see the covariance as being computed between these two variables.
Now, another covariance is again the spread of values. So, how must common spread
must between these two variables is there, the overlap region that is between the these
two variables can actually be indicated by covariance values.
78
Now, another simple statistics that we can compute using simple R functions, so mean.
So, mean was something that was part of sumv sumarry function as well, but if we are
interested in just computing mean of a particular variable that can also be done using this
mean command. So, let us execute this for annual income. You can see the value. You
can values same as what is displayed in summary function. Now, similarly median also
can be computed. We have this function median in R that can be used to compute the
value.
Now, if you are interested in few more statistics, for example, inter quartile range. So,
inter quartile range is the difference between first and third quartiles. As we discussed in
the summary function, we get the statistics related to first quartile and third quartile, this
is another way to get the same information. So, let us execute this line iqr function and
the annual income past in as an argument and you will get the value.
Now, if you are again in we are just interested in minimum and maximum values, so we
have a direct function call range which can be and we can find out the minimum and
maximum value. So, we do not need to depend on summary function, and we can use
this standalone function which provide the specific estimate. Now, standard deviation
there is this function sd is available in R. So, we can always compute standard deviation
for any variables for annual income it counts out to be 1.7.
79
Similarly, if you want to compute variance of a particular variable, variance meaning the
spread of values for that particular variable that can be computed using var function in R.
So, you can see. So, summary function as such it covers some important some key
statistics, in one go you can compute for all the variables in your data frame in your data
set or you can if you are interested in one of those statistics, you can use this direct
function and compute the same.
Now, there are some important function that which are available in R which we might be
required to use sometime, sometimes to transform a particular variable, sometimes do
some specific task which is repetitive in nature. So, there is an already function that is
available in R. So, we can use them. So, one such function is apply. So, in coming
lectures, we will keep learning about many useful functions from R. So, apply function
can be used. If you want to apply a function to a several variables in a data frame, this
particular function can be useful. For example, we want to apply if you want to compute
a standard deviation for all the numerical variables in a data frame that can be done in
one go using this particular function.
So, first argument is again you need to pass on the data frame and the variables on which
you want to apply. So, variables as generally you know recorded in columns, margin
indicates the same thing. So, margin value up to two means that the function is to be
applied column wise. Now, third argument is function f u n - fun. So, in this case, in this
80
example we want to compute standard deviation values. So, sd is the that function that
we have you know seen before. So, this can be passed on and we can apply. So, if you
execute this line for these two variables, one variable annual income and household area,
you can see the standard deviation value have been computed.
Sometimes you might have to write your own functions which are generally called user
defined function. Some you know pre developed predefined function might not be
available in R, and you might be required to write your own functions. So, here is an
example. So, this is a very simple example, just to give you an idea about how you can
write your own function and use them in your modeling R data preparation and
transformation all those steps. So, this function is about providing the difference between
max and minimum values for all the variables in a data frame. So, first you need to come
up with the name of your function.
So, for example, because we want to compute the difference between max and min
values for all the variables, so our name is mm max for min and the difference. So, mm
diff is the name that I have given. And then you have to use function to define it. And
then you have to mention the argument that would be allowed to pass when this case it is
data frame. And then within this particular function, I have used this built in function
apply this is again it again takes the first argument as data frame and this margin for
column. And within this function again defining in a in a way I am again defining
another function. So, this is user defined function and within apply again I am writing
one more user defined functions. So, function x and max and min x. So, let us execute
this code. So, that this function becomes available for us to use in future.
Now you would see in data section a functions section has being created and you would
see mmdiff as the function name now this can always be called any number of time for
your coding. So, mmdiff let us call this function mmdiff and we have passed these
argument data frame and first and second column. So, let us execute this particular code
you would see that difference between max and minimum values for these two variables
annual income and household area has being computed and you can see here. If you want
to verify whether your user defined function that the function written by you is working
fine or not, you can do so.
81
So, let us run a summary command and Let us see whether our user defined function has
provided the correct output or not. So, you can see in the summary, you can see the
difference between max and min value for annual income, this is 10.8 minus 4.3. So, you
can see that it is 6.5 this is correct. Now, next one for household area its max value is 24
and min value is 14, difference being 10. So, you can see that. So, your user defined
function is giving correct output.
Now, let us move onto our next part that is about initial data exploration. So, whatever
basic statistics that we have just discussed sometimes I mean we might require to
understand a bit more for example, whether we can understand if there is any potential
linear relationship between variable, whether we can understand the distribution of data.
So, for that some level of visualization is required. So, now we are going to discuss some
techniques related to visualization. So, these are some of the things that can be that
should be done before starting the formal analysis or formal modeling.
So, one of the most important visual analysis can be done using scatterplot. So, for this, I
am going to generate this hypothetical data. So, again this function R norm this can be
used to generate randomly generate a data which follows normal distribution. So, R norm
and the first argument that I am passing here is 100 that means, I want to generate 100
observation or 100 values. So, Let us generate values for R norm, R x. You would see in
the data section x has been created, and you would see that this numeric vector have
82
being 100 values and the values have being generated randomly and they are also
following normal distribution.
Now, we can generate another variable y. So, let us generate it like x plus R norm. Again
in this case we are giving a mean specifying mean as zero standard deviation as 0.6. Let
us execute this line and generate y. You can see y another vector has being created having
the same number of observation 100 and the values. Now, if we want we can combine
the these two variables and create a data frame, so that can be done using this hash dot
data frame command. So, these two variables, they will be quartz and data frame would
be created. So, let us execute this line. Now, let us see what the data looks like, so first
six observation you would see that x and y you can see the these data points have been
randomly generated. Let us look at the summary of this data frame. So, this is available
for us.
Now, scatter plot. So, the plot command is a generic command that is available in R and
can be used to generate many kinds of plots. So, in this particular case, we are trying to
generate a scatter plot. So, in the plot command, we need to specify first argument should
be about the variable which is going to be plotted on x-axis and then the second
argument is about the variable which is going to be plotted on y-axis. And then some
other las is another argument that is available mainly for the visual appeal, you can seek
help to get more information on las. Then the third important argument is about main
83
which gives you the title of the plot. So, in this case, we have given the title of the plot as
scatterplot of x and y.
Now it is important for you to level the x axis and y axis, because sometimes we are
going to use data frame and the dollar notation, and then that can be taken as the default
name default names for your x-axis and y-axis. So, in this case we have given the name
of x-axis x and name of y-axis as y. Now, you can also specify limits for your x-axis and
y-axis because sometimes your plot area might be smaller, and the it might not look good
because a small portion of your plot is displaying data right. So, therefore, if you are able
to restrict your limits then the plot area would you know your data points would cover
majority of the plot area.
So, as you can see in the summary command that that we have just run the mean and
max value let us look at the mean and max value, the mean value is minus 2.13 for x, and
the max value is 2.77 for x. So, we can say that all the values will are going to lie you
know lie within the range of minus 3 to 3, therefore we have given x limit as minus 3 to
3. Similarly, for y as well you can see that minimum value is minus 2.79 and the max
value is 3.37, now all these value can lie within this range minus 4 to 4, and therefore we
are given y limit as minus 4 to 4. So, let us plot this. Let us execute this code, and you
would see a plot has being created.
84
And you would see all the values and you can also form this plot, all the data points you
can see, a line can be drawn from this point to this point, and it would closely fit the data.
So, there seems to be linear relationship between x and y. So, why this kind of
relationship is visible in this case, this is mainly because the way we have generated x
and y. If you look at the way we have generated x and y, you can see x was randomly
generated and then y was x plus some addition of randomly generated numbers. So, from
there this linear relationship is coming.
Now, let us start our discussion on a hypothesis testing. So, hypothesis testing what
hypothesis testing is about. So, this is one of the very common statistical technique that
is used. So, generally whenever we are trying to formulate whenever we are trying to
formulate a business problem, one part of it is going to be data mining related, analytics
related or statistics related. So, therefore, mainly in when we talk about statistical
modeling generally the first step is formulation of hypothesis. So, in case also we are
going to learn this particular technique. So, generally it is about hypothesis testing is
about comparing populations. For example, comparing performance of a students in
exams for a two different class sections. So, we want to understand how class A students
have performed in their exams, and whether there is significantly different from class B
or is it exactly same. So, these kind of comparison could actually be performed using
hypothesis testing.
85
So, essentially what we are doing is we are testing the difference of means from two data
samples. So, one could be class A and class B and we can compare the means for these
two data samples. And we can statistically we can find out whether there is difference in
performance or not. So, common technique is that we use is to access the difference or
significance and significance of the same. So, idea as we discussed to generally,
formulate an assertion and be then test it using data.
Now, what are some of the common assumption in hypothesis testing. So, generally we
start with that there is no difference between two samples. For example, in example that
we just discussed we can assume that performance of students belonging to class A and
performance of students belonging to class B is similar. So, there is no difference. So,
that is the starting point for us in hypothesis testing. So, this starting point is generally
referred as null hypothesis or it is denoted as H o. So, generally null hypothesis this is
that there is no difference between two samples.
The alternative hypothesis, if we have some region to believe that the performance of
class A is superior to class B or otherwise performance of class B student is superior to
the same of class A, then we can say that using alternate hypothesis, which can be
denoted using H a. So, in this, we state that there is difference between two samples.
Now, we are interested in knowing few more examples of hypothesis, and how we can
formulate our null hypothesis, and the alternate hypothesis. So, here is one more
86
example. So, this one we already discussed students from class A and B that same
performance in the examination being null hypothesis; and the students from class A
perform better than students from class B is the alternate hypothesis.
Some more examples given in this particular slide, for example, new data mining model
whether new data mining model does not predict better than the existing model. So, this
could be null hypothesis. Alternative hypothesis could be new data mining model
predicts better than existing model. So, what is going to happen after we do this
hypothesis testing, so either testing results will lead to rejection of null hypothesis in
favour of the alternative or acceptance of the null hypothesis.
87
Let us look at another example. For example, this one is more related to regression
analysis then we will discuss regression analysis in coming lectures, then this would
seem more important information to you. So, this is important null hypothesis in a
regression analysis case regression coefficient is zero, i.e., variable has no impact on
outcome. The alternative hypothesis could be regression coefficient is nonzero that
means, variable as an impact on outcome. So, the these are some of the examples for
hypothesis formulation, and how we can state our null hypothesis and alternate
hypothesis.
As we discussed before a typical hypothesis test is comparing the means of two

populations. Now, at this point we need to understand another important concept. So,
generally when we talk about statistical modeling and hypothesis testing in particular,
generally this assumption is that the population is normally distributed. So, though the
distribution is not part of this not under the scope of this particular course, but to have an
understanding, normal distribution is a common is a common continuous probability
distribution and useful to central limit theorem. Central limit theorem says that once we
you know average of a particular sample, it tries to it follows approximately the normal
distribution once the number of observation reaches 30 or more.
So, because of this whenever any population, whenever we have gather more than 30
observation, the distribution of the data, it starts following normal distribution. So,
88
therefore, normal distribution is a commonly occurring property, there is generally we
find in different samples and populations. And therefore, it is easier for us to use this
particular characteristic of distribution and then normal distribution and then use it for
our hypothesis testing.
So, generally as we discussed hypothesis testing is about difference of you know means,
so we generally look for look to test difference of means. So, the idea is drawing
inferences on two populations, for example, if the there are two population one is P 1 the
other one is P 2. So, how we can draw inferences from these populations, so that is the
main idea. So, generally this is done by comparing means. So, for example, for
population one and population two mean population is mu 1 and mu 2.
So, therefore, we can state our null hypothesis as mu 1 being equal to mu 2, so that is
null hypothesis that means both populations are same they have same mean, therefore
they are same. And that the second being that mean mu 1 not equal to mu 2 that means,
there is difference between these two population means, therefore between these two
population. So, how do we do it because it is generally a difficult to get information
about whole population. So, generally we take samples, generally we draw random
samples from these populations.
So, our basic approach is to draw random samples randomly generated samples from
these population and then compare observed sample means. So, we got now mu 1 and
89
mu 2 they are unknown, so we take sample from the population and then we will
compute these observed sample means which can be denoted as x 1 bar and x 2 bar; x 1
bar for the population P 1, and x 2 bar for the population P 2.
And then we can go about doing some hypothesis test. So, two popular hypothesis test
are student’s t-test and Welch t-test. So, we will go one by one. So, Let us first discuss
the student’s t-test. So, some of the basic assumptions that are related to student’s t-test is
about that two populations. So, two population distribution P 1 and P 2. So, we assume
that they have equal variance. So, we do not know the variances of these two population,
but we assume that them to be equal. So, only then student’s t-test can actually be
performed. Now, let us say we have two samples from these two samples of n 1 and n 2
observation respectively from these two populations P 1 and P 2 and they have being
randomly and independently drawn from these two population. So, these are some of the
assumptions related to student’s t-test.
Now, another assumption is that is mainly about how the t-statistics is actually computed.
So, if we assume that P 1 and P 2 are normally distributed this is generally the case
because of the central limit theorem. So, if P 1 and P 2 are normally distributed with
same mean and variance, then t-statistics follow the t-distribution in this case with n 1
plus n 2 minus 2 degrees of freedom, and this is how it can be computed. So, t-statistics
can be computed as x 1 bar minus x 2 bar and divided by the pooled sample variance,
90
pool sample standard deviation and then multiplied by this particular factor square root
of √1/ n1 + 1/ n2.
Now, pooled sample you know variance can be defined in this fashion. So, you can see s
1 square the sample variation from population one sample drawn from population one,
and s 2 square is the variance for sample 2 drawn from population 2. And you can see a
kind of rated average has been taken to compute pooled sample variance. So, this is the
statistics that is computed and this is under the assumption that null hypothesis is true.
So, we need to understand that this has to be correct that P 1 P 2 normally distributed
with same mean and variance. And we are assuming that null hypothesis is true and then
we can go ahead and compute this particular t-statistics.
So, as we said S p is pooled standard deviation, and S 1 and S 2 are sample standard
deviation. And the S p square being the pooled standard variation, and S 1 square and S 2
square being the sample variance. Now, another point is regarding the shape of t-
distribution. Now, shape of t-distribution is generally similar to normal distribution and it
becomes more so when the degree of freedom reach 30 or more. As we have more and
more observation into our sample then t-distribution becomes more like normal
distribution. So, normal distribution and t-distribution they are also called bell curved
because their shape looks like a bell.
91
Let us try to understand this particular t-statistics. Let us go back t, it is defined as x 1 bar
minus x 2 bar. Now, x 1 bar and x 2 bar are sample x 1 bar and x 2 bar are observed
sample means. So, if observed t values, so if x 1 bar and x 2 bar are quite close to each
other if the observed sample means are quite close to other, the observed t values also
going to be closer to 0. But it is going to be closer to 0 then the then the sample results
are exactly equal to null hypothesis, therefore null hypothesis is good in that case we
accepted. So, if sample observed means x 1 bar and x 2 bar they are close to each other
or t is close to 0 then the null hypothesis is generally going to be accepted.
Now, let us understand the next point. Now, observed t value, if observed t value is far
enough from 0, so it is far enough from 0, and t-distribution is indicating a low enough
probability for the same then it will lead to a rejection of null hypothesis. So, if the value
t that means, one of the sample one of the observed sample mean is much greater than
the other one therefore a leading to a higher value of t, and the probability is also low on
the lower side and that can actually lead to rejection of null hypothesis. Now, t value
falling in the corresponding areas in the normal curve it should be so this also means it
should be less than 5 percent of the time. So, in that case, this particular null hypothesis
would be rejected. So, we will stop here, and we will continue our discussion on basic
statistics using R in the next part.
Thank you.
92
Dr. Gaurav Dixit
Indian Institute of Technology Roorkee
Lecture - 05
Basic Statistics- Part II
Welcome to the course business analytics and data mining modelling using R. This is a
4th supplementary lecture on basic statistics. So, in the previous lecture we stopped at;
we stopped our discussion about student’s t test, so let us pick up from there. So, as we
discussed in the previous lecture the most common type of testing that we actual do
hypothesis testing that we actually do is about difference of means.
So, let us take this example if P1 and P2 are normally distributed with same mean and
variance, then t-statistic follow a t-distribution with n1 plus n2 minus 2 degrees of
freedom. So, we talked about this particular formula of t-statistic in the previous lecture
as well. So, you can see t can be computed as difference between x1 bar and x2 bar
divided by pooled sample variance and then multiplied by this factor in the square root;
1/n1 + 1/ n2. How the pooled sample variance is computed? Can also you can also see
here, it is kind of a weighted average of sample variances from our population 1 and
population 2, you can see S p square is (n1-1) is (n1-1) S12 + (n2 –1) S22/ n1+ n2- 2.
93
Now, as far as t-distribution is concerned; shape of t-distribution is concerned it is quite
similar to normal distribution and as the number of observation; degrees of freedom
these 30 or more, it closely resembles to normal distribution.
So, both t-distribution and normal distribution they are bell shaped curves. Now, some
specific discussion points on the t-statistic formula; you can see the numerator of t-
statistic is actually the difference of sample means. So, from there you can understand for
our null hypothesis and alternate hypothesis for their testing, if the observed t value is
actually is comes out be 0, then that would actually indicate that sample results exactly
equal to null hypothesis, that means, the means are equal.
Similarly, if the observed t value, that is far enough from 0 and t distribution indicates a
low enough probability let us say less than 0.05, then in that case our null hypothesis H0
would actually be rejected. So, to get a, better understanding let us look at the normal
curve.
94
So, let us say, so for a value, a t value which is far enough from 0 and the probability of
that particular t value falling within this curve is quite low, let us say 0.05 or less, then in
that case null hypothesis would be rejected because our t value is falling in these 2, one
of these 2 areas. So, our t value is not falling within this main area and probability of
falling in this particular area is low which is 0.05, that means, there are more chances
that particular sample results; that particular sample is falling in these 2 smaller regions.
So, therefore, in this case normal in this case, a null hypothesis would actually be
rejected. The same discussion can also help us in understanding the confidence interval
which we will cover later on in this lecture, this is about you might have come across the
words like 95 percent confidence interval, 90 percent confidence interval, 99 percent
confidence interval. So, these words terms you might come across, so in this case though
we have taken this we talked about a small probability and the t-statistics falling in this
particular region, this is corresponding to 95 percent confidence interval. So, there more
discussion on confidence interval we will do later on this, later during this lecture.
So, another way to understand this is the t value falling in the corresponding areas in the
curve is less than 5 percent of the time, so that is another way of understanding this. So,
the low probability that we talked about 0.05, it is generally denoted using alpha and this
also known as significance level of the test.
95
Now, how do we find out whether a null hypothesis is going to be rejected or not, so we
generally compute this t asterisk value, which is determined in such a way, that
probability of magnitude of observed t value being greater than this t asterisk value is
actually alpha. So, in such a fashion t asterisk is determined for different values of
observed t's and once that t asterisk value is determined, we generally compare it using
the observed t value and if the magnitude of the observed t value is greater than t
asterisk, then a null hypothesis is actually rejected.
You can see t asterisk is determined such that P and probability of absolute t greater than
or equal to t asterisk is alpha which is 0.05, for an example and third point is about that
null hypothesis is rejected, if observed value of t is such that absolute value of t is greater
than or equal to t asterisk.
Now, significance level of statistical test is the probability of rejecting the null
hypothesis for example, in this particular case we assume that alpha is 0.05. So, if null
hypothesis is true and alpha is 0.05, then the observed magnitude of t would actually
exceed t asterisk 5 percent of the time.
Now, another term that you might have come across is called p-value; p-value is sum of
probability of t being less than or equal to minus absolute of observed t value and
probability of t observed t value being greater than or equal to magnitude of observed t
value. So, summation of these 2 terms; summation of these 2 numbers is actually going
96
to be p-value. Now, let us open R studio and let us go through 1 example, which is
related to student’s t test.
So, let us first create this hypothetical data, so we have these 2 variables x1 and y1. So,
we are going to use R norm command that we discussed in the previous lecture. So, R
norm again we want 20 observation with mean, mean 50 and standard deviation being 5
and y1 this is corresponding to the second population we have, we want 30 observations
here and the mean being 60 and a standard deviation value being again 5.
So, you can see that because of the assumptions that are related to student’s t-test, you
can see we have kept the standard deviation value as same while creating these 2
populations or samples. So, let us compute this, so x1 you would see that x1 has being
created here 20 observations and these values are randomly generated and following
normal distribution. Now, second sample we can let us execute y1 and will get the 2nd
sample you can see in the data section, in the environment section you have y1 30
observation in this here and again the values being generated and again following a
normal distribution.
Now, let us come to students t-test. Now, we have this t dot test of function that is
available in R and it could be used to run our students t-test. Now, in this case the t test
function we pass on x1, that is the first sample and the y1 that is being the second sample
and you would see there is another argument called variance var dot equals, this is
97
related to variance, where as we understand that in this students t-test variance of 2
population are supposed to be equal. So, in this case var dot equal is assigned as true.
So, once you write this particular code we can execute this, so let us do t-test here.
You would see in the result, it is 2 sample t-test because we are trying to compare the
means of 2 samples x1 and y1. So, it is 2 sample t-test and you can see the data mention
as x1 and y1, you will also get a t statistic t as minus 7.1424, degree of freedom is also
mentioned as 48 you can see, that number of observation in x1 sample were 20 and
number of observation in y1 sample were 30. So, therefore, addition being 50 and then
we subtracting 2 for parameters for mean. So, therefore, it comes out to be 48.
Now, you can also see a p-value has also being computed and how this computation is
actually done that we have discussed in the slide. Now, you would also see that alternate
hypothesis 2 difference in the result is given there the 2 difference in means is not equal
to 0. So, alternate hypothesis is true and the null hypothesis has being rejected in this
particular case, you would also see that 95 percent confidence interval has also being
mentioned there being between minus 14 to minus 8.289, will have a discussion on
confidence interval as well, later in this particular lecture.
Now, you would see mean of x and mean of y those values also being given there. Now,
if you want to compare our results of students t-test with t-value which is related which
98
is t-value corresponding to 0.05 significance level, especially for 2 sided hypothesis test
or sample hypothesis test, so we can do this using this particular qt function. So, qt
function can give us this value for example, in this case the significance level is 0.05. So,
this is going to be divided by 2, being because the normal distribution being symmetric.
So, we need to divide because there are going to be 2 regions and this particular value
has to be divided for equally for each region and you would also the degree of freedom
as 48; 20 plus 30 minus 2 and the lower tail is false. So, once you we can find out the t-
value for 0.05, significance level.
So, let us execute this code and you would find that t value for this given significance
level of 0.05 and given degrees of freedom 48 in this case, t-value comes out to be 2.01.
Now, if you compare this with the observed t-value that we just saw that is minus 7.1424
it is quite less. So, therefore, the null hypothesis actually rejected.
Let us go back to our discussion. Now, another test that can be performed while you
know while hypothesis testing related to difference of means is Welch's t-test. So, when
do we use Welch's t-test when the population variance are not equal. So, assumption of
equal population variance and that is not reasonable and that cannot met and then
probably we can use Welch's t-test and do our hypothesis testing. So, again formula for t-
statistics for Welch's t-test is given here, so t w is again a difference between 2 samples
99
x1 bar and x2 bar and divided by, in this case you would see that sample variance S1 2 /n1
+ S22 square sample variance for 2nd sample S22/ n2 and square root has been taken.
So, this is the formula for computing t-statistics. Now, as far as the interpretation is
concerned as we did, as we discussed for students t-test again here also the sample means
are in the numerator, we have sample means difference of sample means. So, again the
same points are applicable, if the value numerator is 0 then probably the null hypothesis
is true and if the there is numerator is far from 0 and for low probability like 0.05 is there
then probably null hypothesis is good actually be rejected. So, from that sense
interpretation of results are going to be same. So, only 1 important difference being that
population variance that cannot be assumed as equal.
Now, another assumption that was applicable in student’s t-test that random samples
would be drawn from normal distributed population that is still applicable. So, again this
t-statistics; Welch's t-statistics also follows t-distribution which is as we discussed is very
similar to normal distribution and becomes almost normal distribution and the degrees of
freedom reach 30 or more.
So, let us do a small example for Welch's t-test, so let us open R studio. So, again we can
use the same data in this for to perform test related to Welch's t-test as well. You can see
the t, t dot as the same function can again be used to perform this particular test. Now,
the only difference being that variance dot equal now in this case would be assigned as
100
false. So, you can see 3rd argument var dot equal is false and the 2 samples x1 and y1,
we are passing on the same samples and doing the test on the same samples.
So, let us execute this particular line, again you can see here that now the name has
changed to Welch's 2 sample t-test; 2 sample being x1 and y1. So, that you can see, you
can also see the t-statistics that is Welch's t-statistics that comes out to be minus 6.6412
and degree of freedom comes out to be 30.926, degree of freedom computation in
Welch's t-test is slightly different from students t-test.
So, we would not go into detail of that, p-value again the interpretation and meaning
remains same. So, here also you will get a p-value. Now, which again this p-value is also
less than that low probability value that we talked about less than 0.05, alternate
hypothesis 2 difference in means is not equal to 0, again in this case also the null
hypothesis is rejected, you can also see in the results 95 percent confidence interval,
values are mentioned there, so minus 15 and minus 7.99, more discussion on confidence
interval will do in a while in this lecture.
Now, means of these 2 samples; sample estimates is also sample estimates are also given,
mean of x and mean of y are also given. Now, if we go back to the earlier computation
that we performed about that t-value that is corresponding to 0.05 significance level and
the degrees of freedom that computation we can again do and will find out that observed
t-value is less than the corresponding t-value where 0.05 significance level and given
degrees of freedom in this case, that computation can be done and we will find out that
the this particular t is statistics minus 6.6412 is less than that, therefore, null hypothesis
has to be rejected.
So, let us go back and so next discussion point is on confidence interval. So, confidence
interval actually provides an interval estimate of a population parameter using sample
data. So, till now what we are looking for actually the point estimate, but using
confidence interval we can also provide an interval estimate of a population parameter.
Now, this confidence interval in a way also tells about the uncertainty that is associated
with the point estimate. So, point estimate might not be accurate and the confidence
interval in a way is explaining this the uncertainty.
Now, another way of understanding confidence interval is how close x bar is to mu. So,
that is another way of understanding confidence interval because our x bar is actually
101
computed based on the sample randomly drawn from the normal distributed population.
Now, confidence interval will give us some sense of, will minimize some uncertainty
about the sample estimate that we have and it will tell us and range where this particular
where we can with some confidence we would be able to say that population parameter
is going to lie in that particular range.
Now, for example, in 95 percent confidence interval estimate for a population mean
straddles the true unknown mean 95 percent of the time. Therefore, what we actually
mean is that if we are computing an interval estimate based on 95 percent confidence
level, then the population mean is 95 percent of the time, the population mean is going to
lie in that range. The same thing can be expressed in this form that mu is going to belong
to this particular range x bar plus minus twice of sigma divided by square root of n.
So, this particular range, this, using this particular formula range can be computed and
the particular population mean will actually straddled by this particular range 95 percent
of the time or any other confidence interval, if it is 99 percent confidence interval we are
talking about then 99 percent of the time it will straddle that range, similarly for 90
percent. Now, at this point we can discuss 2 important resource related to errors type 1
and type 2 errors. So, is this particular classification table that is displayed in this
particular slide, you can see when type 1 error and type 2 error can actually occur. So, for
example, if null hypothesis we look at type 1 error if null hypothesis is rejected, while
102
null hypothesis is being true, so that is called as type 1 error, while the null hypothesis
actually true, but it has being rejected using our statistical test or hypothesis test.
The type 2 error occurs when the null hypothesis is false, but using our hypothesis test or
statistical test we actually accept null hypothesis. So, these are the 2 situation when type
1 error and type 2 error actually happen, the other 2 are the correct outcome when our
hypothesis test accept null hypothesis and is also true or our hypothesis test rejects null
hypothesis and is also is false. Now, how do we overcome the problems related to type 1
and type 2 errors. So, for type 1 error we can look at the significance level that is denoted
by alpha.
So, we can manage this particular error using appropriate significance level. So, we
reduce the alpha, then there is less chance of doing type 1 error. So, therefore, many
times you would see many researcher would prefer 99 percent confidence interval over
95 percent that because they do not want to commit type 1 error. So, therefore, they
reduce the alpha depending on their acceptance level. In some research stream or
research domain even 90 percent confidence interval is accepted, but in that case there is
a risk of committing a type 1 error.
Now, type 2 error; it is generally denoted using beta, this can be generally managed using
appropriate sample size. So, if you keep on increasing your sample size and it there is
some sort of saturation that is reached and then in that case less chance of committing
103
type 2 error. Now, another point related to hypothesis testing is power of a test. So, what
is a power of a test? So, power of a test is about correctly rejecting null hypothesis.
So, if you go back to the table that we had. So, if you look at null hypothesis is rejected
the second row in that case the only problem when the null hypothesis should actually
we rejected and it is not is actually the because of the type 2 error. So, you would see that
power of a test is actually computed using 1 minus beta because whenever there is type 2
error, then that reduces the power of a test in terms of correctly rejecting null hypothesis.
Therefore 1 minus beta is called the power of a test.
So, this particular power of a test is also used to determine the sample size because as we
talked about 1 way to manage or handle type 2 error is the selecting appropriate sample
size. So, we want to again compute or find out what would be the appropriate sample
size, then power of a test could be a good indicator.
Now, next important statistical test is ANOVA. So, till now what we have been talking
about is was mainly about 2 populations. So, what happens about hypothesis testing if
you are dealing with more than 2 population, so in that case ANOVA is used. So, used for
more than 2 populations or groups instead of performing multiple t-test. So, if we have
more than two population, 1 alternative is we perform multiple t-test pairwise t-test for
different grouping. So, that is 1 solution, but this can be cumbersome and the
interpretation could be cognitively very difficult for us and the probability of committing
104
type 1 error would actually also increase because when you are trying to do multiple t-
test. So, for every pair you have to do interpretation and then it will be influenced by
some other t-test and it will become very difficult for you to cognitively interpret the
results and the, reduce the manage the probability of committing type 1 error.
So, therefore, ANOVA is preferred in case more than 2 populations are involved. So,
another important point related to ANOVA is this is sort of generalization of hypothesis
testing and that used for the difference of 2 group means. So, hypothesis testing that is
used for difference of 2 group’s means it is ANOVA is in a way generalization of that
process. So, if we were to perform multiple t-test for n group, then we have to actually
do n times n minus 1 divided by 2 test to actually make any conclusion.
Now, the typical null hypothesis and alternative hypothesis in ANOVAs case is that in
ANOVA we assume we assume that in null hypothesis all the population means are
equal, which is quite similar to what we do in difference of means, alternate hypothesis
at least 1 pair of the population means is not equal.
So, again assumption is quite similar each population is normally distributed with same
variance and the testing is mainly about whether different population; different
population clusters whether they are more tightly grouped or spread across the
populations, so this is what we are trying to find out.
105
Now, there are 2 important statistics that we compute in ANOVA process, one is between
groups, mean sum of squares that is SB squares. So, this is an estimate of between
groups variance, this is the formula SB square can be computed 1 divided by k minus 1,
when k being the number of groups and then summation over 1 to k and multiplied by n
i, n i is the number of observation in ith group and then the difference between mean of
ith group and the mean of all the groups and square of this particular value. So, this is the
formula for SB.
So, this is mainly for between group variance, then another estimate that is required
related to within group mean sum of a square, it is called within group mean sum of a
square and this is an estimate of within group variance. So, we are trying to find out the
homogeneity within a group and it is heterogeneity with respect to other groups. So,
within group variance is computed in this fashion SW it is called sw, SW 2 is computed in
this fashion 1/n-k and then summation over different add k groups 1 to k and the for all
the observation from 1 to n i, this is for ith group. So, ni (x ij - xi bar)2.
106
So, once these 2 computations are done between group mean sum of squares and within
group mean sum of square has being computed, then we compare these 2 statistics if SB
square is greater than SW square, then we can say that some of the population means are
different. So, therefore, null hypothesis would actually be rejected in this case.
So, this is actually done using F-test statistics. So, generally we can use this particular
formula F SB square divided by SW square and then the F-test statistics is actually used
to find out whether the null hypothesis is accepted or rejected. So, let us go through a
small example for ANOVA. So, let us open R studio. So, again in this case we have
created a hypothetical data.
107
So, we are talking about ads. So, these are 3 options AD1, AD2 or NOAD at all and the
purchase that can be associated with these ads. So, we are again for these 3 types of
situation AD1 and AD2 and NOAD, we are trying to create 3 samples randomly again
we are using rnorm function, you can see here. So, let us execute first create this
particular variable ads, you will see a character vector of ads has being created. So, the
sample size is 100.
So, this you can see, then purchase we can compute we want to compute you know, we
want to generate 3 samples you can see 100 first sample, which is corresponding to ad 1
it is about 100 observation and mean is there 500 standard deviation is there, then for
AD2 again 100 observation mean is 600 and standard deviation being 80, you can see
that standard deviation is same because that is part of the assumption that variance
should be equal, then the 3rd is NOAD case, there also we have 100 observation mean
being 200 and standard deviation again is same as previous 2 samples. So, let us execute
this particular code, so you would see purchase has being created. Now, we can create a
data frame of these 2 variables ads and purchase, so let us do this. So, this is how our
first 6 observation of data looks like, so Ads. So, NOAD, NOAD, so these are some of
the records and the corresponding purchase value it is also given.
108
Now, if you are interested in summary you can see AD1, AD2 and NOAD, how they
have they are distributed 27, 32 and 41, these are the split for 100 observation. Similarly,
if you are interested in statistics related to AD2, specially the purchase part you can see
that mean purchase is 493 and the min and max value are also there, similarly we can do
the same exercise for AD2, we can find out the statistics related to purchase with respect
to AD2, similarly for NOAD situation we can see.
109
Now, we have this aov, ANOVA aov to perform ANOVA test. So, in this case you can see
first argument is actually a formula. So, in this case purchase is being tested with this
respect to ads and data is again df 2, we have data frame 2, we have just created. So, let
us run this particular thing; now, let us look at the results.
You can see in this case F value is there, probability value is there, in this case you would
see that because F value is greater than 1, you can see that null hypothesis is rejected,
you would also see other numbers here sum of a square and mean squares, so those
numbers are here. So, this is how, we can actually perform ANOVA test. So, with
ANOVA, we are able to cover the basic statistics using R for this particular course.
Thank you.
110
Dr. Gaurav Dixit
Lecture – 06
Partitioning
Welcome to the course Business Analytics and Data Mining Modelling Using R. So, we
are into second specific subject data mining process. So, last time we stopped at stopped
our discussion on partitioning. So, let us pick up from there. So, in the data mining
process another specific point is partitioning.
So, as we discussed in this statistical modelling we generally use the same sample to
build the model and then check and then perform and check it is validity again. In the
data mining process we generally do partitioning wherein we split the data set into 2 or 3
partitions, 2 or 3 or even more partitions and then one of them the largest partition is
generally used for model building and then other partitions are either used for fine tuning
the models fine tuning the selected model or for model evaluation.
Now, another important point that we need to understand is that several candidate models
how do we select the our best model. So, it could be due to a 2 main reasons. So, first
one is the acceptable region the region that we want genuine superiority of the final
model over other candidate model is. So, it might so happen that the a final model the
111
selected model is a giving superior performance genuinely in comparison to other
candidate models.
The second is the problematic second reason is the problematic part that we want to
minimise or remove chance occurrence leading to better a match between final model
and data. So, it might. So, happen that you have 3 or 4 candidate models m 1 m 2 m 3 m
4 and it might due to a some chance occurrence that model number 3 that is m 3 is better
matching with the a data and therefore, giving the superior performance. So, therefore,
we need to minimise this particular situation we need to manage this particular situation.
So, partitioning is a one way to do that mainly data driven techniques they lack in
structure. So, they do not impose any specific structure in on data during their modelling.
So, therefore, they might end up a producing this later situation chance occurrence
because they are data driven. So, their main focus is on data and that might lead to over
fitting.
Now, as we said that partition of data set into 2 or 3 parts can actually solve this
particular problem. So, a typically 3 partitions are created they are called training set or
second one is validation set and the third one being test set.
So, again these partitions are created following a predetermined proportions. So,
typically the partitions are created following 60 20 20 rules; that means, 60 a percentage
112
of the points 16 60 percentage of the point observation they going to training set and 20
percent go into a validation set and the remaining 20 percent go into test set. So, though
the that is the typical proportion that is used.
So, but you can anyway change this so, but it has to be predetermined and this
predetermined proportion is then used to create partitions; however, the records are
randomly assigned to different partitions. So, the proportion is predetermined, but the
records are randomly assigned to different partitions. sometimes the situation might
require that the records are assigned based on using some relevant variable. So, in those
cases the variable decides which record will go into which particular partition.
Now, let us discuss the role of each of these partitions. So, first one being training
partitions. So, usually this is the a largest partition and this is I used to the same partition
the same sample exactly used to build the candidate models. So, different models that
you can think of to a tackle your classification prediction task can be used then the
second partition is the validation partition. So, in this particular partition is actually used
to evaluate candidate models or sometimes we also use this particular partition to fine
tune and improve our model.
In those situations when we use validation partition to fine tune or improve our model
validation part partition also becomes part of model building
113
So, therefore, it might create a bias in the model evaluation if the this particular partition
is used for the a evaluation purposes therefore, in those cases test partition becomes
mandatory to evaluate the final model and that is the role of test partition to evaluate the
final model. Now at this point we need to discuss different types of data sets.
So, till now the partition partitioning related discussion that we just did it is mainly
applicable to cross sectional data now what different type’s data sets are generally used
in the statistical modelling or data mining modelling. So, let us discuss so first one be
cross sectional data.
So, cross sectional data are observations on variables related to many subjects. So,
variables could be relate to related to individuals they could be related to firms industries
or countries regions and there are many variables and they are they are could be a many
subjects. Now they are observed at same point of time so it is snapshot kind of a kind of
a snapshot is taken.
114
So, let us assume our data set to be a cylindrical pipe. So, our variables observations on
variables are related to many subjects they are taken at a cross section. So, let us see this
is the point. So, all the observations on different variables v 1 v 2 v 3 v 4 for different
subjects they are taken at same point.
So, this is called cross sectional data. Now generally when we are doing when we do a
cross sectional analysis generally unit of analysis is also is specified though the variables
might be on different subjects individual firms, but there has to be unit of analysis
because the different observation that we are going to recall they are going to represent a
distinct subject. For example, if we in our sample if we have. So, let say we have this
sample and we have different variables in the column side and in this side we have
different observations. So, each observation each observation represents a distinct subject
for example, if the unit of analysis is individual. So, therefore, each observation will
represent an individual.
If unit of analysis is firm then each observation will represent a firm even though the
variables v 1 and v 2 v 3 v 4 could be on you know different subjects now the main idea
when we do cross sectional analysis to compare differences among the subjects. So,
whenever you need of analysis is individual we are trying to compare some differences
that are arising out of differences among those individuals or if our unit of analysis is
115
firm then we are trying to steady some you know and compare differences data their
among firms.
Now, second type of data is time series data, now in time series data observations on a
variable. So, related to one subject. So, in the in time series data we do not deal with
many subject there is just a one subject and observation on the on a variable related to
that subjects are actually taken. Now observation they are the this particular variable is
observed over a successive equally spaced points in time. So, each observation
represents a distinct time period. So, in time series we have the same subject let us say
this is the variable related to the subject one and at equally spaced times the observation
would actually be made.
So, observation or on the a same subject one subject and observed over a successively
equally spaced points and time each observation representing a distinct time period. So,
now, again here also that this unit of time could be different it could be day’s weeks or
years or month. So, based on that equally spaced point in time the observations are
recorded.
Now, the main idea in time series analysis is to examine changes in subject over time.
So, in the this subject changes are examined over time. Now another type of data set that
you may come across is the panel data sometime it is also called longitudinal data. So,
panel data any way takes different features of cross sectional data and time series data.
So, observations on variables related to same subjects over a successive equally spaced a
point’s in time are taken.
Now, the main idea in panel a panel data analysis is to compare a differences among the
subjects and to examine changes in the subjects over time. So, in a way panel data can
also be understood as cross sections with time order. So, we go back to our this
cylindrical tube for a data set that is a this is another cross section this is another cross
section.
So, all these cross sections with different equally spaced successive equally spaced time
can actually be used for panel data analysis. So, you have many variables here could be v
1 2 v 4 in all cross sections and they are they have a definite time order.
116
So, we are trying to study the same subjects. So, therefore, same variables and we are
taking different cross sections. Now another type of data set is called pooled cross
sectional data. So, observation on a variables related to subjects at different time periods.
So, you take observations, but these observations on subject, but these subjects need not
to be same. So, they could be different, but the observation are made at different time
periods. So, what is the main idea? So, main idea is to examine the impact of on subject
due to you know environmental changes caused by some falsely intervention or some
event.
So, for example, population census that is one example of pooled cross sectional data
were in India generally the population census happens in in a in 10 years. So, let say
2001 population census and 2011 census and the census is done the subjects might
change, but the senses is happening at different time periods and the we are looking at
due to passage of time populations we are looking at different characteristic related to
population different features or variables on population.
So, again pooled cross sectional data can be understood as independent cross sections
from different time periods. So, let say this is for pooled cross section data. So, this could
be 2001 related data on subjects and this could be 2011 data on subjects. So, this subject
need not be same and to independent cross section in the panel data the cross sections
had a time order in the pooled cross sectional data they do not have a time order. So,
these cross sections are independent let us discuss our next phase of data mining process
that is model building.
117
So, we will go through this particular phase using an example with linear regression. So,
let us open R studio.
In the previous lecture we talked about over fitting. So, let us revisit the same concept
through an example. So, this is again an hypothetical data this is about predicting future
sales using a spending on marketing promotions. So, I have created I am going to create
some hypothetical data. So, the you can see this data frame this code is about creating
this data frame promotions is 1 variable you can see different numbers are there.
118
So, these numbers are suppose are in rupees crores and then you have a sales. So, these
numbers are again in crores. So, we are going to create this particular hypothetical data.
So, let us create it can see promotion and sales number. So, we have same in
observations on these 2 variables let us look at the summary.
So, these are the some of the statistics on these 2 variables promotions and sales. So, let
us plot. So, you can see I am going to plot promotions on x axis and sales on y axis and
you can see that x axis label and y axis label is given there and then limit on x axis and y
119
axis. So, I we discussed in the previous lecture limit can be given their using the results
from summary command you can see the most of the value some minimum in promotion
is 2 and maximum is 9. So, therefore, most of the value will lie in the range of 0 10s that
is why this assessment as 0 to 10 similarly for y limit.
So, we can plot this this particular in the plot this particular symbol that you might see
this was actually supposed to this was supposed to be a symbol for rupees may be in this
system that particular symbol is not supported. So, therefore, it is coming as garbage in
this case, but we look at the data. So, this is sales verses promotions you can see.
So, looking at this data we can try to affect different models which can actually help us
understand the relationship between sales and promotions, but because we are doing a
business analytics course. So, our idea is not just to understand the relationship, but to
understand relationship and use them in a fashion. So, that we can improve our
prediction we can do some predictions.
120
So, therefore, if we try to fit some complex model let say let us try to fit a cubic curve
you can see that in the plot and this is the cubic curve that we have tried to fit here. So,
we if you look at this curve so supposing this being a complex function which we are
trying to fit over data and this is leading to perfect 100 percent fitting 100 percent match.
So, therefore, this is in a way causing over fitting of in the model you can see this hard to
imagine, when you move when you know increase your promotions spending on
promotions from 4 crores to 5 crores and the sales is actually dropping. So, that kind of
relationship is difficult to imagine with this is just an example, how a complex model if it
is fitted on data it can lead to over fitting a better model could be this particular line. You
would see most of the observation they all closed to this line and this line could be the
better model for this observations and this sample.
Now, let us go through our model building example. So, we are going to use this
particular hypothetical data said used car data set it is actually based on the many post
related to a used car sales that are made online, but mainly it is a hypothetical data. So,
let us load this particular file this particular library.
121
And this used car used car excel file you would see that this 1 d f has been created with
79 observations and right now it is showing as 12 variables.
But it is actually some 3 type deleted columns in excel that are that have also been
picked up by R as variables even though there is no data those columns have been
deleted. So, for that we have this particular code which will actually help us remove
these deleted columns from excel.
So, that is why we have this particular line. So, it will actually remove those 2 variables
you can see now and the environment section we have 79 observation with 10 variables
those 2 variables our actually deleted columns in excel file.
122
Let us look at the variable names. So, these are the variables we have brand of the car,
model of the car and we have manufacturing year of the car then we have fuel type
whether it is petrol diesel or c n g then we have S R price which is actually showroom
price for the car then we have kilometres accumulated. So, this is the related to another
related to car then the price the offer price for the used car, then the we have another
variable on transmission whether it is automatic or manual then we have another variable
on owners. So, the number of owners who have actually had the ownership of the car and
it is life time and then air bag number of air bags that are there in the car.
So, you would see that some of the some many of these variables are directly relevant in
a sense to predict the to predict the offered price of a used car. So, what we are trying to
the task that we are trying to perform here is prediction of offered price for used cars
using these variables you would see accumulated kilometres and the age of the car which
can be competed using the manufacturing year the transmission time number of owners.
So, these are some of the variables which could be relevant in terms of making our
prediction task related to offering offered price.
123
So, let us look at the first 9 records you can see here. So, different you know different
used cars you know from different models Hondai, Mahindra, Maruthi Suzuki honda. So,
these all are the then the model name model names are also available then the
manufacturing here then the fuel type is there we go back to our excel data set.
And let us look at some of the dummy coding that has been done here.
124
You would see that fuel type it has been indicated as petrol for petrol it is 0 then for
diesel it is 1 and for c n g it is 2 then you have, another per transmission 0 means manual
and 1 means automatic let us go back.
So, again for outlier during our discussion on outliers we talked about that sorting can
also help’s in terms of outlier ejection if there is some value related to some variable
some measurement which is which looks out of place which does not seem to be real
then that can that can be found out using sorting. So, let us do that. So, we have picked
up these 3 important variables kilometres showroom price and the manufacturing year.
So, in this case everything looks, but sometimes some value might look like out of place
it could be due to typing error or something else.
125
So, that can be easily identified and that value can then be handled then as we discussed
age could be age could be an another important variable in terms of predicting the value
of a used car.
So, let us compute this particular variable. So, you can see the current year if it is the
current year is let say 2017 then it be subtracted by manufacturing year.
So, we could be able to a compute an age that is add it to the distinct data frame and
since we are interested in since we are not interested in first 3 variables the brand name
126
the model name and the manufacturing year any more. Therefore, we can compute
another data frame by the moving these 3 columns. So, these are the variables and of
interest fuel type S R price showroom price kilometres price transmissions and
transmission and owner’s, air bag and age.
Now, another important concept related to modelling is model building is seeding. So,
generally many analysts prefer to do seeding actually help helps provide some you know
flexibility in randomisation. When if you know if you want to use these same partitions
in your second or third run seeding would actually help us duplicate the same random
partitions. So, in R we have this function set dot seed which can be set up. So, you can
further up in this case set dot seed we have given 1 2 3 4 5.
So, this this could be any number that you like and a seed would be created. So, next
time when you want to create the same partitions this seed can actually helped you
duplicate the same. So, therefore, let us execute this.
Now, let us move to partitioning. So, here you would see that you are using sample
function that is available in r. So, this sample function can actually be used to randomly
draw different observations and create an index which can be used to create an different
partition for example, in this particular case we want to create just 2 partitions of 50
percent each. So, we just want to create 2 partition training and test.
127
So, in this case 50 50 percent you would see in the first argument we have given the
range of values that we want in our sample. So, this is one form the sample size that is
can be computed using n row command and passing the data frame as argument in R and
you can see the second argument is 0.5 multiplied by again the size; that means, we want
50 percent of the observations in one sample and then same in the another second sample
and you would see the replace third argument it is assigned as false.
So, therefore, this is without replacement. So, this sampling is without replacing because
we are we want to create a you know we want to do partitioning. So, therefore, we do not
want the same observation to again appear in validation partition or test partition. So, we
want some of the observation randomly picked and being assigned to training partition
and the other observation that randomly picked and being assigned to other partitions
depending on the number of partitions.
So, let us execute this particular command. So, this will actually create a few are into the
data section.
You would see the you would see that an integer Vector has being created. So, these
values are actually index 2 different observations. So, now, this index is indices can
actually be used to assign different observation to different partition. So, for example,
data frame one we can use this particular index to assign these observation you would
see that the integer vector part ideas it actually has 39 observations. So, the data training
128
partition would actually end up with 39 observations you would see the d f 1 train has
been created 39 observation 8 variables. Now the remaining remaining observation will
actually go to the test partition you would see forty observations have had been assigned
to test partition.
Now, once these partitions have been created we can do our modelling. So, now, l m is
the function that is used for linear regression in R. So, first argument is actually the
formula. So, in this case formula and how the formula is actually written here is actually
the price you have to pick your outcome variable or your output variable the dependent
variable which you want to predict. So, in this it is the price that is the offered price of
the car and then we used these 2 this is the way we write the formula we used till date
and then dot means all other variables that are present in the data frame they would be
picked up as the input variables or the independent variables and they would be they
would be used as predictors for model building.
Now I would see that generally a dollar notation is used with the data frame, but in this
case because this is the way l m is implemented you can mention the name of the data
frame in another argument and the name of variables in the formula. So, it will be taken
other things will be taken care of within the l m function. So, let us execute this line and
build our linear regression model.
You would see again in the valued section and environment section and again values
subtraction have been created you would see a mod has been created list of a 12.
129
Let us look at the summary in the summary you will get the results of you regression
analysis, you can see the formula you can see the residuals is some statistics related to
residuals the mean and max value median value and another things.
Now, let us focus on these coefficients parts you would see different predictors fuel type
S R price K M transmission their estimates are given and you would also see that P
values are given you would also see in the result phase that significance quotes are also
given.
130
For example 0 for 100 percent kind of significance more than 99.9 percent significance 3
stars are used in 99.9 percent confidence interval 2 stars are used for then for 99 1 star
and then for 95 dot is used. So, these are the notations in this case we c 3 less than say
having 2 star and 1 star. So, would see the constant term and the fuel type and the
showroom price you would see a dot in the age as well so these 4 variables. So, if we
look at the main variables excluding and constant term fuel type S R price and the age of
the variable, they are the main variable which are helping us in determining the offered
price.
Other statistics related to this regression modelling are given for example, multiple R
square and R square is given this seems to be 60 close to 61 percent, which is good
enough. Now you would also see from the F statistics that this particular model is
significant. So, therefore, we can go ahead and interpret the results.
131
So, now let us look at how this model is going to perform over other partitions. So, let us
first see. So, this mod fitted values this there are some values that are written from the l
m function. So, one of them is being fitted values. So, we can compute the residuals
using these fitted values more discussion on regression analysis we will do in a later
lecture in this in this particular example we are just going through the data mining
modelling process. So, let us compute the residuals let us look at the actual value
predicted value and the error part.
132
So, these numbers you can see now what was the actual value and what was the
predicted value and what was the error these difference of these 2.
Now, similar thing can be done on the test partition those. So, we have predict function
that can actually help us in scoring the test partition. So, in the predictive function first
we need to pass on the model that is mod in this case and then the test partition is the
second argument which has to be passed on. So, that we can find scoring do the scoring
of test partition. So, let us execute this line.
We would get the numbers let us again compute the residuals for test partition let us look
at the numbers again the similar kind of output. Now we have another library that we
need to load to see some of the metrics for evaluating the performance of the model. So,
r miner is 1 particular r miner is 1 particular library that we need to load. So, let us
install. So, we need to load this particular library r miner to be able to use some of the
metrics.
133
So, let us load this particular library r miner. So, that we are able to a able to compute
some of the metrics. So, let us load this particular library r miner to be able to use some
of the metrics for our model evaluation.
So, let us load this particular library R miner. So, that we have access to some of the
metrics for our performance evaluation. So, let us load this.
134
Now, this is the function m metrics and we have first argument that we are passing is the
price that is the actual value and then the a mod train the residual value and then we are
going to compute these 2 these 3 metrics SSE RMSE and ME more discussion on
metrics.
We will do in a later lecture. So, let us first compute this. So, we have this m metrics
function in r miner that can be used to compute the metrics SSE RMSE and M E more
discussion on this metrics we will do in a later lecture first argument is the price the
actual value is the second argument is the fitted values. So, let us execute this particular
code. So, these are the numbers we can see SSE RMSE and ME values there similarly
we can compute for the a test partition again you would see the numbers there.
So, this is how the numbers of training partition and test partition then they can be
compared and the performance of the model can then be assessed how well it is doing.
So, we will do more discussion on this when we come to our regression analysis lecture.
Thank you.
135
Dr. Gaurav Dixit
Lecture – 07
Visualization Techniques- Part I
Welcome to the course business analytics and data mining modelling using R. So, we
completed our first module that was about general overview of data mining. Now, we are
moving into our second module that is about data exploration and conditioning data
preparation so those topics. Now, first lecture is going to be on visualization techniques,
so let say start. Now, you might have a come across this particular popular proverb that a
picture is worth a 1000 words.
Now, this is also a well known fact that visual processing of human a brain is much
higher than numerical or mathematical processing. So, that is the main underlying
importance of visualization techniques in any modelling process including a data mining
modelling statistical modelling. So, therefore, if we as humans or as domain knowledge
expert or as analyst or data scientist if we get to see, have you look at the data or to
understand some graphs, to see some graphs, to see some plots.
136
So, we are able to exploit our domain knowledge, our expertise in a much improved
fashion. So, that been the bases. So, we are going to start our discussion on visualization
techniques.
So, generally in visualization techniques they have primary role in the data mining
process during this data exploration and conditioning phase, different phases we have
already talked about in the previous lecture; data mining process specifically. Now, this
primary role of visualization technique in this phase, data exploration and conditioning
phase can be summed up with these points. So, we try to understand the structure of the
data that is available. So, that is a 1 goal that is done using visualization techniques.
The second 1 being the identifying gaps or awareness values, so sometimes there might
be few rows or could be duplicate few rows could be you know some of the values might
be missing or some of the values might look out of place or they might look awareness.
So, therefore, those gaps, some shelves might not have any values at all. So, identifying
those gaps would also be part of visualization techniques.
Identifying outlier, so some of the values would be far away from the mean or median
values, where the majority of the major chunk of the values are align. So, some of some
of the values are might be far away. So, identifying those values whether they are valid
point or whether they are awareness values that also need to be determined, so that we
can move ahead for further analysis.
Now, finding patterns, as we said that visual processing of human brain is much better,
so therefore, if we get to see the data, the plots, graph. So, we can easily find some we,
can easily see some patterns which can in turn be helpful in terms of identifying
appropriate data mining techniques or statistical techniques and then use them for our
modelling process. So, these are some of the roles, where visualization techniques can be
useful.
137
Now, building on those points, few other things could be missing values that we already
talked about, identifying duplicate row and columns. So, that is also important
sometimes some rows and columns could be duplicate. So, we would like to avoid that,
because many statistical techniques might have this you know assumption that cases
should be independent, in that case you know duplicate rows could be a problems
similarly duplicate columns would also be problem in some statistical modelling
techniques, where multicollinearity could be an issue.
So, therefore, so these, is specific things, these are specific term that I just talked about,
will discuss them in more detail when we you know come up some statistical technique
like regression and logistic regression. So, another important role played by visualization
technique is about variable selection, transformation derivation. So, sometimes when we
apply some of the visualization techniques on datasets, we are able to identify some of
the variable which could be useful for the data mining task, some of the variable which
could actually be transformed to suit our goal in a much in improved a fashion, much
better fashion, derivation will also get some ideas, some directions about new variable
derivations.
So, all these kind of things are possible through visualization techniques. Some examples
are given here, for example, appropriate bin sizes for converting continuous variable into
categorical variable that is something, when we look at the data, when we look at some
of the graphs that we are going to cover in this lecture. So, we will get some idea what
138
should be the bin sizes for a continuous variable for it to be converted into a categorical
variable.
Combining categories, sometimes you know some categorical variable might be having
many categories which might not be which all of them might not be useful for the
specific task at end. So, sometimes it might be required or it might be mandatory by the
data at some of the groups can be combined, so some of the categories could be reduced
and you should be able to keep only the meaningful groups, meaningful categories for
our appropriate task mainly classification. Another important role could be usefulness of
variables and metrics. So, while we are exploring the data using visualization techniques,
we will also be able to understand, which variables are important and which metrics are
going to be used for performance evaluation etcetera.
Now, this particular phase data exploration and conditioning phase this is considered to
be a required frame e ester before formal analysis and we say formal analysis we actually
mean the data mining techniques and statistical technique like regression tress artificial
neural network discriminate analysis. So, before we go ahead with those formal analysis
of these techniques, this a particular step is mandatory; kind of mandatory where we
apply you know some of the visualization techniques on data and do some preliminary
processing, preliminary analysis.
Now, visual analysis let us understand visual analysis, role of visual analysis a bit more.
So, it could be considered a free form data exploration. So, when we talk about
regression analysis that is very structured kind of analysis that we do, but when we talk
about visual analysis there is mainly we are exploring so and that to in a free form. We
try many plots, many graphs that we are going to cover later in the lecture and which try
to learn something about the data, which is going to help us in our further analysis.
139
Now, as mentioned in the second point, that main idea is to support the data mining goal
and subsequent formal analysis that is going to take place. Now, techniques in visual
analysis they range from basic plots, that will cover you know line graphs, bar plots, a
scatter plots. So these are, be, will cover from these some of these basic plots to
interactive visualizations. So, interactive visualization they will a cover the multivariate
nature of the data sets, will later discuss that it is generally the kind of modelling that is
required is generally multivariate in nature.
So, therefore, some of the advance plots or interactive visualization can be really helpful
for formal analysis. Now, the usage of visualization techniques that also depends on the
kind of pass that we have, so some of the visualization techniques, some of the charts
and plots would be more suitable for classification, some others would be more suitable
for prediction, some others would be more suitable for clustering. So, therefore, it is the
data mining task that will also drive, the way, the kind of visualization techniques that we
are going to apply.
Now, different data mining techniques is also, such as CART and HAC that is
hierarchical agglomerative clustering. So, CART is classification and regression tree
modelling. So, some of these data mining techniques also have their own specific, I
know visualization techniques, their own charts and graphs. So, that is also important to
understand here. That we are not going to apply, everything that we learnt on you know
140
every technique that we are going to follow in subsequent formula analysis, but it is
going to be task specific as well, classification, protection or clustering and also it
sometimes it is going to be specific to the a particular technique.
Now, let us start our discussion with the basic charts. So, as I said 3 important charts or
graphs we are going to discuss, first 1 being line charts graphs, second 1 bar charts, third
1 being scatter plot. So, let us have basic discussion on charts.
141
So, generally these basics are display 1 or 2 variables at a time. So, generally they are
going to be 2 dimensional graphics and then there are going to be you know 2 variables
we are going to pick and 1 is going to be 1 axis and other 1 is on y axis. So, generally 1
or 2 variables at a time and the a main idea being, to understand the structure of the data,
variable types and missing values in the data set. So, generally these are the points where
these basic charts are going to be useful.
So, the basic charts for supervised learning methods, generally main focus is on outcome
variable. So, that is typically plotted on a y axis for and supervised learning, so basic
chart you can also be used for unsupervised learning methods as well, so will see that
through examples using r, so let us a move to our next discussion on line charts.
142
So, line charts the main used mainly to display time series data. So, we try to a see the
overall level and the changes that happen in the data overtime. So, let us understand, let
us learn line charts through an example, let us open r studio. So, let us load this, a
particular library xlsx. So, this is the data set that we are going to use bicycle leadership,
let us understand this particular data set.
So, if you look at the actual data on this excel file, you would see that the data starts from
January 2004 and it goes up to March a 2017 having a 159 data points and the second
143
column, the second variable is on riders which is actually the number of individuals
riding bicycles. So, this is a mainly to reflect the a bicycle leadership in the I I T Roorkee
campus, but this being mainly hypothetical data.
So, this data have created; hypothetical data have created for this demonstration purpose.
So, a let us load this particular, let us import this particular data set has we have been
doing in previous lectures. You can see in the environment section, the data set has been
imported, you would see that there are 159 observation and 2 variables if you want to see
the data here in the r environment or studio environment you can see here, month year is
the first variable and then the riders, so the data because, this being time series data, so
data is mainly on the riders, so it displays the number of riders in a particular month.
Now, this second line, the second line of code is actually about, if there are any as we
talked about in the previous lecture if there are any deleted columns in excel files they
would actually be picked up in the r environment. So, therefore, we want to get rid of
those columns. So, this particular line this particular apply function is going to help in
that. Now, let us look at the, let us have a look at the first 6 observation you can see for
different months, Jan Feb March and a different months, we can see the number of riders
that are there.
So, before generating a line graph, we need to create a time series vector here. So, t s is
the command, if you understood in understanding more about a particular function in r
144
you can do so using help section t s, you can see t s is the function mainly for time series
objects and you would get a detailed uses a t s different arguments, you can get detailed
help here in the help section.
Now, let us go back to our code. So, here you can see that in the t s function the first
argument is actually the data. So, we are passing on this argument data frame and is
specifically this riders variable, so that is the first argument. So, we want to create a time
series object out of number of riders for month for every month. So, you can see the
starting of a date is from 2004 and 1 is for first month of the year January and then
ending of this time series data is 2017 and then March that is third month and the
frequency is mentioned 12, that is be mainly because the it is the monthly.
So, therefore, in a year we have 12 data points. So, therefore, frequency has been
mentioned as 12. So, let us create this time series vector, this has been created, you can
see the same in our environment the t s v has been created and time series is a number of
values 1 2, 159. Now, if you want to plot this particular time series, so that is going to be
our, a line graph. So, you can see again plot is the command that we used previously as
well. So, in this case we have just 1 variable t s v and the this is going to be this
particular rider riders is going to be on y axis as you can see that y label has been given
and the x axis is mainly going to be used for time scale.
So, in this particular code, the time is scale would v determined by the function plot by
default, the default settings could be applied for time scale and you would see then
another argument l a s is there, that is for styling for axis labels. So, how the axis labels
are going to be displayed? More information on this particular argument, you can find in
help. You can type plot in the help section and you will get more information on different
arguments, some of the arguments would be available in the par command that is for
parameters; different parameters for graphical settings. So, you would see l a s
somewhere there, you can see here. So, in the parameter p a r, par command, so you can
see different styling of axis label that is displayed over there. So, you can see I have
mentioned l a s is equal 2, so that means, I want my axis to be always perpendicular to
the axis.
So, the labelling of points would is always going to be perpendicular to the axis. So, let
us see how it is going to be displayed.
145
So, this is the plot, so you would see in the x axis years are being displayed and you
would see different tick marks for different years have been picked up by the a plotting
function, that is plot here, this case and the riders was defined by us, so that was the part
of the time series vector that we created and this particular. Now, you would also see that
labels for these axis are perpendicular to the axis itself, so the way they are here.
Now, you can also look at the plot as well, would see this is a line graph, so all the points
for different points in time, riderships are there and these points have been connected and
that represent a line graph. So, this can help us understand the main label of the data. So,
you can understand the level of values that are being taken here. So, a 1 particular line
which can pass through these points, somewhere in between can be considered as the
main level of this particular graphic and then you can also see the changes over time. So,
it is look this particular graph looks polynomials, so the changes looks a polynomial in
nature.
146
Now, you want to improve this particular graph, so that is also possible. So, will try to,
so if you want to improve this particular plot. So, first thing that you need to do is to
create a sequence that can be used in creating our labels and tick marks in the axis is
specifically the x axis that is for time. So, what I am trying to do here is I mean creating
a sequence of, I am creating this particular vector using this sequence function.
So, sequence function will create values in this case a date sequence starting from 2004
up to 2017 and the difference, so these are going to be equally space points. So,
difference being 2 years, so let us create it, then another function that could be used for
formatting of a vector, so format is the function and for example, the sequence that I
have just created in at 1 in this particular variable, so we can format this. So, format
command would actually allow you the particular information that you want to retain in
your specific format, in your customize format. So, in this case I am using percentage b
and percentage y, which are mainly for month and year.
So, I am living out the date, day information and I am keeping the month and year
information using this particular format function. So, let us execute this particular code
and you would see label size being created. Now, in the environment section you would
also see that labels a Jan 2004 and then Jan 2006, Jan 2008, so these kind of labels have
been created. Now, if we want only the year part, then again we can use the format
147
function and we can extract only the year related information from this at 1 vector, so
that is also possible, so let us do that.
Now, another important aspect of creating plots in r is, margin, margin the plots. So,
there is this command par, p a r par parameter for parameter that we can actually use. So,
there is another, so there is, this a variable that is available in a par command that is for
margin m a r. So, this will actually tell us the different margin in a 4 sides of the plots.
So, these 4 slide 1 with these 4 slides, 1 being the bottom, then you have left side, then
you have top side and then the right side. So, all these 4 side what is going to be the
margin that we can actually defined using this particular function, par.
So, let us execute this line. So, you would see the default setting for margin 5.1. So, these
numbers actually represent the number of lines, so 5.1 number of lines, 4.1, 2.1, this is
the margin that is by default that is there, when we create a plot. So, we want to change
this particular because we want to change the axis and you would see a lot of spaces
being taken the way we are a labelling the axis, which being the perpendicular to the
axis. So, lot of space is required in this case, so therefore, lot of margin is required.
So, we need to change the margin. So, therefore, you would see that, the first margin is
actually for the bottom side. So, we see it is the highest 1 8. So, we want more margin
here and then we want 4 in the left side, then we want 4 in the in the top side and just 2
on the right side and then there is another point 1 addition to this. So, once this particular
margin has been created, now we can go ahead and start getting our new plot. This new
plot is on the same a time series vector that we have created, but the graphic is going to
look slightly different much better.
So, in this case we want to create a new plot and we want to create a new axis. So, you
would have to use these parameters in the plot command x a x t and y a x t. So, in this
case I have assigned n value, that means the x axis and y axis, would not be plotted in the
graph. So, they would disappear and labels also have kept them as null. So, there are not
going to be any labels just the graph, the line graph is going to be displayed without any
x axis or y axis or any labels for those axis, you would see that a box and a graph within
that particular box is displayed over here.
Now, we will have to create this axis. So, axis is the function that is available in r, which
could be used to create this function. So, x is 1, 1 means 1 is for x axis and then the next
148
line is for axis 2 the 2 is for y axis. So these, this axis function can actually be used to
create x axis and y axis. Now, you would see that at this is the tick mark location we are
using at 2, so at 2 is going to be used. Labels you would see, labels 1 that, so these labels
are, so at these tick marks at these points, the labels that have that are there in labels 1
they are going to be displayed and styling of these label is going to be again l a s is 2. So,
therefore, it is going to be perpendicular to the axis.
So, let us do this and you would see an x axis has been created which is slightly different
from the previous plot. In the previous plot, only the year was displayed, now you can
see on the month and year is has been displayed Jan 2004. So, why we are doing this,
main reason being the way we have, the kind of data that we have in the data, it is
monthly data. So, we have ridership data month wise. So, therefore, it is more
appropriate for us to create a, generate a plot which also shows a month not just the year.
So, month and year that information is being defected in the plot now, now similarly we
can create the y axis.
So, we particularly we did not want much changes in a y axis. So, therefore, it has been
displayed as is. Now, if you want to a label the axis, so this is another command m text
that can actually be used to label the axis, x axis and y axis. So, again in m text also first
argument is going to be you have to select the axis. So, side is equal to 1 mean, being the
x axis if you want to a change y axis and it has to be side h 2. Then you can mention text
argument is there, where you can mention the level of the axis and then a line argument
is there which will actually tell you that, how many number of lines the labelling would
be created after how many number of lines below the axis or from the axis. So, let us
execute this particular code and you would see that month year have has been created,
similarly for y axis you would see riders has been created.
149
In the zoomed version of this graph, you can see the same month, year and rider. So, this
is much better plot then the previous 1, where we just had the year information on the
time scale, now we have month and year information on the time scale.
Now, next function that we can learn is about graphics dot off. So, this particular
function if you call this function, so all the plots would actually disappear from your r
studio environment and there is an another command def dot off, if you run that
particular command then only the current plot the default plot that is being displayed
only that put, we deleted or erased. You can achieve the same effect using these 2 points,
these 2 tabs here in the plot section; you can see these 2 tabs, so that can be achieved.
So, we want to get rid of all these plots, so let us run this, you would see everything
would actually disappear. Now, you can again check the margin, once this all the devices
are closed, the par setting, this par function and many settings related to this particular
function would actually be reset. So, this we can check for the margin because we
change the margin you can see, that the default numbers are again having set for
margins.
Now, this was our discussion on line charts and line graphs, more discussion on a line
graphs and how we can use it further would be covered in time series forecasting time
series, where we are going to learn more about line chart and how it could actually be
used before the formal analysis and time series forecasting can actually happen, so will
150
learn more in those lectures. So, let us move to our a next basic chart that is bar chart. So,
the main youthfulness of this particular bar chart is for comparing group using a single
statistic, so will see, how that is done. Now, generally in x axis is used for categorical
variable. So, generally x axis is going to be reserved for categorical variable and we try
to understand more on that particular variable using y axis. So, let us do this through an
example.
So, to understand bar chart and other charts we are going to use this particular data set
that we are already familiar with, used cars data set. So, let us load this particular file, so
you can see this particular data set has been loaded here, we have 79 observation and 11
variables. So, let us run this a command, so that we are able to get rid of the deleted
columns, so there were no deleted columns.
So, let us look at the data as well, so let us go back to the original excel file. So, this data
set, we already familiar, but let us again have a look.
151
So, that data set is about used cars, so we have this information, through we have
information on our used car like brand, model, manufacturing year, fuel type, showroom
price, kilometres accumulated, the offered price and the transmission whether it is
manual or automatic, then the number of owners and then the number of air bags and
then this another variable c underscore price, which has been created manually.
152
So, all this particular variable is, if is 0, if offered sales price is less than 4 lakhs and 1
otherwise, so these are the variables and this particular data sets. So, from that data set
we can actually see that there is the 1 variable there was the manufacturing year.
So, from the manufacturing year and if the current year is 2017, we can actually calculate
the age of the vehicle. So, that we can do using this particular code, we can subtract
manufacturing year from 2017 and we can create age and once this is done we can use c
mind command to add this particular column in the data frame.
Now, in the data set, that we have already seen let us have a look again. We might not be
interested, you would see that age has been added there, you can see age there and you
would also see that some of the variables would like brand, model and manufacturing
year they might not be required now, so will get rid of them.
153
Now, these are the variables that we are interested in. So, let us stop here and will
continue our discussion in the next part.
Thank you.
154
Dr. Gaurav Dixit
Lecture – 08
Visualization Techniques- Part II
Welcome to the course Business Analytics & Data Mining Modelling Using R. So,
previous lecture we talked about visualization techniques. So, will continue our
discussion from the same point where we left in this in that lecture. So, we were
discussing bar plots and we were in this code we had imported this particular a data set
used car. So, this has already remaining imported in data frame one.
And we were also able to create this particular new variable age. So, this has been
created using the a manufacturing year, wherever that was available in the data set then it
was appended to the data frame. And then we had eliminated some of the variables
which were not useful for our purpose.
So, after that we were left with this particular data frame. So, now, we have 9 variables
you can see fuel; fuel type, then showroom price kilometres and price transmission,
owners, air bags, see price and age.
155
So, then another command another useful command, another useful function that is
available in r is S t r. So, we have discussed this a particular function before as well.
So, these actually help us understanding the structure of the data set different variables.
So, you can see fuel price this is factor variable and with 3 labels being 3 label being
CNG diesel and petrol.
Another important point that I would like to highlight here is the factor variable or
categorical variable that are created in r. The labels would be in the alphabetical order.
So, therefore, you would see that CNG has been displayed first. So, this has an impact on
many functions that are available in r, wherein the CNG is taken as the a default
category.
When we do we will start one of the formal analysis let say regression, then will come
across some of these important peculiarities in r and different r function specifically with
respect to factors.
Now, other variables you can see that showroom price, kilometre price, transmission
owners, airbag all have been displayed as a numerical variable, but if you really look at
the variable transmission and C_ Price.
156
So, they are actually categorical in nature because transmission can have only 2 values
that is 0 for 0 and 1 for automatic and manual. So, this is important. So, therefore, we
need to convert this numeric variable into a factor variable.
Similarly a C price that has been created by us manually as the as discussed in the
previous lecture, were 1 was assigned for price amore than 4 lakhs equal to or more than
4 lakhs and for the cars having price as less than 4 lakh well assigned 0.
So, therefore, only 2 values are pass will 0 and 1. So, this variable also being categorical
or factor variable therefore, we need to convert C-Price variable as well into factor
variable. So, let us do that do this. So, as we have talked in a supplementary lectures hash
dot factor is the command that can be used to course a numeric variable into factor
variable or categorical variable. So, let us execute this particular line.
So, first transmission so, we run this line and then let us also run, C-Price.
So, these 2 variables have been converted into factor variable. Let us look at the structure
command as structure function again. Now you would see in the transmission you would
see factor with 2 labels 0 and 1. They have been created. So, the variable converted into a
factor variable, you would also see that C-Price has also been converted into a factor
variable with 2 levels 0 and 1.
157
So, with this most of the variables in our imported data set used cars, most of the
variables they are they are being you know presented in they are they are being used
stored in their suitable a variable type.
So, now let us look at the summary results. So, summary results you can find out that
that the categorical variable for example, fuel type you can see counts have been
displayed for different categories for example, there are 3 only 3 records having CNG as
fuel type 52 records or observation having diesel as fuel price and 24 records having
petrol as the fuel type.
Similarly you can look at the transmission a 63 records having a transmission 0 that is
manual and 16 records having transmission as automatic that is 1 represented by 1.
158
Similarly, C-Price variable that also now being a factor variable or categorical variable 0
0 there are 48 cars, which are actually having a price value less than 4 lakhs rupees. And
31 car used cars are there, which are having price value more than equal to or more than
4 lakh rupees 4 lakhs rupees.
So, other variables are numerical in nature therefore, many descriptive statistics have
been displayed for example, mean median a max that we already understand. So, let us
move to some of the basic plots.
159
So, 2 basic plots that we want to cover that bar chart and this scatter plot. So, let us first
start with scatter plot. So, let us go back to our slides and let us understand some of the
key things about scatter plots.
So, generally scatter plots they are mainly useful for prediction task. So, the focus when
we say that prediction task and how is scatter plot could be useful with respect to
prediction task, focusing on finding meaningful relationship between numerical
variables.
160
So, a scatter plot is mainly for numerical variable both the axis they are used for
numerical variable both x axis and y axis, and for prediction task focus being identifying
some meaningful relationship by from the plot. Now for unsupervised learning task is as
clustering focus is on finding information overlap.
So, why this is useful that when we coming to our unsupervised learning lectures and
start clustering our discussion on clustering will learn more about this, but focus being
finding information overlap between different variables for unsupervised learning task
and scatter plot can be useful in that.
Now both the axis as said are used for numeric variables in the bar charts x axis is used
for categorical variable.
So, this is reserved for categorical variable and different groups because this being a
categorical variable the variable on x axis is being categorical variable we can create
different groups. So, there are going to be different categories and they can be different
groups can be there and statistics can be displayed on y axis and that can help us
understanding the differences between groups and compare them.
So, let us go back to our studio and let us start with scatter plot.
161
So, first plot that we are going to generate is between kilometre K M and price. So,
before plotting because we need to specify limits on x axis and y axis so, that our plot
looks much clear and we get more clear picture on the data. So, let us first understand the
range of this variable. So, this range values of these kilometre and price is going to help
us in the remaining the limits.
So, you can see in the plot function you would see that for kilometre the range is
between 19 and 167 and you can see the x limit that I have specified there is 18 and 180.
162
So, all the values are going to lie within this particular range similarly on the y axis the y
limit that I have specified is 1 and 75 and you can look at the a price values. So, they are
1.15 and 72. So, therefore, these values are also going to be lie with within this range.
So, x axis is a kilometre is on the x axis and price is on the y axis you would had
discussed before price is the outcome variable of interest in this particular data set and
therefore, price has is being displayed on y axis. Label for x axis and y axis have been
given appropriately using x lab and y lab arguments. So, we can run this particular code
and you would see graph has been depicted.
Now, if we zoom into this particular graph you would see that there is 1 extreme outlier
there. This particular where you kilometres accumulated they are far less they are far less
than 25, 000 kilometres, but you would see that price that is offered it is a much higher.
So, it is more than it is more than 70 in this case. So, the price is much much higher more
than 70 lakhs that is.
So, but if you look at the other values of the majority of values are lying within 0 to 20
lakhs range, but this is the only out layer. So, from this we can understand that a most of
the used cars that are they are in the data set they are in a smaller range.
Therefore, it would not be appropriate for us to study this stream outlier along with the
these points. So, you know we have to restrict our analysis to this range as well to be it is
163
it would be. So, we can eliminate this particular point and focus on this major chunk of
points mainly lying between 0 to 20 in terms of price.
You would also see that in this plot some of the some of the points which all lying closer
to x axis, but for you know far away from the 0 value and the majority of the values. So,
they have more kilometres accumulated, but price offered for them is also in the same
range between 0 to 20.
So, let us go back and here you would see first we are trying to identify we are trying to
identify that particular outlier point, you would see I have given this price value greater
than 70, you would see in the graph that this looks more than 70 price looks more than
70 lakhs. So, therefore, a let us run this particular code.
So, you would see this is point number 23. So, this is observation number 23 having fuel
type a diesel and showroom price 116 lakhs. And then you would see that showroom
price is 116 lakhs and the offered price is 72. So, these are high numbers in comparison
to the majority of other observation.
So, we can get rid of this particular observation because this seems to be this can be
considered a very distinct group. So, let us take a backup of previous data frame and now
eliminate this particular point. Now this is how we can eliminate this point in the data
frame we can use these brackets and the point number that index, we can specify as
minus 23 minus will give this instruction that this point is to be removed from the data,
frame and the data frame would be again stored in the same.
So, let us execute this or this particular data frame is gone. Now let us again let us again
have a look at the range values max and min values for both these variables kilometre
and price and let us again re plot the graphic.
So, you would see that kilometres there is not much change price then here you would
see some change for example, earlier values was were ranging from 1.15 to 72.
164
Now, this is ranging from 1.15 to 13.5 5. So, even less than 15.
So, the price is now less than 15 lakhs, now for a kilometres also now this new range is
27.5 to 167 earlier range was 19. So, you would see that kilometre range has also you
know increased specifically for the minimum value. Now let us plot this now the x limit
and y limit values have been appropriately changed modified would see 18 180 and 1
and 15 now instead of 75 earlier case, on it is plot this now this is the new plot that we
have.
165
Now, you would see that the graphic is covering most of the points in a clear fashion. So,
most of the points are covered in the graphic now. If we try to understand some of the
relationship between these 2 numerical variable kilometre and price you would see not
much change, you we can have a constant line that is going somewhere at a price value
of 4 lakhs and it can be a constant line.
So, it seems from this particular data points if we fit this particular these data points into
a you know linear model then you would see that kilometres is not a important factor. So,
the price is being offered irrespective of the kilometre. So, that is the kind of sense that
we get from the data. So, this data can be restart by a horizontal line. So, therefore,
kilometre is not is not such a crucial variable in our analysis specifically focusing on
prediction task related to price.
So, these are some of the insights that we can get from these basic plots. For example,
relationship between price and kilometres we can see that kilometres KM might not be
such a useful indicator for offered price, respectfully this is what we get from this is,
what we gather from this particular data set.
Now let us move to next basic chart that is bar chart. So, would see the bar chart we want
to plot between price and transmission. The transmission is the factor variable that we
have already created and we can compute average price for different groups. So, we can
get 2 groups based on transmission value. So, 1 is a 0 that is manual and the another
transmission value could be 1 that is for automatic.
So, these 2 groups manual and automatic and average price value for these 2 groups can
we can compute using this particular line up code, you can see I am trying to compute
mean from the values. So, which is another function which can return more information
on which you can find from the help section, but to give you a sense of this particular
function, which can find out the indices of the observations were transmission value is
transmission value is equal to 0.
So, those indices would be return and then the those particular observation would be
retrieved or selected or subset or subset would be created for us to pass to the mean
function and to calculate mean of that. And mean is again dollar notation indicating that
mean is to be computed only for one variable that is price.
166
Similarly, for another group the same thing can be performed. So, let us compute average
price you would see that average price a numerical variable has been created and there
are just 2 values, these values corresponding to 2 different groups group, 0 that is go
manual and group one that is for automatic car.
So, these 2 mean values have been computed. Now another variable that we want to
create before generating bar plot Trans is the name of the variable. So, this is just for the
labelling purpose. So, the transmission the labels that we are going to use in our plot so 0
and 1, that are the names of the label for 2 different groups.
We could have we could have used a you know manual and automatic as well here. So,
those could have also been the label names for our bar plot let us go with 0 and 1 right
now. Now, because the this particular bar plot is between average price the variable we
have just created and then a trans. So, let us look at the range; range is between 3.7 4 to
5.4 8. Now if you look at the plot bar plot function that I have written here the y limit is
between 0 and 6. So, these in this particular range would be covered in this.
So, average price is the first variable. So, this will go into the y axis and the names dot r
here this will go into the x axis and the labels for x axis that we have just created using
transmission, x lab name is transmission and y lab y axis label is average price. So, let us
execute this line.
167
And this particular bar plot has been created.
You can see this for group 0 that is the cars used cars you know manual with manual
transmission, you can see their average price is somewhere between 3.5 and 4. And for
group 1 that is the used cars with automatic transmission, their average price is ranging
somewhere between 5 and 5.5.
So, this kind of information, so, it seems that the cars automatic cars with auto automatic
transmission they seem to be carrying more value. Now let us create another bar plot. So,
168
this time this time we are going to use only the 1 variable. So, the this previous plot that
we created. So, we had 1 numerical variable on y axis and 1 categorical variable on x
axis.
Now let us just focus on 1 variable and that has to be categorical. So, it is going to be on
x axis again. So, this variable is again transmission. So, what we are trying to find out is
the number of cars, which are manual and the number of cars, which are automatic the
percentage. So, percentage of all records percentage of all records that we want to find
out which are manual and which are how many are manual and how many are automatic.
So, this is the code that we can that we can actually use to compute this. So, can see
length. So, I am trying to find out the length of a vector which is going to be determined
by this which command. So, which command is going to return the indices for
transmission were it is 0.
So, all those indices they would be counted using the length function and will get the
number of records. And you would see this has been divided by divided by the all the all
records in the vector transmission, that can be computed using again using the length
function and passing on the argument transmission. So, this would be the ratio and then
multiplied by 100. So, this will create a percentage number percentage value similarly
for group 1 that is for automatic cars we can do this. So, let us execute this line.
169
You can see variable pAll has been created and there are 2 values 80.8. So, 80.8 percent
is of records they actually belong to group 0 that is manual and 19.2 records they belong
to group 1 that is for automatic cars.
So, let us generate a bar plot you can see pAll is the variable and the arguments Trans
and limit. Now in this case is 0 100 because we are using percentage. So, that is the
range standard range. So, let us create this plot.
And see the plot let us zoom in.
170
So, can see x axis transmission 2 groups 0 and 1 and in on, y axis percentage of all
records and you can see 0 this is close to 80 and you can see 1, group 1, which is closer
to 20. So, this kind of a using 1 categorical variable that is mainly a transmission we can
also create these kind of plots.
So, this again in a way help us in understanding the structure of the data from this we can
understand that most majority of the cars, more than majority mark around 80 percent
most of the cars. They are manual and only 20 percent of the cars if smaller numbers of
cars smaller percentage of cars are actually automatic.
So, therefore, this gives as a idea about the this particular structure for example, if it was
less than 5 percent that then in that case it could have it could be you know defined as a
rare category rare class you know the automatic.
So, therefore, this might have affected our formal analysis. So, these are kind of graphics
an actually help us understanding some of the insights about the data and help us later on
in formal analysis.
Now, let us go back to our slides let us discuss a next set of plots. So, next set of plots are
actually distribution plots.
So, mainly 2, 2 distribution plots that we are going to cover in in this lecture in this
course a mainly 2 being 1 first 1 is histogram the second 1 being box plot. So, as the
171
name says these are distribution plots and help us understand the distribution of data. So,
because this is distribution so, generally they are applicable to numerical variable.
So, we are interested we might be sometimes we might be interested in understanding the

distribution of a numerical variable. This could be various reason for various reason that
be will keep on learning as we go long. Once we understand the distribution of data we
can we can some of the we can touch, the we can understand, we can verify some of the
assumptions that are there mainly in statistical techniques.
So, distribution can help us for example, understanding whether the data is following
normal distribution or not. So, therefore, if it is not following normal distribution what
can be done?
So, these some of these plots are going to help us in the fashion. So, sometimes we might
be required to transform those variables. So, that we are able to achieve the normal plot
normal distribution. Sometimes if we want to convert a numeric variable into a
categorical variable so, histogram and box plot the entire distribution that they display
that can help us in terms of binning of those variables; how the bins are groups are to be
created. So, those insights again help us in creating new variables.
So, as shown in this slide histogram and box plot they are about distribution of a
numerical variable, we get directions for new variable derivation as we discussed and we
also get directions for binning of a numerical variable. Now useful in supervised learning
is specifically in predict prediction task, because they are mainly applicable for
numerical variable and therefore, prediction task doing the important type for this kind of
plots where we can actually get some help for example, variable transformation in in
case of a skewed distribution.
So, will learn more about the skew in coming lectures as well and in this lecture as well
so, there could be a right skewed distribution or left skewed distribution. So, what are the
transformation that can be done to achieve to actually reduce some of this skewness of
the of the of the plot of the data. So, and so that can actually be find out that can be
actually be shown in use in histogram and box plot.
Selection of appropriate data mining method for example, if the a data set is not able to
follow or not able to meet some of the assumptions in a statistical technique. Then
172
probably, we cannot apply them and probably, we have to go with some of the data
remain technique some of the data mining technique. Because they are some of these
youngsters are relaxed and those techniques can always be applied.
So, selection of appropriate method or technique can also be done using these plots. Now
further discussion on box plot. So, box plot they display entire distribution.
So, till now for example, whatever bar charts that we plotted they were focused on you
know 1 or 2 groups or categories that were there in the categorical variable and the
numerical values for the same were you know reflected in the y axis. So, that kind of
information that that we could get from the bar plot, but in the box plot we get the whole
entire distribution the whole range of values are covered. So, therefore, we can have a
better look in the on the whole data the full data.
Now, there is another thing that can be done side by side box plots. So, we can create
side by side box plot that can again help us in comparing and understanding the
difference between groups, something that we did using bar plots that can also be done
using box plot and in a more in a much better fashion.
So, this could be useful in classification task where we can understand the importance on
numerical predictors. So, in classification task we are using some numerical predictors.
So, this side by side box plot can actually help us in finding out how these numerical
173
variables, numerical predictors, can be best utilised and their importance as well. Another
usage of box plot could be in the you know time series kind of analysis where we can
have series of box plot and we can look at changes in distribution over time.
So, that can also be done. So, let us open r studio and will go through an example go
through examples for box plot and histogram. So, let us also cover histogram as well
before we going to before will go through examples together for both of these kind of
plots.
So, histograms; histograms, they generally display a frequencies covering all the values.
So, in the bar plots only few values are actually covered. So, in this case histograms we
cover all the values and vertical bars are used more we will learn through r studio.
So, again we are going to use the same data used cars data. So, you can see that hist is
the function that can be used to plot a histogram, you can see because we are interested
in this particular variable price that is our outcome variable. So, let us see the range. So,
range is same as we saw here 1.1 5 to 13.5 5.
174
So, let us so you can see that limit minus 5 to 20 has been used, why this slightly wider
limit will see after the plot is created and y limit this is actually the frequency for
different bins. So, will see once the once the plot is created. So, let us execute this line
you can see the plot.
So, in this plot as you can see.
That the for better visibility of this histogram we have given this from minus 5 and this
particular is also and on this x axis on this extreme a right extreme also some more range
175
has been given. So, that we are able to visualize the whole histogram in 1 go in a much
better fashions therefore, that is why this wider x limits were given and you can see the
frequencies for different bins over there. So, this particular distribution because this
histogram covers all the values of a numerical variable, we can get a sense whether this
whether a particular distribution is following is normal distribution or not.
So, in this case in this case we can see that this seems to be a right skewed distribution
right there is a slightly longest tale on the right side. So, this particular particular
distribution does not seem to follow normal distribution. Therefore, it is going to be
slightly difficult to apply some of this statistical technique for example, is linear
regression etcetera. So, therefore, we would be required to do bit transformation to make
it slightly more of a normal distribution.
Now, a let us move to the box plot. So, in again in the box for the box plot we are
interested in these 2 variables price and transmission. So, let us look at the range which
is again is going to be the same and box plot is the function that is used. So, in this case
this is price verses transmission transmission is going to be on x axis and price is going
to be on y axis. So, different groups in different groups different categories in
transmission would be displayed on x axis and for each of those groups price distribution
would be displayed a price values will be displayed on the y axis, limit for y axis as you
can see 0 to 15 and the labelling is also you can understand.
176
So, this is the box plot let us have a look.
So, in the box plot you would see that 75 percent almost 50 percent of the values they are
in the they are generally remain in the box. And the this black this this line this particular
this particular line this is actually the median value and this is the starting point of the
values and then you have in the box and you have first quartile and the this 1 is third
quartile.
So, all the values in the box are between first quartile and third quartile. So, therefore,
covering 50 percent of the values median is displayed. So, majority of the values are in
this range between these 2 limits and some of the values that are displayed using a
squares they are they can be called outlier.
Similarly, this is. So, this was for group 0 and for group 1 also again the same thing this
this this line this particular line is median, then you have first quartile and third quartile
and creating the box and other things remain same this being this value being the outlier.
Now you can see between group 0 and group 1 the median value you can see this is
much lower from group 1 median price value is much lower. So, the whole the whole
box plot for group 1 is higher than much higher than the box plot for group 0.
So, there can be the clear separation between these 2 groups can be seen over there, if we
want to look at the a mean value for a those 2 groups that can also be done. So, we will
177
have to compute the means for those 2 groups. So, this can be done using the by
command that is available in r. So, first is the you know first argument is the variable for
which we have to generate the mean and the second argument is the categorical variable
the now groups for which, we have to create a the mean and the mean function the
because. So, let us execute this code.
So, means would be created and once means have been created we can plot them using
the points command. So, points is the command which can again be used to plot the
points on a particular graph. So, in this case p c h is the plotting character. So, in this case
plotting character as understand as defined by value 3 is going to be displayed in the plot
which is nothing, but plus sign.
So, let us execute this line more information on points command you can find out from
the help section. So, you can see plus sign visible there let us look at the this plot, you
can see plus sign exactly lying on the median value very close exactly matching or very
close to median value for group 0. And you can see for group 1 plus sign is much higher
than the median value. So, the skewness that we saw earlier may be that is coming
because of the group 1 that is automatic cars.
So, let us stop here and we will create will continue from here. In the next lecture will
create few more box plot and will go into. So, basic charts and these distribution plots
they are mainly 2D graphics, we will go in more into you know multivariate or
multidimensional graphics in the next lecture.
Thank you.
178
Dr. Gaurav Dixit
Lecture – 09
Visualization Techniques- Part III Heatmaps
Welcome to the course business analytics and data mining modelling using R this is our
5th lecture and we were covering restarted visualization techniques in the previous
lecture. So, let us start, we stopped at you know R studio where we were doing some of
the examples. So, let us go back and complete some of them and then we will come back
and start our discussion on our next particular plot that is on that is heat maps. So, let us
go back.
So, again we will have to do some of the loadings and importing data set we will have to
reload the library and everything. So, let us load this particular library xlsx. So, once it is
loaded.
179
So, mainly in this particular lecture we would be using used cars xlsx file. So, let us
import that particular data set see here. So, let us import it.
So, we will rerun the same lines that we did in the last in the previous lecture. So, we can
see that there are 79 observation 11 variables in the environment section and then let us
re create this age variable has discussed in the previous lecture let us append it to the data
frame and let us subset the data frame right and let us also convert these variables. Now
you might remember in the last session we had eliminated one of the observation which
180
was which was actually outlier. So, let us perform the same operation again. So, this was
the observation let us take back up and then again eliminate the observation.
Now, we want to have a look at the data set this is the data set you can see now d f1 78
observation and 9 variable.
So, let us go back to the point where we stopped in the previous lecture. So, we were
going through some of the examples of box plots. So, I think what I remember is we
181
completed a one box plot. So, let us discuss another one, this particular box plot is
between kilometre and the categorical price. So, let us look at the range of kilometre
because we would have to specify that in the y limit because this particular variable is
going to be on the y axis. So, let us do that you can see the range and you can see the y
limit that we have specified in this particular line is actually covering this range for
kilometre. So, now, let us create the box plot.
You can see this like the last like the last session last lecture this is the box plot that we
have. So, other understanding of the box plot remains same like the last session if you
want to display mean as well and that can also be done, but for that we will have to
compute the mean first. So, this is the code that we discussed in the previous section as
well. So, means are computed you can see means variable have been a created has been
created means one variable has been created and 2 values are there.
Now, let us plot these 2 points and you can see the plot now here also if you want to
compare how the kilometre variable km variable is actually distributed for 2 groups 0
and 1 group 0 and group 1 you can see. So, in comparison to our the previous example
that we saw there is the both these boxes are closer to each other, but group 0 is slightly
on the lower side the distribution is slightly on the lower side and there is some
difference between box 0 and box 1. So, this kind of box plots as we discussed can also
help us in understand the difference between groups and we can also help you know
182
decide whether we need to include any interaction variable because of you know if we
see a significant difference in the distribution of data in 2 groups. So, we will have more
discussion on interaction and on other related concepts in coming lectures.
So, let us plot another one. So, this one is between age and the categorical price that we
have these 2 variables. So, let us look at the range because again this is going to be
plotted on the y axis we can see the range is 2 to 10.
So, you can see the limit is also specified appropriately and other things remaining
similar. So, can see these 2 plots this one plot and we can also calculate means and plot
these 2 means for these 2 box plots let us look at the graphic.
Now, in this case this is in this case you would see the boxes are you know very you
know in the same range those these 2 boxes they are in the same range, but you would
see the median is coinciding with the first quartile in the box 0 and the box 1 it is
separated you can also see the which are means are also looking at the same value. So,
therefore, very close very little difference between these 2 distribution for these 2 groups
and 0 with respect to age.
Now let us do another example this is bit being showroom price and categorical price.
So, let us go through this range you can see appropriately specified in this particular line
box plot code.
183
And we can plot the box plots and then means let us plot them now let us look at this
particular graph this looks much interesting you can see these 2 boxes a much bigger
difference you can see for group 0 you would see that price is showroom price was you
know showroom price distribution is on the a lower side and for group 1 the showroom
price is on the higher side. So, that is nothing unusual this is actually because of the way
categorical price has been created the showroom prices actually following that you know
indicating the same difference at because both are related to pricing of the cars. So,
therefore, this difference this separation is very clearly visible or depicted in the box plot
because both these variables are related to price.
Now, let us come back to another plot let us come back to our slides.
184
So, heat maps is the our next discussion, heat maps again they are another you know they
can be combined with the basic plots and distribution plots they display numeric
variables using some graphics based on 2 D tables will see how that is possible. Then we
can also use some of the colour schemes that could be use to indicate values. So,
different colours and different shades of colours could actually be used to indicate a
different range of value let say if a value lying between 0 and 0.1.
So, one particular shade could be used if the value is lying between 0.1 to 0.2 a darker
shade could be used if the value is lying between 0.3 and 0.4 a little bit more darker
colour could be used. So, therefore, in that in that fashion a colour ordering you know we
can use the colour scheme and that can the intensity of the shade that can help us indicate
whether the value is on the higher side all or on the lower side. So, 2 D tables any kind of
data that we can have that we can actually have in 2 D you know table format a table
format. So, therefore, that can be displayed using heat maps and the colour coding can
actually help us in understanding the data and some relevant insights developing some
relevant insight.
Now, as we talked about in the previous lecture as well that our human brains are capable
of doing you know much higher degree of visual processing. So, therefore, heat maps
can really be helpful especially when we are dealing with large amount of data. So, we
have large number of value values it might be difficult for us to find out different things
185
different insights about the data therefore, heat map this colour things can actually help
us in building our visual perception, now those visual perception can be carry forward
for subsequent analysis and used for then later on used for formal analysis.
Now as you can see in the slide second point about heat maps is useful to visualize
correlation and missing value. So, as we talked about because different colour shades are
going to be used therefore, in the correlation metrics if there is a higher value if there is a
high degree of correlation between 2 particular numerical variables. So, that can be
shown with the darker shade and if there is low degree of correlation, low value of
correlation coefficient then a lighter state could actually be selected. So, therefore, the
different shades in density of these colour shades that can all actually help us in finding
out which variables are highly correlated and which variables are have or having low
correlation values right.
Similarly, missing values can also be spotted. So, if we have the data generally as we
talked about in the starting lectures that generally data is displayed in metrics format or
in the tabular format. So, therefore, that data can actually be displayed and if there are
any missing values. So, they can be represented using you know whiter white colour and
the cells where the values are there can be represented as the darker colour or the black
colour. So, it would be easier to specify missing value we can also. So, heat maps can
also be used to help us understand the missing ness in a particular data set if there are too
many missing values right that can also be spotted if there are duplicate rows and
columns probably because of the colour shades if the colour shade is a very similar for 2
particular rows or 2 particular columns or multiple rows or multiple columns. So, we can
again do a manual check to find out whether the values are whether it is a duplicate row
or column. So, heat maps can help us finding these problems.
So, let us go back to R studio, first will cover the correlation matrix. So, heat map can be
used for you know creating a correlation table heat map. So, first we need to compute
correlation. So, in this case you would see that in the data frame that we have data set
that we have let us have a relook.
186
So, you can see that column 1 5 and 8, 1 and then 5 and then 8 having left out in the
correlation function reason being obvious that these are factor variable or categorical
variables. So, that the correlation values it requires numerical variables. So, let us
compute the correlation values among the remaining numerical variables this.
You can see a correlation matrix has been displayed there this metrics is symmetrical. So,
upper half is symmetrical to the lower half you know this diagonal is there all the values
are 1. So, this value 1 is between the same variables. So, the variable is going to be
187
hundred percent correlated with itself therefore, these values are one other values are
showing the particular correlation coefficient.
Now we can have a different kind of a table for the same data.
This is the function same num. So, you can see different kind of depiction here. So, you
can see the variable names here in the rows and in the column also variable names are
there. So, the 1 representing the a 100 percent correlation and you would see the some
notations are given in the at the bottom of this particular output, see this single code is
used for values lying between 0 and 0.3 dot is being used for values are lying between
0.3 and 0.6 comma is being used for values lying between 0.6 and 0.8 plus is being used
for values lying between 0.8 and 0.9.
Hash trick is being used for values between 0.9 and 0.9 5 and the b is being used for
values between 0.9 5 and 1. So, you can see there is 1 we can see 1 comma here and then
several dots. So, this comma value might be somewhere between 0.6 and 0.8 and dots
could be somewhere between 0.3 and 0.6. So, this is quite similar to what we were
talking about the heat maps heat map will so the same thing using colours. So, in this
case this particular function sym num is displaying different different value using
different symbols. So, now, because the there is symmetry in the matrix.
188
So, let us get rid of the one of the triangle. So, in this case we just want to keep the lower
triangle. So, upper triangular values have been assigned as NA, now let us create the
correlation table heat map. So, this is the code you can see the first argument in this
particular function heat map is the metrics itself then there are some other arguments
symmetry is in this case is 2 colour we have specified as grey colour. So, we want to
have now we want to use the grey pallet we will understand more about colour schemes
in R later in this lecture so, this particular function grey dot colours can be used to create
a number of grey a number of grey shades.
Now, for example, we want to create 1000 shades starting from value ranging from 0.8
and ending at 0.2. So, the values can be start between 0 to 1 range, but we are is a
restricting ourselves to 0.8 to 0.2 scales we are not scaling the data that we have because
this is already correlation value. So, they are already standardised margins we have
specified.
189
So, let us execute this particular code you can see the output this is the graphic that we
have. So, in this graphic you can clearly see that a diagonal values they are in the darker
shade because there is the each variable is going to be 100 percent correlated to itself
therefore, these value are being repainted by the darker shade having perfect correlation
other values you would see slightly a lighter a shades of grey have been used, but the
intensity of the shade in indicate higher value and higher intensity indicates higher value
and lower indicates lower intensity of the shade in indicates lower value. So, these a
whitish kind of light grey kind of you know rectangles are squares shown here the
correlation values for the corresponding variables pair of variables are in the lower side
and the slightly darker for example, this particular one we mean price and SR price that
is this we can understand that showroom price is going to be highly correlated with the
price of the car.
So, therefore, the correlation is value correlation value is going to be on higher side and
similarly the colour is on the you know this is higher intensity in higher shade of grey
colour. So, this can actually help us visualizing in terms of finding out which particular
periods of variable are highly correlated. So, therefore, we can say price and SR price we
can also say SR price and airbag you can also say kilometres and age right similarly
price and airbag. So, these a particular set of variables they seem to be highly correlated
with SR price and price being very highly correlated.
190
Similarly the data metrics or missing value heat map can be depicted using heat map
function. So, what we are trying to do here is. So, generally missing value heat map is
actually if the value is present it is generally shown in the darker shade, if the value is
absent then it is shown in you know lighter shade. So, that generally you know the colour
that is used is black for the value being present and white for value being absent so, but
in this case we are not doing that because the data set that we have all the values are
present. So, instead of that what we are trying to do here is to just show you to give you
the feel of heat map in the data metrics case missing value heat map case for first 6
records and then later on for all the records we are depicting different shades of grey
colour for different actual values. So, depending on the value different colour shade
would actually be shown.
So, for first 6 records we are going to run this code you can see head is the function that
has been used for to actually subset the a data frame for first 6 records you can see grey
colours the scheme is this slightly different scale, now we want to standardise this scale
because column wise. So, column wise scaling is going to be done margins are also
mentioned there. So, let us run this code.
191
You can see for first 6 observations we can see for different columns different variables
and the values. So, for example, you can see the airbag it looks white and the colour is
predominantly white you can see most of the values in airbag they were actually 0. So,
therefore, this whiter shade of grey colour has been used similarly you can see this 5th
row this is mainly in the, you know many cells, many squares in this many cells in this
particular row they are in the darker shade. So, higher values are there in this particular
row.
Similarly, you can see that KM column this is slightly on the darker side so; that means,
higher values are there in the km column per KM variable. Similarly we can create the
heat map for all the rows for all the records.
192
So, this is the heat map you can see now this is the number in indexing for the rows 1 to
79 because we had 79 observations you can see that. So, depending on the values itself
the shade has been selected had it been for missing value actual missing value heat map.
So, we would had we would see either black or white, white in the places where the
value is absent and the black in the places where the value is present. So, we want to do a
similar thing for in the for our data set whole table is going to look black because there is
no missing value now that brings us to our next discussion.
So, let us come back to our slides, next discussion is on multidimensional visualization.
So, most of the most of the visualization techniques or plots that we talked about they
were mainly 2D 2 dimensional. Now, we can also have some features which can actually
add to the 2D plots that we have a gone through till now and in a way they would be
multidimensional because some of the features that are mentioned in this particular slide.
193
These features are going to give that multidimensional feel. So, our visual perception can
be multidimensional you know using these features on 2D plots. So, you can see multiple
panels if we can have multiple panels. So, if you just 1 scatter for example, if you use
just 1 scatter plot only 2 variables can be selected, only 2 variables can be visualized, but
if we have multiple panels right. So, we can have you know pair wise scatter plots for
many variables and at in one go we can look at different variables and the relationships
and the information overlap and many other things.
So, multiple panels can give us the multidimensional look using the 2D plots, similarly
colour. So, colour coding can be done for different groups of a categorical variable. So,
that will also give us the multidimensional and it will help us in building our visual
perception size and shape. So, a different size and shape for points that are being
depicted or shown forgettable graphics that is being generated different size and shapes
can be can actually be used and therefore, that can help us give that multidimensional
feel from the 2D plot animation can be done which can actually help us in visualizing a
changes over time some operations like aggregation of data rescaling and interactive you
know visualization that can also be done to have that multidimensional feel.
Now, when we create a real multidimensional visualization like 3D plots their visual
perception is not that much clear it is difficult for us to a learn something from 3D plots.
So, it is more easier for us to because of the way we having learning over the years our
194
learning with respect to 2D plots from 2D plots is much better than in higher dimensional
done from higher dimensional plot. Now the main idea is again for these features and the
operations that we talked about the main idea being help build visual perception that is
going to help using support in the subsequent analysis.
So, let us go back to our studio and will go through some of the plots. So, first one being
colour coded scatter plots. So, before we do some before we create some of the colour
coded scatter plots let us understand the coloured schemes in R in R. So, there is this
function pallet which can actually help us understand the default you know colour
scheme for R in R.
We run this particular function you can see different colours are depicted here black red
green 3 green 3 blue 7 all these colours are depicted. So, therefore, for any whenever in
any function we use the colour argument for example, in this plot this colour argument
has been used. So, these particular colours would be picked up in that order. So, for you
know first time black would be picked for different colouring red would be picked for
the third different separate colouring green 3 would be picked. So, this particular colour
scheme is going to be used to have different colours in your plots.
If you want to change this particular this particular colour scheme you can do that for
example, rainbow 6 this is one function that can actually change your pallet scheme. So,
195
you can pass on this argument in the function you can again check the values that is a
rerun pallet.
You can see now the, this colour scheme has changed to rainbow 6 red, yellow, green and
these colours. So, right now will stick to the default scheme, let us reset it now you will
see a default scheme is there you can recheck it you can see it is black, red, green 3. Now
let us create a colour coded scatter plot. So, this is this plot we are going to create
between the variable age and kilometres km. So, a colour is again colour is using. So, the
colour feature that we are using this is for the categorical price variable. So, for different
groups of this variable we have 2 groups in this variable in this categorical variable 0 and
1. So, for these those 2 different group different colours are going to be used, the points
that are going to be plotted between age and km. So, for different groups different
colours would be used.
So, let us a run these 2 lines. So, range is 2 10, appropriately specified in the plot
function let us run you can see the plot.
196
You can see 2 colours black and red as we have already seen that black and red are the
first and 2 second option. So, black has been used for the group 0 and the red has been
used for the group 1. So, you can see this particular plot. So, here we can have that 3
dimensional feel like we have 2 variables came in the y axis and age in the x axis and we
can see the relationship between km and age in this particular this particular scatter plot
as the age of a vehicle is more the number of kilometres accumulated or of course, going
to be on the higher side, but you can also see that the red points or on the higher sides
slightly on higher side there are few red points on the lower side. So, therefore, we can
understand that categorical price were the you know which were assigned as one which
means which were assigned as which were having value more than 4,00,000 equal to our
more than 4,00,000 they have accumulated a more kilometres. So, those cars are being
used more often. So, that 3 third dimension is being depicted using colour in this case.
Now, another kind of multidimensional visualization that we can create is multiple panel.
So, we can create multiple panels each separate panel for each group. So, let us go
through one example.
197
So, this particular example is being done using 3 variables. So, essentially we are trying
to create bar plot and this bar plot is between this bar plot is mainly between price and
age and then the different panels are going to be used depending on the transmission. So,
for different transmission different panels are going to be used one panel per
transmission 0 and one panel for another panel for transmission 1 and the main bar plot
is between price and age were age is being used on the x axis. So, therefore, it has to be
categorical.
So, therefore, we need to create a categorical we need to convert age into a categorical
variable. So, let us start with that. So, age group is the variable is the categorical variable
that we are going to create out of this age variable. So, you can see age dot factor and
you can see age is already a factor, but to be saved this has this particular function has
been used. So, we are extracting the labels with the function which can be used on a
categorical variable to extract the labels different labels that are there in a particular
variable. So, let us. So, age was a numerical variables, in this case s dot factor has been
used to convert into a factor variable then it will have labels and labels function can be
used to get to retrieve those labels.
198
So, let us do that, age groups has been created you can see different labels 2 3 4 5,
different cars with different ages. So, have now have been clubbed into different groups.
So, again this is for coming you know further computation that we need to create this
particular variable. So, we need to have you know we need to run a loop later you can
see for loop is there. So, for that we need this age groups to which can help us in running
through all the different age groups.
So, let us run then we are going to create a average price for each transmission group
transmission 0 and transmission 1. So, for that we have created these 2 variable average
price 1 and average price 2. So, let us initiate them once initialization is being run, for
each age group we are going to run this particular loop and we are going to create these 2
variables we are going to fill feed data into these 2 variable average price 1 and average
price 2 that is depending on that for transmission 0 and for you know all the groups for
the transmission 0 and all the groups and for each group we are going to create an
average price similarly for transmission 1 and for each age group we are going to
compute the average price, let us run this particular loop.
199
So, once this loop is done now there could be some groups where average price cannot
be computed because for you know that combination did not match for transmission 1
some of the age groups there were no data, no record similarly for different transmission
group transmission 0 there might be some age groups were there were no records. So, in
those cases this N A N would be automatically you know assigned in R. So, therefore, we
need to convert them to 0.
So, once this is done now we can now because we want to create a different 2 panels. So,
in this case par is the command that can be used I mean you can see mf row is the
argument. So, we want to create a 2 rows and 1 column right. So, we want to create 2 2
panels and the x axis is going to be the same. So, therefore, 1 column and there are going
to be 2 panels and on y axis so, 2 and see x is again 0.6 this is actually for the labelling
and this is actually for all the numbers that are going to be depicted. So, default is 1 and
0.6 is. So, we are scaling down the sizes of all the points all the numbers and text that is
there margin you already know this is outer margin this is also specified here. So, let us
run this command.
Let us have a look at the range because we are going to require that in bar plot. So, what
is the range is you can see from these 2 we can see that, between 0 to 9, if we have a 0 to
9 limit it would be covered.
200
So, let us plot this let us create a box legend trans 0 and then another the name of the y
axis.
And second plot walks name of the x axis and legend now you can see this plot has been
created. So, these are the 2 panels you can see the scale for x axis is same because the
same variable is being used on the x axis, but in the y axis the variable is same, but the
average price for different groups could be different. So, therefore,, but is still we have
used the same range. So, therefore, these 2 panels can actually be compared value by
201
value. So, therefore, you can see that most of the you know vehicles in different age
group right and having a transmission 0 they are around 4,00,000 average price for some
age group it is slightly slower as we move further this average price goes down till this
particular age group age group 8 and then again for age group 9 and 10 it is increasing
may be the cars over of the higher showroom price if you look at the transmission 1. So,
these are automatic cars. So, the average price for these cars is a slightly on the higher
side right more than 5 or closer to 6 right.
So, the automatic cars of course, they are going to be big costlier. So, therefore, used cars
also are also going to be on the higher side they are also going to be costly that is
reflected in this particular graphic and, but as the age increases you can see there is you
know slight you know decrease as the age of a car is increasing some of course, some
apprehension are there, but that is the general sense. So, will today we will stop here and
will continue our discussion on some more visualization techniques in the next section.
Thank you.
202
Dr. Gaurav Dixit
Lecture – 10
Visualization Techniques- Part IV
Multiple Panel Plotting
Welcome to the course Business Analytics and Data Mining Modelling Using R. So, in
the previous lecture, we were discussing visualization techniques and we were in
particular we were discussing a multiple panels. So, let us go back and restart our
discussion from the same points let us go back to our studio.
So, last time.
So, at the end of the lecture we were we were trying to cover separate panel for each
groups. So, that I think we were able to complete and now today in this lecture let us
move to scatter plot matrix. So, scatter plot matrix can be really useful in situations
where you have a many numerical variables and you are trying to understand in different
relationship between different pairs of a variables that could that is going to be useful for
prediction task and supervised learning supervised learning prediction and classification
task in supervised learning methods.
203
And in and in case of unsupervised learning methods it could be useful in understanding
the information overlapped between 2 variables. So, if we get to see in one go and the
relationship between different variables our visual perception can be much better
specially in some situations.
So, the data set that we are going to use is the same one the users curve data set. So,
again because we are starting a fresh so we need to import this particular data set again.
So, let us do this let us reload this library. So, let us import the this this particular data
set.
You can see the data set has been imported 79 observations of 11 variables that is visible
in the environment section.
Now, let us also compute these age variables some of the things that let us also take full
backup of this particular data frame that we are going to require later on the in this
lecture. And first 3 observation as we understood in the previous lectures might not be
were not important for some of the initial visualization techniques that we discussed once
this is done.
In the previous sessions, we also identified one observation that we wanted to get rid of.
204
So, a let us do the same today’s well backup in eliminating the observation, now we can
now we are ready to start with scatter plot matrix. So, let us go back to a scatter plot
matrix yes. So, as discussed is scatter plot matrix can be useful to understand the
relationships and information overlap.
So, now you can see we have a selected a 4 key a numerical variables continuous
variable for scatter plot matrix as we understood in the previous lecture, that the for
scatter plots both the a variables a variables that are going to be on x axis and y axis are
supposed to be numerical variable. So, therefore, you can see S R Price kilometre and
Price and age these 4 numerical variables can be used to create a to generate a scatter
plot matrix. So, let us execute this code.
You can see the graphic here has been converted created let us zoom in.
205
Now, here you can see the you can see in the diagonal in the in the in the diagonal
rectangles right in the diagonal boxes, you would see the minimum the variables that
have been used to create this particular scatter plot matrix S R Price K M Price and Age
you can the function that we have used to a generate this particular matrix is pairs. So, in
this pairs function we have to pass on this formula in the formula you have to mention
the name of the variables you know which are going to be used to create this scatter plot
and then the data.
So, let us go back. So, for example, if you are interest interested in this particular this
particular graphic then the y axis is going to be S R Price which is in the same row S R
Price and the x axis is going to be represented by K M which is in the same column. So,
x axis is K M and the y axis is S R Price. Similarly if you are interested in this particular
graph then we would see that age is age variable, which is in the same row is going to be
represented on the y axis and S R Price is going to represent the x axis while it is in the
same column. So, that is how you can understand which variables are there in x axis and
y axis.
Now, you can look at different plots and you can try to understand the relationship
between them for example, this particular plot you can see this is between S R Price and
price. So, you can see a linear kind of relationship is visible there see more majority of
the points if you pass a line through the majority of points it is going to be a linear line
206
all right it is going to be a linear line. So, therefore, the relationship now you can
understand from the variable variables itself that S R Price and Price both are based on
prices therefore, there is going there is supposed to be a linear relationship. So, that is
very visible in the data itself.
Now, if you if you are interested in kilometre and versus Price you can see this particular
plot. So, K M and Price you can see most of the points they are clubbed here in this
particular group. So, there does not seem to be much difference of you know you know
much difference of Price on kilometre, we can also look at this particular graph in this
case Price is on a y axis and K M is on x axis because Price is ever outcome variable of
interest this is this particular plot is of more interest to us. So, here you can see the you
know K M could be represented by this for particular data could be represented by
horizontal line that actually you know signifying that there is not much of influence of K
M on in determining Price similarly there are different plots and different kinds of
relationship can be seen over there.
Now, if we if for example, if we are interested in some few other plots for example, Price
and let say Price and age and this particular graph you would see that because the age is
being represented by few numbers only. So, therefore, you would see for particular is
cars of different prices are depicted here similarly also different. So, it is look like a bar
chart bar chart kind of plot.
So, for different age group for different age numbers cars of different prices are being
shown there by different point’s data points. So, this particular these this particular
scatter plot this kind of a scatter plot matrix can really be useful in terms of finding many
relationship understanding many relationship, it can help us in finding a new variables it
can also helps us in understanding the interaction terms if required it can also help us in
grouping some of the categories it can also help us in you know sub setting the model.
So, running a model on a on a subset of the full data set. So, those kind of things can
actually be identified using these plots.
Now, let us move to our a next point let us go back to the slide.
207
So, next we are going to discuss is discuss these operations aggregation rescaling and
interactivity. So, these operations can sometimes be really useful for the same things that
we have been talking about. So, let us start with rescaling. So, let us go back to r studio
again. So, rescaling can be really helpful we if there are crowding of crowding of points
near axis in a near axis whether it is x axis or y axis if there are many points which are
crowded near those axis near those axis.
So, therefore, we can do rescaling of x axis and y axis and get a better look of the data.
So, how let see through an example. So, in this so now we are going to create 4 plots 4
back to back plots. So, therefore, we are trying to we are trying to divide our plotting
area plotting region into 2 rows and 2 columns. So, 4 plots are going to be created and
we have appropriately we have changed other settings like margin outer margin and this
size of font and size of different text and numbers.
So, let us run this. So, first particular rescaling that first particular example is between a
for is scatter plot between kilometre and price. So, let us again have a look at the range.
208
You can as you already know that range can be use to a specify the x axis x limit and the
y limit in the plot function. So, you can see the appropriately the limits have been
specified.
Now, let us run this particular plot you would see because as I said we wanted to you
know we wanted to generate 4 plots in the same you know plot area.
209
So, therefore, you can see one fourth of the one fourth of the area has been taken by the
first plot. Now let us create the axis labels we can see Price verses kilometre and you can
also see the points. Now if we want to zoom into this particular plot though we have
talked about this plot many times that there is not much influence of K M on price, but if
you want to have a much closer look if you want to have much closer look then scaling
can be really useful.
How we will see. So, let us change this scale of x axis and y axis into log scale. So, when
we talk about this log scaling we are essentially changing the spacing of points on x axis
and y axis. So, points are not going to be equally spaced in x axis and y axis they are
going to follow the logarithmic log base is you know scaling and is spacing would be
accordingly changed. So, how that can be done so in the plot function in R there is this
argument log which can be actually which can be used to do the perform different kinds
of scaling for example, if you if you just want to change the scaling for x axis. So, we
can say log we can assign log as x and then if you if you just want to scale y axis then y
can be a log can be assigned the y axis if you want to change both the axis then that that
is the case that we are doing here right now. So, we have to say x y. So, this is between K
M and price.
Now, let us talk about the limits now limits as you as in the previous plot we had used 0
to 180 for x axis. Now in case of log in case of log you would see that this is 10 to 10 10
210
to 1000 reason being that is the reason being that 180 is more than 100 and the next you
know spacing point appropriately spacing point in a log is scale is going to be 1000. So,
it is going to be like 1 10 100 and 1000 or in the other direction 1.1 .01 in that sense.
So, therefore, we have to make sure that the all the values are within the range. So,
appropriate limit for the for the values to lie in the in the in the plot region this could be
appropriate 10 to 1000 for 0 and 180. Similarly 0 to 15 we can have 0.1 2 100. Let us
execute this particular line you can see now the visibility of all these points is much clear
and this is mainly because of the spacing change in a scale and they are and thereby
change in spacing of points in x axis and y axis.
Now, we are trying to recreate both the axis x axis and y axis and we are also trying to re
label these axis now let us zoom in.
211
So, now you can compare these 2 plots. So, here the that a that horizontal line that can
that can represent this particular these particular data points is not that much usually, you
know perceivable in in plot 1, but in plot 2 pretty much you can see that this seems to be
a that kind of relationship there is not much this is a horizontal line. So, therefore, there
is not a much influence of K M on price.
So, this is much better visible in this log scale you can see the points this 10 100 and a
1000 you can see the most of the points they were in this range right around 100 20
around from 25 to 100 20. You can see the points are still lying in the in the same range,
but because of the change in the in this is scale and therefore, is spacing of points the
visibility of the of these data points have changed and therefore, we can more easily
perceive the relationships.
So, let us go back and now we are going to create a box plot and try to understand how
the scaling rescaling can actually be really it can be helpful in case of a box plot. So, this
is this in this for this example we are using this data frame which we had taken backup.
So, in this in why we are taking this particular data frame we had eliminated the one
particular outlier point in the data frame 1 d f 1 now we wanted back so that the
importance of rescaling could be emphasized in much better manner.
We will see that.
212
So, range of this you can see this point is back 72 now we have to change our limits you
can see that. So, this particular box plot is between Price and transmission transmission
being on the x axis being the this being the categorical variable. So, let us plot this this is
the plot let us label the axis now let us zoom in let us have a look at the plot.
Now, you would see the plot is in this particular case because of this particular
observation you can see one observation lying here, because of this and this observation
213
in the automatic transmission category because of this observation most of the whole box
has crowded into x axis.
So, therefore, comparison of these 2 boxes is becoming a very very difficult. So,
therefore, rescaling can really be helpful in this particular case. So, what we are going to
do is we will use the log scale.
See the log y has been selected because now only the y is the a numerical variable. So,
therefore, only there rescaling is rescaling is required and y limit have been changed
appropriately. So, that it covers all the data points you can see 0 75 is very well very well
within this particular range and let us execute this code you would see a plot has been
created let us label the axis.
214
Now, you see the boxes are looking much better and the comparison could be easily
performed. Now the point which was outlier now you can see the spacing between this
particular point and the a box plot the main box plot has changed to a great extent and
this is the result of scaling now comparison could be easily done.
So, this could be the benefit of rescaling in some situations were crowding happens. So,
next discussion point is on aggregations attaching a curve and zooming in. So, we will go
through some of the examples and we will see how aggregations and curve how we can
append a curve or add a curve to the existing plot and how we can zoom in and how it
can actually help us in doing some of the visual analysis task.
215
So, let us. So, again we are going to a create 4 back to back plot. So, the similar
parameter the par function has been you know specified and called appropriately you can
see 2 and 2 2 rows 2 columns and a margins have also been specified. So, that we are
able to use the plotting region effectively. Now our first plot is now for the in this
particular case we are going to use the time series data that riders data that, we had
earlier used. So, let us import that particular data set bicycle ridership dot xlsx you can
this particular data set is the time series data the first variable being from month and year
the time scale related information and the number of riders for every month and this
covering years from 2004 to 2017.
Let us also create this time series vector that we are going to require later on let us also
create these variables at 1 labels at 2.
216
Now, let us go back to aggregations yes now once this we have created the time series
vector now let us plot. So, this plotting we have done before you can see the time series
has been plotted one fourth of the plotting region has been covered. So, because we are
going to plot 4 back to back graphs. So, let us recreate the axis. So, x axis you can see
aggregation the code that we are using that is you can understand the kind of labelling
that we are doing right, now you can see a day month and year and the same is depicted
in the plot. So, labels have been changed let us recreate the axis y axis this is mainly
being done to accommodate to have the access and also to be to able to accommodate 4 4
graphs in the plotting region.
Now, let us label the both the axis month and riders we are also going to provide a
appropriate title, because we are in this particular in this particular example we are going
to add a append or attach a curve on raw series. So, raw series is displayed now this is
the title you can see a title has been displayed there now lines is one function that can
actually be used to create a to create a curve.
So, therefore, if you are interested in finding more information on lines you can go into
the help section you can see add connected line segments to a plot this is a generic
function.
217
So, you have you will have to provide the coordinates x and y. Now in this particular
case we are providing the those coordinates using low lowest function the lowest
function let us find out. So, lowest function is actually a smoother scatter plot smoothing
function. So, it has actually takes the same x y points.
So, these points have are being taken from the time series vectors and the lowest is
applying some computations related to our lowest smoother and we are going to get a
smooth line added to the plot. The colour of the line has been this red we are going to use
red colour for this particular line. So, let us execute this code you would see a line red
curve has been added you can see that.
218
So, this particular curve has been created using the x and using the coordinated from the
time series vectors if we want a more a curve, which is more representing of the data
points you know we should be able to approximate what this particular line graph is
looking like and then the representative curve we should be able to try and plot here. So,
for example, this particular a line graph this particular time series is looking to follow a
polynomial curve. So, probably a quadratic a curve can actually be over laid on this. So,
let us do that. So, we will add a quadratic curve on this.
So, let us create this. So, t we need to create a because t will become a predictor for this
particular quadratic curve. So, let us create t. So, you would see that t has been created in
the you can see same in the in environment section. So, it is nothing, but series of
numbers 1 2 3 4 depending on the number of points that are there and because this is
being a quadratic let us create the t square. So, this has been completed now you can see
that we are using a point’s function which also does the similar kind of thing it plots
points on a on depending on the coordinates.
219
So, coordinates are being passed using the time function and they are using time function
we are extracting these coordinates from the t s v time series vector and then predict. So,
time is coming from t s v and we are using predict function that we have discussed
before. So, we are using linear modelling. So, in these linear modelling 2 predictors 1 is
t and another one is t square they are have been model into riders.
So, this particular prediction they will actually give us the y coordinates and the time
function that is extracting time from the t s v a vector and once both these points are
there we are plotting them using the points function, colour is green for this particular
curve.
So, let us execute this line you would see a green has been a green curve has been added.
220
You can see. So, this seems to be much this a particular curve seems to be a representing
the line graph in a much better fashion, but you can see the you can see the connected
points see points have been plotted it is not the line previous one was the line. So, lines is
is the function that is used to create a line and the points with the function that is used to
a plot points, but the because these being see the points we get a we get a sense of a
curve being added now this is also following a polynomial a quadratic curve.
So, now let us also add the grid. So, a b line is the function which can be used to create a
vertical and horizontal lines a x text function can be used to a get the default text that are
generated in the plot that are that are generated using the plot function. So, this text can
be extracted using extract function and we can have horizontal lines from coming out of
those text and this has been generated at 2 that we used that we computed before has
been used.
So, therefore, you can see that this grid lines these dot lines have been created
appropriately. So, that we get a. So, that we are able to get a better look better look of the
graph and we want to compare some of the values. So, we can easily understand them
now let us come to our plot number 2. So, this plot number 2 is monthly average.
So, we are trying to take monthly average over the years.
221
So, we have 13 years in total. So, for different months we want to we want to compute
averages of average number riders for each month Jan Feb March a similarly for all 12
months and therefore, from those and then we want to plot them in a line graph. So, we
will we would be required to compute those averages. So, let us do these so riders by
month. So, this is the variable and 12 months are there. So, let us a first initialize all
these 12 values 12 variables this is our counter.
So, we have total because we because we have 13 months and the and the 2017 we had
only 3 months Jan Feb and March. So, therefore, we have this 144 plus 3 140 a 7 months
in total. So, this being our counter I variable being our counter you can see in this while
loop we are trying to accumulate all the riders’ month wise.
So, a ride by m one is representing the month of Jan then 2 is representing the month of
Feb similarly the last one 12 ride by m 12 and the braces and the brackets is representing
the a month of December and we are trying to accumulate all these numbers and then
later on we are going to average them. So, that we get the average average a numbers
average number of riders by month and later on we are going to plot them. So, let us run
this now there are 3 more months in the year 2017 that we saw in the data before.
222
If you want to have a relook of the data you can do that d f is the a data frame can see
2004 starting from 2004 we have month for number of riders for every month and in the
last month.
Let us these to the last year you can see we have a the rate on 3 only 3 months. So, the
code that we discussed here this these particular 3 lines or adding those 3 numbers as
well for these months Jan Feb and March. So, let us execute these lines now once we
have all these numbers let us take average. So, you as you can see all these numbers are
being divided by the number of months. So, Jan is counted 14 times and then the others
are there these 3 months are counted 14 times and then rest of them have been counted
for 13 times because there was 13 years of data.
223
So, let us take average now let us create another time series vector this is now going to
we using this particular data average number of riders by month. So, let us create this
particular time series vector now let us plot it you can see this data.
Now, let us recreate the axis. So, we need to create labels. So, x axis y axis and the
labelling for both the axis and then title followed by a grid now let us look at the graph
this is our graph.
224
These is representing this is aggregation by month and monthly by for every month
average number of riders over the years are being represented. So, you can see that if you
look at this particular graph we can easily understand that in the month of July and
august the ridership’s the numbers of riders are at higher so on in an average sense. So, in
an average sense in the month of July and August the numbers of riders are on the higher
side and we if you look at the other months right. So, in the month of let say June there is
a dip in the ridership and then there is a lowest numbers are in the month of Jan and Feb.
So, a number of riders are on the low on the lower side in the Jan and Feb because this
particular data is reflecting the number of you know bicycle ridership in the in the IIT
roorkee campus you can you can see initially when the in the month of July and august
this particular semester the environment the environment is much conducive of a bicycle
ridership, but in in the month of Jan and Feb this cold weather. So, therefore,
environment is not that much conducive there therefore, the ridership is also affected.
So, the same thing can is very well captured in this particular aggregation. So, let us
move to our next plot this next plot is about so zooming in. So, we are going to zoom
into zoom in zoom into particular year. So, first 2 years of data so we are going to have a
look at the first 2 years of data. So, this can also be in a way you know we can subset the
data and then plot it. So, that can also be done or otherwise we can zoom in using this
particular function. So, window is the function in our that can be used to subset a time
225
series. So, we can create a new time series a subset time series from the existing one. So,
t s v is the existing time series and we can create a subset using the start and end
arguments. So, 2004 to 2005 data we are going to create a subset of.
So, let us execute this code let us plot you can see a plot has been created let us recreate
the tick marks and axis let us create the labels, title you mean into first 2 years and also
grid let us go back to the zoomed graph.
226
You can see is this is the these are the this particular line graph is representing first 2
years and you can have a look at the data.
Now, let us move into the a next plot this fourth plot is actually aggregation by year. So,
if we are interested in looking in in looking into the you know a global sense we are
trying to have a look at the data in a global sense what is the global pattern. So, for
example, when we aggregated data using months so, that was mainly that can be called
you know to understand the seasonality if there is any seasonality present in the data
more what these terms actually mean more discussion would be when we come to time
series forecasting, but right now we can understand that when we aggregate year wise we
can have a look at the global things when we aggregate by month we can have a look at
the full graph the r series gives a global trend look aggregation by month gives us a you
know seasonality a look year on year also give us a some other patterns some global
patterns.
So, let us do that. So, we are going to aggregate this particular. So, aggregate is the
function we are aggregating using mean using this particular time series. So, let us plot.
So, for every year these are the riders and this is the graph let us recreate the axis title
and grid.
227
Now, let us have a look now you can see starting from 2004 to 2016 the data is the we
can see the line graph and average rider average number of riders are depicted for each
year and we can see overall year on year there was a dipper on 2009 and after that there
the number of riders have been increasing year on year. So, we get the overall sense over
the years what has been happening. So, global pattern can be seen very easily in this
aggregation. So, we will stop here and in the next lecture we will start restart our
discussion with on scatter plot with labelled points.
Thank you.
228
Dr. Gaurav Dixit
Lecture – 11
In Plot Labels
Welcome to the course Business Analytics and Data Mining Modelling Using R. In the
previous lecture we did multi panel; multi panel plotting. Now in this in this particular
lecture will start a with in plot labels. So, last time we also covered 10 lines while we
were doing multiple panel a plotting. So, a let us start with in plot labels let us go back to
r studio.
So, first let us reload the a data and the libraries let us load this library xlsx.
229
And the used cars a data set. You can see 79 observations and 11 variables.
Now let us a recreate the age variable as we did in the previous lecture first 3 variables
are not important to us. So, we are trying to get rid of those variables.
230
Let us also convert these 2 variables transmission and c price as factor variables
Let us also eliminate that outlier point, like we did in the previous lectures now we are
ready to go.
231
So, will start with in plot labelling, so in plot labelling can be useful when we are a
dealing with when we are trying to understand a large amount of data specially in
clustering. So, it is easier for us to have a look at the make a visual inspection of the data
and try to understand different clusters that could be there. So, we will start with a scatter
plot with label point.
So, this particular data frame we have already created d f f b. Now there are few points if
we plot the if we have if we draw this particular plot if we create this particular plot
232
between k m and price there are few points which are far away from the major chunk of
the points.
So, therefore, we are trying to get rid of those points. So, that we can go ahead with the
labelling, because when we do labelling in the plot therefore, it can be slightly messy if
there are too many points far away. So, there is less scope for labelling in the major
chunk of the points for major chunk of the points. So, therefore, we will try to get rid of
these far away points.
So, this is 1 this is another one, this is the third one.
233
So, these points I will try to get rid of and then will look at them. So, this particular plot
this is scatter plot is between a kilometre and price. So, let us look at the a new range. So,
you can see that these this particular range 27 to 100 and 13.
And the points that we have removed they were slightly far away you can see these
numbers 161, 156, 167. So, these were the far away points. So, therefore, you wanted to
get rid of this point. So, that our labelling is much better. Similarly for price variable also
we have been able to get rid of the other points. This 0.23 and 65 you can see that price
234
value for these 2 points 72 and 13.55 these 272 is outlier and 13.5 5 is also far away from
the major chunk of values.
Let us execute the this particular code and create scatter plot. So, these are the points
between price and kilometre.
Now if you want to find out if you want to label these observations. So, this is the text is
the command which can actually be used.
235
So, in the text command if you are interested in knowing more about this particular
command.
And you can go into the help section.
And find out more information about this particular function.
236
You can see x and y the coordinates are first and second arguments. So, in this particular
case KM and price KM on the x axis and price on the y axis, they are representing the x
and y coordinates.
Their limits have been appropriately specified you can see a limits on x axis 25 to 120
this is again based on the range can range calculation that we just did. You can see the
range this is within this particular within these particular limits. Similarly for y axis also
limits are specified as 1 to 9 and you can see the range for price variable is also lying
within this range.
So, therefore, this plot has been generated and now let us executes the line to label this
now labelling is based on the variable model. So, the model name of the car would
actually be the label for the points, if you are interested in relooking at the data. So, let us
look at the first 6 observations.
237
Let us scroll you can see the second variable model.
So, the name for each of the observation is going to be based on this variable for
example, Verna Quanto SX 4 beats civic. So, the points are going to be labelled with
their respective model name and then there are some adjustments. So, for every point
where we want to place this particular label so that is based on this particular r argument
adjustment a d j.
238
So, a minus .4 and minus 0.4 are the relative coordinates from the points. So, from the x
and y coordinates of the actual point and relative to that coordinate this adjustment is to
be done were the label is going to be plotted. And the expansion for and this expansion
for this for this label point is going to be 0.5 this is half of the default size.
So, let us execute this line you would see the plot is slightly messy now, but every point
is labelled by their model name. So, all the used cars you can see they are labelled by
their label names. Now from here again you can see that on the upper side of, upper half
of this particular rectangle this particular plotting region you would see slightly
expensive cars are there. For example, rapid, crews Duster, so these cars Fortuner.
So, these cars are in the upper side Verna, so these car cars on the upper side of price. So,
this is very understandable. Now you would see that in the lower in the in the right part
right rectangle you would see the cars like which are having slightly higher mileage. So,
those cars are there for example, desire Verna indigo. So, all these cars you would see on
this side. So, these cars are probably being used have accumulated more kilometres. So,
this kind of this kind of clustering this kind of understanding of a data can also be helpful
if the points are labelled.
So, let us move to our next point now large data sets. So, if we are dealing with the really
large data set what is going to happen if we start our visual analysis? The our plot would
be filled with many points because they are we might be dealing with 3000, 4000, 5000
or even 10 000 points.
So, therefore, with the whole plot if we try to a you know generate a scatter plot the plot
might be filled with too many points. So, how do we understand the relationships
between variables or some information overlap? So, those kind of in how we can find
out. So, there we have a few things that can be done while we generate our plots. So, that
the we can easily see and understand the data. So, this is 1 example that 1 particular data
set promotional offers that we are going to use.
239
So, let us import this data set, you can see this particular variable this particular data
frame d f 3 with 5000 observation this. So, this particular data set is can will be in the
large data set category and there are just 3 variables, but 5000 observation. Let us have a
look at this particular data set first 6 observations. So, this particular data set has a 3
variables 1 is income then other one being spending and the promotional offer.
So, these 3 variables are about customers the income of a particular customer and the
spending that they do. So, income and spending and then the promotional offer, whether
the whether they accepted or rejected the promotional offer that was sent to them. So,
these we have these 3 variables and using these 3 variables we will try to see how we can
actually visualize this large data set and then try to understand. So, again here the our
colour scheme is going to be slightly important in this case.
So, let us check our default colour scheme you can see the black red this is the default
colour scheme in r we can change it change it to a grey and black. So, our first colour of
choice is going to be grey and then black. So, let us change it now when we are dealing
with a large number of data points then, we will have to in cooperate some you know
transparent colouring and will also have to reduce our marker size. So, the a points that
we generally see in our plots.
So, for example, you can see these circle these circles that we see in this particular
scatter plot. So, the marker size is slightly on the higher side for this particular plot. So,
240
therefore, if we are dealing with too many points let us say for example, in this example
5 000 points therefore, will have to reduce the marker size and will also have to do a few
more changes we will see.
So, let us so this particular plot scatter plot that we are going to generate this is between
income and spending income on the x axis and spending on the y axis. So, let us check
the range 6 to 222 and for a spending is closer to 0 and then almost 11. So, this is the
range the limits have been specified appropriately and the labels and the colour, you can
see the promo offer the third variable has been used as a colour and you would also see a
plotting character is 19 has been specified, if you are interested in knowing more about
plotting character. So plotting character is the 1 that is actually used to plot a particular
graph. So, you can see circles have been used in this scatter plot.
If you are interested in in finding out, finding about more about plotting characters you
can search in the help page, help section and in the a point function, you would see there
is a more description about plotting characters, you can see p c h values.
241
So, these are first 0 to 18, 0 to 18 and even more 0 to 25 they are well defined plotting
characters. For example, what we saw in our scatter plot was similar to the p c h value of
1, currently what we are going to use in this particular scatter plot is the plotting
character of value 19. So, that is this this dot black dot.
So, let us use these particular plotting characters cex is again the character is character
expansion factor pointed. So, this is going to be 80 percent of the default size. Now we in
the previous lectures we talked about how we can generate grid behind a plot? Now this
this another way that we are going to use to generate grid.
So, in the plot function it itself you can use this particular argument panel dot first and
there you can pass on the grid function, which is going to generate grid for you grid
behind your plots. So, let us execute this line.
242
So, this is our plot now, if you visualize this particular plot too many points the clarity is
lost because most of the points they are in the same region. So, therefore, they are
overlapping each other. So, this is what happens when we are dealing with large data set.
So, it becomes slightly difficult for us to understand, what is going on here between these
2 variables. You can also see the marker size is also playing a role because of this slightly
higher marker size many points are overlap, which could have been avoided with a
slightly smaller marker size. So, let us do few changes. So, there is another concept
which is called jittering.
243
Were, what happens in jittering to avoid overlap between points we add some random
noise very small value, very small value compared to the actual value of that point will
we generally add some random noise. So, that 2 points which are overlay overlapping
they might be slightly closer to each other, but both of them could be visible instead of
overlapping each other.
So, this jittering is generally added to each point. So, in this particular case you would
see that it is the income variable which we have a selected for jittering. So, income the so
this particular x axis points. So, there has been some noise there is going to be added to
all the x values for all the points. You would also see we have changed the plotting
character if we go back to the help section you would you would see that p c h value of
20 here is actually for a smaller dot.
So, therefore, a smaller marker is going to be used and you would also see that character
expansion also we have reduced it to 0.7. So, therefore, this is going to be 70 percent of
the default size. So, we have reduced the character expansion we have changed the
marker size and we have also done jittering which is adding the random noise, so, that
the overlap could be avoided.
So, let us execute this line let us see this plot now you would see that many more points
can be seen. So, there is much less overlap the marker size has reduced. So, therefore,
points you know they are now of the size is very small, but much less over overlap
between points. And most of the points many more points many more points in
comparison to the previous graph can be seen here.
Now, you would see that most of the most of the customers who have actually accepted
the promotional offer they seem to be lying on this region they seem to be lying on the
right and then the upper part top right mainly. Now there is another way to further
improve these particular plots. So, that is something that we have discuss in the previous
lecture as well as log scale.
So, we can transform our x our both these scales x and y scales and log is logarithmic
scales could be used and therefore, that is further going to improve the our visibility of
the points. If you look if you again have a look at the plot you would see that this
particular region between 0 to 100 this is a more messy there are more number of points
244
in this region, the other region is slightly you know there is some more space the points
are slightly more away.
So, therefore, visibility of points is much better in this case this is bit region of 0 to
hundred on x axis this is slightly more messier. So, we will try to change the scale and
try to see what can be done. So, you would see that we have used the log argument in the
plot function now x or y. So, scaling of both the axis is x is going to be perform and the
other things remaining same colour again promo offer is same is being used plotting
character is same and character transmission is also same grid is going to be there. So, let
us execute this particular line and you would see a significant change in the plot this is
mainly because of the change in scale.
Now, you would see that because this being log scale you would see the points this space
from space between point space between points having x value are 0 to 100 it is much
more and 100 to 100 less space has been the there less space is there this is, because of
the usage of log scale similarly for y axis as well.
So, therefore, now most of these points they are more spaced this region of points they
are more spaced and therefore, visibility of these points is these point is actually
improved. So, clustering could be so this kind of a jittering and a scaling of scales axis
could really be helpful, when we are trying to visualize a large amount of data.
245
And it can actually help us especially in unsupervised learning method task specially
clustering and also sometimes for to understand the relationship between 2 variables
specially when there are too many points. So, let us reset the our colour scheme for R,
Now that brings us to our a next discussion points. So, we are now going our to start our
discussion on multivariate plots.
So, we are going to discuss a few of multivariate plots where not just 2 variables are
more than 2 variables are going to be used and because generally the kind of modelling
that we do, whether it is for a statistical modelling or data mining modelling generally
we are dealing generally we are in multivariate environment.
So, therefore, multivariate plots can sometimes be more useful for us to gain some more
insights. So, we are going to start our discussion on multivariate plots.
So, first 1 that we are going to cover is parallel coordinates plot. So, will see later on as
we go through this particular example about parallel coordinates plot is about. So, for
different variables we can assume different dimensions are there and for all those
dimension all those dimensions are actually given some space in our 2 D plots.
So, will see how that is done and that gives us a better picture of each observation in
specially in parallel coordinates plot. And we can help understand what is happening
246
between the data in the what is happening the kind of relationship between variables and
the observations as well.
So, too able to use to create parallel coordinates plot we need to reload we need to load
this particular library mass. Then will also change because we would be creating a 2 back
to back plots.
So, let us also change the you know parameters 2 rows, 2 rows 1 column. We also
changing this cex value margins and outer margins as well, some margins are specifically
for the plotting region and outer margins are these space between the plot and the
remaining a area. So, let us execute this.
So, for this particular plot we are going to use again d f 1 let us have a look at the data
frame.
So, this is main data frame that we are going to this is again used cars data set. So, what
we are going to do is we need to do certain a transformation to be able to use the parallel
coordinates plot. So, if we look at the data that we have right now look at this structure.
247
Let us look at the structure you would see that fuel type and transmission and C price all
of them they are factor variable, but a parallel coordinates this function pair code that we
are going to use now this actually requires us all the variables to be numeric.
So, therefore, we need to change them. So, you would see a this is one way. So, all the
labels, so fuel type labels, we are trying to make them numerical for example, right now
they are labels for fuel type they are CNG diesel and petrol. So, first thing would be to
change them to numbers. So, this is the this is the particular code that could be used a
length because there are 3 labels.
So, therefore, we can have a 1 2 3 instead of CNG diesel and petrol. So, this is the code
that can be done to change the labels name. So, once the labels name have been changed
for the fuel type we are particularly ready to apply as dot numeric as dot numeric
function to all the variables in this particular data frame.
Earlier even though we had 2 more factor variables transmission and C price, but their
labels name were already in the numeric form 0 and 1 and 0 and 1. So, therefore, when
we course when we do the convergent for convergent of these strings to a numeric, it is
easier, because they are already in the numbers form you know written in the stored in
the string form.
248
So, therefore, a, but this particular variable fuel type that was stored using CNG diesel is
a texture form. So, the name of the labels they were text. So, therefore, for that we need
had to change it. So, let us change this particular function, let us change this particular
data frame.
Let us have a relook at the structure.
Now, you would see all the variables are now numerical and you would also see that
labels have also changed, you know the values have also changed specially for fuel type
249
you would see same is true for this transmission and C price let us have a look at the first
6 observation.
You would see transmission earlier for 0 and 1 now you can see it is 1 and 2 this is main
reason being, because we have applied numeric function. C price also earlier it was 0 and
1, but now it is 1 and 2 fuel type it was earlier you know text that was CNG petrol diesel
now it is 1 2 and 3.
Now, once we have done this kind of transformation, now we are ready to use this
particular function par coordinate, now here we are not going to include our outcome
variable or interest. So, there are 2 variables price and C price scatter plot price is not
going to be included price and C price is not going to be included in this particular plot
we are interested in. So, we are going to use 2 panels and each panel for particular group
of C price. So, for C price value 1, 1 panel 1 particular plot, 1 panel and the second panel
for C price value of 2.
So, that is the used cars used cars with less than 4 lakh value. So, they are going to be in
the panel 1 and the used cars with more than 4 lakh more than or equal to 4 lakh value
they are going to be in the panel 2.
250
So, will try to understand the differences between these 2 groups across variables, so,
this is going to be multivariate visual analysis. So, let us execute this line you can see
that panel 1 has been created now let us label the axis.
Now you would see from here that par coordinate function it also scales all the variable
into 0 and 100. So, that is in percentage. So, all the variables have been scaled have been
brought to the same scales let us also create grid.
Now, let us plot the second panel, this is in different colour and labels and grid. Now let
us zoom and find out the plot.
So, panel 1 this is for group 1, the panel 2 this is for group 2. Now from here we can
actually compare these 2 panels and we can try to understand the differences between
these 2 groups. So, a you can see transmission if we look at the transmission then you
would see the panel 1; the panel 1 there are there are 2 values that are there in
transmission. So, we had manual and automatic.
So, there are 2 values you would see in the panel 1 there are few values at the value 1
right. There are few values at value 1 and if you if you see here in the in the in the panel
2. So, the values transmissions for both these values there are they are seem to be equal
number of observation.
251
So, equal number of equal number of lines are passing through this particular axis, for
each variable Fuel-type, SR-Price, KM, Transmission, Owners for each variable we have
an axis. So, that is why the name comes from this parallel coordinates. So, for each
variable we have a you know coordinate system and they are they have been put in
parallel. So, each observation each line is representing in observation going through this
particular plot.
You would see fuel type if you look at the fuel type dimension you would see in the
panel 1 petrol CNG and diesel all 3 are present, but if you see in the panel 2 only petrol
and diesel are present, 1 1 is 1 is not there 1 particular category is not present. So, this
kind of this kind of comparison can be done using parallel coordinates plot.
Now let us go back. So, we are going to start now our discussion on next point that is
specialised visualization. Till now what we have been doing is we have been mainly
dealing with the cross sectional data or the time series data. Now we are going to in co-
operate some other forms of data. For example, in this particular lecture we are going to
cover network graphs. So, for that we require network data.
So, will go through 1 example and see how it is different from cross sectional and data
cross section analysis and cross sectional data and time series data and time series
analysis. So, this is this 1 hypothetical example that I have created.
252
This is mainly applicable in the association rules context that is going to be covered in a
much later lecture. So, this is a bipartite graphs 2 more graphs. So, there are going to be
2 groups and we are going to see the interconnections between these 2 groups by plotting
a network graph.
So, first what I am going to do is because this is mainly in the association rules context.
So, therefore, we are essentially dealing with the transactions. So, therefore, in
transaction generally we have items which are purchased together. So, we are going to
create a hypothetical data set for the same.
So, let us say some items 1 to 10 they are represented by letters a to the corresponding 1.
So, let us create this particular you would see that item 1 has been created 50
observations and these and the labels 1 to 10. And now this is for the first transaction this
is for the first item in a transaction. Second item in a transaction again the again we are
going to it is going to be from the same pool of items, but it cannot be the item that has
been already purchased.
So, therefore, we are going to we have written some code here to perform that. So, first
let us let create these this particular pool this is 1 to 10 you can see here pools variable
has been created 1 to 10 a to this particular value.
253
Now, what we are going to do we are going to for a particular for a particular item in
item 1 the it cannot be included in the item 2. So, therefore, it is eliminated through this
code you would see minus which pool to lower item i item 1 i. So, item 1 i, in the upper
case we lower it down and it is it should not if it is equal, then that particular index is
excluded from the pool samples from the pool from which this sampling is being done.
So, let us execute this code now let us create the data frame of these 2 variables, this I
graph is the library that we generally required to deal with network data. So, let us load
this particular library.
Now, from the data frame that we have just created let us have a look at the data frame as
well.
254
So, you can see first 6 observation items name. So, we can consider this to be 1
transaction, row number 1 is 1 transaction were items A and e are what purchase
together, row number 2 is second transactions. So, these transactions base data set is
mainly applicable to association and rules. So, now, let us move on.
So, from this particular data frame will try to create a network graph data. So, this is the
function graph from data frame that can be used, if you need to pass on the a data frame
and the this we need to set this directed argument. So, in this case we are not trying to
create a directed graph. So, therefore, this is this has been set as false. So, let us execute
this line. Now V is for vertices of a graph now labelling of those vertices. So, till now
what we have done we have created graph. So, if you want to see what we have done.
255
You can see this graph has 20 vertexes and 50 edges and those ages have been displayed
here right. So, now, let us try to label these vertices. So, labelling could be done using
their name itself. So, let us execute this.
Now there are 2 groups as because 1 is what you know in a particular transaction first
item and the second item. So, we are trying to put them into different groups. So, 1 to 10
is 1 is type 1 and 11 to 20 vertices as they are type 2. So, let us execute, we are trying to
understand what generally happens in association rules? What course with what? If item
a is purchased then, whether item b is purchased or not. So, in that kind of association we
can see here through a network graph here, if in a particular group item 1 item is
purchased then whether another item is purchased along with that. So, we are going to
find out that through network graph.
So, we have created 2 groups then a randomly we are trying to create the coordinates
were these were these vertices are going to be plotted. So, this is for the creation of
layout coordinates. So, this is again randomly being done. So, x and y coordinates for all
the vertices have been created. Now shape of the vertices we have selected right now
circle colour grey. Now we come to our edge part. So, edge edges the colour has been
selected black.
Now sometimes there might be multiple lines between 2 particular items, because 2
particular items can be bought by more than 1 customer. So, therefore, there could be
256
more transaction of that kind. So, we are trying to a represent more number of
transaction through a through the edge width.
Therefore, we are instead of having 2 or 3 connecting points between 2 items we are

going to have just 1 connecting point point with a increased you know edge width. So,
for that we need to compute the number of edges between 2 vertices. So, that we can do
using count dot multiple function.
Now, we need to remove the you know more than if there are more than 1 edges between
2 vertices we need to remove them. So, we are going to do using simplified function
remove multiple being true. So, therefore, those edges are going to be removed.
Now we want to use that number of if there have been a more than 1 edges between 2
vertices we want to use that as a weight, because we want to use that as width of the
edge. So, doing the same, now let us come to the our vertices. So, let us if there are more
if there are size of a vertices we are trying to define by the number of edges that are that
that are going coming in or going out of a particular vertex.
So, we are trying to compute that using this particular code you can see, now to have a
better visualization we have added 10 to each size for the vertex, for each vertex now let
us change the parameter setting margins and plot you can see this.
257
This particular network graph has been created
And you can see these are this is the 1 group 1 this is the first item that has been that was
purchased by different customers, a f e and the second item that was purchased by the
same customer for each transaction we can see and there is difference in the a line width
that is actually signifying the a more number of transactions.
258
If that particular item has been purchased and the another item has been purchased in
more transaction. So, that is reflect in line with the bigger size of the vertex that is
reflecting the involvement of that particular item in more number of transactions.
So, therefore, network graph can really be help helpful in association rules, while we are
trying to understand the relationship between different items. So, we will stop here in the
next class will start with hierarchical data.
Thank you.
259
Dr. Gaurav Dixit
Lecture – 12
Specialised Visualization Techniques-Network Graph
Welcome to the course business analytics and data mining modelling using R. We have
reach to our last part of visualization techniques. In the previous lecture we were we had
we had started our discussion on specialised visualization we will continue from there. In
the previous lecture, we started set size on network graphs, that is for network data. We
will start from there and then we will cover 3 maps that is that is mainly for hierarchical
data and then map charts that is for geographical data. Till now what we have we mainly
dealing with before specialised visualization, was mainly cross-sectional data or time
series data.
Now, these types of data set network data, hierarchical data and geographical data they
are for different specialised visualization and they have different kind of data set as well.
So we will discuss them in more detail as we go along.
Let us open R studio, for network a graph we are going to use this network data which is
mainly we also talked about a bit for this in previous a lecture that this is mainly in
260
association rules context. If we try to have we need to understand a bit more about
association rules.
Now association rules mainly is generally done on transaction data basis.
Generally, transactions are in this format, transaction ID 1, 2, 3, 4 and then items and
generally when a customer visits to a retail store; generally they purchase a few items.
Let us say these are those items that were purchased by the first customer that is reflected
in transaction ID 1. Another customer might purchase C, D; another customer might
purchase D and A another customer might purchase B and D.
There are 4 items that are available you know that are mainly under discussion for these
4 transactions. They are being purchased by different customers and transactions are
being recorded. Each in each transaction which items have been purchased, you can see
it here. If we are interested in finding out you know generally, association rules mining is
about, finding out what goes with what.
More detail more discussion on association rules we would be doing in a much later
lecture. When will devote much more time on association rules right now, we can
understand what association rules is about what goes with what. We try to identify which
item is being bought along with which item. We are interested in finding out those items.
261
That we can plan our you know store layout and other things promotional offerings and
acting offering we can bundle some of the items to boost our sales.
Those kind of things can be done, if we are able to identify. Now in this particular case
when we are discussing network data or network graph we are interested in
understanding transaction-based data you know database data set. In in a particular
format were we are interested in finding which item was purchased first and then second
another item would be purchased after that in the same transaction. Similarly, for this 1
CDD and A and B and D.
This particular information can also be depicted this particular information can also be
depicted in through a network graph and we can understand from there we can make
some sense about if we are dealing with large amount of data then in that larger amount
of data visual analysis can be really helpful.
There we can now in this example that we are just doing we have just 4 observation 4
items 4 transactions 4 items, but if we are dealing with large amount of data visual
analysis visualizing the data in a network data format in a network graph would be much
more helpful for us to understand what is happening. Which items to identify to be able
to identify some patterns and then that can be really helpful for us in our association
rules mining; we will just do that.
What we are going to do is we are going to create bipartite graph that is 2 more graph. In
this case as I talked about item 1 in a particular transaction and the item 2 you know the
second item that is being bought that is being purchased in the same transaction. There
are these 2 types that we are going to create. First, let us create the network data now this
is hypothetical data we are going to create. Let us say we have letters 1 to 10. That is let
us execute this particular code letters letters A to letters from A to J.
262
These letters are representing different item. Let us execute this code let us create this
pool item 1. These are you know items that are being purchased in different transactions.
Now, second one is for example because the number of items are same. We are using the
same items, but in lower cases this is mainly for the coding purposes. Otherwise, whether
it is whether the item is represented in upper case or in lower case they both mean the
same item, but just that we want differentiate between what was purchased first and what
was purchased a along with that first item.
Because, here we are trying to create a data set were if in item has been purchased first it
should not be you know because there could be multiple quantities a particular customer
might be buying more than one item of this or more than one item of this as well. We are
not interested in the quantity, we are interested in finding out if a is bought what is the
second item that is also being bought or purchased along with this.
That is why you know we need to exclude this in our coding. Let us create this variable
item 2 and you would see I am running this loop for from one to 50. 50 observations are
going to be created, a just like item 1 right, but because the same item cannot be depicted
now. It cannot be A and then second item can all cannot B because we are trying to get
hypothetical data. We do not want this kind of situation we want it to be something other
than. It could be from b to j and it should not be a. That is what we are trying to do in our
coding here. Would see this item 2 lower item 1 I and it has been converted into the
263
lower case and then we are comparing it with the pool items pool items are these. Let us
create pool items. These are the pool items a to j smaller cases and then we need to
eliminate this when we do sampling for item 2. That we get a proper item 2 vector.
Now, let us crate the data frame of these 2 variables. You would see data frame has been
created 50 observation of 2 variables now I graph is the library that is generally used for
network data and network graphs. Let us load this library.
264
Now, from this data frame we are trying to create the a graph. This is the function that it
is being used graph from data frame and because this is not a directed graph. That we
want to create. Therefore, directed has been assigned as false. Let us execute this.
It you would see that g has been created in the environment section and the this is
undirected graph now a labels of these vertices if we at this a particular time if we look at
the details of our graph you can see g and you would see that there are you know a 20
vertices and in 50 observations are there, then the 50 edges are there, you can actually
see the edges. These are different items g and d was also bought along with this then e
and j then d and c and then h and j. These kind of this kind of list edges you would have
now let us rename these vertices. They can be renamed after using their name itself. Let
us execute this line. This is done.
Now, as we said that we are trying to create 2 groups because we want to create bipartite
graphs. Therefore, we need to have 2 groups. That is being done through this type a
variable. V g and the type is type one is actually for item 1 and the type 2 is actually for
item 2. We want to create 2 different groups. That we can we are able to create our
network graph later on. Let us do this.
265
Now, we are trying to create a network layout. How the graphs are going to be depicted.
We are we will be doing that. We want to be we want to create our network lay layout.
All these items are going to be represented by some vertices may be in circles and this is
like type 1 and type 2 type 1 this is actually for item 1 and type 2 this is for item 2. These
are like these elements. Represented by these elements ABCD in this fashion. Similarly,
ABCD, A could be with C, C could be with D depending on the transactions that we
have. This kind of network graph we want to create. That we are able to visualize the
data.
266
Now, using X and Y coordinates. We are trying to create we are trying to generate
coordinates for all the vertices. Run if run if the command that we are going to use. You
would see that we are trying to create a 10. Because, there are 10 items in in type one and
similarly 10 items in type 2, you would see that run if 0 to 5 is range has been given and
here 10 to 15 range has been given. The some space between these 2 groups because this
is a bipartite graph some space is being left in the layout. Then we can easily visualize
this vertices or nodes.
This is for x coordinate y coordinate is you can see that some spacing between different
vertices or nodes is being run using sequence function. Once we have done this layout
planning, now we can decide on the shape of nodes. In this particular case, we are going
to select circle. Vertices they would be in this shape we can have squares and other
shapes as well.
For this exercise we are selecting circle, then there are going to be there could be many
edges between 2 particular nodes. If we are just trying to convert if we are just trying to
depict transactions into a network graph. There could be a some customer who would be
buying the same item 1 again B and D this kind of transaction would be there.
Therefore, between B and D there could be multiple edges, but in a network graph we do
not want to show in this fashion. We want to show just one particular edge, but we would
like to change the width of this edge. This would be a much a wider much wider edge
representing more number of connections. This is how that is how we want to show that.
We need to find out the multiple edges between 2 vertices.
This is what we are trying to do using count dot multiple function. In a graph, that has
already been created with g in g. That we can find out on we are trying to. If we execute
easy weight, let us see the values. You would see that for all these edges 350 edges that
we have now. You would see some of the edges you can see that they are they are for
example, this one is twice, some of the edges are thrice, in this fashion we can find out
how many edges are how many edges are between the same vertices.
Now, we want to remove this multiple edges. This we can do using simplified function.
Simplified function you would see that remove dot multiple argument is there. We are
setting is it has 2. That is going to remove the multiple edges. Let us execute this 9 now
in the next line you would see that.
267
Now we are trying to assign width that same way because we want edges to be of much
wider width were the multiple edges were there. The same thing we are trying to you
know you know assign it here. You would see Eg 1-dollar weight that has been assigned
to Eg1 width and 0.5 is the kind of we are trying to control the scale of it. That the
network graph it is visually more you know pretty or attractive. 0.5 is the reaction scale
that we have done from our side.
Depending on the way graph is being generated we can change this change this. Once
Eg1 once e once simplify has been learnt if you are interested in finding out the you
know this particular weight, you can see the same you can see there are many edge there
are many edges for which the weights value have increased. 4 4 for some are 9 is also
there. There many edges for which a weights value have increased that is because we
have removed the you know a multiple edges and increase the width and through weight.
Now, the same exercise we can do for the vertices. Now the size of vertices that is also
going to be based on the number of edges that is that are coming into a that are you know
a coming into this vertices are going out of a particular node or vertex. The same thing
we are trying to do here. For every for every vertex we are trying to find out the length.
Let us execute this code you would see that now this 4 is 4 here is again the boosting
factor. Size. That again the vertices they are you know in a in a they have the particular
shape which is reflecting the number of also the reflecting the number of edges coming
268
into it or going out of it and also the item that are being represented by a to j in capital
letters or small letter they are also very well seen.
Now, let us come back to the colour 2 colours we are taking, per vertices are going to be
generated going to be coloured with grey colour and edges are going to be coloured with
black. Let us do this now let us set our parameter function margin 0.1.
On all you know sides and let’s plot. Now, let us zoom out this particular now this is our
graph.
269
Now, from this graph you can see that items g d which are in in in which are having
much greater size. Similarly, e and h there are many more edges coming into these
vertices. You would see some of the edges their width is much higher. Because, there are
more they are involved in more number of transactions. For example, e and I the width is
quite high in comparison to other edges. Because, these 2 items have been brought more
often. The same is reflected in the network graph.
From here you can understand similarly c and e you would see that is smaller e this is
also the size of this particular vertex is also on higher side. You would see e is anyway
involved in more number of transactions in the hypothetical data set that we have
generated and c and e they are being purchased together many times, that is reflected in
the through the width of width between c and e. If you want you can have a look at the a
data frame that we had created for this particular exercise or network graph, that is d f 5.
The same would be same thing would be reflected in these transactions.
For example, in the graph as we can see c and e there seem to be more number of
transactions and e there again being purchased many times, you might see that more
often. You can see transaction 50 in this particular record c e is there. You can c in many
transactions either in the item 1 or item category similar same case is there.
270
For e and c e can also be seen in many records. You can see this is again we can see c
and e. The same thing we can verify here.
Now, let us move to the next specialised visualization, that is our about hierarchical data.
Sometimes, we might have to deal with data that is more of a hierarchical more you
know hierarchical in nature. They are going to be for example, in a university there are
going to be departments and within departments there are going to be labs. Similarly, in
271
business organisation there are going to be organisation different verticals and then there
are going to be different departments.
The way this particular information is going to presented it is more in a hierarchical

format. How we can visualize hierarchical data? And how the kind of insights that we
can generate through visual by applying visualization techniques? That we are going to
cover now. Interesting plots that can be used to understand or visualize hierarchical data
is tree maps. We are going to generate tree map through an exercise. We have this data
set e commerce dot xlsx.
Let us look at the data first again this is also.
272
A hypothetical data set.
You would see there are there is there are 5 columns in this particular excel sheet item
category then sub category, brand, price and rating.
You would see that item category we have, electronics furniture and clothing and then
sub category within electronics we have mobile accessories, computer accessories,
variables in the furniture. We have sub category within you know living room furniture
dining room furniture and bedroom furniture then clothing we have for clothing for
273
women men girls and boys. You can see for every category there is going to be a few sub
categories.
A hierarchy easily a hierarchy can be seen in the data, then a brands are there. Of course,
for every a sub category there are going to be multiple brands. And the another hierarchy
and for each brands there are going to be you know different product. We are not
covering the products here, but for each brands we have given some prices which are
actually reflective of different products that are there. For each brand we have some
prices and ratings for. These ratings or by customers for that is also available.
We are going to use this particular hierarchical data and we are going to plot tree maps
and see what we can understand from those maps about specially about the data and how
it can help in different task like prediction, classification and clustering. Let us load the
import this particular data set.
let us load this library first, mostly library has been loaded. Let us execute this line and
import the data set you would see df6 has been imported 30 observation 5 variables the
same excel file that we saw just now.
274
Let us execute this line as well let us look at the structure of the data set you can see item
category factor fact sub categories also factor and the brand and then we have 2
numerical variable price and rating. The library that we require for tree maps is this tree
map library. Let us reload this particular library.
275
Let us reload this library tree map this one this has been reloaded.
Let us look at the let us look at the excel file again you would see that.
276
There are if you look at the price column. There are few items in this particular price
column which are on which are up the very, very high value they for example, in the
furniture the values are from 20000 to 90000 the other items other records that we see
they are of lower value. See in here it is up to 5000 around 1000 or even less than 1000
other items.
There is quite a big gap between different price value and when we are going to create
tree maps we would see that is going to a impact. The way that tree map is going to
created. We want to reduce the scales of some of these values especially for create
creating for the purpose of tree maps. In tree maps essentially, we are going to be
creating different, different rectangles.
Let us see through an example.
277
Tree Tree maps are generally in rectangular format. This kind of a rectangular plot is
going to be created then you are going to have item categories item 1 item 2 and then
you are going to have sub category sub category for these item sub category 1 sub
category 2 3. Similarly, for item 2 there are going to be sub categories and then for item
3 and then sub categories.
And then depending on the way you are trying to define the size of these rectangles
rectangular zones, the way you want to colour them, the shade of the colour and the size
of these rectangles that is going to convey some information that will see through an
exercise but we want to we do not want some of the rectangular reasons to dominate and
reduce these a size of other rectangle. That you know the visual that perception is lost.
We do not want to lose on our visual perception of some of the smaller rectangular. We
need to control for that. If there is too much of you know too much of gap in the range
for a particular variable which might be used for you know sizing of sizing of these
rectangles, then that could be problematic. What we are trying to do in this particular line
rectangular size for any price value. Price is going to be used to determine the size of
these rectangular regions in tree maps.
For any value which is greater than 5000. We are trying to reduce that particular value of
the that value of that price. This is in this fashion 5000 and then we are dividing the
value by 10. That is how the new value would be and that would determine the
278
rectangular size. If the value is less than 5000 then will keep the value as is. Let us create
this particular variable. Let us also added into re data frame now let us again reset our
par margin 0.1 on all 4 side. Now, tree map is the function that we are going to use.
There are many there are many arguments that can be passed that can be used with this
tree map function. If you are interested in more detail you can find out here in the help
section.
279
You can see tree map and you would see there are so many arguments that can actually
be used to design your tree maps.
Some of them we are going to use in this particular exercise you would see that first
argument is the data frame that we need to pass on to this function that is our data frame
is df6 now the second is index. Indexing of a tree map is a dependent on the this
hierarchy that we have. Our hierarchy is item brand item category then sub category and
then the and then brand. That has to be passed on using a character vector. This is what
we have done. Index takes character vectors.
We have created this then the second argument that we are using in this exercise is v size
that is size for the rectangular regions that are going to be created. We have already a
created a variable rec dot size for the same. We have made sure that the sizes are in the
appropriate sizes. That smaller rectangular they are not over you know overcrowded or
reduced to a insignificant size because of bigger rectangles.
Now, the another argument that we have is v colour. For this colouring of different
rectangular regions, we are using this rating column that we have. Ratings of these
different, different products can be is going to be used to create different shades of a
particular colour type, is another argument in this function value is going to be used here.
Value of you would see value of price is going to be used to determine the this particular
tree map. Then aggregate is using mean. Mean values they are going to be used for the
aggregating aggregation.
Colour scheme is a grey. We are taking 4 labels of grey colours font size labels for
different categories and sub categories depending on the hierarchy of the data this this we
have given 11 for item names 9 for sub category names and then 6 for brand names this
kind of font size labelling we are doing and the title we are not giving any title and
legends. We do not want legend for our tree map. Let us execute this particular function.
Let us see, this is the tree map that we have been able to generate.
280
Let us look at. You can see that there is one particular category furniture which is more
dominant because of the higher value as we thought earlier. Living dining and bedroom
are the sub category and then the brand names you can see rectangular regions, they are
based on the average price the you would see that aggregation is based on the mean
value or average value. That has been done. And so, that is there and the shading of this
shading is based on the rating.
If a particular if a particular brand is a rated highly by the customers, the colour intensity
is on the higher side. If it looks you know more of a grey higher intensity grey colour
then therefore, customers have rated that particular thing highly more of you know light
grey they have been poorly rated. The sellers of those items have been poorly rated. You
can see this tree map now let us move to a next part that is geographical data.
To depict the a geo graph geographical data we generally use map chart. Again, we are
going to use another data set this is a between this is this particular data set has
information about internet inclusiveness and the corruption perception index and we are
going to depict this information using geo using map chart.
281
Let us look at the data set first, you would see we have 1 column first column is about
the country. We have different country names here and then the their index about the kind
of inclusive internet that they have. This index is reflecting that then the corruption
perception index is also there.
We are we are going to create a map chart depending on these index values and will try
to compare how the internet index is there and internet. How the internet index are there
for different countries? And the level of a corruption that are there whether there is any
link between these 2 that we are going to do through a map charts. Let us import this data
set. R1 map is the library that we would be requiring to create these map charts. Let us
load this.
282
Let us reload this library.
283
R 1 map once it is loaded.
We need to create a device which is suitable for a generating map. We need we are going
to create 2 maps. Therefore, 2 rows and 1 columns. Let us create this device. You can see
a one device is active. Now, we are going to use this particular function join country data
to map. The country data different in indexes that we had in the data set we are going to
join them. Name join column that is country the column that we had in our data set join
code is name that is why the using name of the country.
284
A data is available in in this particular data frame d f 7. Let us create this now you would
see 71 codes have been. 1 code there was some a mismatch. That failed. Let us move on.
There is another function map country data. This particular function is going to create the
a map. Data map is going to be passed in then the appropriate the respective index that
we want to plot in the map that we need to create.
The cat method is pretty the way the colours are going to be this selected colour pallet is
given 7 to 0 then legend we do not want legend. Let us execute this code.
285
And you would see a map has been created in this particular device let us execute the
another one this is representing the corruption index. You would see another map has
been created now you can compare these 2 maps. The first one the is internet you would
see the u s and Canada these 2 countries they are in the higher intensity a colour. They
have higher inclusive internet and they also have a low levels of corruption. Higher
intensity of is reflecting low intensity low level of corruption. You can see that. In this
way you can actually visualize.
For example, inclusive internet inclusive internet index for India you would see you can
see the shade of this colour grey and if you look at the corruption. This is in the lighter
shade. Inclusive internet is there, but the corruption levels are not that much are not at
that level. You can see Russia inclusive internet is much higher index, but if you look at
the corruption there are much more corruption perception in Russia.
This kind of thing; we cannot, from this we can actually in general we can see that if the
internet inclusiveness is on the higher side, we can also see that corruption perception is
also a corruption is on the lower side in those countries. Some exceptions are there for
example, Russia. We will stop here and in the next lecture will start our discussion on
dimension reduction.
Thank you.
286
Dr. Gaurav Dixit
Lecture – 13
Dimension Reduction Techniques- Part I
Welcome to the course business analytics and data mining modelling using R. So, we are
going to start our discussion on next topic that is dimension reduction techniques. So, let
us start. So, first, let us understand the context, so generally whatever discussion we have
done till now, we have been talking about the different variables predictors that we
generally require to build a model. So, sometimes if there are too many variables because
we are also dealing with a large data set, when we talk about data mining modelling, we
are generally dealing with large data sets. So, sometimes there are could be too many
variables and subset of some of these variables might be highly correlated.
So, we are of course, looking for some amount of correlations, so that some link or
association can be established, which could be used for prediction, but if the correlation
is on the higher side then the information overlap could be problematic because few
variables if they are highly correlated they can lead to spurious relationships because
they are going to dominate the model.
So, their coefficient for example, if we are running regression, if we are doing regression
analysis then they will dominate; their coefficient will dominate and the others, other
variables coefficient might come down. So, therefore, they will dominate the influence
on the outcome variable that is y. So, we do not want that, we do not want our model to
be spurious or results to be relationships, results to be meaningless. So, we want to get
rid of this highly correlated variable problem.
So, dimension reduction techniques are actually used for this purpose, then there are
computational issues as well, for example, if we are dealing with too many variables,
sometimes computation might take more time, especially if you are running different
variants of your techniques and so many candidate models are there. So, it might be time
consuming too much of time would be required to do your modelling exercise. So, for
that also we would like to reduce the number of variables.
287
Now, another reason could be cost of data preparation exploration and conditioning. So,
we talked about this space and the kind of things that we are required to do in data
preparation exploration and conditioning. So, that might take more time, if there are
more number of variables for example, visual techniques that we covered, that might be,
that might require more time and so the cost of modelling would go up, so that could be
another reason. So, dimensionality, so we talked about in previous lectures, we talked
about principle of parsimony, were we said that we want a variable to have as few
variables as possible and still we able to explain most of the variation in the outcome
variable, so that is the desirable property.
So, therefore, we would ideally look to reduce the dimensionality of a model. So,
therefore, large number of variables we always look to reduce the number of variables,
but we want to retain the important variables in the model. So, we need to identify them
and some of the problems for example, some variables been highly correlated, we want
to get rid of that also. Dimension reduction is also called factor selection or feature
selection in feature extraction in some domains, specially machine learning, artificial
intelligence, so they are some other names are also being used.
288
So, what are the different dimension reduction techniques that can be used? So, these are
the few that we are going to cover domain knowledge, data exploration techniques, data
conversion techniques, automated reduction techniques and data mining techniques. So,
domain knowledge is the one, that we generally that can be very handy, if you
understand the phenomena, if you understand the underlying theories that are explain
that particular phenomena, you understand the key variables construct and keeping in
mind the kind of task that you have to perform, you can identify; easily identify key
variables because you have been working on that area. So, because of the knowledge that
you have throughout the phenomena about the variables construct and everything else,
that can help you, that can put you in a better position in identifying key variables.
So, in that fashion you can reduce the dimension, you might not use other variable state
and your modelling and that can is that is going to help. Removing redundant variable,
some variables through that same process if you understand your area, you will
immediately understand that these variables are not useful, so you can get rid of those
variables.
Sometimes, if you have been involved in data preparation exercise and the collection
exercise you would also know some of the variables which might be having a redundant
values because of the way the data might have been collected, it might not you know,
those variables might not be suitable because some errors will come into your model and
289
the results that you actually might expect might not come and your own arguments and
theories might not be justified. So, therefore, we can also identify these redundance
variables and if we can handle them, we should go for that, otherwise we can get rid of
these variables as well.
Measurement issues for variables, so sometimes we also have to see, if a variable if there
is an we need to repeat the exercise can the variable be measured again, is it easy to
measure that particular variable, so those kind of things also. So, that is from the project
point of view, if you are running a project for your business organisation and there is
some business roles and the relevant business analytics exercise that you have to
perform.
So, you will also have to see, whether a particular variable can be again it can be data on
that variable can be again measured and collected. So, that is also you have to see
whether the exercise can be repeated or not, so those kind of issues could also be there.
Next is data exploration techniques, so some of them, some of these things we have
already discussed in previous lectures for example, descriptive statistics, that can also
help us in reducing the dimension for example, summary statistics.
Sometimes it might be easily able to see, the you know, from the summary itself, the we
can identify important variables and we can also identify the variables which we can get
rid of, pivot tables in excel that can also be used to, again look at the summary level data
and make further decisions about your modelling exercise. Correlation analysis can also
help you identifying the variable which are correlated and should be included in the
model, highly correlated variable can also be identified there that can again help you in
determining which variable you need to, you have to perform closer inspection and so
that is also, so correlation analysis is also going to be quite helpful in reducing the
dimension.
Then we have visualization techniques that we have covered in the in previous lectures.
So, let us go through some exercises for these techniques. So, let us open R studio.
290
Let us, load this particular library that we have been using in every exercise. So, used
cars data set this is the 1 that we are going to use this for this exercise. Let us import this
data set again, now first an observation we have already seen this data in previous
lecture, but let us have a look again. So, these are the variables, as you can see you are by
now yon should be familiar with this particular data set, brand model, manufacturing
year, fuel type, showroom price, kilometre price, transmission, owner’s airbag and C
price. So, these are the variables.
291
Now, another variable of interest that we have been using in other lectures as well is age.
So, we are going to compute age like we did in previous lectures from manufacturing
year, let us append this to our data frame, let us take a back up and first 3 variables are
we were not interested.
So, let us drop them and now let us have a look at the structure. So, these are the
variables of interest, 9 variables, you can see 1 being the categorical, others being
numerical. Now, let us look at the summary statistics.
Now, this is 1 function that is also part of summary statistics that we are going to create a
count blank. So, this is going to count the number of a blank cells that are there in a
particular variable, in a particular column. So, that we can identify there are any missing
values that are represented through blanks. So, if they are represented through NA or
some other format, so we can use that and create our; write our own function and find
out, if there are any such values.
So, let us create this function, so we can see a function has been created; count blank
function has been created. Now, this is the data frame that we are going to create that is
based on summary statistics that we are going to compute. So, mean is 1, then we are
going to compute median, you would see that first column we have eliminated because
this being fuel type being categorical. So, there is summary statistics are mainly
applicable to numerical variables, mean, median then a minimum value and the
292
maximum value. So, that will tell us the range, then the standard deviation, then the
length for that particular column, how many observations are there, then the count blank
also. So, that we can easily spot, which variables are having all the values and if there are
any missing values at all. So, let us execute this line and find out more about this
variables, let us look at the results.
Now, you can see all these variables are having 79 as count. So, therefore, 79
observations are there and no variable is having any missing values, count blank
numbers or 0 for each variables. So, we are on the safer side on these 2 counts.
Now, let us look at the range minimum and maximum value, you can see the range for
different variables SR price and price, kilometres, so different range you can see, you can
also look at the median values and the mean values. So, these values can also help you
understand, average values are going to give you the central sense of data centralise, a
numbers of the data, a minimum and maximum value will give you the range of the data.
If the, you can also compare average and median value, so if the average is higher than
the median value then probably the data is rightly right skewed, if the it is average mean
value is less than the median value then the probably the data is left skewed. So, that
kind of thing can also be understood from these statistics. Now, this brings us to our next
part that is pivot tables. So, pivot tables can also be used, so we can if there is some level
of interactivity that can be achieved using excel pivot tables and some kind of combining
293
category is an understanding the relationship between variables can be understood using
pivot tables we will see how. So, let us open the excel file.
So, data, so you will have to select the data and once you have selected all the data. So,
let us select it, select all these variables and all the observation, you can go into the insert
tab and there is pivot table there, you can create the pivot table; we have already created
1 for you. So, this p v, pivot table is there.
294
If you look at the first pivot table, so once you create the pivot table, you would have this
kind of a view. So, where all the variables would be listed on this particular right top of
your excel file and all the variables you can see.
Now, you would have report filter column, label, row labels and a values. So, in this
particular example we have taken transmission variable in the row side and the in the
values section we have taken the price and for the price also it is the counting of those
values that has been done. So, if we look at this particular table, you would see
transmission 0, 63 such cars are there and for transmission 1 that is for automatic 16 cars
are there.
If you look at the another pivot table that is in more detail you would see we have now 2
labelling, 1 for kilometre is on the row side and the transmission is on the column side.
So, transmission 0 and 1 and then different ranges have been created. Once you create a
pivot table you will have to group them. So, you will have group them to have this kind
of result that we have, if you are in your own time you can learn more about the pivot
tables.
So, different binning’s of a kilometres have been binned into these groups and you can
see the values for these bins, across these 2 categories of transmission and if you look at
again the price, average of price values have been taken, so the values that you see in
these cells, these are actually the average price you can see average of price value 5.8
and row is this 1929. So, these kind of values they can easily have here. From there, from
this, these kind of values you can easily spot, if some of the categories can be combined,
for example, if KM was a variable which we wanted to convert into a categorical
variable, therefore, how binning’s can be done and the kind of sizes of those bins and
number of such bins that those kind of decision can be performed. For example, if you
look at the 79 to 89 range and 89 to 99 range, if you look at the average pricing for these
2 range this is quite close. So, probably we could have clubbed these 2 bins.
295
Similarly, if you look at 29 to 39 and we look at 39 to 49 the average pricing is quite

close. So, therefore, we could have combined these 2 bins. So, this kind of combining
can be done because as you know, when you have categorical variable and when you do
your modelling, you will have to create dummy variables depending on the number of
categories that are there in the variable. So, that would actually increase the
dimensionality of your model and you are always looking to reduce the dimensionality, if
there are few categories that is better for the model.
So, we are always looking to reduce the number of categories for a variable. So, we are
always looking to combine some of the categories or drop some of the categories if
required, so that kind of thing can be easily done using pivot tables. Now, next thing that
we can do is correlation analysis. So, as we talked about correlation analysis can be
helpful in identifying the highly correlated variables. So, let us open R studio and will
understand through an exercise. So, again you would see the same data frame, the used
cars data set we are going to use, you would see the categorical variables we are not
going to consider for correlation tables because the variables have to be numerical. So,
let us, so car car is the function. So, this is something that basic stat is something that we
have covered in our supplementary lectures. So, you are advised to actually go through
those videos, those lectures. So, correlation let us compute between the variables.
296
You would see SR price, so this matrix you would see in the row and column we have
the same names, all the numerical variables have been selected SR price, KM price,
owners airbag and age and you would see. Now, this particular metrics is symmetrical
from the diagonal all the diagonal values are 1 because the variable is going to be 100
percent correlated with itself, therefore, all these values are 1.
Now, we want to get rid of the upper triangular, so this we can do through this particular
function upper try, we will set it to NA and then later on will get rid of this, print is the
297
function that can be used, so we can leave NA as blank and will have this particular
correlation table, which is now showing just a 1 half lower triangular matrix and we can
see the numbers. If we look at the numbers, we can see price and showroom price they
are very highly correlated 96 percent correlation .96 is the correlation coefficient here.
Similarly, we can look at some other numbers for example, airbag and SR price 43 and
then we can have look at other numbers like age and KM a 53, similarly airbag and price
they are 35. So, there are some a highly correlated numbers, price and showroom being
very highly correlated. So, these highly correlated numbers we can easily spot and then
we can try to understand if some of these variables can be, some of the dimensionality
can be reduced, but in this case SR price is an important variable, it being correlated with
the price is because of the, that SR price is mainly determining the used price of a used
car. So, a because of that this high correlation is there. So, we do not need to get rid of
because this being a key variable and important variable and it is relationship is again
with the outcome variable of interest that is price.
Now, there is another function that can be used to perform the same thing that is sym
num, sym num is 1 function that again displays the, a correlation table, you can see here.
But the instead of the actual numbers they display the different symbols for different
range. For example, any correlation value that is between 0 and 0.3 that is left blank,
then any value that is between 0.3 and 0.6 they would displayed using dots, you would
298
see 4 dots there. So, 4 correlation values are dots. So, this is easier for because we are
generally looking to identify the variables which are slightly on the higher side, so more
than 0.3.
So, therefore, you would see any values, any correlation value which is more than 0.3,
we are interested in knowing those 2 variables. So, 0.3 and 0.6 are being depicted by dot,
there are 4 such values, then 0.6 to 0.8 they can be depicted by comma, we do not have
any such value and 0.8 to 0.9 y plus then 0.9 to 0.95, that is why as trick. So, we do not
have any such value, but 0.9521 that is depicted by B. So, that we have 1 value in this
category that is price and showroom price. So, immediately we can spot through these
symbols because we are interested in identifying the highly correlated values. So, this
can actually help. So, this is another function. So, let us go back to slides.
Now, another method for reducing dimension could be data conversion techniques. So, if
we have a categorical variable and many categories are there and it will we can have a
look at the data and identify the categories which can be combined. So, combining
categories as we discussed that can reduce dimensions. Similarly, if we have a
categorical variable and it is possible to convert it in to a numerical variable, so that can
also help us in reducing the dimension. For example, suppose we have a collected data so
on age, age groups.
299
So, we have not been able to collect the data on actual age of a individual, rather we have
age group information. So, particular record belonging to a particular age group, so in
that case depending on the number of age group that we have, so those many dummy
variables have will have to be created for our modelling exercise and that will increase
the dimension. So, we can reduce this scenario because age is essentially is a continuous
variable. So, we can, if there are many more categories if there are many more categories
for age groups we can treat it as a numerical variable. So, the middle value of a particular
range can be assigned, that mid value of particular range can be assigned as the value for
that particular, individual age value for that particular individual and we can convert this
categorical variable into a numerical variable and thereby reducing the dimensions, so
this can be done.
Similarly, for categorical variable as we talked about in the previous exercises as well, a
for 2 groups if there are similar kind of numbers are there with respect to our outcome
variable, then probably we can combine them. If there is 1 group which is having very
few number of observation for that group, we can also drop it or we can combine it with
the major category. So, that kind of combining can be performed using data conversion
techniques. So, let us open R studio and let us do the same thing through an exercise.
So, in this particular case again same data set we are going to use. So, age is the variable
that we have in our data set; age of used cars. So, let us convert it into age groups. So, d f
age this was numerical variable we need to use the hash dot factor function to convert it
into factor variable or categorical variable and now we are trying to extract the labels, so
of this particular categorical variable once created.
300
So, you would see age groups has been created and different labels can be seen here in
the environment or data section 2, 3, 4, 5 and many others. Now, for further coding we
would also require these groups in a numeric format. So, we are trying to create another
variable to store the same information. Then, we want to a create a bar plot, which will
see as later on. So, for that we need to create this C price that categorical pricing that we
had created for this particular data set by age. So, let us do that.
So, there are 2 groups for categorical price 1 is 0 and 1 is for 1. So, used cars with less
than 4 lakh value and then used car is greater than or equal to 4 lakh value. So, these 2
groups we have, one is, so 2 variables we need to create, for each of these 2 variables and
for different age categories, we are trying to find out the this these values, we are trying
to sum them up. So, you would see first C price age 1 is actually, we are trying to find
out the percentage values, where age is a particular belong, belongs to a particular age is
there and it belongs to particular age group and the C price is also 0. So, we want to find
out the percentage of that, similarly for C price 1, we are going to perform the same
thing.
301
So, let us run this, you would see that different 2 different variables C price by age 1 and
C price by age 2 have been created. So, 9 groups were there, you would see 9 groups
were there and 9 values have been computed.
Now, let us create a matrix of these 2 variables, once this is done. Now, this particular
information, this particular variable can be used to draw a bar plot, the x axis is again
going to be categorical as we have been doing before. So, age groups; the labels from the
age groups would be taken for this labelling, limits have been is specified every
appropriately because this is going to be percentage number. So, 0 to 100 is the range for
y limit and then because there are 9 groups, so this, they are going to be captured in this
range for x axis.
302
So, let us execute this.
You would see that a bar plot has been generated, now you can see that 0 is represented
by this higher intensity shade of grey colour and then 1 is represented by light grey
colour. So, you can see for age group 2, 3 and 4 you would see that proportion of these
values, this is number of cars for these is same and count of these cars for other groups
this is slightly different, if you look at these 2, 3 and 4 from here, if we can see that
probably age group 2, 3 and 4 they can be combined together and similarly age group 9
303
and 10 they can also be combined, if you look at age groups 5 6, they can be combined
similarly, age group 7 and 8 can also be combined, that is mainly because the proportion
does not change, so therefore, we can combine these categories.
So effectively, we can reduce the number of categories from, right now they are 9 to 1, 2,
3 and 4. So, we can reduce from 9 categories to 4 categories, so thereby reducing a
dimension by 5. So, this kind of combining categories can always be done as we talked
about. Now, let us do 1 more exercise for time series data, so this hypothetical data we
are going to create this is a time series data, so this is sales, a for force equipment sales in
rupees crores for each quarter between 2012 to 2015 and for each quarters these numbers
are given, let us create this variable sales and a time series is being created for the same.
You can see the frequency is 4 because this is quarterly data and the sales data is being
used from 2012 onwards to 2015 and this time series has been created. Now, let us plot
this.
304
Now, let us have a look at this particular plot. Now, you would see that let us a look at
the 2012 data. So, there are going to be 3 points as we saw in the data itself, 1, 2nd point
somewhere here and 3rd point here and then 4th point. So, it is always at the 4th point,
where some peak is happening. So, it is at the in the 4th quarter were then there is huge
spike in sales, but if we look at the a 1st and 2nd and 3rd quarter the numbers are quite
close to each other. So, we are looking to combine categories, so we are looking at we
have sales data for different quarters and we are interested in analysing this the same
quarter wise performance.
So, probably we can combine the data for quarter 1, quarter 2 and quarter 3 because the
kind of numbers that we have for this time series you would see that, their performance
is quite similar. So, we can combine group them into 1 and the peak 1 could be the
another group. So, this is how, combining can be done even for time series data. So, will
stop here and in the next lecture will start our discussion on automated reduction
techniques and the 1 that we are going to cover in this particular course is principle
component analysis, so will start from there.
Thank you.
305
Dr. Gaurav Dixit
Lecture – 14
Dimension Reduction Techniques- Part II Principal Component Analysis
Welcome to the course business analytics and data mining modelling using R. So, in the
previous lecture we were discussing dimension reduction techniques and we covered
quite a few. So, today’s lecture will start with the automated reduction techniques and the
one particular technique that we are going to talk about, that we are going to discuss is
principal component analysis also called PCA. So, let us start our discussion on PCA.
So, a principal component analysis is mainly used for reducing the number of predictors,
number of predictors as we are discussing the dimension reduction techniques and as we
talked, as we discussed in the previous lectures as well, that the idea being reducing the
dimension and mainly to achieve the parsimony, to follow the principal of parsimony and
many other regions that we have discussed before.
So, the PCA, role of PCA is also similar can we used for reducing the number predictors
hence can be used to reduce the dimensions used for quantitative variables. So, only the
quantitative variables can be used under this techniques. So, for categorical variables we
have to rely on other methods that we have discussed in previous lecture as well.
306
Now, sometimes when we are dealing with a large number of variables or big pool of
predictors, we might encounter highly correlated variable subsets. So, this kind of
situation is not desirable in many situation because some of the, some of these variables
some of these variables might have information overlap. So, they might be measuring the
same kind of information, same information that can that can disturb the model spurious
relationship could be there and the model might be useless.
So, therefore, we want to get rid off from these situations. So, principal component
analysis PCA can specifically help in this particular kind of scenario, now what is the
main idea main idea is to find a set of new variables that contains most of the
information of original variables. So, we do not want to lose out on the information
because we want to have, we want to retain the explanatory part explanatory power of
the model which we could have had using original variables.
So, with new set of variables also we would like to retain most of the information and
hence most of the explanatory power from the model of the model. While idea is to find
new set of variables the point to be noted here is we are trying to reduce the dimensions
therefore, these set of new variables are going to be less than the number of variables
number of original variables, few other objectives could be eliminating co variation and
multicollinearity. So, in any information overlap it been 2 variables, it could also be
called co variation, now sometimes in regression modelling especially we might
encounter this problem multicollinearity.
So, in relation models as we will discuss in coming lectures that this co variation is not
desirable and therefore, that leads to multicollinearity that will discuss in the regression
related lectures, but through PCA this can also be eliminated. So, eliminating co
variation and multicollinearity could be another objective.
Now, essentially while we are looking for a new set of variables which is able to which
are few in which are less than number of original variables thereby reducing the
dimensions, while we are and also at the same time we are trying to retain the
explanatory power of the all the variables put together therefore, the model we need to
redistribute the variable this is the, how the when we are looking for a new set of
variables we are essentially redistributing the variability that is contained by the original
set of variables right.
307
So, let us go through an exercise using R. So, let us open our studio.
So, will go back, to go reach to the section where we start our discussion on principal
component analysis yes. So, let us import this particular dataset, breakfast cereal data set.
Let us import this, will also discuss this data set and how it could be how it this going to
be utilized for our exercise. So, you can see 35 observation of 18, 18 variables let us
open the this particular data set will try to open here in the our environment itself. You
can see the first particular variable that we can see is brand name and you can see few
brand names here and then the product name and then specific details about these
products the kind of packaging.
So, this is way it is depending on the packaging and the price is the corresponding price
and the energy and other contents or ingredients that are there in that particular cereals.
So, all those details starting from protein, carbohydrate, sugar, fibre, fat so all those
details you can see in this particular data set at the end of it last column after iron related
information you would see that customer rating is also there. So, how customers have
been rating these particular cereals. So, now, all this data is based on the different cereal
packets that are being sold in Indian markets and. So, we have selected few of them and
also taken the different details about these cereals.
308
So, we are going to use this particular data set for our principal component analysis. So,
let us eliminate the, if there are any columns, columns having any values and then let us
have a look at the rows of data.
So, the same thing that we did using opening this file in the our environment, same thing
we can do through this particular command. You can see in this particular data frame that
weight for different the package the details which have been taken from different
packages are carrying different weights. So, therefore, it is important for us to have
because we are going to compare these serials right later on through principal component
analysis essentially. So, these serials are going to be compared. So, therefore, we need to
we need to get the details like price energy and other details like protein, carbohydrate,
sugar etcetera for the same weight for the same packaging for the same weight of that
particular cereals.
So, let us look at the structure of this particular data frame you would see except the
brand name and product name all the variables are numerical in nature. Let us take a
backup of this full data set, now what we are going to do is we are going to apply this
particular function this we have written function this so we are going to divide all the
details starting from the price, energy protein with the weight.
So, that we get the details for of all the cereals for similar weight or similar packaging.
So, 100, so it is for 100 grams. So, you can see we all these details are going to be now
309
available per 100 grams. So, let us execute this. So, this particular lines will get a new
data frame and now once this is done we had earlier eliminate, we have earlier not
included customer rating in the earlier line. So, let us combine this one as well customer
rating and let us look at the first 6 observations, now you would see that details specific
numbers have changed.
Now, all these numbers are for all these numbers are for each cereal and for 100 grams of
each one.
So, now we can move ahead. So, let us select 2 important variables out of this dataset
energy and customer rating and let us do our pca run, let us apply our PCA on these two
variables and then will proceed further. So, let us focus on energy and customer rating.
So, these are first 6 observation for energy and customer rating, now what we are going
to do is we will plot a graph between energy and customer rating, let us look at the range
you can see 12.82, 35 for energy.
310
This is kilocalorie and then the customer rating is there this is percentage customer rating
between 0 to 100 so appropriately the limits x and limit and y limits have been specified.
So, let us plot this graph.
So, this is the scatter plot that we get, if we look at this scatter plot some of the
observation you can see these observations these seem to be way out of the major chunk
of the values. So, we will consider them as outliers for our exercise and will try and
311
eliminate them so, that we would end up dealing with only this particular chunk of
points.
So, let us find out these outliers. So, most of them looked like having energy value of
greater than 300. So, let us identify these points as you can see point object case number,
observation number 32, 33 and 34. So, these are the points and you can see the energy
values are in excess of 300 kilo calorie.
So, probably these cereals are having high energy content, high energy value so will not
include them. So, will only analyze the cereals having a closer by energy value range; let
us eliminate these 3 points from our data frame and we are going to plot again. So, now,
will we get a much closer plot, rating versus energy?
312
So, all these points we can now visualize. So, from this if you look at the points. So,
most of the points if we try to draw a line which can go from here to here, its look like
you know as the energy, as the energy increases for a particular cereal 4 different cereals
as energy increases energy value increases you can see the rating that is slightly that is
coming down which is expected.
So, variability in terms of variability also, we can see most of the variability can be
captured by the rating and energy itself because they are quite aligned to the x axis and y
axis. So, let us look at the mean values of these 2 variables energy and the customer
rating. So, these are the mean values. So, mean values representing as we have talked
about in previous lectures also the centralized value of a particular variables and gives us
a sense of the you know mean value, average values from where other values also other
values might be lying around that value. So, it gives such as central value representing
value in a way representing value of that particular variable.
So, let us look at the covariance matrix of this these 2 you know variables. So, let us
compute the variance of energy variable then followed by the variance of customer
rating, let us also compute their covariance. So, if this is the matrix why we computing
all this information is because essentially as we talked about the if these are the original
2 variables energy and customer rating and we are going to apply a principal component,
component analysis on these 2 variables essentially we would be redistributing the
313
variability because we want to retain the much of the information right while we are
trying to find a set of variables which is fewer than the original number of variables.
So, we are trying to have a look at the variability, if you are interested in finding the
correlation between these two. So, you can see these two variables the correlation
coefficient seems to be minus 0.45 right. So, now, again let us come back to the
variability. So, let us see the total variability within the original variables energy and
customer rating that is 1804.333, let us see the contribution of energy the variability that
is contributed by energy variable is 87 percent and the variability coming from rating is
12 percent.
So, you would see if we go back to the plot, if you go back to the plot then you would
see that it is the along the x axis that is being represented by energy, most of the
variability is being captured, You can see from values starting from 22 value starting
from 140 and most of the variability can be captured along this dimension. Now, some
variability is also in the perpendicular orthogonal direction which is being represented by
y axis and rating. So, some variability is also being captured by contributed by rating.
So, now as if the data of energy and rating as it looks like the most of the variability
contribution is coming from energy. So, we can actually get rid of rating variable and use
the energy because 87 percent of the information is anyway will be able to retain, retain
and will get rid of one particular dimension that is rating. So, is there a better way can be
increase, can we retain much more information. So, that can be seen through a principal
component analysis. So, let us apply principal component analysis of energy and rating.
So, first let us select the these 2 variables in a new data frame dfpca, let us apply this
function. So, the p R comp is the function that is actually used to apply principal
component analysis in R, if you are interested in finding more about this particular.
Function you can go into the help section and find out.
314
You can see principal component analysis description you can see performs a principal
component component analysis on the given data matrix right.
So, we are going to use this particular function and let us run this code. So, this has been
run, now many things have been computed as part of as we called this function, now let
us look at the new PC direction. So, let us find out, let us look at the details of this mod
function you can see different details have been computed p R comp function and if you
are interested in the finding out the summary of it, you can see 2 principal components
have been computed. A standard deviation for first one is pc 1 for 40 and for pc 2 is
12.99, if you look at the proportion of variance. So, pc 1 is contributing explaining 90
percent of the variance and remaining 9 percent, 9.36 per percent variance variability is
contributed by second principal component.
So, you can see from the numbers earlier numbers from 87 and 12 we have redistributed
in this particular variability to 90 and 9. So, this is a small change, but in other scenarios
if we apply principal component analysis to other data sets or to other variables then the
situation could be different, it could be from 60, 40 to 80 20 or 90, 10 kind of
redistribution. So, it depends on the particular variables, in this particular case that is
redistribution of variability is happening from 87 and 13 to 90 and 10. Now, let us
analyze further.
315
So, because now will have two new dimension and determined by these 2 principal
components PC 1, PC 2. So, what we will do? I will add them to our scatter plot which
was earlier generated. So, first we need to compute the, this slope and intercept to find
out the directions.
So, adding PC directions to the plot you would see that that the rotation is one particular
out, one particular returned value that we have in the in the mod, in the mod function in
the mod variable right. So, we are going to use this particular, this particular rotation
value. So, these are nothing, but weights if you are interested in interested in looking at
the rotation value you can check weights for new dimensions z 1 and z 2. So, these are
the weights. So, PC 1 and PC 2 you can see the weights. So, these if we want to compute
this course for new this direction PC 1 we can use these 2 weights, 1 corresponding to
energy minus 0.98 then another one responding to customer rating 0.18 similarly your
PC 2 rate is just the reverse.
So, using these weights we can comp compute the newest scores for these new
dimensions. So, earlier we had score for energy and customer rating now. So, scores are
also pre computed by the P R comm function and written in this particular variable x, let
us look at the first 6 values these are the p computed this course. If you want to see how
these codes can be computed. So, we can take one example. So, let us compute first
score. So, d f PCA 1 let us look at the very first values of the energy and customer rating
316
values where 70 and 84 for the first variable, if we look at the after applying the p R
comp or PCA the weights are here. So, we can do we can compute the first code in this
fashion.
So, these are this is the particular weight that we saw in the earlier table, this is this
particular value minus 0.98. So, mod dollar rotation 1 1. So, this is representing that
particular weight and then we are subtracting the d f PCA 1 1 by the mean by the mean
value. So, the value that we had 1 1 is 70.1 right. So, now, we are also subtracting it by
by mean, similarly for second weight that is corresponding to customer rating that is a
0.1835 this is here. So, now, the second value the rating value d f PCA 1 2 is this one 84.
Now, this is being subtracted by the mean value of that, mean value of customer rating
and then we are weighing it through this weight that we have just computed.
So, let us compute this value you would see that value minus 1.38 has been computed
and this is for first is code and for first direction you can see in the results the same value
was there. You can see minus 1.38 the same value was this is, this is how we can
compute the scores using the weights of new dimensions. So, now, let us plot add the
directions. So, let us find out the slope. So, this is nothing, but rotation this is a y value
and this is the x value and we are going to use bit one part of function that can be used to
do some computations in a particular environment.
So, the environment in this is determined by the first argument that is mod in this case.
So, that we do not have to use different notation like mod dollar rotation or mod dollar
centre etcetera. We can in the first argument we can specify the environment that is the
data and then we can specifically access the variables and do our computation. So, the
same thing we are doing this is like y divided by x in this case these weights and this will
give a slope.
Similarly, intercept can be fin found out using this if you are interested in looking at the
new centre; new centre that can be look looked at.
317
So, this is, this is the value 67.47 and 77.5. So, if we go back to the scot scatter plot and
let us zoom this particular plot.
So, the new centre is going to be at 67.47 and 77.5 so somewhere 67.47 and 77.5
somewhere here. So, new centre is going to be here now using this particular new centre,
we are trying to compute the intercept. So, slope we have already computed right. So, let
us compute intercept once this is done we can add this line. So, this a b line is the
function that can draw a line given the intercept and slope. So, let us plot this line you
318
can see a line has been plotted. So, this is PC 1. So, you can see if we compare it to the
the original x axis that is represented by energy so there is some angle.
So, now this particular through this line even more variability is being captured that is
why we saw a jump from 87 to 90. So, the earlier variability for our captured by energy
would have would was 87 which would be represented by a line like this, a line parallel
to x axis horizontal line, now this is slightly some slope is there. So, this is capturing
even more variability. So, that is why we saw a jump from 87 to 90.
So, let us plot the second direction. So, because these 2 are going to be perpendicular
orthogonal, we can compute the slope in this fashion also slope one slope 1, now in this
new slope can be minus one divided by the slope for the PC 1 or the other way also the
rotation values can also be used to compute the slope. So, let us compute this, we can
also find out the intercept for this particular line and we can add this line into the plot.
So, if it see the pc 2 has been added. So, these are the 2 lines, now this is now this is the
redistribution of variability. So, earlier we had x axis and y axis represented by energy
and rating, now we have PC 1 and PC 2. So, re distribution is variability of variability is
happening, now earlier it was 87 and you know kind of 13 by these 2 axes, now we have
90 and 10 scenario.
So, same thing we can see here. Now, the new v z 1 value, v z 2 value we can find out
the total variability if you look at the total variability is 18, 1804 and if you go back and
look at the earlier total variability that we have computed as v 1 plus v 2 both are same.
So, you can see variability is same, but the redistribution has happened because of the
change in the directions of dimension.
So, now a contribution by energy and rating you can see 90 and 90.6 and 9.36, now
principal component analysis can now can be applied to all numerical variables that we
have in the data set. So, let us have a look at the data set that we had. So, this is the data
set that we saw earlier. So, till now we applied the principal component analysis on just 2
variables energy and rating. So, now, we can apply it on all numerical variables right. So,
in this particular data set the variables that we have or all numerical right except 2
variables these trans fatty acids and cholesterol. So, most of the values in these 2
variables are 0. So, therefore, we would not be we would not be including them in this
particular analysis. So, we would be eliminate them and other variables would be taken
319
as for the principal component analysis. So, I will stop here and will apply principal
component analysis on all the variables that are in the data set in the next lecture.
Thank you.
320
Dr. Gaurav Dixit
Lecture – 15
Dimension Reduction Techniques- Part III Principal Component Analysis
the previous lecture we were discussing the dimension reduction techniques, and
specifically principal component analysis. In the previous lecture we applied principal
component component analysis on breakfast cereals data base and two particular
variables and had it energy and customer rating.
So, now today’s lecture will start will applying principal component analysis, on almost
all the values that are available in the data set.
So, as we talked about in the previous lecture that will include all numerical variables
except two, which are mainly having a zero values in almost all the cells, which will
eliminate them ah. So, generally generally different brands try to indicate that these two
you know these two particular variables stands for the fatty acid and cholesterol or 0 in
their product so, that they can market them much better. So, that is why their information
has been recorded, but essentially yours most of the values are 0 that is not useful for us
321
So, let us select the appropriate data frame. So, now, if you want to look at the selected
variables again P C A 2 is the new data frame, here we have sub stetted ah. So, you can
see the variables that all almost all the variables that were originally available in the data
sets had been taken for principal component analysis had it starting on price energy,
protein, carbohydrate. So, were dietary fibber fats, saturated fat fatty acids and mono
unsaturated fatty acids poly unsaturated fatty acids sodium iron and last one the customer
rating.
322
So, let us apply principal component analysis, the p r comp is the function again we are
going to use it. So, let us execute this particular code, now let us look at the summary.
Now, so, you would the at 13 principal components having used. If you look at the if we
look at the number of variables that we had in this new data frame on which we applied
principal component analysis.
We count the number of variables, we can apply names function as well. So, 13 were
there ah. So, now, 13 original variables were there, and now we have 13 principal
323
components. Now let us look at the different values that we got after applying principal
component analysis. So, first P C 1 if you see the proportion of variance, its 68 percent
and the second is prince second principal component is 29 per 29 percent. So, if you
combined these two 68 and 29 its almost its its almost its almost I think 97 percent.
So, these two principal components P C 1 and P C 2 almost conti almost contributing 97
percent of the variability that was there by the original variables. So, we can eliminate
other principal components. So, only these two principal components are capturing the
most of the variability is therefore, the dimension can be reduced from 13 to 2 because
the because of the most of the variability being captured by these two variables. Other
principal component say p c; that the amount other variability that proportion are various
type they cap captured is a less than 3 percent. P C is the 2.8 and then others are others
are insignificant totally insignificant.
The first two we can have the first two principal components has two dimensions and
therefore, we would able to reduce the dimensionality from the 13 to 2. So, let us look at
other values.
So, let us look at the rotation weight weights. So, these are the weights.
324
Let let us look at the nicer version of this. So, we will ha will have we look at just three
values, three decimal values ah. So, now, look at let us the first principal component P C
1. So, let us see which variables which original variables are contributing to P C 1 you
would see that price that is minus 0.99. So, the first principal component is mainly
determined by price and other original value means they are they are contributing to in in
significant amounts. If we look at the second principal component, then it is you can look
at the second values this is energy minus 0.968.
So, second principal component main contribution is coming from energy. So, P C 1 is
essentially we can say price plus kind of variable and P C 2 is energy plus kind of
variables. So, most of the variability being captured by price and energy; if you try to
make sense of it you would see generally Indian consumers when they buy breakfast
serious, they might they generally you know we might have this perception that they
generally go for price and energy same is being reflected in a way here .
So, other principal components. So, proportional variance they explain was anyway quite
less and then. So, they it does not make much sense to look at the contribution of original
variables to these principal components. So, P C 1 , P C 2 its like price plus an energy
plus. So, let us move forward now what is a problem here in this case? There is one
problem that is there in this analysis that we applied. So, we look at the the variables that
we are talking about its price that is in measured in a rupees, then the energy then that is
325
being measured in kilo calories, and protein and carbohydrate and other contents there
we measured in grams and milligrams ha right. So, we look at we have having different
you know measuring units. So, it is still the data that was fed to principal components
analysis that was not normalized.
So, may be that was the reason we had just two principal components dominating most
of the variability. So, let us apply principal component analysis after normalizing, all the
numerical variables and a study. So, let us run another principal component analysis. So,
again this time we are going to use the same data frame; that is d a P C A to and now we
would see scaling is being down. So, scaling second argument is scaling is prove in the
function now this let us execute this.
Let us look at the results, now these results are after doing normalization. So,
normalization something that we have talked about in the previous lectures, we talked
about that sometimes some variables because of the scales they can dom they can
dominate results, they can influence the results and which might not be desirable in most
of the scenarios. So, therefore, normalization is the one recommended step, before going
add with the, you know going add with the building of your own model or running your
model.
So, now this particular results that we see there after doing normalization. Now if we
look at the portion of variance now we see P C 1 50 percent of variance is being captured
326
by P C 1 16.5 is captured by P C 2 10.3 is captured by P C 3 you can see much bigger
role for P C 3, you can see even you know bigger role by P C 4 also same 7.5 percent,
similarly P C 6 6 percent, we can see P C 6 4 percent P C 7 are 3 percent and after that.
So, you can see that first you know 7 first you know 7 principal components they are
capturing more than 90 percent of the variability and most of the variability in the
original variables. So, now, the dimensional dimensionality which we thought when we
when we ran principal component analysis you know without doing normalization, we
thought it was reducing from 13 to 2 that was not the actual case.
If we do normalization if we do scaling and we find out that it is actually from 13 to 7;

so, we would still we requiring 7 new dimensions to capture most of the variability. Let
us look at the weights of new principal components. So, let us look at the nicer version
three decimal points, of to three decimal points let us start with first principal
component.
In the first principal component if you see the largest contribution is coming from energy
and then that is not the only dominant dominant contribution, he would see similar
number protein, carbohydrate, dietary fibber, fat they and the other thinks also
contributing in similar fashion right. So, this is how P C 1 is being determined. And the if
you look at the P C 2 then sugar is dominating in this particular component 52
contribution coming from sugar, then we will look at another other numbers we would
327
see sodium is also there even sodium is bigger than sodiums number is bigger than
sugars, sodium sugar and then the iron. So, these are dominating the P C 2.
Similarly we look for look at P C 3, then the biggest biggest weight that we can see is
coming from price that is minus 0.79; and then after that much smaller much smaller
weights much smaller weights after that and we can see That point minus 0.31 that is
again coming from sugar in principal component 3.
So, if we look at the principal component one the weights are also with minus sign right.
So, therefore, you know energy and protein. So, this is a mainly signifying PC1 is mainly
signifying the particular principal components which is determined by energy, protein,
carbohydrate fat saturated fat fatty acids many of this contents right. P C 2 we see that is
mainly determined by sugar and sodium. We will look at the P C 3 it is mainly
determined by price. So, P C 3 could be called price plus P C 2 is mainly can be called
sugar and sodium sugar sugar and sodium, P C 1 is may can be you know termed as
health plus. So, these could be the new names for these different principal component the
different new dimensions, and since we require first seven principal components. So,
similarly will have to do a similar excises for other principal components as well.
Now, let us what will do? I will plot the new dimension that we have just computed. So,
let us look at principal component 1 and principal component 2; let us look go back to
the results that we had earlier the proportion of variance proportion of variance by P C 1
and P C 2 is 50 and 16.5. So, let us plots these two dimension and then we can compare
it with the original plotting that we had done earlier. Let us look at the range of particular
variables minus 6.96 23.02. So, you can see appropriately values has has been specified,
let us look at the range for the second variable and you would see that minus 4.14 to
1.49, you can see the appropriately values have been specified. Let us change the margin
and correct expansion through par function and let us plot now let us zoom to this
particular plot.
328
The marker point has been chose has been chosen as the smaller we have chosen a
smaller marker for generating this plot, now you would see these are the these are the
points for z one and z two .
So, quite different from the case that we had earlier had when be applied principal
component analysis on a energy and customer rating. We saw distribution redistribution
happening from 8713 to 90 and 10, and the and that was also you know without
normalization. Now if we do normalize we can see this kind of scenario. Now variability
that is being captured by just these two dimensions is slightly less in comparison to
previous two models that we developed.
If we want to a further analyze formation, we can label all these point we can label all
these point by their product name.
329
So, we can make a; we can analyse these a new dimensions further now. So, these are the
all the points have been labelled by their product names. So, now, we want to further
analyse right for example, if as we move along in the z 1 direction and will look at the
weights, and that for actually contributing for this particular directions.
So, let us look at the P C 1. So, this was the P C 1 and it. So, as we move along to the z 1
direction from left to right, the energy content would actually decrease right. So, you
would see that has move along from left to right the energy content is actually would
330
actually decrease and then the protein would also protein would also decrease similarly
other carbohydrate fiber. So, as we move along from left to right probably less healthier
options are more clubbed on the righter side of this particular plot.
So, this is how we can analyse similarly for the second directions also, we can make
similar kind of analysis. For example, sugar and sodium they dominated sugar sodium
and iron, they dominated the second directions second principal component. So, as we
move from bottom to top and this direction get two dimensions, he would see that these
decrease in these three contents, that is sugar, sodium and iron.
So, therefore, more healthy more healthy cereals would be would slightly be in the
middle in mid section and then left mid section right. So, that is we are probably more
healthier options are there.
So, let us go back to our discussions. Now we look at the principal component analysis;
how it could be used in the data mining process? So, how we can actually applied
principal component analysis in our data mining modelling and it use the dimension to
lesser number. So, first step is going to be applying P C A to the training partition. So,
will have training partition were validation partition and test partition. So, first step
would be applied P C A to the training partition. So, now, will have new predictors, and
they will be you know now we different principal is four columns, they would be new
predictors.
331
So, the original variables we can we might will not be using further and new principal
new score columns that we saw through that mod dollar x value right similarly for all the
principal components, we can find new values and they are going to be new predictors.
So, as we has been talking about will have new names for also new name for them; as
were for example, we were talking about health plus, price plus, energy plus, kind of new
variables for new principles four columns .
So, now once these we have new predictors, we are ready to build our model on training
partition, but how do we evaluate our model? So, for that we would be requiring
validation data set and test data set. So, the principal weights that be obtained while we
applied P C A to the training partition, the same principal weights can be can be applied
can be used to compute the variables new variables from the validation partition. So, the
validation partition we can apply the weights that we computed from applying P C A to
training partition, can we used to obtain new scores; and then these new predictors can
we used to perform to actually test the model and then refined it, and then test on new
partition that is test partition .
Now, we look further at P C A the, what we have been doing is in P C A is be mainly

focused on numerical variables. So, we generally selected all the numerical variables and
then we applied P C A on them, and then we looked an analyse the results to find out
how many new predictors will have and whether the dimensions are going to come down
or not.
So, those are the things that we looked at, but we look at the way P C A is done, we join
generally exploit the relationship between predictors and output variables that
relationship is generally ignored. So, that is one limitation of this principal component
analysis. So, this particular limitation can be overcome using some other methods. So,
limitation of principal component analysis that it does not include the relationship
between predictors and the outcome variable, that can be overcome using some other
methods. So, these are.
So, we come to our next category for dimension reduction techniques that is data mining
techniques. So, some of the data mining techniques that we would be covering in more
detail in the coming lectures, they can also we used to reduce the dimensions. So, first
332
one that we are going to discuss briefly is regression models. So, we can apply some of
the subset selection procedure using a regression models.
So, for example, Linear regression for prediction task. So, for prediction task we can
apply a linear regression. So, there in using the different using the significants of the
coefficient that we get for different variables right, we can find out which of the
important variables and therefore, we can we can get rid of the insignificant variables
and also variables probably having low coefficient value. So, we can also if the we can
also drops some of the values, if the coefficient numbers coefficient values is on the
lower side those variables can also be drops even if that that is significant.
So, now domain knowledge also important, sometimes even though the coefficient value
is on the lower side the variable might be of more importance, but if that is not the case
probably we can drop those variables also. So, we can drop insignificant variable and
some of the significant variables, which are carrying low value.
So, that is how selection subset selection procedures of regression models can we applied
linear regressions, different subset selection method that we will be covering in coming
lectures, can be run to find out the best subset, which is able to explain the model or fit
the data. For classification task we can apply logistic regression, and the same process
can be adopted there as well. We can use we can find out the significant relationships,
and then we can also have a look at the coefficient values and thereby we can determine
333
which variables to drop and thereby we can reduce the dimensions. We can also look at
which regression models, the subset are explaining most of the variance that we can do
through multiple R square values.
Regression model can also be used for combining categories. So, we can use p values to
actually find out that. |If there are you know few if there are few categories for which we
have in sin which are insignificant, had there is category which is insignificant it can
actually we combined with the difference category, it can be combined with the
difference category and the this category can be eliminated. If we have 2 category which
are having similar coefficient similar value had similar coefficient value they have
similar influence on the outcome variable or output variable, and those variables on those
category can also be combined. So, much can be done after analysing the results.
Another technique that can be used for dimension reduction is classification and
regression tree. So, in a coming in a coming session, we will discuss classification and
regression tree in more details ah, but to give you an idea that there is we in this in this
under this technique, we develop a classification tree for classification task and we
develop a regression tree for prediction task. So, while we build this model using
different using full of variables, the large number of variables, in result is going to be a
tree diagram.
So, which would be represent which would be giving us the different classification rules
or prediction values and will also be incorporating the important variables that would use
to build that particular tree. So, if a variable does not show up in that particular tree
diagram, does not figure in the in that particular tree diagram; that means, that particular
variables can directly be eliminated. So, that is how dimension can also be reduced. So,
we can start from a large number of predictors and then some of them can actually be
eliminated if they do not figure in the tree diagram.
Similarly, in the classification and in the classification tree as well we can combined
categories if there is similar kind of if they are coming in the in the in the in the same
branch, and there is possibilities of having you know similar classification rules for both
of them, and probably we can combined they we can combined them. Similarly for
regression tree as well if a particular variable is not coming up in the regression tree
diagram, and also those variables can be eliminated from the analysis.
334
So, more detail on these two techniques regression model whether it is linear regression
or logistic regression, and how they can we used for subset how subset selection
procedures, can be developed using these models and how classification and regression
discard can be used for this model for dimension reduction, we will discuss when we will
discuss this when we come to the lecture we will discuss this particular techniques.
Now, one difference between principal component analysis and these models regression
models and classification regression tree is that these models they in they account for the
relationship between the predictors and the outcome variable. So, whether it’s a
regression model that, the way linear regression is model, its the relationship between
outcome variables and the predictors that is actually incorporated in the modelling
process. Similarly for logist logistic regression also the, predictors and the outcome
variables there relationship incorporated and any subset selection for this procedures that
are used later on they are based on those relationships so, that being one big difference.
Similarly, in classification and regression tree also, while we develop the tree diagram it
is the it is the relationship between the variables that help us reach to the terminal nodes
are leaf nodes, and do our classification on prediction. Therefore, it is the underlying
relationship between those variables that is also playing its role and determining the
importants of variables. So, will stop here and in the next section will, I will discuss the
performance matrix so.
Thank you.
335
Business Analytics & Data Mining Modelling Using R
Dr. Gaurav Dixit
Lecture – 16
Performance Matrix-Part I
Welcome to the course business analytics and data mining modelling using R. So, in this
lecture will start our discussion on our next module and next topic that is performance
matrix for prediction and classifications? So, in this particular lecture we are going to
focus on these two task prediction and classification and how different model whether
they are prediction model or classification models, how their performance can be
evaluated.
So, let us start our discussion. So, first of all we need to discuss why we need, why we
require performance matrix. So, because different models, different methods can be
applied to different prediction and classification task. So, therefore, how do we select the
most useful or the best model from all those candidate models. So, for that we would
require some matrix which can help us in deciding which can help us in making that
decision finding the most useful model or finding the best performing model. So,
comparison of candidate models can also be done, sometimes as we have talked about
that it might not be several different methods just one method and different variants of it
because of the different configuration or different options that we select for different
variants.
For example, one simple example could be the same model regression model with
different set of predictors could be run on the same data set. So, therefore, for the same
method we will have a number of candidate models each having different set of
predictors. So, therefore, it becomes important for us to have some matrix which can
help us in deciding which one is the most useful model or the best performing model. So,
in this particular lecture we are going to focus mainly on the classification and
introduction and the relevant performance matrix, will start with classification.
336
So, classification, generally this is the metric that we use probability of misclassification.
So, the probability of misclassification is on the lower side. So, that can help us
determine whether a particular model is useful or performing giving better performance
or not. So, for different for different candidate models we can look at the probability of
misclassification and then how that is how we can actually compare their performance or
compare their usefulness for the task at hand, then naive rule for classification
performance is could be the most prevalent class. So, for example, if we have 3, 4 classes
for which we are trying to develop, for which we are trying to develop a classification
trying to build a build a classification model.
So, any record, any record can be classified, any new record can be classified to the most
prevalent record if it is classified to the most prevalent class then that is called naive rule.
So, this serves as a benchmark. So, other models, other methods their performance can
be compared with respect to this benchmark. So, we do not run a model, if we do not
include the predictors information the relationship between predictors for our models,
relationship between predictors and output variable in our model and just use the naive
rule which is the any record irrespective of the you know information that is contained in
predictors right we just assigned this particular record to the most prevalent class. So, for
example, if there are 3 classes and the most prevalent one being class one this is having,
let us see out of hundred observation 80 observation belong to this class and then the
remaining 15 belong to class 2, out of remaining 20, 15 belong to class 2 and the 5
337
belong to the class 3. So, the most prevalent class being class 1, 80 records belong to this
class. So, therefore, any new record can be easily classify to the class 1 and will also
have a good amount of accuracy of based on this naive rule model 80 out of 100.
So, now any new method that we apply for this particular classification task classifying a
record into these 3 classes, class 1, class 2, class 3 should actually perform better than the
naive rule. So, naive rule serves as a benchmark and then we can look at the probability
of misclassification and compare different candidate models to find out the most useful
one or the best performing one, now there are few matrix which have been developed
based on naive rules. So, therefore, these matrix actually use the naive rule and use it to
build a metric that can be used to compare the usefulness of different model. So, one is
multiple R square. So, what is multiple R square? More detail will discuss in coming
lectures, but to give you a brief definition that multiple R square distance between fit off
model to data and fit off naive rule to data. So, that is how that is why we are saying that
this is metric based on naive rules.
So, distance between fit off model to data and fit off naive rule to data that is actually
looked at in multiple R square.
If we want a naive rule kind of equivalent for prediction tasks, if we want a naive rule
equivalent for a prediction problems, for prediction problems of predicting task sample
mean could be the one. So, we are looking again if we are trying to predict a value for a
338
particular variable and we know, we are not interested in looking the information that is
the that can come from predictors right we are ignoring the predictors information then
simply the mean of a mean value of that particular output variable can be taken, can be
taken and for any new record that mean value can be assigned to that record. So, that
could be the naive rule equivalent for prediction tasks because the mean value this is this
is the centralizing value for a particular variable and therefore, any new variable we can
expect it to lie around you know that mean value. So, the error would be minimum if we
do not use any other predator information the simple, sample mean could be a new rule
equivalent for prediction tasks.
Now, let us come back to our discussion on classification. So, classification on matrix is
generally used to compute a different performance matrix, different performance metric
for classification tasks. So, you can see how a classification matrix is displayed, you can
see here. So, generally will have predicted class, so if we are if we. So, in this particular
case we are talking about 2 class scenarios where one class is represented by 1 and the
other class is represented by 0 value, similarly actual class represented by 1 and 0.
So, that we have predicted class and actual class, class and we have different numbers.
So, this is based on once the models have one once different models have been applied.
So, for a particular model for a particular model n i j represents the number of class i
cases, classified as class j cases. So, we look at and n 1 1. So, this is the number of a
class 1 cases, classified as class 1 cases, this is the true classification right. If we look at
the next number this is number of class 1 cases we classified as class 0 cases. So, this is
this is the incorrect classification, similarly we look at the second class that is class 0. So,
the this particular value and n 0 1 is number of class 0 cases classified as a class 1 cases.
So, this is also incorrect classification on next value is n 0 0 this is number of class 0
cases classified as class 0 cases. So, this is the correct classification.
So, if we look at the diagonal values n 1 1 and n 0 0. So, these 2 values are actually the
correct classification by the models. So, these numbers so these some of these numbers
represent the correct classification and will look at the off diagonal values that is n 0 1
and n 1 0 they represent the incorrect classification. So, some matrix can be computed
based on this classification matrix. So, generally most of the techniques most of the
methods that we apply that we use for a classification task they generally you know, they
339
generally result in classification matrix. So, finally, with this classification matrix is then
used to compute different matrix and then evaluate the performance.
Let us another important points about classification performance is that while we build
our classification model on training partition it is the so therefore, the while we are
building our model on training partition. So, the model will fit the data little bit more and
therefore, as we have been talking about it is not recommended to test the performance
of the model on the same partition. So, therefore, it is the validation partition
classification matrix, that is generally used to just the performance of a classifier
performance of a model for classification tasks.
Now, we can still develop a training classification matrix, but it could be used to
compare with the validation partition classification matrix to detect over fitting. For
example it is expected that model would perform slightly poorly for the validation
partition because the model was built on training partition. So, it will fit the training a
partition data more accurately. So, therefore, the performance might go down for the
newer partition that is valuation partition; however, if there is too much of gap between
the performance numbers of validation partition classification matrix and training class
partition classification matrix that might indicate over fitting, that the model has over
fitted the training partition and that is why the performance on validation partition has
come down much further. So, that is how that is one way to detect over fitting if the
numbers for if the numbers for validation partition cosmetic classification matrix is on
the lower side, but not that much there is a small gap then probably the model is stable
and performing the develop model is performing well might perform well on the new
built data as well.
Now, there are some performance matrix as we were talking about which are based on
classification matrix. So, these are as simple as misclassification rate or error and
accuracy. So, if we go back to the classification matrix that we discussed just a bit
before, you can look at the off diagonal values n 0 1 and n 1 0. So, these are the
misclassified number of cases. So, misclassification rate or error would be the proportion
of you know proportion of total number of observation which are misclassified. So, these
are going to be n 0 1 and n 1 0, if we look for the another matrix that is called accuracy.
So, therefore, the classes as the observation classified into their actual classes in that case
we go back to the matrix we would see n 1 1 and n 0 0 are these 2 diagonal values they
340
represent the accuracy, proportion of these numbers out of total observation represent the
accuracy and it reminds the usefulness of a particular model.
So, misclassification rate or error or accuracy numbers can be used to compare the
performance of different candidate models.
So, these are the formulas for these 2 matrix, 1 matrix based on classification matrix you
can see error is computed as n 0 1. So, plus n 1 0 divided by total number of observation
that is n and accuracy is one minus error that is n 0 0 plus n 1 1 divided by total number
of observation n. So, these 2 matrix can actually be used to compare the performance of
candidate models, another important concept related to performance matrix is the cut off
probability value.
So, cut off probability value is. So, we talked about the earlier we talked about the
probability value, earlier we talked about the probability of misclassification. So, let us
come back to the same discussion. So, how do we, how do we assign different cases,
how do we assign different cases to different classes. So, generally the data mining
methods or statistical method, generally they compute using the model they compute the
probabilities values for different record. So, if there are 2 classes for example, class 1 and
class 0 for each record will have probability value of that record belonging to class 1 and
probability value of that record belong to class 0.
341
So, how do we, how do we assign that record to a particular class. So, cut off probability
value is the 1 that is used to do that to perform that that assignment. So, there are
different scenarios how this cut off probability value can be used. So, first one is when
accuracy for all the classes is important and that means, we want to correctly classify all
the cases irrespective of the class; that means, they could belong to class one and class 0,
but we are trying to classify them if the record belongs to if the actual class of a
particular record is class 1.
We want to classify it to class 1 and if the actual class of a record is class 0 through our
model we want to classify as class 0, if that is the scenario if that is the that is what we
want then the particular case or observation can be assigned to the class with the highest
probability as estimated by the model. So, as we talked about for every case or
observation will have number of probability values for each class. So, for 2 class case
class 1 will have a probability value for class 0 also will have a probability value. Will
compare them and the, and the class where the probability value is on is more than the
other value same value for the other class then the case would be assigned to the class
having the higher probability value.
Now, in some other scenarios accuracy for a particular class of interests is important. So,
there could be a particular class where we might be interested in identifying cases
belonging to that class a bit more. So, we would be interesting in identifying ones little
bit more even if it comes at a higher misclassification records for 0, class 0 records
belonging to class 0. So, in those situations the assignment would happen in a different
fashion. So, we would not be assigning depending on the higher probability value. So,
what we generally do is case is assigned to the class of interest, if probability for the
classes have a cut off value.
342
So, will define a cut off value in such cases and if the value for class of interest is higher
than the cut off value that particular case or observation would be assigned to that class
of interest otherwise it would be assigned to the other class.
So, if we talk about two class model, the default cut off value is generally 0.5 that is
principally similar to the naive rule because if there are 2 classes if a particular
probability value is more than 0.5. So, therefore, that is the more prevalent class. So, that
is how the assignment is going to happen, if we look at the other scenario when accuracy
for all the classes is important even there when a record is classified to the highest
probability value class there also we can see the application of naive rule was higher,
higher probability meaning highest probability meaning that is the most that is in a way
more prevalent class. So, therefore, that that particular record is being assigned to that
particular class so the idea is similar to naive rule when we talk about the assignment
based on the probability values.
So, we will do an exercise using excel. So, let us open this particular file. So, what data
we have is.
343
So, the here are 24 observation as you can see row number 2 to row number 25, we have
24 observation and for each observation we have the actual class whether the particular
observation or case belong to class 1 or class 0. So, you can see that is appropriately
indicated here 1 or 0 for each observation and we also have probability of that particular
class belonging to class 1. So, that probability value is given there, probability of a class
belonging to class 0 would be one minus this probability because we are discussing 2
class scenario. So, these probability values are here also and if you see they are arranged
in the higher probability to lower low probability sequence they are arranged higher
probability to low probability sequence.
You can see if we cut off probability value for success the default value that is 0.5 as we
discussed before. So, we look at the actual class 1 it is going to be correctly classified as
class one because the probability value for this particular case is more than 0.5 similarly
for second observation this is also going to be correctly classified having more than 0.5
value. So, same is for this case if we keep going on in this fashion will these 2 record or
observation number 8. So, there you will see that even though the cut off probability
value is 0.686 which is more than 0.5 the actual classes 0. So, as per the probability value
it would be classified as 1, but the actual class is 0. So, therefore, this will we and this is
going to be an incorrect classification.
344
If we move further again we get the correct classification, how more than better more
than 0.5 value and correctly classified, if we go further again we encounter another
misclassification where the value is for the observation is more than 0.5, but incorrectly
classified. Similarly now further as we go down all probability values start becoming less
than 0.5 and the classification also start becoming 0. So, in this case observation number
13, case number 13, valuing 0.49 and the observation being correctly classified as class
0. Similarly, if we go down there are few misclassification for example, this one record
number 15, row number 16 this is .46 value and this is incorrectly classified as this is
going to be incorrectly classified as 0, but the actual class is 1.
So, we can say if we want to count the number of ones and 0s in this particular example
that we have, number of ones are 12 and number of 0s are 12 and if we look at the based
on this class cut off probability 0.5 if we construct the classification matrix. So, these are
the values. So, classification matrix we can see that 10 observations are which are, which
belong to the for example, owner for example, their owner is 1, non owner is 0. So, 10
observation are correctly classified 2 observation which are actual owners, they are
incorrectly classified, if we look at non owner class. So, 2 observation which were
actually non owner they were incorrectly classified as owner and 10 observation which
were actually non owner correctly classified as non owner.
So, these are the numbers and the accuracy values can be computed you can see, you can
see here the diagonal values diagonal values e 6 and f 7 and they are divided by the total
number of observations and we get the accuracy similarly for error also. So, we have
created so here in this particular case we have created one variable table in excel and for
these value range from 0 to 1 at every 0.05 interval. So, as the cut off value changes from
0 to 1, we have the accuracy and overall number. If we focus on these values 0.25 we
will have the accuracy number as 0.63 and 0.38, if we go back to the default value 0.5,
0.83. So, you would see that at the default cut off value the accuracy is much better for
0.25 it was 0.63 for 0.5 it is actually 0.83. So, default cut off value accuracy numbers are
much better.
If the cut off value was 0.75 then you would see again the accuracy is comes down from
the case of 0.5, now is 0.75 from 0.38. So, whether we move, whether we reduce the cut
off value from 0.5 to 0.25 the accuracy goes down or whether we increase the cut off
value from 0.5 to 0.75 again the accuracy goes down. Now, we look at the if we look at
345
the classification matrix so you would see right, now the predicted we look at the
predicted class right. Now 12 are being predicted as owner and 12 are being predicted as
non owner when the cut off value is 0.5, if we change this cut off value to point from 0.5
to 0.25 you would see the classification matrix has changed because of the appropriate
formula is being used in those cells.
So, you would see now in the protected class 19 observation have been classified as
owner and only 5 observation has been classified as non owner, this is mainly because of
the lower cut off value 0.25. So, therefore, more number of observations are being you
know classified as ones and therefore, we see more number of owners 19. So, this jump
from 12 to 19 and that is why there is a drop in the accuracy value, similarly if we
change the cut off value to 0.75 you would see the numbers have changed in the
classification appropriately and now you would see that there are more non owners being
more observation being classified as non owners we would see that only 6 observations
have been classified as owner and 18 observations have been classified as non owner.
So, you would see that because the cut off value is more than 0.5 on the higher side. So,
therefore, fewer observations are being classified as ones and more observations are
being classified as 0. So, let us go back to our discussion. So, if we move away from the
cut off value of 0.5 we move to a 0.25 or 0.75 then in each case the accuracy was down.
So, in each case the accuracy goes down. So, why change this cut off probability value
from 0.5, there could be two regions one is we have a special class of interest and
therefore, we are interested in identifying those rare cases all those cases belonging to a
special class of interest a little bit more, even if it comes at the misclassification of higher
misclassification weight for other class, similarly another region could be asymmetrical
misclassification cost.
So, if one particular class the cost of misclassification for that particular cost would be
much higher than the cost of misclassification for an for some other class. So, therefore,
because of this asymmetric misclassification cost also we would be interested in finding
out more of the that more of the observation belonging to class having higher
misclassification cost and that would require us to move away from our default cut off
probability value of 0.5.
346
So, next question would be when to incorporate change in cut off value, when do we
change.
So, the so for example, if the example that we the exercise that we just went through in
excel was be already had we had already run the model, we had the actual 1s and 0, but
we also had the probability values at as estimated by the model. So, you would see that
the exercise of changing the cut off value from 0.5 to 0.25 or 0.75 or other numbers
through one variable table in excel that was done after the model has been selected and
we had the probabilities value and then we were trying to see how the results could
change if we change the cut off value.
So, one situation would be after final model selection, we can incorporate change in cut
off value because will have the probability values most of the techniques they pro they
estimate probabilities values also. So, therefore, it is easier for us to change the cut off
value and see how the results are changing; now another situation could be before model
derivation. So, we can incorporate the misclassification cost that could be there at the
model derivation during the model derivation steps itself and that would ultimately
determine the results. Now, when we have a special class of interest the performance
metric like accuracy and error might not be useful. So, few other metrics are popular
when we are specifically interested in a particular rare class or a particular class of
interest. So, these metrics are sensitivity and specificity.
347
So, will discuss them in more detail in the next lecture so will stop here.
Thank you.
348
Dr. Gaurav Dixit
Lecture – 17
Performance Matrix – Part II ROC/Cut-Off Value
Welcome to the course business analytics and data mining modelling using R. So, last
time we started our discussion on performance matrix. So, before we proceed further I
will do some exercise exercises that are related to what we discussed in the previous
lecture. So, let us open our studio.
So, as usual let us load this particular library xlsx. So, that we are able to import the data
set from excel file.
So, again for this particular exercise we are going to use a sedan car xls data set, data set
the particular concept that we want to discuss through this exercise is class separation.
Where we, when we started our discussion on class performance sometimes depending
on the data set, sometimes for some data sets it might be easier to apply different
techniques and get the accurate classifications for different observation, sometimes it
might be very difficult because of the data set.
349
So, let us see this through an example. So, will import this particular data sedan car let us
run this code.
So, this is the data set.
So, you can see in the data set has been imported 20 observation, 3 particular variables
let us remove the n a columns and we have as we all, as we are already familiar with this
particular data set. So, we are going to plot these 2 values annual income and household
area as we have done before as well let us plot this.
350
Now, this particular plot we have created in previous lectures as well, now important
point to note here is despite being a small data set or maybe that is the that is my times
the region also that the classes, there are 2 classes in this data set owner and non owner
classes and there is far clear separation between these 2 classes. So, therefore, it is easier
for us to apply different methods applied you know build the different candidate models
and then a well with the most useful one or to find out the best performing model, right.
So, in this particular data set the job is much easier, but sometimes this situation scenario
might be difficult, might be different for example, in other data set that we have used
before as well the promo offers data set this particular file. So, let us import this one, let
us remove the n a columns, let the data just data has been loaded five thousand
observation of 3 variables.
351
So, earlier data set was 20 observation, 3 variables this one is 5000 observations. So,
much larger data set, now in this let us remove the n a columns and we are going to plot
this particular data set, this plotting we have done in our previous lectures. So, palette
like the last time we had used we had used this gray and black palette that we created, let
us change the margin and outer margin settings and for this particular these 2 variables
income and spending.
352
Now, if we zoom in to this particular data set and if we look at other things the points
belonging to different classes you would see that this particular region this, left you
know this left part of this rectangular region blotting region there is homogeneity there is
most of the points belong to or 1 class, but if we look at the right part there is not much
clarity, we have points belonging to class 0 and we have point belonging to class 1. So,
both classes are present to the separation between classes is not very much clear
therefore, when we applied different candidate models on this particular dataset the
performance will have to be, will have to be evaluated much much closely and we might
not get much improvement in comparison to the benchmark cases.
So, for a classification task point me, for a classification task the class separation is quite
important in the previous lecture we also talked about the classification matrix. So, one
simple example that I had written here is this creation of this classification matrix. So,
this is all a few of our demonstration purpose if you want to create a matrix classification
matrix this is how it can be done. Otherwise if you have, if you have information in 2
variables and they are factor variables then table is the command that could be used to
generate the classification matrix.
So, in the previous session we talked about error and accuracy 2 2 matrix. So, this is how
we can compute them. So, in this particular case you can see the 0 classified as 1 and 1
classified as 0. So, these 2 numbers would actually give the number of miss classification
and this would this particular code will compute the error, similarly accuracy club
records 0 class 0 classified as class 0 and records of class 1 classified as class 1 and this
will give us the accuracy number.
So, let us go back to our discussion on the performance matrix. So, in the previous
lecture we stopped at this point and so let us start from here. So, when there is a special
class of interest then performance matrix accuracy and error might not be suitable. So, in
that case because we have one special class of interest and we might not be interested in
other class classification, the misclassification error and whether it is on the slightly
higher side or lower side, our focuses on one class in those scenarios we can use these 2
matrix sensitivity and specificity.
353
So, in this case sensitivity is about identifying true positive fraction. So, the cases the
cases belonging to class 1 how much a particular model is capable enough to identify
such cases, similarly specificity measures the capability of model in terms of identifying
or removing the observation belonging to class 0, as class 0.
So, that is true negative fraction. So, different candidate models can be compared using
these 2 matrix. So, if you look at the name it is names of these matrix sensitivity. So,
whether our model is sensitive enough to identify the true positives right, that is a class 1
observation. So, how much of the what proportion of the class 1 class 1 observations are
being identified by the models, that would be captured in sensitivity. You can look at the
formula as well this is the number n 1 1 divided by n 1 0 plus n 1 1. So, that means,
number, number of observation, number of class 1 observation identified as class 1
observation divided by total number of class 1 observation. Similarly, specificity if we
look at it is true negative fraction wherein the formula is n 0 0 divided by n 0 0 plus n 0
1; that means, the number of class 0 observation identified as class 0 observation out of
total class 0 observation. So, whether the how much, how much capability of the model
can be tested can be evaluated using a specificity is in terms of identifying the true
negatives whether we are able to eliminate the 2 negatives through our model or not.
Now, that brings us to our next discussion on ROC curve that is a receiver operating
characteristic. So, this product curve is generally used to plot sensitivity and 1 minus
354
specificity points as the cut off value increases. So, for different cut off values we try to
compute these 2 matrix sensitivity and then 1 minus specificity and then we try to plot
them for different cut off values.
So, now, the this particular curve is in a way, the way this particular curve was earlier
used for radio signals in world war 2 wherein the radio signals and the signals that were
received whether a particular signal is identifying a enemy ship or tank that. So, that was
identified through a blip in the screen. So, that is where this name is coming from
receiver operating characteristic, but this particular curve can be used is being used in
multiple domains especially in and to solve analytics problem and especially statistical
modelling or data mining modelling is being done.
So, top left corner of this particular plot as we will see through an exercise reflects the
required performance, the desired performance that we want from our model. So, let us
open our excel file that we used in the last session.
So, here as you can see in the previous lecture we created this one variable table where
we computed accuracy number and overall error for different cut off values, similarly we
can compute sensitivity and specificity for different cut off values. So, you can see the
values here and you can see how these are being computed.
355
So, you can see that it is the owner classified as owner that number, number of owners
classified as owner divided by a number of owner classified as owner plus number of
owner classified as non owner. So, that is the fraction that is how this sensitivity is being
calculated, similarly if we look at the specificity this is actually the number of non
owners being classified as non owner divided by the total number of non owners so that
is how we are computing this. Now, one variable table can actually be, can actually be
can actually be computed using these 2 formulas. So, you can see that I have already
created this one variable table and for different cut off values you can check these
numbers. So, I will be were talking about the ROC curve.
So, this particular data set can be this particular data that we have just generated can be
used to create our ROC curve. So, we have taken out this particular data and copied here
in a different worksheet right. Now, this data set will import into R and will create our
ROC curve, in the meanwhile the previous table that we had created in the previous
lecture this accuracy and overall error.
So, one product a plot has been created here, between accuracy for different cut off
values if we look at the cut off values, they are plotted in the in the in the x axis and the
accuracy numbers and also error numbers they are plotted because error being 1 minus a
accuracy they are plotted here in the y axis. You can see for the example that we had
discussed in the previous lecture that this is how as we go as we move from where cut off
356
value of 0 towards around 0.5 and the accuracy keeps on increasing and if we look at this
range from 0.4 to 0.8 the accuracy is stable around 0.8 values.
So, a bit fluctuation, but in a way is stable around 0.8 value in this particular region and
then again as the cut off value further increases this particular value goes down. So, this
is what the data that we had created, the graphic representation of the same. So, now, let
us open our studio and will import this sensitivity and specificity data that we have just
created. So, let us import this file, you can see 20 one observation 3 variables.
So, let us remove the na columns and also na rows here so that has been removed now
what we are interested in we are we want to create an ROC curve. So, ROC curve is with
me sensitivity and 1 minus specificity. So, therefore, let us compute that so the data
frame that we have is. So, we have cut off value then this number 1 minus specificity and
then sensitivity. So, this particular data frame has been ordered by cut off value. So, cut
off value.
Let us execute this code, we look at the output you would see that is starting from cut off
value of 1 and then it as the value goes down different numbers for sensitivity and 1
minus specificity have been created have been computed.
Now, these numbers would be plotted to create ROC curve. So, let us create this
particular plot. So, as we talked about that ROC curve plots, ROC curve plots.
357
So, the sensitivity and 1 minus is specificity points you can see this has the be look at the
again, if we look at the output again for cut off value of 1 and both these values are going
to be 0 and 0. So, therefore, you can see the first point is here then for cut off value of
0.95 you can see this value is x value is 0 and the y value is 0.83.
So, therefore, you see this similarly for as we this particular number decreases you would
see that sensitivity value keeps on increasing and that is reflected in the plot. So,
therefore, as the cut off value moves towards 0.5 from 1 from the value of 1 the model
improves in identifying the true positives. So, more true positives are being identified
being identified as we move cut off value from 1 to 1 towards 0.5 as this cut off value
decreases from 1. So, the sensitivity value has been improving right, as we move further
right as we move further you would see that from here if we move further you would see
1 minus a specificity, this value is again some value is there and you would see that this
value is also increasing as we move further. So, that means, 2 negatives, 2 negatives as
we discussed, 2 negatives as we discussed more false negatives are being identified right
in this particular model in this particular plot. So, as we move further the sensitivity
improves further. So, as we keep on changing the cut off value sensitivity moves, some
sensitivity increase further, but at the expense of false negative values. So, we gain we
gain in terms of identifying more ones, but that comes at the expense of misclassifying
more 0s so, but the overall sensitivity keeps on increasing.
358
So, we are interested in this top left corner this is the desirable this is the desirable
performance of a model that we want. So, because we are interested in identifying more
ones, therefore, top left corner we are interested in this particular region. If we look at the
particular 0.5 value at what which point is actually reflecting that value you we can see
that, 11 value is reflecting the 0.5 cut off value. Wherein this true negative false negative
rate is 0.16 and we can, we can see that sensitivity is 0.83 at 0.5, if we look further than
the maximum, we look further then the false negative rate increases quite significantly.
So, therefore, as we talked about we would like to identify the model which is in this top
left region and not go more into the right direction. So, therefore, this is the model
corresponding to a 0.5 cut off value, if we look at the number of 0.16 is on the x axis
0.16 on x axis. So, it would be I think this point and then you have 0.83, yes. So, this is
the point corresponding to 0.5 cut off value and lies in the top left region. So, probably
for the cut off value of 0.5 we are getting good a good enough performance even for you
know using these matrix sensitivity and specificity to have a. So, these were the points to
have the actual ROC curve that is generally used, we can change the type of plot from
point plotting to step plot plotting.
So, generally step plots are used for ROC curves this is the plot you can see till at till few
cut off values few changes in cut off values starting as it decreases from one, the
359
sensitivity keeps on decreasing sensitivity keeps on increasing as and as the cut off value
decreases further, right.
There is some false negatives are also, some false negatives are also being identified by
the model classified by the model right as we move further, further we see some jump in
sensitivity, but more false negatives are being classified by the model. So, this is the
ROC curve that we want we can draw the reference line. So, this is the reference line
representing the average case scenario.
So, this reference line representing the average case scenarios and this particular line
representing the ROC curve. So, let us go back to our discussion, now another interesting
point that we can understand from this exercise is that while we want to identify when
we have a special class of interest we want to identify more of class 1 members because
the idea in a business context is generally for example, whether the customer is going to
respond to a particular a promotion offer or not. So, in that case we would like to have all
those customers which are having higher probability as estimated by the model and mail
them our offer. So, therefore, creation of a rank ordering of records with respect to our
class of interest becomes more practical.
So, how that can be done? How different plots and different mechanisms can be used to
do that, to rank on ordering of records for class of interest this could be done based on
the estimated probabilities of class membership. So, the records which are having the
360
highest probability of belonging to a special class of interest they can be taken. So, all
those records can we take in first and so that they can be the promotion offering can we
send and can be mailed to them first right. So, the lift car, lift curve is there which can
actually be used to display the effectiveness of the model in rank ordering the cases right.
So, the selected model that we might have we can draw a lift to curve for the same and
then find out how effective it is in terms of rank ordering the cases.
So, how it is actually constructed once you have build your model on the training
partition you can have your valuation partition scores and using these scores the
estimated probability is number we can actually construct our lift curve. Now, the
effectiveness of model can generally is, generally is seen using these this particular curve
cumulate elliptic curve wherein depending on the probability we look at the cumulative
number of records which can actually be, which are actually going to belong to class 1
class of interest this cumulative elliptic curve is also called gains chart. So, this is
actually used to plot cumulative number of cases on x axis and cumulative number of
true positive cases on y axis.
So, this particular the plot displays the lift value of the model for a given number of
cases. So, for a given number of cases the lift value of the model is displayed with
respect to random selection. So, if we just rely on the probability value of class
membership. So, let us say if there are 20 observation and ten belong to one particular
class class of 1. So, therefore, probability of a particular record belonging to the class 1 is
going to be ten divided by 20 all right so that is going to be 0.5. So, if we do random
selection how much more our model is going to help us in terms of identifying those
cases as belonging to class one how much lift in comparison to this random selection this
average scenario is going to be given by our model. So, that we can do through
cumulative lift chart or gains chart. So, we will do again discuss this through an exercise.
361
So, let us open our excel file, now if we look at the this particular same data, the same 24
observation and their estimated probability values can be seen here, we also have actual
class corresponding to each record right. So, what we want to compute is the cumulative
actual class. So, for now let us look at this. So, the these probabilities are pre arranged
from. So, in decreasing order. So, the record with the highest probability or belonging to
class one is first and then it is followed by records with slightly you know in decreasing
order having probability values in decreasing order and then the actual class membership
is there.
So, if we look at the cumulative actual class for the first record we have 1, once we get
the second record which is also then the number the cumulative number of records
belonging to actual class will be 2. So, in this fashion will continue till we have actual
class as identified as 1 now when we come to the first record which is misclassified 0,
but if you look at we look at the probability a value it is more than 0.5 which actually
0.686, but it has been miss, but it is going to be misclassified, but it is going to be
classified as a record long to class 1, but it is actually class 0. So, we do not add this
particular. So, this there is not going to be no addition in the cumulative actual class this
will remain 7 then in the same fashion will give on and will keep on accumulating these
numbers.
So, once we have prepared this particular data then we can go ahead and create our plot.
362
So, let us open our studio. So, first let us import this particular data, you can see 24
observation, 4 variables. Let us remove the na columns let us look at the range of this
particular new variable cumulative actual class and range. So, we have we know that we
have 24 observation. So, the plot is going to be between number of cases on x axis and
the cumulative actual class number on y axis. So, let execute this code and you would
see the plot has been generated.
363
Now, let us also plot the reference line. So, that being that is the reference line. So, now,
from this, from this ref reference from now this particular lift this particular cumulative
lift curve or gains chart has been generated. Let us also create the legend and few other
lines, now let us look at the plot now you can see this is the dotted line represents the
cumulative ones using random selection. So, if we randomly select and rely on
probability value this is the line then the actual line is cumulative one sorted by predicted
values right. So, we will stop here and will continue our discussion and try to interpret
this particular gain chart in the next lecture.
Thank you.
364
Dr. Gaurav Dixit
Lecture – 18
Performance Matrix – Part III ROC Curve
Welcome to the course Business Analytics and Data Mining Modeling Using R. So, in
the previous session we were discussing Performance Matrix and in particular we were
we had created a cumulative lift curve or gains chart.
So, this was the chart that we had created in the previous lecture. So, as you can see on x
axis we have the number of cases. So, cases at cases are increasing from 0 records to 5
records on x axis and 10 records and then 15 20 in this fashion. So, as the number of
cases increasing what is the performance of the selected model in the y axis we have
cumulative actual class.
So, cumulative actual class being identified by the model as we increase the number of
cases, now the dotted line is the reference line where we have the cumulative ones as
would be identified using a random selection approach, which would mostly rely on the
probability value of a particular record belonging to a particular class right. So, when we
start when we have 0 observation this particular reference line will you start from 0 and 0
no records.
365
And no identification or classification as we move to all the records that is 24
observations in this case you would see in the data set we had 12 we had 12 owners and
12 non owners.
So, therefore, the and the class are being interest being owner. So, number of records
belonging to the owner class are 12. So, the probability of a particular record belonging
to owner classes 12 divided by 24 that is 0.5. So, you would see data the last point
coordinates being 24 and 12 if you look at so this being the reference line. So, this being
the average case now how our model is performing you can see the cumulative actual
class values and how this particular line in black color stepwise line in black color is the
performance of our model, as the number of cases increase we can see the gap or
separation between the this reference line and the actual line is increasing; that means,
model it starts to perform better in comparison to this random selection case.
So, as we move further there are slight blip blips in between where we take these steps
and then again the performance improves, where we see the slide slide blips our
horizontal lines these are actually misclassification. So, our model has misclassified a
particular observation. So, class of interest being class one and that is repre being
represented in the y axis therefore, a class a an observation has been identified probably
as class 1 which was actually class 0 right.
So, if we had a model which a would explain all the observations perfectly. So, each
record to it is own class then that would that model would actually be represented by red
dotted line.
So, you would see this particular model as we move along right. Now this particular data
that we have plotted in cumulative lift curve or gains chart is actually reflecting the rank
ordering as well. So, you would see that all the this by this red model red dotted a model
that the perfect model will represent will correctly classify all the classes belonging to
owner class you can see this particular point is 12 cases.
So, all the cases have been correctly identified and then you would see the horizontal line
because now all the cases belonging to non-owner classes are being identified. So, this
par particular model is will go in this particular fashion our model they are going to be
few misclassifications. So, therefore, our model will deviate from this particular red line,
366
but if we compare with the reference line there is much more better performance by the
model.
Now, let us go back to our discussion now what we have discussed in cumulative lift
curve or gains chart.
In a way helps us in determining how much improvement our model is providing if we

had not used any model at all what was the case, which was represented by reference line
and the gains that the lift from that no model scenario that lift that has been given by the
model that we could see in this chart. Now the same information that we just plotted that
we had seen and plotted through gains chart can also be done through decile chart now
this is alternative plot to convey the same information as gains chart right.
367
So, let us open our studio and will generate decile chart so before generating decile chart.
We would need to create a few variables horizontal, decile chart as we will see that we
create different deciles which are nothing, but buyers which represent the cumulative.
But number of cases in percentage terms 10 percent of cases and what was the bar size
then 20 percent top of the cases and what was the buyers size and this all and then 30
percent of the cases and the respective bar plot.
368
So, in this fashion we plot for all the cases. So, that is all 100 percent cases. So, if we
have gaps for 10 percent 20 percent 30 percents so will have 10 such bar. So, for each
bar we would try to compare each bar would reflect the same kind of information which
was reflected by gain chart the lift that is given by the model.
So, I will see how we can generate this. So, as we saw in the gains chart the comparison
was with respect to reference line. So, in this case on y axis will have decile mean
divided by global mean. So, for every decile or bar will create the mean for all those
values and that would be divided by the global mean. So, global mean here being this
value, which is actually the number of cases belonging to actual class 1 divided by
number of total number of total cases.
So, this particular value will give us the number of will give us the global mean the
average case. So, let us first create this variable decile cases. So, this is nothing, but
variable which is representing the number of observation that are going to be there for
each decile or each bar.
(Refer Slide Time: 08:02).
So, you first decile will show some stat or some show bar about 2 observation, than 5
observation, 7 observation, 10 observation in this fashion.
So, do observation actually representing 10 percent of the total observation in this case
24. So, it has been rounded to 2 then 5 this has been this is 24 percent of the total
369
observation that is 48 rounded to 5 observation and in this fashion for different number
of observations different deciles will have different number of observations. So, let us
create few more variables. So, that we are able to compute this decile mean for each
decile global mean has also been computed as you can see that this value is 0.5. Now let
us also compute this particular counter and is like this particular counter.
Now in this particular for loop what we have is we are running a counter I for all the
decile cases that is in this particular cases there are 10 such deciles. So, the loop is going
to be run on 10 times and you would see in the next second line; in the for loop decile
mean for every decile is being computed and you can see that cumulative actual class is
you know is being divided by the counter value and that that giving us the decile mean
for that particular, because the high will have actually the number of decile cases.
So, far a particular bar the number of cases so, the cumulative actual class is being
divided by the number of cases for that particular decile. So, will have decile mean and
then decile value for is being computed by dividing the decile mean why global mean.
Let us execute this for loop will have decile numbers. Now will create a bar plot will
actually create this decile. So, let us compute this you would see a decile chart has been
generated.
In on the y axis we have decile mean divided by global mean and you would see that this
particular and the on the on the x axis we have deciles. So, for decile rep representing 10
370
percent of the cases and these 10 percent of cases are actually from that ordering rank
ordering that we have done based on the estimated probability numbers.
So, these are the most probable ones coming first then 10 percent most probable ones
coming first, just like the cumulative lift chart, they are also the most probable ones were
coming first. Similarly second decile most probable ones 20 percent of such cases
coming first and then 30 percent, because as we talked about in a business scenario in a
business context we would be interested in identifying for example, in a case of
promotional offer we would be interested in identifying the customers who are most
likely to respond to a particular promotional offer therefore, most likely once we would
like to mail first.
So, these 2 particular charts that we talked about are actually indicating the performance
of our model in that sense in rank ordering sense. So, you would see that lift for decile 1
is actually 2 in comparison to the average case for 20 is for decile 2 that is for 20 percent
cases again the similar number for 3 also the decile number 3 also similar number, as we
move towards right side of x axis we would see that decile value keeps on decreasing,
because as we go along and try to classify more number of cases the lift values goes
down. The same thing is same thing was reflected in the this particular gains chart that
we had, we compare from here the random case this this average case and we see the
separation we would see as we move further from this point onwards, you would see this
both these lines are closing right the same thing is reflected in the decile chart as well.
371
So, let us go back to our discussion on performance metrics. So, next particular concept
that we are going to talk about is asymmetric misclassification cost. So, we talked about
that when we would require to change the cut off value we talked about different
scenarios, when we have a particular class of interest then other scenario going that when
asymmetric classification cost are involved. So, what is this particular concept? So, when
misclassification error for a class of interest is more costly than for other class who
sometimes it might.
So, happen then that if a particular class of interest if there is some misclassification
error it is more costly this generally happens in a business context for example, and
misclassifying a customer as false positive who is actually likely to respond to the
promotional offering. So, for example, if we were able to make a promotional offer and
that if that offer was accepted by the customer and he purchases that particular item then
we might end up making some profit. So, if we are not able to correctly identify or
classify that particular customer as the responder of the offer then we is stand to lose that
particular profit that we could have made. So, this is opportunity cost of foregone sale.
So, if we are not able to correctly classify a responder then we will lose on this particular
opportunity cost opportunity cost of a forgone sale if we are. So, this particular cost has
to be compared with the cost of making an offer if the particular customer is not a
372
responder then the cost of making an offer would be much lesser than the opportunity
cost of forgone sale.
So, it is important for us to identify a particular class the class of interest because if we
are not able to do. So, we would that is going to cost us more. So, in this particular
scenario the misclassification rate metric that is generally applicable is not appropriate
reason being there are different cost in different cost associated with misclassification of
different classes, other consideration that could be there for example, it might not just be
the opportunity cost of forgone sale versus the cost of making an offer there are some
cost of analyzing data also.
For example if we built a data mining model that would require us to have the data set in
the first place it might be in the form of data warehouses data marts and then the finally,
data would be extracted for the analytical problem and therefore, will develop a
classification model a classifier and that will help us identify. So, all that is going to
incur a cost so will have to we should be incorporating that cost as well if through this
model we identify a particular class one member then we should be incorporating this
cost as well. So, we need to look at the actual net value impact per record.
So, if we are able to correctly identify with the help of our model data mining model then
we have to incorporate the cost of analyzing data as well. So, actual net value impact per
record would be much better. So, eventually our goal when asymmetric misclassification
cost are there our goal would not be minimizing the overall error or maximizing the
overall accuracy rather it would be minimization of cost or maximization of profits.
Now, while we do this how to improve actual classification by incorporating asymmetric

misclassification and cost how that can be done. So, will see that, but before let us go
through an example.
373
So, let us open excel file to understand misclassification cost further before we discuss
how we can incorporate miss misclassification cost to improve our modeling.
So, let us say we have this example we have 1000 observations this is our sample size
and one percent of this particular sample they belong to the buyers category. So, that
would be 10 buyers and then remaining would be non-buyers. So, buyers would be 10
and the remaining would be non-buyers. So, if we use the name classifier we do not
build a data mining model then what we will do following name classifier is assign all
the cases to the majority class.
So, for example, in this case we look at the sample then there are 10 buyers. So, all those
buyers would also be classified as non-buyers that would lead to 1 percent error. So, will
have 99 percent accuracy or 1 percent error that seems to be a good model, but this is of
no practical use reason being that we are not able to make any money out of it because
all the customers have been classified as non-buyers.
So, the opportunity cost for this particular model the name classifier is going to be 100
rupees given that profit from one buyers 10 rupees and cost of sending the offer is 1
rupees.
So, if we look at I have already done the formulas here. So, you can see here that the
numbers are buyers multiplied by the profit from one while that could be there that gives
374
us the opportunity cost numbers that is 100 rupees. So, this is the cost if we follow the
naïve classifier. Let us say we had a data mining model and using this particular model
we generated this classification matrix. So, this particular model correctly classified 970
class 0 as class 0 members.
But 20 were misclassified so 20 of class 0 members were misclassified as class 1 to class

1 members were misclassified as class 0 and 8 class one members were correctly
classified as class 1 members. So, that gives us in terms of misclassification 2 buyers has
been misclassified as non-buyers and 20 non buyers have been misclassified as buyers.
So, that gives us an error of 2.2 percent you can look at this particular value and see here
that diagonal values of diagonal values and divided by all the observation that gives us
the error that is 2.2 percent, if we compare this particular scenario with the naive
classifier the error is on the higher side.
But if we look at the classification matrix now we are able to identify correctly identify 8
buyers. So, that will give us some profit so how we can understand that. So, one way to
create one way is to create a matrix or profit. So, here will look at depending on the
results of classification model, how we can make profit from that. So, because we were
able to identify 20 and 828 members as belonging to class 1 therefore, from these, but
only 8 of them were correctly classified as 1. So, from these 8 members from these 8
members will have 80 rupees 10 from each while and 20 which were you know which
were classified as class 1, but they are actually class 0 members. So, will have 1 rupee
cost of sending the offer, that is the cost, that is minus reflected through minus sign
minus 20 so overall will have a profit of 60 rupees.
So, this is a from the profit perspective we can look at the same situation from the cost
perspective. So, cost of sending the offer is 1 rupee and then the profit from 1 buyer is
10. So, will look at the number of number of customers who have been incorrectly
number of class 1 customers who have been incorrectly classified as class 0 so, these are
there are 2 such customers and profit that we could have made from them is 10 so, this is
20.
Now, other members which have been classified as 1 20 misclassified and 8 correctly
classified. So, 20 and 8 is the value. So, we look at from the cost perspective total cost is
going to be 48. Now we can either maximize this particular figure we can either
375
maximize. So, our naive word would be maximization of this profit or minimization of
this cost. So, let us go back to our slides.
Now, as we talked about that the actual classification would not be improved by what we
discussed just now whether by minimization of cost or maximization of profits through
these new goals. So, how do we improve actual classification one way is change the rules
of classification for example, in the previous lectures we were talking about changing the
cut off value from 0.5 we can either increase or decrease? So, that that could be done
So, that is that is one example.
So, what we can do here is we can create a new performance metric. So, this is actually
the average misclassification cost. So, here we can look at this particular value C I which
is the cost of misclassifying a class 1 observation. So, in this case we have 2 classes. So,
C 0 is the cost of misclassifying a class 0 observation and C minus is the cost of
misclassifying a class 1 observation. So, average misclassification cost this naive metric
can be computed using C 0 into N 0 1 that is number observation number of observation
belonging to class 0 classified as misclassified as one then plus cost of misclassifying a
class one observation that is c 1 into number of observation or belonging to class 1
misclassified a 0. So, this is misclassifications all the misclassification and the
misclassification cost have been computed in the numerator divided by the total number
of observation.
So, we can actually look to minimize this average misclassification cost. So, that is how
we can actually incorporate. So, this could be our change this could be new matrix where
we can look to minimize this average misclassification cost and thereby changing our
classification rules and improve the classifications.
376
Now, for this we would have to estimate C 0 and C 1 values right so, but if we look at
this particular formula if we divide we take c 1 outside of this will have a ratio C 0
divided by C 1. So, other term would actually be the constant value. So, therefore, if we
do not need to find out the actual cost of misclassification C 0 C 1 if there are more than
2 classes this could be a costly process.
So, we do not need to do that if we are able to if we are able to find out if we are able to
understand the ratio of these costs that; that means, how much more in comparison to a
particular class how much more costly it is going to be misclassified the members of
other class. So, we have the ratio of p p are able to understand the ratio or find out the
ratio of these 2 numbers that would suffices to minimize this particular this particular
quantity or expression.
Now, sometimes the samples that we use that we use for building our training partition
might have one particular probabilities, but the new data on which we are going to apply
our model might not have the same proportion of class same proportion of records
belonging to different class members.
So, the in incorporation of prior probabilities can further improve the model. So, in such
cases when we incorporate prior probability our average misclassification cost this
particular formula will also change, but even in that case also the formula would change
377
and it will become C 0 multiplied by the probability value p 0 n 0 1 plus C 1 p 1
multiplied by n 1 0.
So, we can again take the we do not need to again we can take the prior probabilities
even in this case and incorporate that in our model. So, we would not have to again we
would not have to again find out the actual values and even in that case the constant term
can be separated and we just have to know the ratio of these values.
So, that is why you would see that that is why you would see that many software’s they
would actually manage to discuss after and they would actually ask you to specify the
ratio of misclassification ratio of cost misclassification costs belonging to different
observation, belonging to different class observation, belong to different classes and they
also incorporate the prior probabilities.
So, let us go back to our excel sheet and the same information is being conveyed over
here the average misclassification cost of misclassifying a class 0 observation that we
need to incorporate cost of misclassifying class one observation that we need to
incorporate and you would see that if it is cost as q 0 q 1 and prior probabilities p 0 p 1
and we will have to change our sampling routine. So, earlier if we just do the simple
random sampling and that might not work in case of prior probabilities because z if there
is a rare case we will have to do over sampling for that particular case. So, for that we
use stratified sampling.
So, in though in such cases we will have to minimize not just this this q 1 q 0 or will also
have to minimize this probability of a particular record belong to class 0 and ratio of
ratio of probability of class 0 membership and risk and probability of class 1
membership. So, depending on the case right we can minimize the particular quantity.
So, we will stop our discussion here and will discuss further of asymmetric
misclassification cost and some other concept under performance matrix.
Thank you.
378
Dr. Gaurav Dixit
Lecture - 19
Performance Matrix-Part IV Assymetric Classification
Welcome to the course business analytics and data mining modeling using R. In the
previous lecture we were discussing performance matrix. So, we were specifically we
were discussing asymmetric misclassification cost. So, we will pick up our
discussion from there. So, as we were talking about the asymmetric misclassification
cost and how it can be important in a business context right. So, let us open the excel
file so, that we are able to recall some of the thoughts that we had in the previous
lecture.
So, we talked about this particular example where the sample size is 1000 and we
have 1 percent buyer, others being non buyer. We talked about what will happen in a
new classified case, and we will have 1 percent of error if we classify all the records
as non-buyers therefore, 1 percent bars would also be classified as non-buyers. And
we will have 1 percent error we also have a hypothetical data mining model where
the results would be in this form in this form of this classification matrix where we
have this prediction that 970 members classified correctly and correctly as class 0
members and 8 members classified correctly as class 4 members others are
misclassification errors.
So, that would lead us to 20 plus 2 that off diagonal elements and 2.2 percent errors.
So, we talked about that this despite being a more misclassification error, higher
379
misclassification error, this particular data mining model could be preferable to us
because, of the profits that would be involved the value that we can get from an
customer selling a particular item to the customer.
So, that also we saw through this example where we had this matrix of for profit
when the focus is on the profits, and then we also had this matrix of matrix of cost.
So, we have gone through this so, in matrix of profit we saw that through this is
example exercise that 60 rupees profit we could make for buyer, and then we through
a matrix of cost this is 60 rupees profit is from all the from the whole data mining
exercise from all the buyers then we had matrix of cost from the cost perspective we
had 42 8 dollars. So, our purpose is going to be when we take profit and cost into
account the purpose is going to be maximization or profit or minimization of cost,
but if we look at the improvement in our classification model. So, we would not see
much in using these wage.
So, therefore, we talked about another performance matrix that was average
misclassification cost.
So, this was the formula where we take so, this particular average misclassification
cost as we talked about measures average cost of misclassification per observation,
and this is the formula that we also discussed in the previous lecture right. So, we
also talked about that this product we look at this particular formula the values n 0 1
and 1 0 and being can be considered constant and therefore, the minimization of this
quantity that we require would essentially be based on the ratio of c 0 and c 1 right.
So, we would essentially be and it would be easier for us to estimate the ratio of c 0
and c 1 right cost of misclassifying a class 0 observation or class 1 observation that it
would be easier for us to estimate that and therefore, essentially the software which
380
will try to minimize this particular expression average misclassification cost would
essentially be doing on the basis of this particular ratio.
You also talked about that is why many commercial statistical software they will ask
you to specify if there are any misclassification cost, you also talk about that some
software will also ask you to specify if there are any prior probabilities. So, that was
the discussion that we were having so, why we would be requiring prior probabilities
in the in the first place. So, sometimes it might happen that the model we are building
our model on a particular training partition, and the ratio of different classes might
not remain same when we apply our model on new data. So, the ratio of class 0
members and to class 1 members are number of class 0 members and number of class
1 member that proportion, might not be same in in the new data all in the sample
data.
So, if while we are building our model and the sample that we are going to use to
build our model, and that sample might not have the proportion of class 0 members
to class 1 members as would be there in the real data in the actual data that would
also be used for testing, then we might incorporate that in our modeling process. So,
how we incorporate that by specifying the prior probabilities so that is p 0 and p 1 p p
0 for class 0 members and p 1 for class 1 members. So, even in this case also this we
would be just taking the ratio of these 2 prior probabilities would suffice us to this
minimize the expression. So, we would essentially be minimizing p 0 divided by p 1
and c 0 divided by c 1. So, this is the expression that would essentially be minimized
by the softwares instead of the full expression.
Now, we also talked about previously the importance of lift curve and how it can be
used to understand to evaluate the effectiveness of a model.
We saw that in comparison to a random selection case how the lift curve can tell us
about the lift that we will get for a particular model, as the number of cases increase
right. So, we saw that through this deciles chart also can we generate a lift curve
while in court incorporating cost. So, let us do an exercise where we do this. So, let
us open our studio.
381
So, as usual we will first load this particular library, data set that we are going to use
it is particular again is in the same sheet or cutoff data dot xlsx. So, now, let us have a
look at the data as well.
So, the data set that we are going to use is this one. So, you would see that 5
variables are there, but if you look at these are based on the results. So, we have
serial number that that representing the index of particular observation, and the data
has been sorted based on the estimated probabilities of class membership specifically
class 1.
So, based on the estimated probability of class 1 the data has been sorted, we you can
also see then that actual class in the third column has also been mentioned. So, that
particular observation belonging to the actual class of that particular observation, and
382
the probability of class 1 is given there. So, point 5 of anything is if any probability i
value being more than point 5. So, that class would be predicted as the 1 otherwise 0.
So, now, in this particular data set you can also see that we are trying to incorporate
the cost, the concept misclassification cost concept that we discussed. So, you can
see your cost of sending the offer we have specified as 1 rupee and value of a buyer
thirty have we specified has 10 rupee. So, now, at this point I would also like to tell
you that this is a small data set. So, the plot that we are going to generate would be
slightly different in comparison to if we had full data set, bigger data set.
So, from this example you would also know that there are 12 ones and 12 class 0
members, and 12 class 1 members in this particular small data set. Now the
depending on the actual class; however, specified you can look at the excel formula
that is there depending on the actual class I have a specified the cost. So, that is if the
actual class is 1 then value of a buyer that is 10 rupees that would be taken up if it is
0, then the cost of sending the offer that would be used. Once this value is there then
we can also find out the cumulative cost you can see the cumulative cost here. So, for
to compute the cumulative cost for each observation for the first observation it is
going to be the same as cost for the next observation, we are going to add the value
of previous result as well.
So, in this fashion will continue and up to the last observation. So, will have this so,
essentially in the lift curve we would be plotting this number cumulative cost number
with respect to the index of the observation. So, let us open our studio. So, you would
see there is slight change in the function read dot xlxs that we are going to use here.
You would see that I have taken this sheet index as 5 this is 5th work sheet, and you
would see also that column index and other argument have used. So, we just want to
383
we just want to import the data of first 5 columns. So, that is starting from 1 to 5 only
these so, we are interested in putting only this data that us right other details this
particular detail cost of sending the operand value of y this we do not want to include
in our imported data set that execute this line.
And we will get the appropriate data set into our environment. So, let us look at the
data that we are going to use for this slept curve generation. So, we have 5 columns
here serial number that is representing the index of the observation, and then a
probability of class 1 so, this particular data set has been sorted from higher
probability to lower probability the estimated probabilities by the model. Small data
set we have 24 observation, and we have 12 observation belonging to class 1 and 12
observations belonging to class c class 0, the we have third column is also
mentioning the actual class of particular observation, and then we have cost which is
determined by.
So, we have some numbers here you can see that cost of sending the offer that we
have specified is 1 rupee and value of a particular buyer is 10 rupees.
So, therefore, if the offer is accepted by buyers. So, the net value that the impact that
we that the organization get gets from a buyer is rupees 10, and if that offer is not
accepted by the buyer then it would cost them 1 rupee in the cent and the cost this is
actually the cost of sending the offer.
So, you can look at the excel formula has been appropriately specified, where if the
actual class is 1 right in that case the value of buyer 10 is mentioned this is net value.
So, that is mentioned if the class is if the class is 0 and you can see then in the other
in that case this particular value minus of this value is actually mentioned. So, in this
fashion for every observation which has been sorted by the estimated probability of
belonging to class 1 we can mention the cost.
So this will this this per these particular numbers in this particular column cost they
are including the net value from a buyer or and the cost of sending the offer as well.
Now cumulative cost has been computed using this particular cost column. So, in this
case first observation the cumulative cost is the same as the cost for the first buyer,
then the for the second buyer we are adding up the cost that incurred in the first
buyer.
So, in this fashion for previous cost is being accumulated and we get this variable
cumulative cost, so let us import this particular data in the r environment so, you can
see some changes in the function v dot xlxs that I am calling here you can see that
work sheet that we are calling is 5 you can see in the excel file. So, this is work sheet
number 5, but you can see this work sheet number 5 Otherwise you can also import
384
the data from this worksheet by using the name of this work sheet with which we
have provided here as well.
So, right now we are going to use the index then you can also see the columns that
we want to import. So, we want to import column number 1 to 5. So, we are using
column index argument here you can see we are interested in for the column 1 to 5
this program formation we do not need for plotting lift curve. So, let us import this
particular data set you can see 24 observation of 5 variables let us remove if there are
any columns.
Let us look at the first 6 observations you can look at the first 6 observations
cumulative cost is there. Now we are going to we are going to generate a scatter plot
between these 2 variables the serial number, and the cumulative cost data cost.
So, let us look at the range of these 2 variables can see that range is 10 and 112 for
cumulative cost and for serial number as we know that there are 24 observations. So,
you can see that these limits have been appropriately specified from 0 to 25 and also
from 5 to 140. So, all the values would be covered in this particular range.
385
So, let us create the scatter plot you can see this plot, let us zoom in let us also
generate the reference line. So, in this case reference line is conducting the initial
point that is in this case for reference line we have taken a 0 0, and the last point that
is twenty fourth observation and the corresponding value. So, in this case the
corresponding value is let us confirm using excel file it is 108.
Let us generate this line as well and legend. Now let us look at the zoom plot now we
look at the if we look at this particular plot.
386
So, this is essentially plotting the accumulative cost in the y axis and the number of
cases in the x axis. So, a dotted line being the reference line which is connecting the
0 0 to the last point you can see the cumulative cost is continuously increasing as we
as the number of cases increased from as we move along the x axis from left to right,
but you would see at the end of it there is slight dip. The same thing is reflected in the
excel sheet that we see you can see here, that the maximum value that we have is 112
and after that you would see there is a dip and the last value is 1 0 8.
If we had a much larger data set a many more observation this particular dip would
have been even more to the extent that this line might come down, the slope of this
line might come down much more and it can even be negative for example, if we
consider case where we have let us say 20000 observations. And they are the high
most probable ones would be selected in the initial part of the plot and the buyers the
non-buyers would come later and thereby they will decrease the last value of the last
point, and the whole plot might go below the x axis some part of the plot might go
you know go below the x axis and therefore, the slope of this reference line might
even become negative. So, it can come down
So, you can see the optimal point is this particular point where the value is maximum
which is 112 to this point this is twentieth observation so, this is where we are getting
the optimum value. So, let us go back to our discussion. So, that was the lift curve.
So, lift curve can actually be used to find out the most probable ones the rank
ordering.
387
And how many probable ones can actually be sent the offers first, and what is the
optimal point how many buyers would actually be sending the offer. So, around the
business context that would be more desirable. Now lift curve that we just plotted.
So, there would be 2 scenarios so, 1 is the plot that we just generated was lift versus
number of records. So, the goal was to identify the goal was to see the cumulative
cost that could be there the optimal point, that where the optimal point could be and
also to identify the more likely it is buyers which are more likely to accept the offer.
Now, we can also generate a lift plot lift curve versus cutoff values. So, if we are
interested in finding out the cutoff value where the model is going to perform better.
So, that can also be found that can also be done. So, we can also have so, depending
on the goal whether we are looking to find the suitable cutoff value or whether we are
looking to identify the check the effectiveness of the model, and the number of
records most probable records. So, depending on that we can plot the lift curve
accordingly.
Now, whatever discussion that we had about the asymmetric misclassification , and
other concept before that was mainly applicable in 2 class scenario class c 1, and
class 1 where class 1 was ever class of interest right can this all these concept can
they be extended to m class scenario.
388
So, if we extend asymmetric miss classification cost for m class scenario, that is m
greater than 2 we will have a classification matrix m class 1 m rows and m columns.
So, this is going to be a much bigger classification matrix therefore, the kind of
computations and discussions that we had they would become even more
complicated. So, we will have to deal with m prior probabilities if suppose that
sampling is distorted, and the real data in the original data the proportion is different
then we will have to incorporate m prior probabilities
Similarly, we look at the misclassification cost there could be m times m minus 1
misclassification cost. So, the understanding different misclassification cost that
could be there that would also become slightly more complex. Lift chart that we have
generated for 2 class case that would also be that can also not be done in a multiclass
scenario until, and unless we identify our earlier class the class of interest and other
classes we combine and reach to the 2 class scenario only then it could be used.
So, otherwise the discussion that we discuss on related to probability and
misclassification cost they can be easily extended, but when we want to execute the
things are going to be when we want to do an actual implementation of this it would
be more complicated. Next point that we want to discuss is over sampling of rare
class members. So, sometimes it might be the case that the class of interest might be
having very few class members. So, it might so if we are want to create a model with
very few class members the model might not be really useful. So, in this case we
would be required to do over sampling of rare class members.
389
So, that will bring us to our discussion on simple random sampling versus stratified
sampling. So, when we do simple random sampling when we have a rare class of
interest with very few class members belonging to that particular class then.
Simple random sampling might not give us the good enough partition to build models
right, and your training partition if it is randomly drawn we might not get enough
number of cases belonging to class of interest, and that can impact our model and
then later on it is up implementation on new data. So, in such situation we will have
to do over sampling as I said and stratified sampling is generally used for this kind of
tasks. So, generally stratified sampling is used to perform over sampling especially, if
there are such groups are present where 1 group is dominating the and other group is
very few members of other groups are present.
So, what could be the different over sampling approach so, one way could be sample
more rare class observations. So, this is like equivalent of over sampling without
replacement. So, the sample that the data set that we have we can sample more of the
class 1 members the class of interest right and so, the problems that can be faced in
this. So, this is more desirable approach, but there could be some practical problems
that we might have to encounter. For example, lack of adequate number of rare class
observation what if in the data set itself the number of rare class observation or so,
few that even this you know sampling of more real class observation might not feel a
meaningful you know sample and therefore, meaningful modeling.
So, those practical problems we might have to encounter. Now the ratio of cost that
will that is also difficult to determine, when we are faced in this this situation where
very few you know members are present this ratio of cost, this is also difficult to
390
determine. Another approach could be replicated existing rare class observation. So,
this is equivalent of over sampling with replacement. So, some of the observations
that are present in our data set we can replica we can have copy replicates the same
observation, and then use that for the modeling
So, this is another approach what the analysts generally do typical solution that is
adopted by analysis sample equal number of members from both the classes.
So, what they generally do is they take 50 percent of the members from class 1, and
50 percent same number of not 50 percent same number of may same number of
members from both the classes, and then use that for their analysis. Now if you have
done your modeling using an oversample training partition right even if it is only for
the rare class, then the performance evaluation you have to adjust for that over
sampling.
So, when we go about is when we use our model that has been developed on over
sample partition will have to score this particular model right so, validation partition
validation the one approach could be scored the validation partition without over
sampling. So, that validation partition that does not have the over sampled cases over
to sample the cost so that can be used.
So, that is the easier and direct and straightforward approach. So, build your model
on over sample training partition, and evaluate your model using a validation
partition the from taken from the original data set. Now another approach could be
use the over sampled validation partition, and then remove the over sampling effect
by adjusting weights. So, these 2 approaches could be there. So, the steps typical
steps that are taken in rare class scenario. So, build the candidate models on training
person with 50 percent class 1 observations and 50 percent class 0 observations.
391
We take equal numbers and validate the models, with the validation partition drawn
using simple random sample taken from original data set detailed steps. So, we will
stop here, and detail is step of this particular procedure we will discuss in the next
lecture.
Thank you.
392
Dr. Gaurav Dixit
Lecture - 20
Performance Matrix-Part V Over Sampling Scenario
Welcome to the course business analytics and data mining modeling using r. So, in the
previous lecture we were discussing performance matrix, and specifically at the last part
of the lecture we were discussing over sampling approach especially in the scenario,
where you are dealing with rare class members where, the class of interest members
belonging to the class of interest they are very few in the sample. So, we talked about
different things related to this particular approach.
We talked about that there could be 2 over sampling approach, 1 being that we can
sample more rare class observations, and the second being that replicate existing rare
class observations.
So, we focused more on sample more rare class observation which is equivalent of over
sampling without replacement. Let us also understand few more situation through using
some graphs.
393
For example when we are dealing with the when we are dealing with 2 class scenarios
where the classification cost are equal. So, we are dealing with equal misclassification
cost, the scenario where we are dealing with equal miss classification cost and we have 2
class scenario as we discussed us in the previous lecture and these 2 classes being class 1
and class 0.
So, if there are more records belonging to class 0 being represented by 0 itself here in
this graph, and there are few records belonging to a class 1 right, and we are in the
scenario where we have equal misclassification cost. So, a 1 particular model could be
this one so, this line will separate the records will create homogeneous partitions and also
minimize the misclassification error right. This is homogeneous all they record there is
just 1 record there belong to the same class, and here most of the records belong to the
class 0 just 1 misclassification error.
So, this is the case when we are dealing with the equal misclassification cost, and this is
how it is going to work out. The another scenario could would be when we have
asymmetric misclassification cost the same scenario 2 class scenario class 1 and class 0
in such a situation we might be interested in as we discussed, in the previous lecture we
might be interested in identifying more of the class 1 observations. Even if it comes at
the expense of misclassifying more of class 0 observation.
So, if we try to plot the same points here. So, our model could be represented by this
particular separator line right this particular separator, now you would see that in the
upper partition we have all the observation belonging to the class of interest that is class
1, and in the lower half we have all the observation this is homogeneous belonging to
class 0. So, this is more desirable when we have asymmetric misclassification cost that
394
meaning, that there is 1 specific class of interest and we would like to identify more of
that class, even if it comes at the expense of misclassifying some of the other
observations.
Now, when we talked about this over sampling scenario that these rare class members
they could be very few in the sample, what we can do about that, so we talked about that
over sampling to over sampling approaches could be used, sample more rare class
observations, that is equivalent of over sampling without replacement then another one
being replicate existing rare class observation. So, we further talked about what is the
typical scenario that is followed by analyst. So, they generally sample equal number of
respondents from both the classes right so, that is the typical approach.
So, we talked about when we follow this approach irrespective of the over sampling
approach that we follow.
That over sampling adjustment will have to perform to for our performance valuation,
and to evaluate the performance of the model right, we talked about 2 particular scoring
methods for validation partition. So, one first one was to build your model on over
sample training partition, and then test it on a regular validation partition. So, that is
indicated by the first one validation partition without over sampling that is regular or the
original validation partition. Now second one being over sample validation partition, and
then remove the over sampling effects by adjusting weights.
Sometimes the number of a rare class cases could be class of interest cases could be so
few that even that validation partition without over sampling might not remain practical
might not remain useful, in such situations we might have no choice, but to over sample
even the validation partition. So, we will be building our model on over sampled training
partition as well as we would be evaluating our model on over sampled variation
395
partition in those situations. Now we also talked about few typical steps in rare class
scenario that are generally taken. So, as we discussed that equal number of observation
from both the classes the same thing can be said that training partition with 50 percent
class 1 observation and 50 percent class 0 observation so, equal number of participation.
Now validate the models this is the main this is the approach 1 that we actually would
like to follow that validate, the model with validation partition drawn using simple
random sampling taken from the original data set. So, the regular data set. So, we would
like to have validation partition taken from the regular data set. Now to clarify these
steps a bit more to have the detailed step of this particular process, we have listed them
out all these steps the first one being separate the class 1 and class 0 observations into
strata. So, we talked about in the previous lecture that a stratified sampling is generally
used for over sampling. So, the first very first step itself is talking about the same.
So, we can now whatever the number of observation that we have in our sample, we can
separate the class 1 observation from class 0 observations, and we can create 2 strata or 2
distinct sets of those classes. Now because we would be requiring training and validation
partition at least, therefore, it would be advisable to use half of the records from class 1,
we can select half up those class 1 records randomly and then put them into training
partitions, the remaining half of class 1 records they can be reserved for validation
partition or even for test partition we will see as in the next steps.
396
So, half the records from class 1 observation class 1 stratum can be randomly selected.
The remaining class 1 records are reserved for validation partition. Now next step would
be randomly select class 0 records for training partition equal to the number of class 1
records that we did in step 2. So, that we are able to maintain that 50 percent records
coming from each of the classes right for both from the classes we would like to have
these equal number of records. So, what we did in step 2 the same number of records we
can also randomly select from class 0 stratum.
So, the next step would be randomly select class 0 we got so, till step number 4. So, we
have been able to create our training partition. So, we can do our modeling, so we can
build our model then later on when we will require to evaluate that performance of that
particular model we would require validation partition.
So, as step number 5 onwards they deal with creation of validation partition. So, you
would see step number 5 is randomly select class 0 records to maintain the original ratio
of class 0 to class 1 records for validation partition, because we said that our first
approach would be to test the performance of this being the direct approach also that
build your model on over sample training partition and test it on regular validation
partition.
So, to build the regular evaluation partition we would like to randomly select class 0
records and the half of class 1 records we already have so, we would like to maintain the
original ratio that was there in the regular dataset. So, in this fashion we can create a
validation partition. Now this could be used for performance simulation, but if we want a
further you know test partition because sometimes the validation partition could also be
used for fine tuning the model and therefore, it also becomes part of the building process
of the model.
397
And therefore, test partition is very much required to evaluate the true performance of
the selected finally, selected model right. So, therefore, if you want test partition also so,
the validation partition that we might we have created during step 5. So, from that
validation partition itself we can randomly draw a sample for test partition. So, in this
fashion we can perform our modeling we can over sample, if there are you know class of
interest there are very few cases we can do over sampling for the training data set, and
we can prepare our validation partition accordingly. So, that it is i as per the proportions
in regular data sets.
And therefore, we can go ahead and evaluate the performance. Now there as we
discussed that there might be situation where validation partition without over sampling
the first one that we just talked about the detailed steps.
We talked about that might not remain practical that might not be useful reason being
due to very few class 1 records that you know some few records that, half of the records
for training partition and reserving half of the regards for a validation partition. Even that
might not be feasible or practical and therefore, the modeling might the modeling might
become a bit more complicate.
So, we can we have to follow the second approach was to over sample even the
validation partition for evaluation as well. So, this validation is going to be used typically
for evaluation. So, we are going to use the over sample validation partition, and later on
we will have to adjust the weights to get rid of over sampling effects. So, we apply our
model that has been built on over sample training partition, then we test the performance
of that model on the over sample validation partition then we readjust the weights that
we get there and so, that we can remove the we can get rid of the over sampling effects.
398
So, now because this is main idea is about evaluation of the model. So, adjustment that
we require is on a validation partition classification matrix that we get when we test our
model on validation over sample validation partition.
So, that adjustment is required there and lift curve can also be adjusted accordingly. Now
what we will do we will do an exercise in excel for this whatever we have discussed till
now, let us go through an exercise. So, this is the scenario we have over sample
validation set.
Now, the assumption is that the sample that we have the original response rate is 2
percent; that means, the response variable the records belonging to that class of interest,
they are just 2 percent and the other records belong to the other class 0. So, recorded
belonging to class 1 they are just 2 percent of the sample, and the 98 percent of the
records in the sample they belong to class 0.
So, when we do over sampling we try to increase this particular ratio this particular
portion, and we make it 50 percent. So, we over sample in such a manner that now the
response rate it increases from 2 percent to 50 percent. So, now, in this over sampled data
set that we have will have 50 percent records belonging to class 1 observation class of
interest and the remaining 50 percent belonging to the class 0 observation, same you can
see that validation partition size after over sampling if it is 1000, then the number of ones
in this particular 1000 records are going to be 500 and number of 0s are going to be 500.
Now let us say we build our model on it on the oversampled training partition, and then
later on we applied that that particular model on over sampled validation partition. So, as
a result of that that validation exercise we got this classification matrix this validation
classification matrix. So, you would see so, because there were more than usual response
399
rate because of the over sampling you can see the results, this could be 1 example of this
classification matrix.
So, you can see 390 for a class 0 members classified as class 0 than we have 420 class 1
members classified as class 1 members. So, in total you have 500 class 0 500 class 1 total
1000 records in the sample right. So, if you want to compute the classification laid this in
this fashion as we have discussed before that off diagonal elements that is 80 and 110,
and then it would be divided by the total number of records.
So, that will give us the misclassification rate that is a nineteen percent. So, when we do
over sampling this is the misclassification rate that we get, but this is on the over sample
validation partition. So, this is going to be slightly less than what could have been there
in the regular dataset scenario. So, if you look at the number of percentage of class 1
records so, you can compute this also right so, this comes out to be 53 percent.
So, 53 percent records have been classified by the model as belonging to class 1. Now to
access the to evaluate the true performance of the model that we need to adjust the
weights so, there are 2 ways either we remove you know ones so, that we are able to get
maintain the original proportion or we can add 0s. So, that we are able to again and get
back to the original proportion. Now the typical strategy that generally we follow is
adding 0s to reweight the sample to achieve the original proportion, now how that can be
done you can see that let us say validation partition size after reweighting reweighing is x
right.
So, we are going to use some of the utilities that are available in excel is specifically goal
c this is an easy equation that we can solve manually as well, but because we are using
excel so, we like to use some of the utility that are available there. So, if we are if you
want to add extra 0 so, that we are able to reach the we are able to reach the original
proportion so, we need to find out. So, this new this new sample size would be much
bigger. So, let us so this is going to be the equation that will give us the new sample size.
So, earlier 1 was 1000 now we want new sample size which will have the original
proportion of different class members.
So, 500 if there are 500 and ones and we are not going to change the this particular
figure. So, we would like to add 0s so, 500 representing the 2 percent so original
response rate was 2 percent. So, 500 value of number of ones is representing now the 2
percent of the response rate. So, therefore, we need 98 percent of 0s here. So, we will
this is how we can write this equation 500 plus 98 percent 0.
So, therefore, 0.98 into x x is the total number of records that is in the new sample
reweighing and that has to be equal to the total number of records 500 plus 0.98 x equal
400
to x. So, if we solve for the value of x. So, we will get the new sample size after doing a
weight adjustment. The same equation can also be written in this fashion x minus 0.98 x
is equal to 500 this being more this being suitable for us to be able to use goal seek
function in excel. So, if you want to use goal seek function which we have already done
for example, we want to solve for x. So, this is the x and this is the cell that we have
reserved for the value.
So, let us find out this. So, you go into the data tab and then you would see this what if
analysis there and then there you would see that goal seek is there. So, the set cell there
we need to specify the cell where this particular formula is there. So, formula what we
have written in this particular cell right, and then the value that we want to target that is
500 right, you can look at the new equation so, this is what the equation that we are
trying to target.
Now the formula has been written there so, we will have a look at the formula the
formula is actually this representing this particular expression x minus 0.98 x we will just
see what we have formula we have written in the particular cell c 16 right. So, the value
that we are targeting is 500 that is right hand side of this equation. And we are changing
the cell the this one c 15, which is representing the x.
So, we want to change this particular cell and if you just run through, we will get these
values which are already regard they are there because I had ran this particular goal
seeker utility before. Now look if you want interested in the formula this is the formula
that we have written there. So, this particular formula is representing this equation x
minus 0.98 x that is going to be used by the old c function as we just saw. So, you can
see if c 15 this is the value c 15 is this x right so, c 15 representing the x minus 0.98.
401
So, that this is how we are computing this 0.98 1 minus b 2, b 2 is our original response
rate. So, 1 minus b 2 it is in percentage will excel will take care of this percentage
notation, and it will appropriately convert it into 0.2 and therefore, 1 minus 0.2 will
become 0.98. So, we will get that value then again this is again 0.98 x.
So, again x being c 15; this is the formula that we have written there. Now as we saw that
we can learn the goal seek function I will get this value 25000 here so, this is the 25000
is the size of a validation partition after reweighing, as you can see validation partition
side after reweighing is 25000. Now 500 ones were there so, the remaining number of 0s
would be can be very easily computed using this formula or manually as well for this
particular case.
So, 24500 0s are to be there, now once we have once we are done with this particular
calculation. Now we can adjust our matrix validation classification matrix using this
particular information the new sample size, and the number of funds and the number of
0s. So, here as you can see in this classification matrix, now we can fix these values here
you can see v 17 this particular value number of 0s, this has been fixed using this number
then the this has also been fixed using this number of 1s total is also fixed using this
number c 15, and then we are now we need to adjust the values that are there in the in
this 2 cell the cells for class 0 members.
You can see that class 1 members they are unchanged right. So, because they were 500
we do not want to make any change there. So, these value remain unchanged as you
know same as the previous matrix, another 2 values that we need to find out is these 2
values right. So, the how we can do this is we can keep the ratio that was there in this
particular matrix that we got from the model and so, we need to maintain this ratio 390 to
1 390 and 110 for the total number of observation of 500. So, this ratio we need to
maintain 390 divided by 500 to 110 divided by 500. So, this ratio for class 0 and class 1
we can maintain and we can compute this new number.
Total we already have so, in this fashion you can compute you can see this that 390
divided by a 500 this is the ratio, and total number of observations are 24500. So, we will
get the new number of 0s in that particular cell. Similarly this one can also be computed
to this particular cell value you can see 1 once again hundred and 10 divided by 500 as
represented in c 8 divided by d 8, and then total value being 245, 1500. So, the number
has been appropriately calculated by actual, now we can also get these values predicted
class 0s number predicted class 0, and number predicted class 1 records using this
particular some function. So, in this fashion we will get the new validation classification
matrix after reweighing.
402
Now, once the weight adjustment has been done, we can use this particular new matrix to
compute adjusted misclassifications a. So, now, following the same procedure as we
having using before we can look at the off diagonal values to find out the error, and now
we get the new number that is 21.88 percent. Now the earlier misclassification rate was
19 percent which was slightly lower than this number 21.88 percent so. So, now, once we
are able to remove the over sampling effect you can see that miss classification has laid
has jumped to almost 3 percentage point right, and if we look at the percentage of class 1
records right. Now they have they have come down. So, earlier we had 53 percent in the
over sampled partition validation partition.
Now, the new numbers as you can see this is how you can compute c 22 divided by d 22
these 2 numbers right. So, you can see the new percentage is 23 percent of the records;
they have been classified as class 1. So, let us go back to our slides. So, in this fashion
what we were talking about that if we had to use the over sample validation partition,
then how do we go about evaluating the performance of our model. So, we will have to
adjust the weights. So, that we get the new validation matrix validation partition this
classification matrix, and then that can be used to compute the misclassification error.
Now, the lift curve can also be appropriately adjusted so that we can again compare the
find out the efficiency of the model for the over sample validation partition case. So, you
can see the same thing is being discussed here, lift curve on over sample validation
partitions.
So, how to create that so the steps that we have been following earlier right the lift curve
that we had created before, the same steps can be followed here right lets go back to our
excel file once again and let us look at what when we created a lift curve.
403
So, this was the data so, the steps that we followed that we sorted the probability
estimated probability scores, from you know in the decreasing order starting from the
highest values to the lower values. And the actual class was also given in the 1 more
column and then we had net value because this was the case where we were
incorporating the net value all right, and we were plotting the curve accordingly.
So, net value and cumulative value we used to compute these columns, and then that
these values were used to plot the lift curve the cumulative value column and the serial
number column. So, this is how we used to do this, now if we look at what we need to
change here is the multiple we need to multiply in 1 of the steps in the intermediary step
that we need to multiply the net value of a record with proportion of class 1 records in
original data.
So, net value of a record for example, if we look at this particular column number a for d
column net value. So, there we need to multiply this value by this ratio this proportion of
class 1 records in the original data. So, that can now be used to compute the new values,
and then those values can further be used to compute the cumulative value, and once we
have those cumulative values we can plot our lift curve using on over sample valid
validation partition.
Now that lift curve would be adjusted for this over sampling effect, and we would be
able to find out the effectiveness of a particular model. Now there are a few more things
that we can discuss in it especially in a 2 class scenario right, 1 is that sometimes we
might want to have you know you know some records would be there which might not
be appropriately or correctly classified by our models. So, can we have some other way
404
to overcome this problem. So, the records which are difficult to classify by our model we
can labeled with a third class option cannot say.
Now, once this kind of modeling is done for all the records. So, most of the records they
would be classified as class 1 or some other records would be classified as class 0, the
few records which are difficult to classify they can be labeled as cannot say, and then
expert judgment can actually be used whether to classify them as once and 0 right. So,
this kind of configuration can also be used in a 2 class scenario. So, we will stop here
and we will continue our discussion on performance matrix in the next lecture.
Thank you.
405
Prof. Dr. Gaurav Dixit
Lecture – 21
Performance Matrix-Part VI Prediction Performance
Welcome to the course business analytics and data mining modeling using R. So, in the
previous lecture we were discussing performance matrix and we discussed specifically
the over sampling scenario, when we have a rare class of interest very few records. So,
that was covered we also talked about at the end of that particular lecture, the 2 class
scenario where some part some observation are difficult to be classified by our model,
and we can we can have another third option cannot say and then that can be a then
manually expected by experts and appropriately classified. So, after that we come to our
last leg of this particular module also and this particular topic performance matrix. So,
will discuss the prediction performance, till now whatever we have been talking about
right was applicable to classification performance now we have come to our the last part
that is prediction performance.
So, when we talk about prediction performance essentially it is about. So, we are going
to focus on the continuous outcome variable. So, generally we deal with continuous
outcome variable. In the classification performance we had categorical outcome variable.
406
So, most of the things they were dependent on the classes that it have 2 class scenario
and extension into m class scenario and all those things, that we have already discussed.
Now in prediction performance the focus is now on continuous outcome variable. Now
here again we need to understand few things that touch up on few things specifically the
classical statistical modelling and data mining modelling, few things we can understand
predictive accuracy that is the matrix that we generally use in prediction performance
here and data mining modelling, and goodness of fit measures are generally used in
statistical modelling.
So, idea main idea behind this difference we have already discussed, but to be specific in
this particular topic performance matrix, goodness of fit there in statistical modelling the
main idea is to fit the data as closely as possible. We have one sample and we do not do
any partitioning the same sample generally primary data is collected in a statistical
modelling, and that the hypothesis hypothesis that we have they are tested, and on the
same sample we build our model and the same sample that then used is to find out the
significance of that model.
So, that is generally done using goodness of fit measures, which typically measure how
best that model is fitting the data. While when we talk about the data mining modelling,
when we talk about the predictive analytics, we are focusing on how well we can predict
the new observation the future data. So, there we focus more on predictive accuracy. So,
therefore, in both these settings the scenario is difficult different and the measures that
we use to assess the performance of the models are also different.
Now, if we talk about if we like to discuss few examples for classical statistical
modelling. So, one would be goodness of fit measures that are generally used in
regression modelling in a statistical setting. So, these are 2 measures that we have shown
here R square and a standard error estimate. So, generally these 2 measures are used to
understand how well the model is fitting the data in a statistical setting.
So, R square as we have talked about this particular metric before as well, in the
supplementary lectures that this particular metric measures the in a way captures the
variability in the outcome variable. So, how much variability in the outcome variable is
actually being explained by the model by the statistical model. So, that in a way in a
sense and variability when we talk about, it is about the information spread the spread of
407
data that is there. So, essentially it boils down to the same thing that how closely the
model is fitting the data, how much variability in the outcome variable is being explained
by the model or with the help of predictors information.
Standard error estimate that we see in regression modelling that is also in a way telling
us how a particular the relationship between a particular predictor and outcome variable,
how closely that is following for example, following is being followed by the model in
this particular example, the regression model right. So, a standard error estimate would
actually indicate that that fitting. Now residual residual analysis is also performed to
understand how well the data is how well the model is fitting the data in a statistical
setting. We do some analysis we apply some visual visualization techniques as well
while we analyse residuals, and we try to find out we try to understand how closely the
model is fitting the data and how it is being reflected in the residual series then when we
come to a predictive analytics when we come to data mining modelling right.
So, prediction error is one matrix prediction accuracy and prediction error on validation
partition. So, generally to performance just like the classification performance that we
have been talking about, that is the matrix they are computed on validation partition in
the same fashion, even for prediction tasks even to evaluate the prediction performance
of the model the matrix, they all computed on validation partitions whether it is
prediction accuracy or prediction error, they are computed on validation partition and
that is where we compare the performance of different candidate models and then try to
select the best model the most useful model using these matrix.
Now, as we talked about in the classification case that naive rule provides us the
benchmark the baseline model. Where we talked about be you know if there are m
classes and naive rule would be for any new record assign it to the you know most
prevalent class. So, that we talked about. So, that becomes the benchmark rule there in
the classification.
So, the benchmark rule it does not include it does not incorporate the predators
information right when we say that any new record can be assigned to the most prevalent
class, then we are not incorporating we are not analyzing the predictor information, and
that is used as a benchmark that is used as a baseline model. Similarly in case of
prediction right we use average value, we use average value of the outcome variable as
408
the you know benchmark criterion. So, the average value is actually that that becomes
the reference line that becomes the baseline model, and those average value is actually
used to compare the performance of the model.
So, all the candidate models that we might build on would be compared to this particular
baseline model. Now average value is in a way naive rule equivalent for the prediction
task. Now let us discuss few metrics that are applicable in prediction task. So, generally
this is the first one prediction error. So, for any record I we can always compute the error.
So, for any record once we build our model on training partition, and we apply the model
on different record that are there in the validation partition or even for that matter in the
training partition we are in the test partition. So, for every record the model is going to
give us a score right. So, that is going to be the predicted value, and we also have the
information on the actual value of that particular record. So, the difference between
actual value and predict predicted value that is defined as the prediction error. So,
prediction error for a particular record is defined as the actual value minus predicted
value. It can be denoted in the mathematical terms as e i as y i minus y i bar. So, in this
fashion we can we generally denote this.
Now, there are some major productive aggressive measures, which are popular which are
generally used to assess the performance of a prediction models. So, will discuss them
one by one first one is average error. So, if you look at the average error as the name is
409
suggesting you can look at the formula that is written over there, you can see the
residuals for all the records starting from we have and records starting from 1 to n. So,
when we say n it could be any partition where we are trying to compute this particular
metric average error. So, it could be on training partition could be on validation partition,
it could be on test partition, but the performance evaluation that is considered on the
So, 1 to n for each observation is starting from 1 to n, we have the that error value that
we can compute using the formula that we just discussed before prediction error, and we
can summate this this particular these particular values and take an average. So, we can
divide it by n this summation and that will give us the average error what does an
average error indicate about the performance of the model? So, on average because the
this these particular errors they would be at the actual value is bigger than the predicted
value, the error is going to be the positive for that record. If the actual value is less than
the predicted value error will have a minus sign before it is value, therefore, will have
positive and negative errors error values.
So, when we take an average of all the error values for all the observation that are there,
on an average level will get either plus or minus sign. So, what does that indicate? So,
the average error what does that indicate. So, it will indicate overall under prediction. So,
it is a plus sign indicating that on an average level model is over predicting the
observations on an average level model is over predicting the observations. If the this
average error value comes as minus something then what do we understand from there,
that on an average level model is under predicting the value of outcome variable for
those observations right. So, under prediction over prediction whether the model is under
predicting or over predicting that can be understood from the average error.
410
Now, let us discuss next metric. So, next metric is mean absolute error or MAE,
sometimes also called mean absolute deviation or MAD you can also see the formula.
So, how this is being computed. So, as we talked about that a particular you know a
prediction error for a particular record it could be positive or negative. So, because the as
the name suggests because we are you know we want to compute absolute error. So, for
each error for you know error for each record will we take an absolute value of it, and
then again we take the average. So, that is how we get the mean absolute error. So, this
particular matrix metric gives us the magnitude of error. So, on an average level we get
the magnitude of error, which we are getting from that particular model. So, how much
magnitude of error you know in a way for observation this being is coming from the
model so that we can get using this particular metric. So, let us move to next matrix. So,
next metric is mean absolute percentage error or MAPE.
Now, as the name is suggesting. So, still we are trying to compute the absolute values,
but now we are interested in percentage values rather than the actual values. So, to
compute the percentage value you would see that the whole formula before the formula
we have this 100 percentage, is it is a multiplied by 100 percent. So, that the percentage
value can be computed; now we look at the actual expression the error value is being
divided by the actual value. So, that we get the difference and then absolute has been
taken. So, it could be on either side positive or negative side. So, we take the that
difference that quantum that is there, the using error values divided by the actual value
411
and we take the absolute value of it and then we take the average for including all the
observations and then the percentage value. So, what this particular metric tells us about
the model that, as you can see in the slide that. On an average level what is the
percentage deviation from actual value that is being given by the model. So, what is the
percentage. So, how the values or you know the kind of deviation that is the predicted
values how what percentage point they are deviating from the actual values for that
particular model. So, that thing we can understand using this particular metric.
Now, let us move to our next metric. So, this next metric is the more important one
among the ones we have discussed region will understand. So, this one is root mean
squared error also called RMSE. If we look at the formula for computing this particular
measure is square root and then we have we are taking the average value of the squared
errors, you know we are summating all the squared errors and taking average of it and
then taking square root of it. So, you can see when we take a square of a particular error,
the sign that we get in a error in the prediction error plus or minus positive or negative
sign, that is taken care of now we have a squared values.
Now all these values are summed up and then be divided by n. So, that is we get the
average. So, we get the you know mean is squared error now again we take this square
root, once we take this square root again we go back to the same scale. So, this particular
scale once we take this square root, it is in the same scale as the scale of prediction error
412
that is computed on the outcome variable. So, RMSE is very similar to a standard error
estimate computed on validation partitions. We talked about the standard error estimate
in the context of statistical modelling right we talked about R square and standard error
estimate. If we want to have one metric which is quite you know in the predictive
analytics this could be used and which is quite close to a standard error of estimate that
we get there, this is the RMSE is the matrix. So, this is something that we get for the
model.
So, from this we can understand that on an average level, with respect to the outcome
variable what is the error that is there. Now this is also another advantage with this root
mean squared error is that measured in the same unit as the outcome variable. So,
because of these regions RMSE is one of the popular matrix to evaluate the performance
of a model. So, whenever we would be comparing performance of different candidate
models, RMSE is the one value on which we will rely to evaluate the performance to
compare the performance; reason being main that main reason being that same unit as
the outcome variable right and quite similar to the standard error estimate that we have in
statistical study. So, for that purpose we do use that metric.
Now, the another metric that we might be interested is in total sum of a squared errors
sometimes called total SSE or simply SSE as well.
413
So, how it is computed simply all the errors square values of those errors, and then
summation for all the observations. So, that computes us for the total SSE. So, this what
does this particular matric indicate? So, this particular metric indicates us the gives us the
total you know that overall sense of error that is being given by the model. So, if we are
comparing 2 models, then we can look at the SSE value how much how what was the
SSE value for model 1 model 2 model 3 and onwards and the this particular value will
also give us the sense that which model is you know performing better. Though the scale
of these errors would be different, but comparison is still feasible.
So, this is to give the overall sense of the error for a particular model. Now these
particular matrix measures that we have talked about for prediction tasks. So, where we
can use them? So, we can use them to compare the candidate models as we in discussing
we can also assess the degree of prediction accuracy that is there. Now what is the
problem with these matrix is there a are the matrix can always be use can there be a few
issues outlier related issues could also be there. For example, all the matrix all the
formulas that we talked about they generally consider all the observation, and when we
they consider all the observation they include all the error values all the prediction error
values, a irrespective of whether it is lying within the major majority of the values or
whether it is it is an outlier value. So, that might complicate the assessment of models.
So, how do we overcome such a situation, how we how do we overcome outlier

influence in our measures that we have discussed. So, we can compare some we can use
some of the median based measures and we can compare them with the mean based
measures. So, from that comparison just like what we do in when we check for normal
distribution and whether it is right skewed, or left skewed we look at the median and
mean values. If mean value is higher than medium then probably it is right skewed and if
the reverse is true then it is left skewed, in the same fashion we can compare the median
based measures and the mean based measure and from there we can get a sense whether
outliers the kind of influence that outlier have.
Another way would be to apply visualization techniques for example, we can use
histogram or box plot of residuals we can plot we can generate histogram and box plot of
residuals and from there we can you know the same kind of thing we can observe there
what we do for normal distribution when we check for skewness. The similar kind of
observation we can make and then check for the outlier influence.
414
Now another point that we need to understand at this point is that when we build a model
in a statistical setting, and when we build a model in a data mining setting and in data
mining setting we are going for the higher predictive accuracy and when we talk about
statistical setting we are going for the best fit of data. So, the model that we get in these 2
different settings they may or may not be same. So, the model that we get when we build
a model and data mining setting and using the matrix using following high predictive
accuracy, it might be different from the model that we get when we you know build a
model in a statistical setting, while we are looking for best fit of data.
Now, can there be a visualization techniques which could be used to evaluate the
performance of you know prediction models. So, lift curve is one which can be used.
Now, lift curve that is the exposure that we have till now, we have been using lift curve
quite often. So, at it is going to be relevant only when records with highest predicted
values are sort. So, as we have seen before that channel lift curve we compute some
values, and then we take accumulated numbers and then those accumulated values are
then plotted against the number of cases there, and then we try to assess how much lift
we get in comparison to the baseline model.
So, in situation in prediction tasks if we have a situation wherein for example, sedan card
data set that we have been using. If we take an example of sedan car and there is a
company and they have multiple channels to sell their sedan cars right. So, different
415
channels you know might give them different kinds of revenues right. So, and especially
the data set if it is for example, if it is for used cars right. So, therefore, it becomes
difficult to assess the price, new cars this is fixed from the manufacturers side so, but
when we talk about used cars the prices might vary and therefore, different channels how
different channels can be used by the firm to sell those cars.
So, lift curve can be useful in that sense for example, a particular firm might have few of
it is own channels and few a few it might also use some of the channels, operated by
third party. So, the firm might look to retain you know look to sells the highly priced cars
through it is own channel. So, that it can make more revenues and therefore, more profit
using it is own channel and some of the low value used cars they might go for with the
third party channels.
So, that can be done using lift curve. So, will do an exercise in R to see how this is how
this can be done.
So, this is this small data set that we have you can see we have already this these a
column called value, serial number is there then value is there this value of car can be
considered for the cars sedan cars premium cars. So, they are ranging from 9 lakhs to 50
lakhs, 8 lakhs to 50 lakhs. So, random functions were used to generate these values and
then these values worth then later on sorted in the decreasing order. So, as you can see
416
once they were sorted, then the cumulative scores were a also computed as we have been
doing before as well. So, you can see the cumulative scores have been computed.
So, this is how you can generate the data, now what we are interested in we are interested
in the for example, higher value cars. So, which ones are the which first 10 cases are you
know the having the highest value right. So, we would like to identify those high value
cases right.
So, let us open R studio let us load this library x l s x, now the data set that we have just
seen it is in this particular file cutoff data, let us import this particular data set. Let us
remove the na columns and first six observations this we have already seen these
observations cumulative value. So, we would like to plot these cumulative value with
respect to the serial numbers that are there.
417
So, let us look at the range that is there. So, the range is between 49 to 544. So, this is the
cumulative value range that we have serial numbers I think we have just 20 records.
So, range is going to be 1 to 20. So, you can see the plot the way we have been doing
before or as well the first variable first argument is going to be plotted on x axis that is
serial number, then the y argument second argument cumulative value that is going to be
plotted on y and the type of the curve is lined and labelling for x axis and y axis
cumulative value for y axis is given, limits have been appropriately specified as you can
418
see range we have already computed at 0 to 25 and then we have this 40 to 550 which is
covering the entire range that we saw.
So, let us execute this line of code and so, this is the plot that we get. So, as you can see
this is smooth plot lift curve has been created now let us also draw the reference line. So,
reference line would be connecting the initial point with the last point that we have let us
draw this.
419
So, this is the reference line. So, this lift line that we have. So, there is good enough
separation between the lift curve and the reference line. So, therefore, model is useful
and giving us some effectiveness in terms of identifying those high value cars right. So,
high value used cars sedan cars it is the model is giving us some usefulness and we can
have those cars and sell those cars using our own channels if right and the other cars the
low value cars we can probably post them through third party channels.
So, let us also create the legend you would see the cumulative value is started by
predicted value from this lift curve and in the reference line. Now the information that
we have just defected using lift curve can also be done using the decile chart like we had
done before for classification tasks. So, let us create a decile chart for prediction task. So,
in this case also because deci we would like to have 10 deciles and each decile
representing 10 percent of the cases, and 20 percent, 30 percent and so on.
420
So, we would like to create this variable which will have the information on the f d
decile the number of observation that would be there. So, you can see sequence and the
length is of this particular variable which will give us the total number of observation
and that that would be appropriately distributed for each decile you would see that 2 4 6
8 10 12 in this fashion will have the distribution.
Now, few other variables decile that is where we would be computing the decile values,
that would be plotted using word chart later on decile mean. So, this we want to
compute. So, let us initialize this value as null now global mean in this case would be
you would remember how we computed the global mean for the classification tasks, now
here we just need to compute the average of this particular column and will get the
global mean. So, global mean comes out to be 27.2 lakhs now let us initialize these
counters.
421
Now, we are going to run this particular for loop starting in decile cases where all the
decile 10 deciles are there. So, for each deciles this loop is going to be run and we would
be computing and every loop decile mean, which is right for every decile this is actually
nothing, but the cumulative value that we have up to that decile and divided by this
particular number of that decile, right.
So, that will give us the decile mean and then we can divide the decile mean by global
mean and then that will be used as our decile value. So, let us execute this particular loop
and will have all the decile values here, you can see all the decile value starting from
1.76 in the decreasing order they are going to be there.
422
So, let us look at the range the range is from 1 to 1.76 now we are going to generate this
bar plot, which would be creating this decile chart. So, decile one these values decile
values that we have just computed while limit is representing the range 0 to 2 is going to
cover that deciles level of the x axis, then the level of y axis decile mean divided by
global mean and the arguments appropriately specified for each decile let us compute.
So, this is the plot that we generate this is the decile chart that we have. So, as you can
see decile one representing 10 percent of the cases. So, the value as we have competed it
423
is 1.6; 1.76. So, it will give us the liftoff 1.76 in comparison to the baseline model in
comparison to the random selection scenario. So, these probably these 10 percent cases
see we would be interested in selling these used cars these premium sedan cars using our
own channels, because we might generate more revenue more sales through this.
Similarly for decile 2 also we get a good enough lift value right is also more than 1.5 it is
actually 1.7.
So, in this fashion we can find out the optimal you know we can take the appropriate
decisions till the appropriate lift value. So, lift value of more than one would be useful
for us. So, in this fashion we can find out the number of cars that we would like to sell
through our own channels. So, thank you will stop here. So, that concludes this particular
module and in the next lecture will start the next module that is on supervise learning
methods and the very first technique that we are going to start on is the multiple linear
regression.
Thank you.
424
Dr. Gaurav Dixit
Lecture – 22
Multiple Linear Regression - Part I
Welcome, to the course Business Analytics and Data Mining Modeling Using R. So, we
are be would be starting our next module that is supervised learning methods and we are
going to start our first topic that is first technique that is multiple linear regression. So,
let us start. So, multiple linear regression is one of the most popular model that is used
for statistical modeling as well as data mining modeling. In most of the text books and
other stat related courses this is one of the first model that is generally discussed and
covered.
Now, multiple linear regression model the main idea is to fit a linear relationship
between a quantitative out outcome variable, that is, Y and a set of p predictors for
example, X 1, X 2, X 3 and X p. So, one difference between the statistical techniques and
data mining techniques that we have described before as well is that generally in a
statistical technique we have assumed relationship between output variable and the set of
predictors.
425
So, for example, in this case multiple linear regression you would see that the first
assumption that we are talking about in the slide is that the relationship as expressed in
the following model equation that is the linear model equation holds true for the target
population so; that means, this relationship linear relationship is actually assumed. So,
therefore, when we talk about statistical modeling generally we make certain
assumptions about the data, certain assumption about the structure of the data,
relationship between variables and we also hypothesize few relationships and then also
test them through techniques, through different statistical techniques.
So, the same thing is applicable to multiple linear regression as well and the first
assumption is that this as expressed in this particular equation the linear relationship is
assumed here. Now, the beta zeros, beta 1, beta 2 and beta p till beta p these are all
regression coefficients and then we have this epsilon term that is the noise or
unexplained part. So, generally for the exploratory models the predictors information that
is used to explain the variability in the outcome variable that is Y and the noise is
actually the unexplained part, something that we are not able to explain right with the
help of the predictors information that goes into the noise. Now, there could be 2
objectives while where we could actually use them this particular technique multiple
linear regression. So, one objective could be understanding the relationships between
outcome variable and predictors. This is the typical objective that is followed in a
statistical approach. The second objective which is generally followed in a data mining
approach is predicting value of outcome variables for new records.
So, as we will discuss in this lecture depending on the objective our approach modeling
model building might remain same to a large extent, but the results interpretation the
model evaluation that would actually change. So, that is very closely tied to the
objective. So, the first objective that we just talked about understanding the relationships
this particular objective is mainly explanatory in nature. The second objective the
predicting value that is the you know predictive in nature and the predictive modeling
would be required.
426
Now, applications in data mining, so, multiple linear regression has many applications in
data mining situations. For example, if you predicting credit card spending, predicting
life of equipment predicting, sales. So, many such examples that we have been talking
about in this particular course. So, multiple linear regression as a technique can be
applied in many such situations that we have been talking about. So, mainly multiple
linear regression is used to handle the prediction tasks and that is also very well reflected
in the previous slide when we said that the outcome variable is quantitative in nature. So,
when we said that the model is used the main idea is to fit a linear relationship between
this quantitative outcome variable. So, therefore, in a data mining situation essentially
when we use multiple linear regression we would be predicting the value of this
particular outcome variable therefore, it is the predict prediction task where this
particular technique is used.
Now, the goal that we talked about as we talked about that the selection of the model is
particularly tied with the goal and it is the model building process might remain same to
the large extent, but the results interpretation. So, that will differ depending on the goal.
So, which is explained in this particular slide that, for example, predicting the impact of
promotional offer on sales.
427
This would be even though we are trying to predict this impact this in a sense is
explanatory task and therefore, the interpretation results interpretation and model
building exercise would slightly differ from if we have a goal of just predicting sales. So,
that would be predictive goal, predictive task and predictive modeling would be required.
So, these 2 task these 2 examples that we just discussed. Predicting the impact of
promotional offer on sales which is explanatory and then predicting sales which are
predictive. So, the model building exercise and the later on the model evaluation and
result interpretation that would differ from in the between these 2 cases.
Now, as we discussed that selection of a suitable data mining technique it will also
depend on the goal itself, whether the goal is explanatory or predictive. To discuss little
bit more about, because the multiple linear regression modeling because the most of the
statistical technique that we are going to cover in this course. So, they are going to be
used in a predictive analytics setting in a data mining modeling. So, there, but they are
also used in the statistical setting as well. So, therefore, we would like to again
differentiate these 2 environments. So, one is explanatory modelling then other one is
predictive modeling. So, let us go through some of the let us understand some of the
differences that is there in explanatory modeling and predictive modelling.
428
So, when we do explanatory modeling it is about, we want to find out a model that fits
the data closely. So, that is our objective when we talk about predictive modeling we
want to find out the model, the best model that predicts new records accurately.
Similarly, when we talk about explanatory modeling the full sample that we have that is
used to estimate the best fit model, the model that is best fitting the data, the full sample
is used there. When we talk about predictive modeling the sample is as we have been
talking about in the previous lecture, sample is partitioned into training validation and
test set and it is the test it is the training partition that is used to estimate the model.
There are other differences for example, performance matrix.
So, the performance matrix in explanatory modeling they measure how close model fits
the data. So, the model that we wanted we want wanted a model that fits the data closely
and we also require performance matrix which measure the same thing which measure
how close model fits the data when we talk about predictive modeling then the
performance matrix that we require or performance matrix that we use they should
measure how well model predicts new observations right. So, many such matrix we have
talked about in our previous lectures on performance matrix.
So, you can see a clear difference between these 2 modeling approaches explanatory
modelling and predictive modeling. Let us move forward.
429
So, there are few more things for example, model that we might select the final model
that we might have after this model building process and everything other phases. The
model might not have best predictive accuracy because the purpose was as we talked
about in the previous slide we want a model that fits the data closely. So, the model that
we might get final model might not have best predictive accuracy. Now, we look at the
predictive modeling scenario the model that we might finally select out of this exercise
in predictive modeling model might not be best fit of data. So, clearly the difference is
very clearly we can understand.
Now, to further I understand the explanatory modeling, generally statistical techniques

with assumed or hypothesize relationships and then a scarce data which is generally the
primary data we are operating in that part. So, we generally use statistical techniques and
the assumed or hypothesized relationships. So, we always depending for a particular
problem we always formulate our hypothesis and we test that hypothesis using particular
statistical technique by building a model. So, that is the world where we operate and the
data is always the sample is the view that is data collection is always difficulty we deal
with the scarce data and generally primary data is collected.
Now, we look at the predictive modeling. Typically, we are operating in the machine
learning we generally use machine learning techniques and these machine learning
techniques they assume you know there is no assumed structure. So, we do not force any
430
structure on data when we use machine learning techniques and we are generally dealing
with large data sets. So, these are generally dealing with secondary data and this data is
then because this being large data set we can do other things for example, partitioning
which can minimize some of the problems that we face during statistical modeling.
Now, the regression equation that we talked about, so we also talked about the regression
coefficient beta 0 to beta p; if there are p predictors and then another estimate that we
need to compute is this sigma, that is, a standard deviation of noise; noise be denoted
using epsilon. So, these estimates we need to compute to find out we need to have about
the target population to understand the relationships right first. Now, we these estimates
cannot be measured directly, because we do not have the data available on the entire
population and that is why we take a sample and it is the sample on which we apply our
estimation techniques and we compute these estimates beta 0 to beta p and then sigma
that is standard deviation of noise.
So, typically there are many techniques that could be used to estimate these coefficients
beta 0, beta 1 to beta p and sigma, but typically ordinary least squares the OLS is the
technique that is used to compute these estimates from a sample. So, OLS will compute
the sample estimates which minimize the sum of the squared deviations between actual
values and predicted values.
431
So, let us understand this particular thing by plotting a model. So, let us have we have a
this kind of data and the regression equation that we just saw in slides and. So, we were
able to find a particular line that would fit this particular data as closely as possible. So,
probably this would be the line. So, this is the line that we after applying regression on
data set this particular data set these observation we got this line which is closely fitting
the data.
Now, how this line has been estimated and this line is also being represented by these
estimates beta 0, coefficient and sigma. Now, how OLS works at that is very well defined
in the slides that we tried try to minimize the sum of squared deviation between actual
value and predicted value. So, therefore, all this line will actually have the predicted
values. So, actual values we can see on screen. So, all these actual values and these are
the actual values and the corresponding predicted value is going to be represented
somewhere in the line. So, if we are able to connect the corresponding points the
predicted value points on this particular line and the actual points that are their actual
observations that are there in this particular graph and we are able to connect, so, this
would be the deviations right. So, these would be the errors for individual observations
or individual cases.
Now, OLS will always try to minimize the sum of squared deviations. So, let us say you
have n observation, let us say we have 100 observations. So, for each observation the
432
squared value of these errors the actual value minus predicted value has we talked about
in our previous lecture performance matrix. So, these squared, these errors actual value
minus predicted value this square value of these deviations and then the summation of
this would actually minimized by the OLS and that is how this line would be computed.
So, the line that we get is computed the estimates that we get the beta 0, beta 1 to beta p
and sigma that we get is by following this process we try to minimize these numbers,
sum of squared deviations between actual values and predicted values.
Now, if we want to compute the predicted value for different observations this is how we
can do it. You can see on a screen that ordinary least square on this particular slide. That
given these values x 1, x 2 given these values on predictors x 1, x and x p and since we
have you know estimated these betas right these beta values. So, because these are
sample estimates. So, OLS is applied on the sample, we get the sample estimates. So,
these are being represented by bar beta 0 bar, beta 1 bar. So, these being sample
estimates. So, we have these numbers and we have the information on the predictors the
values on predictor. So, these are going to be used as per this equation and will get the y
bar that is the predicted value and any difference between actual value that is y and this y
bar that is predicted value that will become the error for that particular record or the
deviation.
433
So, following this method OLS after applying OLS and computing these estimates this
beta 0 and the standard deviation, we will get we can get unbiased predictions. So, that is
for all the observations the predicted values that we get on an average we can assume
that these values are going to be closer to actual values, because the idea was to
minimize these particular deviation, the idea was to minimize this error and we will also
get a smallest average squared error because that is the method that we have followed,
but certain assumptions should hold true. So, first assumption is about noise follows a
normal distribution.
So, the noise term that we had right in the regression equation it should follow a normal
distribution or equivalently we can also say that the outcome variable should be
following normal distribution. So, this is the first assumption. Now, we require this
assumption mainly in statistical modeling within a statistical modeling when we build
our model we build it on the same sample and the reliability of the estimates are also
assessed on the same sample therefore, there is therefore, the estimates might not
estimates might lack reliability, because in statistical modeling we are always looking for
a model which best fits the data. So, that might lead to over fitting. So, therefore, the
estimates might not be reliable. So, therefore, we need to draw confidence intervals to
have a range and then we should be able to claim that those values would fall within
those within those ranges.
So, therefore, for us to be able to compute those confidence intervals, those regions we
need to have this first assumption have has to be held true that is noise should follow
normal distribution only then we would be able to arrive derive those confidence
intervals and therefore, we would be able to claim that the those particular estimates
whether it is about the mean of population or anything else they would fall within that
particular range separated value plus that range has estimated using confidence interval.
Now, the second assumption that should hold to true is the linear relationship, it should
hold true. So, the underlying release relationship between the outcome variable and the
set of predictors that should hold true, that should be linear and because the first
assumption we can recall that it is about that the relationship between outcome variable
and the set of predictors is following that what is represented in the regression equation.
So, therefore, linear relationship holds true, should hold true otherwise the model would
not be predicting values in a reliable fashion.
434
Observations are independent, so, observations should be independent all the observation
that we have, the sample that we have, though these observation should be independent
of each other there should not be any dependency and then the last assumption is about
the variability in the outcome variable. So, the variability in the outcome variable should
be same irrespective of the values of the predictors. So, this particular proper property is
also called homoskedasticity. So, this should also hold true that the variability in the
outcome variable it should be same.
So, if you want to understand that this particular graph that we have, if we look at the
variability of these points the outcome variable which is generally represented on y axis
and the predictors which are generally represented on x axis; if we look at the variability
of these points that remains same you know irrespective of the values that are being
taken on the predictor axis that is on x axis irrespective of the values the if we look at the
variability that looks quite similar. So, this is following that last assumption
homoskedasticity and therefore, now we could have if this is held true then we can have
the unwise predictions and the smallest average squared error.
Now, if we look at these assumptions, let us again go back to the assumption the very
first assumption that we talked about the noise follows the normal distribution. This is as
we discussed this is mainly for statistical modeling that because we use the same sample,
but when we talk about data mining approach the partition that we do in data mining
modeling that allows relaxation from the first assumption.
So, when we talk about data mining of modeling, let us say this particular bar is
representing our sample and generally, we partition the sample into 3 sets. So, because of
this partition thing because we are building our model on this particular partition and the
model is assessed on the remaining partitions, either validation partition if it is not part of
the modeling process or the test partitions. So, because of this we don’t, the estimates
that we get the performance of the model is actually evaluated on performance
evaluation happens on this particular partition.
So, therefore, if the model is giving close enough error values the matrix that we use the
performance matrix, the numbers that we get from performance matrix if they are quite
similar, quite close in both training and validation then probably the model is good and
therefore, because we have used different partitions we do not need to follow the first
435
assumption that is noise follows the normal distribution, that is, mainly for a statistical
setting where we have just one sample and the same sample is used for the model
building exercise the same sample is used for the evaluation exercise. So, there because
the estimates might not be reliable therefore, we need to derive confidence interval. So,
so as we are sure about that our estimates are following within that range. So, as the
same thing is expressed in the second point of this particular slide that in statistical
modeling same sample is used to fit the model and assess to reliability therefore,
predictions of new records might lack reliability.
And, therefore first assumption that is required to derive confidence interval for
prediction. Now, let us go through an exercise to understand this particular technique.
436
So, the data set that we are going to use for this exercise is this used car data set that we
have used before. In this we have these variables brand. This data is about used car. So,
this is about, the task that we are going to handle is the predict prediction of prices for
used cars and based on this historical information about those used cars. The information
that we have on these used cars as is brand name, the model name, the manufacturing
year, fuel type; it could be petrol, diesel or CNG. Then, we have a SR price, that is
showroom price in lakhs of rupees and then we have km, that is, accumulated kilometres
and thousands of kilometres and then price we have offered sale price in lakhs of rupees
and then the whether the car is manual or automatic that is represented by either 0 or 1,
then the owners number of previous owners and the airbag, number of airbags in the car
and then we have another variable C price that is mainly for the classification task. So,
we would not be using this particular variable in our exercise because we would be
applying regression modelling which is generally for prediction tasks.
437
So, let us have a look at the data that we have. So, this is the data that we have, as you
can see. We have these variables and around 79 observations. Now, with the help of these
predictors we want to estimate the price of used cars, this is which is there in the this
particular column price. So, all these variables are going to be used in the modeling
exercise and the price in this case is the already it is the continuous variable or
quantitative variable. So, therefore, we can go ahead with our modeling exercise, there
we can apply regression model. So, we will stop here at this point and we will continue
our discussions from here will our model in the next class, in the next lecture.
Thank you.
438
Dr. Gaurav Dixit
Lecture – 23
Multiple Linear Regression-Parts II
previous lecture we started our discussion on this particular technique multiple linear
regression. So, we discussed the linear regression the equation, the different coefficient
that we are required to estimates. We talked about the exploratory modeling and the
predictive modeling a few differences and we also try to understand the application of
this particular technique multiple linear regression, in a statistical technique and how it is
different in a data mining environment. So, we also talked about some of the assumptions
you know that when we apply OLS to estimate those coefficient those beta zeros and
sigma.
439
Then what are the underlying assumption that we have to follow, and how those
assumptions are different you know how in first assumption especially noise
following normal distribution, how we get some relaxation relaxation for from that
assumption in a data mining setting. We talked about all those things then we will
again go through an exercise to understand how linear regression modeling is done
and how different concept can be put into practice.
So, let us open R studio. So, as usual we will load this particular library x l s x because
the data that the this particular data used cars data set, that we have it is in the excel
file.
So, let us import this data set. So, the function that we are going to use is read dot x l sx
first argument is as usual we are going to browse for this particular file and then the
first worksheet will import the data of the first worksheet and the header is true
because we have name of all the variables there in the data set. So, let us execute
this line you can see in the environment section this particular data set has been
imported and you can see 79 observation of eleven variables.
440
Let us have a look at this data in the R environment. So, this is small icon that you see in
the environment section once the data has been imported. So, once you click on this
particular icon you would be able to see another tab that would open and you would
be able to see the data just like you see it in the excel files. So, you can see the
variable names brand models, manufacturing, your fuel types. So, there are 3 fuel
types for these used cars petrol diesel CNG and then we have showroom price for
each of these used cars. So, when these cars were what first time what; these right as
the new cars. So, what was the price and then the kilometers since these cars have
been running on road, the kilometers that have been accumulated from the starting
years from the purchase year
And then the price this particular price is the used the offered price for the these used
cars. The cars whether the transmission is manual or automatic that is also we have
information on. So, 0 representing the manual, and one representing the automatic
transmission.
Now, next variable is on owners where each number is representing the number of
owners that actually owned the number of people number of individuals who
actually owned this particular car. So, that is also there. You also have information
on some of the security features for example airbags.
441
So, number of airbags that are there in the car. So, that that information is also available.
We have another variable in the data set that is c prices, that is this is this variable what is
generally present for the classification task where any car having a less than offered
value of a 4 lakhs is represented by 0 and the cars having value of equal to or more
than 4 lakhs are represented by 1 So, this is the data set. So, let us close this
particular tab and let us. So, first thing would be as usual we would like to remove
the n a columns.
So, we used these within the brackets and the for the column value. So, we have applied
this first particular function. So, apply what is going it is going to do is it will find
out the using is dot n a function will find out which of the columns in this particular
data frame d f d f they are having na values right. So, those particular columns
would be selected. So, 2 is indicating that this particular function is being applied or
you know column wise on this particular data frame.
So, therefore, those particular columns na columns would be selected and the all function
would then be applied on those columns. So, so other columns which have which do
not have na values, they would be true and then the columns which I have na values
which will have false values. A logical operator logical vector would be written from
this function and then we have this another operator not that would apply on this
logical vector logical result, and then all the true values would be converted to false
and all the false values would be converted into true. So, therefore, all the n a
columns which were identified using the apply function and they were indicated as
false now they would become.
442
Now,. So, the reverse will happen. So, the all the columns which do not have na values
they would be returned as false using apply function, and when not is applied then
they would become true, and the other columns they would be returned as true
because they were na columns and when not is applied they will become false to
those columns would actually be dropped. So, this is how this particular line we
have been using quite often. So, this is how it will operate. So, in this particular data
set we did not have any such column. So, the result would remain same.
Now, let us look at the first 6 observation of this particular dataset, you can see the name
these variables as we saw through other options in R studio by clicking this
particular icon, and looking at the full file in one go. So, using head function we can
look at the 6 observation we need not look at all the observation, that would require
the whole file to be loaded into memory and therefore, if your device where you are
running your r studio and if it does not have sufficient ram built into it then probably
you would are you are better of running head function, and you would just be
loading 6 observations into memory.
So, these are the observations the variables we have already discussed. So, one particular
third column that we can see is manufacturing here, when the car was actually
manufactured. So, the age of car can actually be computed using this particular
vector. So, this is what we are going to do next. So, you can see age variable and if
the if we if these offered prices if all the information is in the context of year 2017.
So, therefore, the current here is 2017. So, we can subtract all this particular vector
from 2017 and all the values all the observations for all the observations will get the
difference and therefore, the age.
443
So, let us execute this line you can see in the environment section age variable has been
created this numeric vector, having 79 values right. So, age of all the used cars have
been computed. Now age could be a relevant variable because as we discussed the
task here is the prediction task we are trying to we would be trying to build a model,
trying to we will build a model to predict the price of a used car right offered the
price of a used car. So, age could be an important variable in terms of explaining in
terms of predicting that particular price. So, therefore, you would like to have age
also in our model.
So, let us append this particular variable this particular vector and this data frame. So, c
bind is the function as we have talked about in previous lectures also, this can be
used to an already existing you know a data frame to append this particular variable.
By default it would append this variable at the end of all the all other columns in the
data frame. So, let us execute this line. So, now, this particular variable has been
appended. If you want to check again you can run the head you can call the head
function again and you would see the last column age has been created and you can
also see the values for first 6 observations.
Now, if we look at this particular data set, then brand name first 2 columns brand name
model and also third one also manufacturing here they do not seem to be relevant for
our analysis for our manufacturing year we have already we transformed this
particular variable into an age variable, age of the used car. So, therefore, we would
not be requiring this particular variable right. So, we will get rid of this variable also
since we would be building a prediction model. So, therefore, this c_price which
was the outcome variable mainly designed for the classification task, we also would
not be requiring this variable.
444
So, we can get rid of these four variables; first one brand then model than manufacturing
year and then this one c price, and the remaining variables would be the outcome
variable that is price or the relevant predictors, that we want to include in our model.
So, first let us take a backup of the existing data frame.
So let us take a backup you can see this particular data frame has been created now let
us eliminate these columns. So, combined function can be used and minus before
combined function indicating that we do not want to subset, that we want is other
columns that are mentioned here in the combined function.
So, if you want to have a look at now if we are interested, we can have a look at the
structure of the data frame, now in the present data frame this data frame that we
have is we have all the variables of interest.
445
You can see now we have 79 observation of 8 variables. So, all the variables are now of
interest to us whatever modeling exercise. So, fuel type right that would also help us
determine the price of offered price of a used car because the CNG whether the car
is running on a CNG fuel or diesel or petrol.
So, these cars are when they are purchased for the first time, they are priced differently
and therefore, depreciation and other factors they work differently. Therefore, prices
of these cars based on the fuel type that they have are going to be different therefore,
fuel type being an important predictor for us in this prediction modeling exercise.
Now as s r price is actually the show room price. So, this is equivalent of similar to
the first time this particular and that particular car was purchased what was the price
that was paid right this is the price over the years, the depreciation would be applied
and the depending on the condition of the car and other variables some of them
some of those variables, we have in this data set as well the offered price would be
adjusted.
Now, how this offered price is being determined by the individual is part of this exercise
of this model. Now the another important variable that we have is k m. So, the
number of kilometers that a car has accumulated that also tells about the wear and
tear that the car might have gone through, because of the those number of kilometers
covered right. So, therefore, kilometer might also indicate the value of a car that is
you know that should be depreciated that should become part of the depreciation. If
the car has been driven less then probably it has not gone through that much of wear
and tear.
446
But if it has traveled more than probably you know more wear and tear might have
happened and therefore, the price might be on the lower side therefore, kilometer is
also an important variable in this exercise. Price is the variable the offered price that
is the variable that we are trying to predict this is the outcome variable of interest to
us, next variable that we have is transmission. So, transmission right now as
indicated in this output this is right now numerical vector as shown here, but this
variable can have only 2 values or let us say 2 labels, because this is a categorical
variable because we have just 2 labels whether the car is automatic or the car is
manual.
But here in this data set it is being shown in it is a numeric vector here zeros and ones.
So, we would like to convert this vector from numeric to categorical variable to
factor variable. So, therefore, you can see in the next line we have used as dot factor
function to coerce this particular numeric variable into a factor variable in r
environment. Because this variable has 2 labels 2 categories that is the car is
automatic or manual.
So, let us execute this particular code and once it is done you can again run this structure
function and you would see a change there the transmission variable, now you can
see it has been converted into a factor variable with 2 levels 0 and 1.
So, now the values 0 and 1they are being treated appropriately because this being a
categorical variable. So, now, the these values have become numeric codes
indicating 2 different labels now in the structure function you would also see that the
some few initial observation that are being shown here they are in this format one
category 1, category 2 1 and 1 and then 2 and then 1, but the actual values at we had
saw before in the using the head function they are either 0 and 1.
So, this is just the way output of a structure function is presented the values have not
changed because of that conversion that we just did if you want if you are interested.
447
You can again have a look at the actual values you can see the transmission variable is
still having the same 0 and values 0 and 1 value it is only the structural function that
is presenting the output in that fashion, that for factor variables if for fact for factor
variables it generally shows 1 2 and 3 for different categories, different labels that
we might have. You can see even for the fuel type variable CNG diesel and petrol
these are the 3 labels that we had the in the structure output we see 3 2 3
So, these representing the 3 classes 1 2 and 3, but actual values are same they are not
disturbed. So, you can see fuel type you can see it the these strings these characters
patrol diesel and therefore, those values are not changed it is just the representation
in the output of the structure function.
448
Now, another important thing that we need to understand here is, that the factor variables
that we see the are nominal variables the way they have been created these are
nominal variables, they are not ordinal and the way labeling is done here.
For example if we check the be run d f of function for the first variable that is fuel type,
you would see these values first 6 values petrol diesel, petrol petrol and petrol all
and then labels CNG diesel and petrol these labeling in r environment is done
alphabetically is therefore, CNG because it start with c which is and then diesel it is
start with d and petrol p. So, the alphabetical in alphabetical fashion the ordering of
labels is in that fashion, but when we do when we run a classification task, we will
have to decide our reference category. So, in that case if we happen to select our
reference category as petrol that would not be by default made as the reference
category here in this case.
We will have to relabeled this variable. So, some this kind of exercise we will do when
we you know discuss a particular technique that is suitable for classification or used
for classification. So, next thing that would be required, once we have done on all
very you know variable transformation, we have checked the variable type
appropriately transform them, now all the variables are ready and we are ready for
the modeling exercise. So, before we go ahead we need to partition on this particular
sample. So, in this particular exercise will partition this particular dataset into 60
percent for the training set and the 40 percent for the test set.
So, we would not be having validation set we would be building our model on the
training set and then we would be testing our model on this test partition. So, sample
is the function that could be used to perform this partitioning. So, in the first
argument as we have discussed before, we need to specify in a numeric vector form
number of observations. So, in this case 1 2 number of rows that are there in this
particular data frame representing the number of observations that are there in the
data frame. So, that being indicated then the size of the sample, that is being
indicated by this 0.6 into the length of the data frame.
So, because we want 60 percent of the observation 2 randomly selected observations to

go into our training partition. So, therefore, 0.6 into this the full size, and the
replacement is false because we want to do our sampling without replacement which
is the typical way of sampling in a modeling exercise.
So, this particular sample function would return as the indexes indices of those
observation which have been randomly drawn; now to further partition once. So, let
us execute this line. So, you would see that part i d x this index variable has been
created this is integer because these are the indices.
449
That have been returned by the sample function and you would see that 60 percent of the
other observation, the indices of 60 percent of the observation randomly drawn from
that particular data set have been written.
Now, we need to partition the data set. So, we can do using these brackets functions, we
can subset these particular observations from the full data set. So, in the rows value
in the for the row value within the brackets, we can mention this particular variable
and all those indices would then be selected subsetted for training partitions. So, let
us execute this code. So, we have been able to create d f train, you can see forty
seven observations of 8 variables, which is same as the part i d x which was also 47
have which had 47 indices in the first place. Now since we are getting just 2
partitions therefore, all the remaining observation can go through can actually be left
for the test partition.
So, in within the brackets in the for the row value we can mention minus part i d x. So,
the remaining indices they would be subsetted for the test partition d f test. So, let us
execute this code.
Now, once the partitioning exercise is over, now you can see in the environment section
d f test another partition has been created having 32 observations of 8 variables.
Now once this partitioning exercises has been done the function we can move to our
linear modeling linear regression modeling. So, the function that is available in r for
the to perform this modeling is l m. If you are interested in finding more details
about this further function you can go to the help section and type l m and enter and
you would see that in the help section it talks about this particular function, l m is
about fitting linear models.
So, you can see in the description that it talks about that l m is used to fit linear models; it
can be used to carry out regression, single stratum analysis of variance analysis of
covariance right. So, all those statistical tests can be performed using this function.
450
Now, if you look at the usage, you can look at the function and the arguments that can be
fast first one is a formula, formula that is going to represent the linear regression
model right. So, we need to pass this particular formula and then the second
important variable is data. So, there are many other arguments also. So, you can on
your own time you can go through some of these variables which are not typically
used.
So, the formula is written in this particular format. So, the output outcome variable of
interest it is written first and then we use the tilde operator and then we can write the
all the names of the predictors that we have in our data set, and that and that we
want the variables predictors that know which we want to include in our model. Or
we can simply type dot if we want to include all the variables that are that are
available in the data set. So, if you remember then the data frame that we are using
now the that we partitioned, we had already excluded the variables that it we did not
want in the first place and therefore, all the remaining variables are the come into
our set of predictors and we would like to have all of them in our model.
So, our formula is going to be priced tilde dot; dot indicating to this function that all the
all other variables should be part of the set of predictors. Now the data set that we
are using is d f train that is the training partition. So, we would be building this
model on training partition.
451
So, let us execute this line and you would see that in the environment section a mod
variable has been created if you are not so, it has it is this this particular variable is
actually a list of 3. So, it has information on thirteen elements, if you are interested
in finding out the all the names of these this particular list, you can the names come
on.
So, these are the thirteen elements that are there you can see. The first one is about
coefficients, then second one is about residuals then effects, rank, fitted values,
assignment you know similarly so many other details have been computed by this
particular function.
If you are understand finding out all these values, you can again go to the help section
and find out and it is scroll down in this particular section, and you would see that
there is going to be discussion under the value sub section. The kind of values that
are returned by l m function and within this you will have details what are
coefficients a named vector of coefficients of the coefficient that we have residuals
fitted values.
452
So, details about all these written values would be there. So, we might not be interested
in all these written values. So, somebody is one important function for the for us to
get the relevant output from this exercise. So, let us execute this.
So, let us look at the summary output. So, you can see that the call explaining the
formula that we had given there that we have the arguments that we are given to the
l m function, then we have residuals some descriptive statistics about the residuals
you can see that a minimum value median value residual, that this particular residual
residual is of this models are having min max medium first quartile third quartile.
453
Now, let us come to the important part that we would like to discuss first coefficients; the
coefficients you can see for all the predictors or we can see here first one is the
intercept that is the constant. So, this particular estimator is representing the beta 0
that we had in our slide.
So, this particular intercept against intercept this particular value that we have this is beta
0, the corresponding standard error for this particular estimate is also given there.
The t value and p value have the same meaning which we described in our
supplementary lecture on introduction to basic statistics right. So, those values
remain same.
We have also discussed a few more details about these values there. So, you can watch
those particular lectures. Now the next important variable that we can see is fuel
type. Now you would see that instead of one we have 2 variables for fuel type fuel
type diesel and fuel type petrol why this has happened is, because fuel type is a
categorical variable and it had 3 categories. So, for all those categories because the
way software’s are implemented, the way these techniques are implemented, they
cannot handle the textual data and therefore, dummy codes dummy coding has to be
performed on these variables categorical variables, and depending on the number of
categories in a categorical variable we will have to create equal number of dummy
variables.
So, dummy variables are actually representing different categories. So, for example, fuel
type we had 3 categories diesel, petrol and CNG.
454
So, for each of these categories right diesel patrol and CNG. So, we had these 3 labels
for fuel type for fuel type labels, for each of these labels will have to create dummy
variables. So, dummy variables would indicate presence or absence of that particular
value that particular label. So, if the value is one; that means, that car is having fuel
type of diesel if the value is 0; that means, that car is not having fuel type of diesel.
Similarly, the variable the dummy variable corresponding dummy variable for petrol
having value 1 will mean that the car is running on the fuel type of petrol, and if it is
0 it is not running on fuel type petrol similarly for this one. So, therefore, for each
label will have to create equal number of dummy variable. So, we will have 3
variables. So, for the fuel type instead of having one variable in our model we will
end up with 3 dummy variables to represent this particular categorical variable. But
if we look at the variables presence of if you know if we have information on 2
variables, any 2 variables out of 3; if we have these any 2 variables out of these 3
then the other variables information is automatically known.
So, therefore, in our modeling exercise right if a particular value is if a particular car is
not having diesel or petrol then of course, it will have CNG. So, because of that if
we have information on 2 dummy variables the third one is automatically known.
So, therefore, in our model we just have to include 2 dummy variables. So, if there
are n classes then we will have to include n minus one dummy variables in the
model. Now what happens to do the remaining label right.
So, for example, petrol and diesel are selected here, and we had petrol diesel and CNG.
So, the remaining label it becomes the reference category that we have been talking
about. So, any results that we get for these 2 variables now these 2 dummy variables
p and d, those results would have to be interpreted with respect to this res reference
category. More on this will stop here more on this particular dummy coding we will
discuss in our next lecture.
Thank you.
455
Dr. Gaurav Dixit
Lecture Number – 24
Multiple Linear Regression – Part III
Welcome to the course business analytics and data mining modeling using R, in the
previous lecture we started our discussion on multiple linear regression, and we
discussed the theoretical most of the theoretical part of it, and we were doing an
exercise using r studio. So, let us start from there. So, let us open r studio.
So, last time you are able to produce the results of the regression analysis. So, will going
to reproduce the same, and then we are going to pick up our discussion from the
same point.
So, let us load this library the file the data set that we have been using. Let us move na
columns let us create this variable age then we do not want let us take a backup and
then we do not want few columns.
456
So, let us eliminate them then let us also change this variable type. Now in the previous
lecture we also eliminated few points. So, let us again do the same thing here now
let us also do the partition.
So, we were discussing the results of this regression analysis. So, let us come to that part.
So, here at this point we were discussing the results of this particular regression this
particular regression analysis, and we talked about the fuel type variable being the
categorical variable the factor variable in r environment and how in the results we
see these 2 category fuel type diesel and fuel type patrol, similarly for transmission
which was converted into a factor variable.
457
So, this we can see as transmission 1 is there instead of transmission. So, how this is
happening and why this is happening right. So, as discussed in the previous lecture
as well, that any variable any categorical variable which is having data in textual
format string format, it is going to be automatically treated as a factor variable in r
environment. Any variable which is in the numeric which is having labels in 0 or
once or some numeric forms in form the numeric code form. So, that will be treated
as a numeric code.
So, therefore, we would have to convert that particular variable into factor variables. So,
which we did for the transmission, now once you have converted your categorical
variables as i’s factor variables in in r environment right.
So, then another thing that we were discussing in the last lesson that for a particular
categorical variable, having many categories many labels; let us say the many
classes that it has c 1, 2 let us say c 1, c 2 and c 3 alright.
So, it will have. So, we talked about dummy coding, we talked about that we will have to
compute 3 dummy variables. If there are 3 labels and then we also talked about that
we will have to include only D 1 let us say only D 1 and into only 2 of the dummy
variables in this case we have 3, three classes we have 3 classes we talked about that
we would have to create 3 dummy variables this is the part that we were discussing
in the last lecture and then 2 of 2 dummies will have to incorporate in our model the
remaining one would be the reference category.
458
So, the remaining class that is there that class 3 would be the reference correctly, and we
talked about that any results and interpretation of the results that we get for these
dummy variables for this categorical variable, it has to be the interpretation has to be
with respect to this reference category right. So, we talk why. So, talked about that in
an r environment, we also displayed this that generally if you the labels the way they
are created, labels if it is textual labels, text labels right then they would be in
alphabetical order. So, the ordering of those labels would be alphabetically.
So, if you do not change the labels. So, the default; for example, in this case we had
CNG and diesel and then petrol. So, therefore, this would be the ordering of the
labels. So, for the categorical variable, and by default CNG would be treated as the
reference category and the remaining 2 would be part of the model.
So, this is what has happened in the results you can see that fuel type diesel and fuel type
patrol, they are part of the model they have the regression coefficient has being
displayed in the output and the CNG has become the reference category. But if you
wanted patrol to be your reference category and you wanted CNG and diesel to be
incorporated in your model, then you would have to change this ordering in r you
would have to change this ordering in r environment right.
So, there is different functions that are available to perform this ordering. So, this
particular change in ordering we will cover and in coming lectures, where we would
be requiring such a where will be in such a situation, that we would be requiring the
change in labels. So, relabel is the function that can be used in r environment, to
actually perform this change in. We will do this using an exercise in coming lectures
you can see, in the help section variable reorder labels of factor.
So, this particular function can actually be used to change the labels, change the ordering
of these labels and therefore, whatever levels comes first that would actually we
after changing let us say you wanted patrol to be your reference category, you can
have an ordering like this and now once you have changed this ordering, then patrol
is going to be your reference category and CNG and diesel would be part of your
model.
So, depending on the task at your hand depending on the kind of interpretation, that
require you require depending on your requirements of reference category, which
particular class you want to treat as reference category, and which particular classes
you want to have in the model. So, that you will have to decide and accordingly you
will have to change it. Similarly if your labels are numeric codes, which is generally
the case. So, they would be 0 or once for example, if the it is we are talking about if
its 2 class case, but is more than 2 class case then the labels could be anything.
Depending on the way data set is presented to you labels could be in this in these
numeric codes. So, they could be in any ordering which numeric codes that if they
are being used they could be in any ordering right.
459
So, there also if you want to change the labeling they are in r environment put again be in
the increasing value order right. So, for example, if you have 1 2 3 and 4 levels, the
labeling would be 1 2 3 and 4 and therefore, the class 1 would become the reference
category and the others would be included in the model. So, therefore, if your task
requires you to have a class 2 as reference category then you will need to change this
order all right. So, again same thing you can do using the relabel command. So, you
will have to change these orders, and we will be doing in the coming lectures we
would be doing such an exercise and then your ordering would change and your
reference category will become 2 right and other categories would be part of the
model part of the results that we produce right.
So, this is about this is mainly about the categorical variable and the labeling and the
reference category that we need to take care. Now another difference between r and
other commercial softwares other statistical software is, that in other statistical
software you might have to perform your dummy coding explicitly, and then bring
that those variables dummy coded variables into your model. In r if you are having
text labels or once you have declared a variable in R environment once you declare a
variable as factor variable that is nothing, but the categorical variable then most of
the packages most of the functions that are available they take care of the dummy
coding.
So, you do not have to explicitly convert your categorical variable and do these dummy
variables zeros and ones as we talked about in the previous session presence or
absence of that thing. So, this part we have discussed in the previous lecture. So, you
do not have to do this dummy coding 0 and 1 in R environment, if you have
converted your variable into a factor variable. In other commercial statistical
software, you will have to perform this dummy coding on your own there are many
software provide specific utilities for the same. So, therefore, it is quite easier given
for in those softwares also to perform the system according.
But you will have to perform dummy coding first and then you will have to pick your
variables right and then you will have to include them in your model that is how it
will happen in other softwares. In r you just need to convert your variable into a
factor variable irrespective of whether the labeling is text or based or the numeric
codes based. Once you convert them the most of the things would be taken care of,
only thing that you would be required to change is the reference category.
Now, let us look at the results that we have to as be discussed the l m function the
formula, that is being passed here and we also we can also see them the descriptive
charts about residuals and then we come to the coefficient part. Here we can see the
intercept, the then the fuel type diesel and fuel type patrol, but we can see that
neither of this fuel type are significant significance how do we see in the results, you
can see at the end of this particular results you would see a asterisk there.
460
Now, these if any of the variable any of the row has asterisk, star then that particular
variable that particular pre predictor has a significant relationship with the outcome
variable. So, now there are different labels of significance. So, you can see between
0 and point 0 0 1 if. So, if your p value which is the last column here this your p
value. So, if this p value the concept of p value we have discussed in supplementary
lectures. So, you are recommended to watch those videos. So, if your p value is
lying between 0 and 0.001 then that is indicated using 3 star. If your p value is lying
between 0.001 and 0.01 then that is indicated using 2 stars, if your p value is lying
between 0.01 and 0.05 so, that is indicated by a single asterisk.
If your value is between 0.05 and 0.1. So, that is being indicated in this case is dot
otherwise it would be a blank. So, we are generally you know. So, these ranges
actually also signify the confidence interval for example, if we have single asterisk
here; that means, the value is between point zero; that means, the value is less than
0.05, that actually also signifies that we have accepted the confidence we are
working with the confidence interval of 95 percent. If the value is less than 0.01;
that means, 2 asterisk would be used. So, therefore, we are looking for the
relationship with 2 asterisk or more and therefore, the 99 percent confidence interval
is being used if it is 3 star then it has to be 99.9 percent.
So, it put up the value is less than 0.001, then we are deal we are working with the 99.9
percent confidence interval. Sometimes people might also pick even the 90 percent
confidence interval, in those situations the p value is going to be less than 0.1. So,
depending on the acceptance of your the confidence interval that you want then the
level of acceptance of your error, that you can select the significant relationship.
For example, the most typical confidence interval that is residual is 95 percent and
generally we are looking for relationships which are having 3 asterisk 3 star
significance levels. So, if we look at the results any particular rows belonging to
different variables that we have in the model, if there are no such asterisks as
discussed no no such codes so; that means, those are in significant relationship.
So, the significant relationship that we can see is SR price, that is show room price
having 2 asterisk in this case. So, therefore, have being significant relationship at 99
percent confidence level. We look at the k m also. So, this is 1 astrick is there. So,
therefore, this relationship is significant at the level of 95 percent confidence
interval. Other relationship if we are ignore if we ignore the intercept term then the
main predictors that are there others seem to be significant. Please also bear in mind
that we are dealing with a data set.
461
That is just having 75 observations right this is a quite few number of observation in this
case. So, therefore, the result or not that was not stable and therefore, the
relationship so that we see the significance of relationship only 2 of them are being
significant, that will also change depending on the partition we select. So, at this
moment the partition that we have created 60 percent 40 percent, and the
observation that had been randomly selected into the training partition; data in a way
determining the relationship the significance of relationship and all that. So,
therefore, it is generally advisable to have a much larger data set. So, that we can
overcome the problems that could be there to do smaller sample size.
462
So, currently what we have is the smaller sample size and it might not be reflective of the
true relations relationship between the outcome variable of interest, and the set of
predictors that we have. Now other things that you can see in the regression results
is, the estimate the this is regression coefficients for all these variables. So, we can
we generally also look whether these estimates how you know if something is less
than the value of residual standard error right, if something is less than that then
probably even if those coefficients are having significant relationship right.
So, be even drop those variables. So, some of the insignificant variables that we see we
can easily drop them to reach our final model, and even some of the predictors
which are having this particular estimate the regression coefficient less than the
residual standard error that is also a candidate for elimination. That all that
predictors also could be candidate and those variables would reduce dropping of
such variables would reduce insignificant dropping of insignificant variables and
also variables having even though having significant relationship, but very small
regression coefficient or small estimate as in this column that would improve the,
that would actually reduce the variability in the prediction errors and which is the
desirable thing for us.
So, this particular column gives us the regression coefficients for different variables. So,
for example, in the SR price you can see 0.2 is the value, and for k m you can see
minus 0.02 is the value. Then we have a standard error as well. So, for all these
relationships standard erorr as also been given and the t value and p value is also
given based on we can find out the significance of the relationship as discussed.
Other statistics that are available in these results are residual standard error and the
degrees of freedom, we can also see the multiple R square which will discuss
coming lecture.
463
So, multiple R square captures the proportion of explained variance. So, in this case we
can see 67 percent of the variability in the outcome variable is being explained the
value being 0.6747. So, 67 percent x the variance is being explained. We also have
on the adjusted R square there is some difference between multiple R square and
adjusted R square we will discuss this also in the this lecture are coming lecture, you
can see adjusted R square value is slightly on the lower side that is 0.60 to 5 that is
about 60 percent lower than multiple R square value.
So, we will discuss this why this value is lower and generally we prefer to look at the
adjusted R square value this con considered to be a much better criteria to check the
performance of the model. Especially in statistical sense we are generally looking
for this R square value because there our idea is to find the model which is best fit to
data. So, therefore, R square values whether its multiple R square and or adjusted R
square indicating the proportion of variability that is being explained by the model,
thereby indicating the fitness of model 2 data therefore, we always look for higher
value, and then adjusted R square being a better indicator than multiple R square.
So, we always look for the adjusted R square value for different models right in different
variation that we could have for this particular regression analysis that we are doing.
So, we look for the model having higher adjusted R square value, then we have F
statistics which be covered in our supplementary lecture. So, these values are also
they were indicating the significance of the overall model that we have.
464
Now, let us look at few more measures. So, these measures that we are going to compute
now they are mainly for goodness of fit. So, you can see that we are trying to find
out the this residual value right for this model and d f degree of freedom for the
degree of freedom for the residual, and then we have r squared value and then the
sigma value right and then the residuals sum of square. So, all these well we are
going to compute. So, let us execute these codes.
So, these are a few more statistics that we can discuss residual d f this was all 36 is there
multiple R square value, and that is the same that we discussed before 67 standard
divia deviation estimate is also there, residual as s s is also there residual sum of
square is also there. So, sometimes we also like to a compare residual s s value, if
there are more than 1 model if there are many candidate models.
So, some of these numbers we would like to compare and then we can decide the find the
most useful model or best performing model.
465
Now, let us test the performance of this particular model on test partition. So, we are
going to score the test partition first. So, you can see thus the function that we are
going to use is predict. So, predict function its quite a generic function. So, for many
different techniques for many different methods, this particular function just like
summary function this is available
So, that on any new data, that we are able to score and in new data using the model that
we have just built for example, mod is capturing mod is representing the model that
we have just built using training partition, now we are going to score this data that is
test partition d f test you can see that we are eliminating the fourth column which is
actually the outcome variable.
So, we do not need that right. So, it is the model which is going to be scoring this, but
data set having just the prediction information. So, just to clear that this that we do
not need the outcome variable we are trying to predict the outcome variable and
therefore, this particular column has been eliminated. So, let us score this test data
set. Now we can also compute the residuals for test data set the error values for test
data set.
So, now you can see that outcome variable this is representing the actual values now we
can subtract the predicted values and find these errors or residuals now let us create
this data frame and let us have a look at the first six observations. So, you can see
the actual value and the predicted value and the residual. So, for first 6 observations
out of the 75 total, we can see these 3 value actual value predicted value and the
residuals now if we want to as we discussed that we wanted to the main idea being
why in the main idea behind scoring the test partition is we wanted to compare the
performance of our model in training partition with test partition
466
So, for this we are going to require this particular library r miner. So, this particular
library has functions, which can be used to compute various matrix that we talkeed
about in our previous lectures on performance matrix. So, let us load this particular
library.
Now, M metric is the function that is used. So, in this particular function the actual
values are the first argument. So, you can see d f train price is the first argument.
So, here you can see you are also computing the matrix containing partition as well, and
then the fitted values that are predicted values right. So, that is the second argument
and then the third argument is a character vector indicating the matrix that we want
to compute. So, in this case we want to compute SSE that is sum of squares error
then RMSE and ME mean error. So, all these matrix these 3 matrix is specifically
we want to compute.
Now, once this has been done you can see we are using another functional print, that is
just 2 because we are not interested in the actual value till the last decimal point, we
would just be interested in few decimal points and because the main idea is to
compare. So, print function I am using and rounding of the results that we have got
gotten got in the previous line and we are restricting the numbers to 6 decimal points
right. So, let us print these numbers.
So, these are the numbers and the value for SSE RMSE ME and if you look at the
previous value the residual s s that 66.98 that we had this is nothing, but SSE again
the same number is coming there the new matrix that we have computed RMSE and
ME.
467
Now, let us look at the our values on test partition. So, again we are going to call this
function and m metric and you can see first argument now for the test partition. So,
the actual values and then the mod test this is these are the predicted values which
we have just computed using the predict function right. So, this is scoring has
already been done and the same matrix. So, we want to compute the data for the
same matrix and you can see the numbers.
So, in this case you can see that the SSE value that we have is even lower in the case of
test partition. So, the model seems to be performing better in the test partition in
comparison to training partition. So, this is fault because we are having a smaller
data set. So, this seems to be more because of the chance thing. So, it is always
advisable to have a larger data set. So, that we are more sure about the results that
we have.
So, in this particular instance the exercise that we have done, the our model seems to be
performing better in new data in the test partition. If we look at the RMSE value that
is also on the lower side. So, RMSE value is a much better indicator as we have
discussed in the previous lecture, because it gives us the number in the same unit as
the outcome variable. So, we can see that RMSE value for the test partition is also
on the lower side in comparison to training partition if you look at the ME value.
So, we can see that in the training partition the ME value is 0. So, therefore, the ME on
an average level model is neither over predicting nor under predicting for the
training partition, but if we look at the ME value for test partition we can see that
this is negative. So, on an average level model is under predicting the values. So,
from this we can see the results that we have, we can say that the model is robust
model is giving stable performance or even better performance.
468
But; however, the main caution being that this exercise that we have done is on a smaller
data set and therefore, the results also depend on the partitions that we created and
the observation that are selected and those partitions. So, therefore, it is always
advisable to have a larger sample size and then trust your results. So, will stop here
and will continue our discussion on multiple linear regression in the next lecture.
Thank you.
469
Dr. Gaurav Dixit
Lecture - 25
Multiple Linear Regression-Parts IV
previous few lectures we have been discussing multiple linear regression and in the
last lecture, we were discussing the regression results that we had produced. So, let
us start with the same exercise. So, these are the results that we are discussing in the
previous lecture.
So, we stopped at this point when we you are comparing the performance of the model
that we had built, on training partition and testing partition. So, we were discussing
that the importance of the samples size, how where the point to be understood is
that, the model that we have just build is performing quite well over the test partition
as well; however, this has to be confirmed with the a largest sample size.
Now, a few more things that we can do about our model and to understand the results of
our model for example, the box plot of residuals.
470
So, box plot of residuals also in a way gives of gives us some useful information about
the model. So, let us look at this let us draw, let us create a box plot for residuals. So,
range this minus 2.27 to 3.31 and you can see this box plot the first argument is on
residuals, which we have already previously computed as we saw in the previous
lecture. The titles and the labeling for y axis is also given appropriately, the limit that
we already have is the range is going come fall within this is a particle range.
So, let us execute this line and get the box plot. So, this is the box plot that we have. So,
this seems to be a bit more compacted. So, main reason being the range is slightly on
the wider side and that is why in in every plot that we have been generating, we
always focused on the limits. So, let us seen the limits appropriately let us make it 3,
and this one as 4, and then we would have much regular box plot not the compressed
one you can see the change in results, and the much better box plot in this case.
471
So, as we have discussed in the when we were discussing visualization techniques that
the box plot contains is an some important formation for us, example in this case we
can see that a different values the few outliers that are there right and the majority of
values which are represented by the first quartile and third quartile, and the range
where they are lying right. So, majority of value they seem to be this minus 1 to
something more than 0 right. So, this range for box plot and from the look of it looks
to be right skewed. So, we can easily find that out there is another important
function that we are going to discuss the this is quantile function. So, if you want to
compute the values for different quantiles. So, this is how you can do this.
So, for example, we are interested in first quantile and third quartile reason being that the
box that we generally generated using box plot, that indicates that the majority of
values are lying between first quantile and third quantile. Majority means 50 percent
about 50 percent or even more our values are lying in between these 2 plots.
472
So, let us compute these quantiles you can see the number. So, if we have to compute the
range for these. So, 50 percent values as indicated by these quantiles are lying
between minus point approximately between minus 0.8 to 0.4. So, that is the range
where more than 50 percent or approximately 50 percent of the values are lying
Now I with for talking about the skewness. So, we can do the same using by computing
mean and median value.
So, in this case we can see the mean value is slightly on higher side then mid median
value and we can also see the some outliers there therefore, this particular you know
residuals they seem to be following normal distribution which is right skewed. So,
we can check this again by plotting histogram. So, now what we are going to do is
we are going to plot histogram on the outcome variable.
473
So, this is the price outcome variable is a price. So, let us plot this and you can see this
skewness that this seems to be clearly right skewed distribution right normal
distribution.
So, we can again confirm the same thing using few different plots for example, one is
one popular plot that is generally use is normal probability plot. So, we have Q Q
norm function to generate the same. So, we are going to use this, this is the. So, we
are applying Q-Q norm on outcome variable d f price.
And this is the plot that we get we want to have a reference line on which we can
compare whether the plot is whether following normal distribution or not this is the
line
474
So, in the normal Q-Q plot we see that as we move from left to right on the right part,
many observation they start deviating from this reference line that we have just
drawn. So, that indicating the right skewness right for a normal distribution these
points would be distributed along this reference line; if there is some deviation that
tells that indicates the skewness in the plot in the normal distribution. So, as we
would have as you see that we are mainly when we talk about the distribution
specifically normal distribution, we are mainly interested in residuals and the output
outcome variable in this case we have checked both.
Now, generally for our data mining modeling, the outcome variable or the residuals not
following the normal distribution is not such a problem because generally the
performance that we how we assess the performance is on using different partitions.
So, therefore, we do not have to so, we get some relaxation from that assumption
that we talked about, but if you are dealing in a few are doing a statistical modeling,
in that case the people generally prefer to you know do some transformation. So,
that there outcome variable of interest it follows normal distribution.
So, we are going to do the same. So, from the looks of it the we saw that outcome
variable it seems to be rightly skewed. So, log transformation would be appropriate
for it to make it more normal distribution. You would see that in a log form log
curve log function, majority of the value you know they lie in this smaller range and
then the there is a long range that is taken on the other values, which are slightly on
the higher side. So, therefore, log transformation seems to be most suitable to make
it look like more normal distribution.
So, if you plot a histogram we generate histogram on log of d f of dollar price, right then
we will get whether will see that whether we getting a normal.
475
Now, you see this once we are generated; now they seem to be more of a normal
distribution. If we compare this particular plot to previous histogram that we had
plotted this seems to be following normal distribution. In the other plot there was a
clear skewness right skewness in the plot. Now from the plot what we understand is
now we can use this long log transformed value as outcome variable of interest in
our regression model.
So, let us see this. So, we are again going to run one more models on this is on the
transformed value log transform value. So, you can see in the l m function we have
log and then we are taking log of price and this is the model is being run on log of
price, and the set of predictors on the training partition. So, let us execute this. So,
we get the model and now let us also score using this model let us also score the test
partition. So, we can see that we are again using the predict function and we are
passing on the new model, that is mod 2 in this for this time because you want to
you score the test partition using this new model that we have just bulit. So, let us
execute this code. So, this has been scored now we are again going to use this m
matric function.
So, you can see the first argument being the out the this is for the test partitions first
argument being the outcome variable for the test partition coming for the test
partition and then you can see the values, that we have just computed mod test 2 we
are taking an exponential of that value, because if you remember that we have done
log transformation. So, therefore, this course that we would get on test partition,
they would be in that scale itself.
So, therefore, we need to bring those value to the normal or regular scale. So, that we can
done using applying the exponential function on this score values and therefore, the
values would return to the normal scale, and then these values could be used to
compute the matrix.
Because idea is to compare the performance of this transfer model with respect to the
previous model that we have and that too on test partition. So, that is compute this
these values and you can see the number. Now we look at the values you can see that
the earlier the results we look at the RMSC value it is 1.16 right for the earlier model
and now it is 1.11 right.
476
So, even in this case the error is further reducing. So, this might not always be the case
was once you transform the typical scenario could be that the this RMSC value, the
this error might slightly increase right, but in this particular instance the error is even
when can we run a transform model when we transform about outcome variable and
then even in that case the model is doing even better you can they see that even the
SSE value is lower than the previous value that we had in this case it is 37.
And the previous value was 40 here. So, in this case the samples is a partition that we
have, for after doing log transformation we are getting even better results; however,
the word of caution would be that as have said in the previous lecture as well that we
are doing this exercise on a small sample size having just 75 observations. So,
therefore, the results are not that much reliable. So, therefore, we have to use a larger
sample size. So, that the results are comparable in a much broader sense now. So,
this was about the regular regression analysis that we have just done.
477
Now, let us move to our next discussion that is on variable selection. So, when we were
discussing dimension reduction techniques in our previous modules and lectures. So,
we talked about the principle component analysis, and how it could be used to
reduce the dimensions. We also discuss that some of the data mining techniques
right they could also be used to perform the dimension reduction. So, now, we also
talked about at that time that the regression analysis could also be used to perform
this we also talked about that cart could also used. In this right now we are going to
discuss how the regression modeling can it be used to perform variable selection,
that in a way also is for dimension reduction.
So, generally when we are dealing with large data sets we also have large number of
variables, you know and from those a large number of variables we will have to
select the useful predictors for our prediction or classification task. So, how do we
do that? Because the main idea for our modeling exercises, to select the most useful
set of predictor for a given output outcome variable of interest. So, therefore, from
those large number of variables the 20, 50, 100 or even 200 variables that we might
have in our data set, how do we identify the most useful set of predictors of 5 6 7 8
to 10 variables for our modeling right.
So, variable selection that using regression models that could be the one alternative
solution to perform this. Now at this point I would like to mention that is as explain
in the slide also that selecting all the variables and the model is not recommended.
So, now, those of the computing softwares that we have that immediately you might
be tempted to have all the models in your all the variables in your model.
478
And then later on selecting and useful a variable from that from those results to that is
not recommended for a number of reasons for example, data collection issues in
future if you are having many variables, in the model and later on if you are required
to rerun the model and because you would like to compare the performance of the
model to the previous model as well, but in future you might not be able to collect
the data on all the variables.
So, as you increase the number of available in your model, the collection data collection
issues in future might hamper you know comparison the analysis. So, therefore, you
have to beware in this thing in mind that there could be data collection issues, if you
build your model using all the variables that are available. Now measurement
accuracy is use for some variable. Now some of the variables because generally
when we talked about the statistical modeling, generally we are dealing with
primary data as we discussed in the previous lectures, some of this data some of
these variables are measured use using the survey instrument.
Now, there are generally many of these variables of perceptual variables also there are
generally the measurement issues measurement and the exercises about the data that
we collect. So, because of that it is not recommended that whatever data that you
have it is not recommended that you include all the variable all the variables in the
model, you might choose to eliminate some of the variables which might be having
some accuracy related problems measurement accuracy related problems. Now
missing values if you are having more variables in your model, missing values could
complicate the problem much more. If there are more variables in the model having
missing value in even just 1 cell might lead to eliminating or removing more number
of records right. So, therefore, if you have more variables in the model there are
more chances of having missing values in the data set and then even more chances
to remove you know more number of records, more number of observations or rows
from your data set because of the missing values or depending on the impute
imputation that you have.
So, usefulness of your model can also be under question. Now the other reason is
parsimony. So, as we have discussed before the would like to follow the principle of
parsimony. So, therefore, it is always recommended that we try to build a model
with as few as predictors and is still to be able to explain the most of the vary
variability in the outcome variable interest that is the ideal scenario that we want.
So, with respect to this principle of parsimony also we should be building our model on
pure variables, and model building on all the variables is not recommended. Now
few more reasons then other one being multicollinearity.
479
So, beta is the form multicollinearity in previous lectures as well, and I will do the same
again in coming lectures as well. So, what is a multicollinearity. So, this is briefly
this is about 2 or more predictors sharing the same linear relationship with the
outcome variable of interest right. So, as we have talked about that one of the
assumption in regression analysis is, that cases should be independent right.
Otherwise that is going to do you know produce might produce the or might affect the
estimates of reliability of the regression coefficient those estimates right.
So, if the cases would be independent the same applies on the variables also the column
side. So, on the row side also the cases would be independent, on the column side
that is the variables that we are talking about we say that multicollinearity should not
be there this is even more applicable to regression analysis and other statistical
techniques. So, many predictors having the same relationship with outcome
variables in a way is very similar to having you know 2 dependent rules. So, we do
not want that in our data. So, we would like to eliminate the multicollinearity issues.
If there are 2 predictors which are having the same relationship with the outcome
variable, and we include both of them in the model, the result that we might get
would be dominated by the information that is there in this 2 variables and therefore,
our model would rendered useless. So, therefore, we would like to eliminate the
multicollinearity issues and we are including more variables in our model, there are
more chances for multicollinearity to appear in our model there are more available.
So, of course, some of few variables might be highly correlated there are good
chances of few variables, being highly correlated and therefore, multicollinearity
could be there in the model.
480
So, therefore, for this reason also it is recommended that we should do our modeling
with fewer variables. Now let us move to the next point that is sample size issues.
So, few rule of thumb for sample size is we have been discussing before for
example, this one particular is that number observation that we have should be
greater than 5 times of you know number of predictors plus 2 value right. So, if there
are you know if there are p predictors. So, we would like to have number of
observation more than 5 into p plus 2 right. So, if we have a the main logic being if
we have more number of predictor our requirement for having a number of
observation will also go up.
So, we have a 100 100 variables in our model, and you want to include all of them in our
model. So, that would increase the number predictors that is p value. So, therefore,
our requirement for the number of observations could also will would also go up.
So, sample size issues could also be encountered, if we have more predictors in our
model. Now few other things would also be there for example, variance of predict
predictions that we do or after modeling might increase, due to inclusion of
predictors which are uncorrelated with the outcome variable. So, we are having
more number of variables in the model. So, there are chances there are more chances
for including some uncorrelated variables.
Some predictor which are uncorrelated with the outcome variable and therefore, they will
increase the variance of prediction. So, that is avoidable now another issues that me
my face is the average error of prediction. So, average error of prediction might
increase if we exclude some of the predictors which are correlated with the outcome
variable. So, as I would said for us when we do any kind of analytics whether it is
based on data mining on whether it is based on statistical techniques, we are always
looking to build model we are always looking to predict values or classes in
different task that we do. So, in all those situations if there are variables that we are
analyzing, if they are correlated only then we would be able to do our job.
But as we discussed if the variables of highly correlated then of course, you would like to
avoid that situation. Similarly if the variables are not correlated at all we would like
to avoid that situation always that situation as well. So, therefore, we are interested
in the mid-range, but the variables are slightly you know variables are slightly
correlated less than moderately correlated and therefore, the our average error of
prediction that is on that could be kept on check. So, if we in exclude predictors
which are correlated with outcome variables, our average error of prediction will go
up. So, therefore, we have to see on the we have to balance on both sides is would
not like to have highly correlated variables because that would again those variables
my dominate the results as we discussed.
481
Multicollinearity issues similarly if there are uncorrelated variables, they will increase
the variance which is not desirable. So, we would not like to have those uncorrelated
variables as well. So, we would like to have and you we also have to make sure that
the variables which are correlated I know that are in the mid-range low or mid-range
having low or mid-range correlation we would like to have those predicted in our
model.
This brings us to another concept related to the same discussion that is bias variance
trade off.
So, bias variance trade off when we is important especially when you try to include you
know many variables, then what is going to happen is your variance will be a
negatively impacted, when you have a low number of variables your bias that is the
average error that is going to be negatively impacted. So, we have to balance
between bias and variance. So, for example, so, this scenario sometimes is also
referred as too few verses too many predictor.
Whether we should have too few all too many predicts both the suggestion are avoid
avoided should avoided we should have a balance approach, balance bias variance
trade off and balance has to be maintained. So, if you have few predictors that is
going to be leading to higher bias, that is at we discuss higher average error and
therefore, lower very variance. So, the lower variance is desirable, but then you
would also do not know would not like to have the higher bias. So, therefore, that is
the tradeoff that we need to do that we need to perform.
482
So, what could we done. So, we can drop variables with coefficient with are less than
standard deviation of noise, and with moderate or high correlation with other
variables. So, as we discussed that if the other variables that our predictors they are
moderately are highly correlated with the other variables right, and they are also
having coefficient which is less than standard deviation of noise, then probably they
are good candidates for dropping. So, as we have discussed when we were
discussing regression results. So, some of the variable which are having coefficient
value less than standard deviation of noise, and also being correlated highly
correlated or moderate highly correlated with the other variables. So, they could be
dropped and if this is done, then will achieve lower variance which is desirable.
So, sometimes we accept even a bit more bias to achieve lower variance. Variance is
more desirable for our modeling exercises. So, sometimes we accept even more bias
to achieve lower variance therefore, most of the recommendations that you would
see they are generally given to it is variance achieving lower variance. Now this
brings to our next discussion that is steps to reduce the number of predictors that we
have been discussing. So, what could be the steps that we can take? So, domain
knowledge is one. So, using your domain knowledge having been having you know
done some work on the same area, having some knowledge about the different
relationship, different phenomenal constructs variables that will give you that will
put give you some more expertise to find out the predictors, which are more which
are more sensible with respect to the task that we had that is prediction of
classification or even the statistical modeling.
So, domain knowledge is always going to help you in terms of identifying which
variables are going to be useful for the modeling exercise that you are performing.
So, the first reduction you can do based on your domain knowledge. For practical
reasons would be the another approach. So, most the discussion that we did just
before all of them would be applicable on the practical reasons.
So, discussion in the previous slide that we had done for variable selection why we
should have few variables, all these points they would also be they will also give us
some direction from practical reasons that we talked about for example, the first one
that we discussed that data collection issues and missing values right. So, all those
all those point gives give us provide a some practical reasons to avoid to reduce the
number of predictors.
Few other things that we have discussed in previous modules also summary statistics and
graph that we have done in our visualization techniques lectures right, this could
also be done. So, in this particular topic multiple linear regression our focus is on
statistical methods, and to also explain the computational power that we have
nowadays. So, that 2 approaches are common or popular exhaustive search that is to
search all possible combination of predictors and find find out the best subset which
is fitting the data. So, that is one approach this is more like brute force approach.
483
The first one is exhaustive search is like brute force approach, where we are checking all
possible combination of predictors and you would like to identify the subset of a
subset of those predictors, which is giving the best fit to the data. Now the second
approaches partial iterative search.
This is more of algorithm driven, your optimized approaches right. So, we are going to
discuss these 2 approaches in coming lectures. So, first let us start with the
exhaustive search and few points and then will continue on the same in the next
lecture say exhaustive search large number of subsets we are going to examine.
Because we are going to check all possible combinations criteria that we generally used
to compare models, different subsets and comparison between those model subset
models would be generally be based on adjusted r square; and r square and value CP
that we are going to discuss in coming lectures. So, will stop at this point and our
next discussion is going to be around the exhaustive search, and the criteria that that
are used to perform that, and then will also do an exercise in R studio to understand
how this is done. We will stop here.
Thank you.
484
Dr. Gaurav Dixit
Lecture - 26
Multiple Linear Regression-Part V Exhaustive Search
previous lecture we were discussing multiple linear regressions. And specifically we
started our discussion on exhaustive exhaustive search. So, there are many regression-
based search algorithms that could be used to reduce the dimension or for variable
selections as we have been discussing in the previous lecture as well.
So, let us start our discussion with exhaustive search. So, exhaustive search is about
when we try all possible combinations of predictors. So, we are looking to examine each
possible each possible combination of predictors, and check you know compare their
performances and find out the best subsets from those possible combinations.
So, essentially, we are dealing with large number of subsets. So, there are a different
criteria to compare model’s subset models. So, the criterias 2 of them are same, as we do
for any regression model R square and adjusted R square. So, first you know let us
discuss adjusted R square before we proceed.
485
So, adjusted R square can be defined using this particular formula this particular
expression 1-( n-1/ n- p- 1); where p is the number of predictors and is the number of
observation, multiplied by (1- R2) that is p multiple R square, where R square is a
proportion of explained variability in the model. So, R square is something that we have
been using and will discuss a bit more on R square as well, R square is also called
coefficient of determination. And mainly used in statistical modeling where we are
generally looking for goodness of fit measures. So, therefore, we are trying to un
understand how much of the variability is being explained by the model, and R square
being the matrix for the same.
Now, adjusted R square in a way we can say the improved version of R square, and
where we actually account for the degrees of freedom in a sense the number of
predictors. So, our R is adjusted R square generally includes a penalty for number of
predictors, thereby you would see that adjusted R squared value is always slightly less
than the corresponding R square value. So, that is mainly because of the penalty that has
been added due to the number of predictors.
So, more the predictors more penalty and that has to be accounted for. So, even look at
the R squares definition or formula as well R square can be computed at as following 1
minus SSE divided by SST, that is SSE is sum of squared deviations of sum of squares of
errors, and SST is total sum of squares. you can also express this in the following form as
486
well 1 minus summation over I one to n number of observation, and then y i that is the
actual value minus the predicted value that is y i cap y i hat, and then you can divide this
particular numerator by summation over 1 i equal to one to n, and then within the
parentheses you can have the actual value y i minus divided by the mean of this
particular y.
So, in this fashion also you can compute R square value. Now R square is called as the
resource coefficient of determination, and mainly used to check the goodness of fit. So,
let us move further; now R square another way to understand R square is that it would be
equal to a squared correlation in a single predictor model. So, if we just had a single
predictor we just had y and that is being regressed on x 1.
So, if we just had y and x 1 and we are looking for a linear regression model for the
same. Then the R square would be squared correlation that being the single predicted
case. And that is how our square also gets it is name. So, in that case the correlation
coefficient generally is expressed using a small R and if you square that R square. So,
that is how in a single predator case R square gets it is name coefficient of determination.
Now as we as I discussed adjusted R square introduces a penalty on the number of
predators to trade-off between artificial increased verses amount of information. So, that
this is the trade-off that is generally incorporated and adjusted R square value.
487
So, if there are more number of predictors and they are uncorrelated to the outcome
variable. thereby they would just artificially increasing they would just be artificially
increasing the R square value, and would not be contributing much to the model in terms
of amount of information. So, the R square adjusted R square will consider this trade off,
and will impose and penalty for the same.
So, we do not want artificial increase in R square right. So, we want some amount of
information some contribution coming from those predators information to the model in
terms of explaining the variability increasing the variability that is in the outcome
variable that is being explained by the model. So, if we look at a few more things for
example, high adjusted R square values. So, that would also mean that we will get a
lower sigma square low sigma hat square.
So, that is also so, we are eventually we will get a low variance. So, high adjusted R
square values, and that would also indicate this. Now another great area to compare
model models that could be used in exhaustive research is mallow’s C p.
Mellow’s C p is can be expressed in this form C p SSR that is sum of a squares sum of
squares for regression. And then divided by this sigma for full model sigma hat full
model square, then twice p plus 1 minus n. So, this being the how we compute mellow C
p, now sigma f hat square is estimated value of sigma square in the full model. And SSR
488
is as expressed here is the y hat the predicted value minus that mean value of y. So, that
is the sum of squares for regression; so, this is what we have.
So, mallow C p can also be used to compare models, how it can be used? We will just
discuss. So, one assumption that you can see in mellow C p is that full model assumption
is that full model with all predictors is unbiased.
So, that is the assumption that we make, full model with all predictors. So, that is how
we start when we are talking about exhaustive research. So, we would we would also be
because we are would be exploring all possible combination of combinations of
predictors. So, therefore, this would also full model will also be considered. So,
assumption is that full model with all predictors is unbiased right. So, that is then that
would also mean that predictors elimination would reduce the variability.
So, if we eliminate predictors. So, that it is going to reduce the variability and that is the
desirable thing for us. So, how do we find out using this particular criterion? How do we
find out the best subset model? So, best subset model would have C p closure to p plus 1
value and p would be a small value. So, C p value that we compute using the formula
that we just saw in the previous slide. So, it would be closer to p plus 1 and the p would
be a small value. So, using these 2 using these 2 criteria’s criteria we can actually find
out the best subset model now. another effect another important point related to mallow
C p is that it requires high n value more number of observation for the training partition
489
related to p. So, depending on the number of arrive because, we would we considered
considering all possible combinations, though therefore, with respect to p, we would be
requiring more number of observations in the training partitions.
Now, let us open R studio, and we will understand these concept through an exercise.
So, as usual let us load this particular library, the data set that again we are using.
490
Used cars data set let us import this. So, this is the file let us import can see 79
observation of 11 variables in the environment section and allow let us remove any
columns. And let us look at the first 6 observation we are already familiar with this
particular dataset.
So, again you can see brand model manufacturing year fuel type SR price KM price
being the outcome variable of interest, and others being the predictors transmission
owners airbag and c price.
491
So, right now we are not interested in c price. So, that would also be removed. So, first
thing that we will do is compute this particular variable age using manufacturing year
column, and that is append this to data frame let us take a backup. And then we would be
getting rid of first c variables which are not off interest to us and the another one c
underscore price as well.
So, now we can look at the structure of this particular data frame you can see that 79
observation of 8 variables now, and all the variables of interest here are in this data
frame. Now be a transmission is also a categorical variable automatic or manual the 2
variables are using you know coded using numeric codes. So, let us also convert this into
a factor variable using this particular code. Now let us also after this let us also plot this
KM versus price scatter plot.
492
You would see now from this particular plot you can see there are few outliers very clear
outliers.
So, majority of the majority of the values they are lying in this particular zone, right,
between 0 to 120 on y axis, and somewhere between 0 to 100, 20, 30 in on x axis, very
small part of this particular plot where the values are lying.
493
So, other values seemed clearly to be outlier values. So, let us get rid of them on y axis
direction, you can see that values greater than 70 there was one particular point, greater
than 70 this is the point surprise having 100 and the price having 72.
And then similarly on the x axis that is for the kilometre in the x axis direction, we have
3 values greater than 150. So, we can identify those rows as well. So, you can see what
we are trying to identify is the row indexes for all these values. So, these 4 are large we
have been able to identify the indices for all of them. And once identified we can use the
brackets to subset this particular data set. So, once we achieve this code, and
combination function combine function and then we would be left with only 75
observation. You can see from 79 we have dropped to 75 observation of 8 variables.
494
Now, again we can plot and see how this is the scatter plot has changed between price
and kilometre and now you can see most of the plotting region is being occupied by the
points. And all the points are close by no outlier seems to be there. Now as usual we will
do the partitioning.
Once partitioning is done so, we will directly skip to the part where we can discuss
exhaustive search. So, let us skip to that part, this is the part, variable selection, and
exhaustive search is the first method that we are going to discuss. So, library leaps is the
leaps is the library that we need to load for this particular method exhaustive search. And
reg subsets. So, regression-based subsets is the function, and that we would be using. So,
let us load this particular library. So, this does not seem to be installed so, let us first
install this particular package.
495
As we have discussed before for installing a particular package, you just have to pass it
as an argument in the installed dot packages function, and once you do this and if you
have internet connection this will start downloading the required packages and also
install then let us reload the library once now that this is installed so, what is being
loaded.
Now, we would be able to use this function, reg subsets we are interested in finding more
information about this program function.
496
You can go into the help section reg subsets, and you would be able to find out more
information on this, functions for model selection.
So, in this case variable selection and few other algorithms are covered for example,
model selection by exhaustive search forward or backward step wise or sequential
replacement. So, these are the methods that are supported by this particular function.
So, in this function as you can see that first we have to express the formula because
essentially this is a regression based. So, a regression models would be built. So, price is
497
there were our outcome variable of interest, and the dot representing the other predictors
in the data set, but as part of the, this exhausted search various combinations of various
pairs of combinations of predictors are going to be tried data is d f train and other
arguments if you can find out from the help section. So, we have appro appropriately
specified all those arguments. The most important being the method where we have
selected exhaustive as the method for our exercise. So, once this is done.
So, we can execute this particular code. And model has been build we can also find out
the summary.
Now, in the in the in the in the summary output and the, and as we will see just a bit now;
that there are going to be the variables, that would be counted using asterisk. So, first let
us let me show you the summary here.
498
So, this is the summary result that we have. So, in this case you can see.
a call and the 8 variables and intercept that are there.
499
So, you would see in the results we have one subset of each side up to 8 SR price is
selected as you can see asterisk indicating that a surprise is selected. We will discuss
more of the results once we get produce an the output in the suitable format for us to
analyze. So, let us do that so, these s take this function that I have written in the next
line; count is special character this function is going to count the instances of a asterisk
in a particular row or column right.
So, let us create this function. Now this function will be using to for counting for these as
you can see the in the assembly output that we just saw for a particular column we would
be counting these asterisk . So, that this counting will help us reorder the columns of this
particular matrix.
So, this being the matrix we would like to reorder the columns probably SR price we
would like to see first because as S R price seems to be present in all 8 models. So,
therefore, and then followed by the next variable next column which comes in more rows
right which appears in more rows therefore, we would like to order in that in that sense.
And for that we need to count asterisk as part of the computation. This particular this
particular out mat would actually have this output matrix.
So, we are trying to account the this apply function you are already familiar, the second
argument to secured argument value of 2 indicating that we are going to apply this
function on a column wise. And therefore, we will get the numbers of asterisk for each
500
column. So, let us compute this, you can see these numbers now these numbers are going
to be passed on to, pass down as argument to this matrix itself when we construct a data
frame for our output.
You can see ordered by here, you can see ordered by minus om which we have just
computed.
So, the count the particular column having the most number of asterisk, we want it to
appear first and then followed by the next column having more number of asterisk. So,
that is why this arrangement this has been done. Now in the data of frame you would see
the first column is number of coefficients. So, for each model we want to see number of
coefficients that are there.
So, that can be again computed using this asterisk right. So, this does not count the
intercept term. So, intercept is another that could be there so, we are not counting it. So,
again you can see near apply function now we are counting row wise the second
argument the apply function is one. So, therefore, we are counting row wise. Then we
have this RSS which is residual sum of squares.
So, this is again in the output itself. So, we would be using this then the C p values we
have we want to see we are interested in just the value up to 2 decimal points similarly R
square and adjusted R square it is R square. So, the 3 criteria that we have to compare
501
different subset models models are 3 rather 4 criteria is RSS residual sum of square then
values C p then R square and then adjusted R square. So, we would be using these
statistics, these criteria’s to compute different subsets models.
Last one last particular variable would is it is actually a matrix would actually indicating
the, the presence of presence of or absence of different variables in the particular model.
So, let us compute this, let us look at the output. So, you can see RSS value for all 8
models.
So, the first model is having one coefficient or one variable then the second model 2
variable 3 variables and in this fashion up to 8 variables that are there. Now you would
see though we had 8 variables there is was one categorical variable. So, we had to
dummy variable for the same. So, therefore, we had though we had 7 variables, it is
showing as it just like the regression output, where we had 2 variables representing fuel
type right fuel type diesel and petrol.
So, access values are there that is residual sum of squares and C p values are there a
square and adjusted R square value are there. If we focus first on adjusted R square
value, you can see starting from one variable model to 2 variable model, the R square
value keeps increasing right, keeps increasing and when we reach to 4 variable model,
that is this 4th row right when we reach to this particular 4 variable model the R square
value has peaked at small value of 0.72.
502
And after that as we include more variables, 5 variable model 6 variable model, you
would see that the R square adjusted R square value is decreasing right.
So, that is because the as we discussed before, that adjusted R square it imposes and
penalty on number of predictors. So, therefore, because of that so, there is not that that to
quantum increase in R square value, and the with respect to further increase in number of
predictors is not good enough, and that is why penalty has been imposed and adjusted
say R square value is then decreases after that after 4 variable model. But if we look at
the R square column, you would see that these value keeps R square value keeps on
increasing and 4 variable model it reaches it is peak that is 0.74, and then it increases to
0.75. And for more you know if we add more variable model we look at more 6 7 8
variable model, and then there also the value remains at same 0.75.
So, maybe it is increasing you know after 2 decimal points or 3 4 fifth decimal point we
might see some increase. So, essentially, we can say R square value keeps on increasing.
So, even though the predictor information might not be useful for the model in terms of
contributing the, you know contributing as you know amount of information. But the
uncorrelated even though the information even though the predictor might not be
contributing in terms of information, but R square value keeps on increasing right. So, if
we look at the adjusted R square value probably the 4-variable model for variable subset.
That is the one that we would like to select similarly, if you look at the mallow’s C p. So,
as we discussed the Cp value a closer to p plus 1, and also, we can have a look at the
lower p value.
So, if we look at this and specifically we start at the 3-variable model, you can see 1.73
and p value is at this point is 3. So, 3 plus 1 that is 4. So, 1.73 is and the differences from
slightly more gap between 1.73 value and 4. We look at the next value for variable
model, you can you can see 1.9, and that is about 2, and the number of variables now are
4 plus 1 5. So, that gap is 3 here, and the earlier gap was for 3 variable model 1.73 to 4
so, that was about 2 point something. So, 3 variable model 4 variable model. So, these
are the models then if you look at the 5-variable model asterisk from mallow C p the
value increases to 3.21.
Now, 3.21 and the p value is 5. So, that makes it 6 this is also. So, 5 variable model is
also 2, 3, 4, 5. They both are in the that the cap is similar, right. If we move further then
503
the sheep mallow C p value is 5.03, and the p value is 6. So, that is comes out to be 6
plus 1 7. So, even less than value, but as we discussed that we are interested in low p
value right. So, if we look at the C p allow C p, then probably will will select the 3-
variable model, right. Because the difference is 1.73 and difference between 1.73 and 4
ah; that is, 2 point about 2.27, and, but the value of p is low that is 3, but if we compare it
to the 5-variable model all, let us say 6 variable model.
So, in this case 6 variable model let us say 5, and that that difference is 7 the difference is
less, but it it is higher p value 5 variable model 3.21. And p value is 5 so, 5 plus 1 6. So,
the difference is still 2 point more than that. So, therefore, one the 4-variable model there
also the difference is more than that, then probably using the C p value we look at it, then
3 variable model would be selected looking at the adjusted R square value, the 4-variable
model would be selected.
And looking at the R square value 5 variable model would be selected because that is the
highest R square that we can have, and after that we just keep on being if we just look at
the 2 decimal points. So, after that it is just the number of variables keeps added and the
R square value is increasing after 2 decimal points the third 4th decimal point there
might be some increase.
Now if we look at the variables that are included in these models, you can see the first
column, and that is why we had ordered this particular matrix in terms of the number of
asterisk that are present. Because, you immediately we can identify that a SR price being
the most important variable, in a sense that it is appearing in all 8 models followed
closely by age which is also appearing in 7 of these models.
Then fuel type petrol which is appearing in 5 of these 6 of these models, then kilometre
is appearing in 5 of these models, all right. So, in that sense, we can generate this you
know in a very importance of we can also understand the importance of variables that is
S R by price is appearing in all the models definitely the most important ; which is also
expected that because we are trying to predict the used price of a car and the show show
room price being the main indicator it is not surprising.
But let me also tell you as I have indicated before that, this particular analysis that we are
doing is based on just 75 observation very small data set therefore, there is it is subject to
the results are subject to change to partitioning that we are doing. So, every time when
504
we do a partitioning, and when we change it the results might change significantly. So,
therefore, if we do this same exercise using a much larger data set then probably, even if
we repeat the exercise that is a small change to that extent, right.
Now if we are interested in few more information for example, coefficient of subsets
models, for example, second there is this coefficient function it is going to give us the
coefficient value for our different subset models that we have just computed eight
models.
So, the second argument is actually a index for the same the output that we saw in the
summary and otherwise the data frame that we constructed. So, 1 2 8 8 models are there.
So, let us look at the coefficient values. So, let us start from the first. So, you can see the
first model when we look at the one variable model, then the model is actually based on
a surprise the one predictor; that is, there is a surprise coefficient value we can also .2 4
6.
505
And then if you look at 2 variable model than S R price, and age are there help me move
forward than fuel type petrol S R price. And age are there and then fuel type petrol S R
price KM and age and as we move further the 5-variable model.
Then we see that owners is also there, and then 6 variable model and then now so, if you
if one airbag has also appeared now. So, in this fashion you can see the coefficient value
and which variables depending on the variable models the that. That is 5 variable model
for 4 variable models 6 variable model. Will we recall the results that the R square
506
adjusted R square value using adjusted R square value using 0 on 4 variable model. We
look at the that we take adjusted R square as the primary criteria for our exercise. Then
four variable model we can see the variables that are there the fuel type petrol S R price
KM and age.
So, probably these are the important variable which are contributing to some extent, and
which is reflected and this for example, show room price is definitely important for
predicting price of a you know used car kilometres. kilometre is also important, you can
see you can also see that this is negatively negatively correlated here, and then you can
look at the age also which is also negatively and rightly and you get it correlated.
So, that is also there you can also see at fuel type petrol is also negatively correlated;
which is also true in the sense that diesel cost they are slightly you know they carry more
price the show room at you know when first purchase at the time of first purchase they
carry more price. And even for used cars it is the you know when this as per this data this
petrol is negatively.
So, diesel or CNG with respect to because this reference is CNG so, with respect to the
CNG petrol is a negatively priced. So, with this decision will stop here. And in the next
next lecture, we will discuss some of partial iterative search algorithms for variable
selection.
Thank you.
507
Dr. Gaurav Dixit
Lecture – 27
Multiple Linear Regression–Part VI Partial Iterative Search
previous lecture, we were discussing a multiple linear regression. And we concluded our
discussion on exhaustive search. So, the next approach that we use for variable selection
and also for dimension reduction is the partial iterative search. So, there are a few
algorithms under this particular approach that we are going to cover.
So, let us start our discussion. So, partial iterative search; this is a computationally
cheaper in comparison to the exhaustive search that we do. So, in exhaustive search we
try out all possible combinations of predictors. So, therefore, it is more like a brute force
approach, and therefore, the partial iterative search which works on different algorithms.
It is slightly computationally cheaper, but of course, there are some pitfalls like best
subset is not guaranteed right.
So, there is always going to be this potential or missing some good sets or predictors. So,
what we actually get is we produce close to best subset. So, that that is something that we
508
can say that close to a best subsets are definitely we are going to get out of applying
partial iterative search.
So, this particular approach is preferred, when we are dealing with large number of
predictors right because their the computational time that might be required an
exhaustive search might be slightly on the higher side. So, therefore, we would prefer to
apply partial iterative search in those situations. So, otherwise if we are just dealing with
the moderate number of predictors or going low number of predictors, their exhaustive
search is better, because we get a us you know some sort of guarantee of producing the
best subset model.
So, therefore, with this between these between these 2 approaches, we can say that there
is going to be a trade-off, between computation cost versus potential of finding best
subset. So, if we want to you know minimize the computation time, and then probably a
partial iterative search is the way to go if we want to you know, if we do not want to
compro compromise with the potential of finding best subset, then probably we have we
should apply we should employ the exhaustive search.
Now, under this partial iterative search approach we have 3 algorithms that we are going
to discuss. So, first one being forward selection so, in the forward selection algorithm.
509
We start with 0 predictors so, we start with no predictor, and we add predictors one by
one; so and then the as we go along, the main idea that what happens in forward
selection is that strength as a single predictor is actually considered in this approach. So,
if a predictor because we are adding one by one, and if it is significant then it would be it
would remain in the model, if it is not then it would be excluded from the model. So,
therefore, for a particular variable to appear in the in the model to be in the model in a
forward selection algorithm, it is strength as a single predictor should be on the higher
side.
So, that also being the limitation of this forward selection approach. So, we start with no
predictors and then we start adding one by one. So, the predictor is significant that it will
remain there. Then the second approach is backward elimination. So, in this particular
approach we start with all the variables, and then we start dropping them one by one. So,
the insignificant mostly insignificant variables are dropped first, and in that order we
keep on building the models.
And we keep on dropping till we reach the saturation where all the predictors that are
present they are significant. So, this particular approach backward elimination, there is
no you know obvious limitation of this approach, except that the computational time that
would be requiring this approach would also be slightly on the high higher side in
comparison to other partial iterative approaches.
And the third one is a stepwise regression so, in this also we start with just like forward
selection approach. So, we start with no predictor and then we add predictors one by one;
however, we can consider dropping in significant ones in this particular approach. So, as
we move along. So, we keep on adding predictors one by one, and if we if there are some
insignificant ones we can consider at any step whether we would like to drop them.
510
So, these are 3 main approaches. So, let us understand them a bit more using an exercise.
So, a partial iterative search first we are going to start with the forward selection
approach. So, in this as you can see the data set that we are going to use is the same; that
is, a used car data set so, this is pre-loaded partitioning is already done, and we are
dealing with the what number observation that we have is 75 after excluding all the
outlier that are there, and the variables of interest only 8 variables are there, one being
the price that is outcome variable of interest. So, again you can see that we are using this
reg subsets function, and the price and then this formula is specified as price tilde dot
that includes all of the predictors in the data set. Now you can see the next important
argument data is mentioned as d f train and then method is forward.
So, that tells the function that we would like to apply forward selection algorithm. So, let
us execute this code.
511
Now, let us generate the summary of this result this particular model. Now as we did in
the previous lecture that we would like to count the special character that is asterisk in
this case. So, that the particular function count is special character is already created here
you can see here for the special character. So, we can use this particular function, and
count the number of asterisk, number of asterisk that are there in this particular matrix
for different columns. So, this particular result we are going to as we discussed in the
previous lecture, this particular result is going to be used as an index vector for us later
on.
512
Now, as discussed before, first we will have the number of coefficient and different
subset models that we produced that we generate here. then the RSS that is residual sum
of a square then c p mallow c p then R square followed by adjusted R square, and then
we will have the output matrix covering all the variables that are there and whether they
are present in that particular model or not. So, let us execute this so, we get the output
here.
As you can see, when in this in this case forward selection case, that are this adjusted R
square will it starts from 0.48 and it keeps on increasing till a 4-variable model, right,
this particular row 4th row it keeps on increasing. So, in terms of adjusted R square the
result of forward selection and exhaustive search seems to be same.
Then after that after reaching 4 variable model and this highest value adjusted R square
value of 0.72 this adjusted R square value starts decreasing right as we discussed that
there is going to be a penalty for increase in the number of predictors, and with respect to
the contribution of information the amount of information.
So, 4 variable model is the model that we have to select if we follow the criteria of
adjusted R square. If we look at the R square value, this is also a very similar to or is
exactly same as what we had an exhaustive search 0.49, 0.66, 0.73 and then finally,
reaching 0.74 and then 0.75. So, same result and the numbers are also same for c p value
though we had the, we have with us the results of exhaustive search. So, you can see
513
these are this is this particular table is the results of exhaustive search, and you can see
the same numbers are there the output that we have got from forward selection are
happens to be happens to be same for forward selection. So, this is just in this in this
particular case the data set that we have and the partitioning that we perform. If we
change the partitioning the results might also change and also what we are dealing with a
small data set. So, therefore, we might not see much difference between these 2
approaches.
Now, if we look at the variables, you can see show room price a surface being present in
the all 8 models, starting from one variable model to 8 variable models and all of them it
is present then age is present in same in or such models starting from 2 variable models
to 8 models. So, early fuel type petrol is there then k m is there. And then comes the
owners and other variables that are there.
So, if we are interested in understanding the coefficient. So, let us run this particular
function coefficient and passing passing the model argument we will get this.
514
So, you can see the one variable model the coefficient comes out to be 0.24 5 9 5 this is
the value.
So, S R price being the only variable, then we look at the 2-variable model, we have SR
price and age then you can see age is negatively related here.
515
And then fuel type petrol and S R price age. So, just like in the exhaustive search. So, we
are getting very same results the same models also. Then after fuel type petrol S R price
we see the kilometers appearing in their. So, it is happening exactly like what happened
in the exhaustive search right.
However, as I said if we change the sample of or if we change the partitioning, we

increase the sample size or we change the partitioning or redo the partitioning then
probably the results will change. Now let us come to our next algorithm that is backward
elimination. So, again for backward elimination as well we can use the same function reg
subsets.
And you can see the one change that we have done here is the in the method argument
where we have a specified backward as the algorithm in this case. So, let us execute this
code, and let us also generate the summary of this output. Now here again we are going
to do the same thing that counting of a special character and the other things remain
same. So, let us execute this code ok.
516
Oh, we did not create O m 2. So, let us compute this, and then produce this data frame.
Now, again if we look at the results of backward elimination, then also we would see that
same numbers are there again. So, adjusted R square value again starting from the same
0.48 0.64. So, the values these values for all 3 algorithm whether exhaustive or forward
selection our backward simulation what we are saying is up to 2 decimal points. So, there
might be some difference may be there may be some difference if we look at maybe up
to 6 decimal points. So, the adjusted R square column you can see that peaking at 4
517
variable model 0.72 is the value. similarly, for R square also it is peaking at the you
know 5 variable model 0.75 is the value. And the mallow c p is the numbers are same.
So, we will have to select again 3 variable model, but the value of c p is one 0.73.
So, in this so, the RSS residual sum of squares number, there also seems to be same. Let
us look at the variables you can see again a S R price present in all 8 models, and age is
present and the 7 models followed by fuel type petrol present in 6 models, and followed
by kilometer k m which is present in 5 models. So, if we have to we want to look at the
coefficients for different variable models, we can you do the same using coefficient
function.
So, let us look at this again you would see that SR price is present in the first model and
then 6 model you can look at the coefficient also age is again negatively correlated as
expected in the 2-variable model, then in 3 variable model again we have the fuel entry
of fuel type petrol with the negative coefficient negative regression coefficient. And then
you would see in the 4-variable model, we see the k m as well with the negative
regression coefficient here.
In the same fashion the results again seemed to be very similar. So, let us move to our
next algorithm, that is sequential replacement. So, for sequential replacement there is
something that we have not discussed.
518
So, what happens sequential replacement is or we start with all the variables, we start
with all the variables, and from there then we select you know, and then we identify the
best subsets from their best you know subset model that would be there. Then for each of
for each of the variable in the in the in that subset for each variable we try a different
whether that can be replaced by some other variable right.
So, in that sense we start out, and then these so, in the subset model that is selected for
each variable will try out for it is replacement. And then again, we will have for if there
are 4 variables in the subset model, and for each variable we will try out different other
variables as a as a replacement. And therefore, we will get 4 more models, and out of
those 4 models, will again check the, which one is performing better if it happens. Then
if then we select the best one, and then we proceed further the same thing is applied
again and again till we are not able to find the best model.
So, this is what we call sequential replacement algorithm. And the stepwise regression
approach would also be very similar to what is we specified as you can see the
replacement.
519
Some variations might differ, then with the name also, there are going to be very
variation depending on the implementation that we follow. So, in this case as you can see
we are again using the reg subset function. And the one difference that we can see is the
method sequence replacement s e q R e p has been specified. So, let us execute this , it is
run this model.
Let us also compute this number of this asterisk in each column of the output matrix. Let
us compute this matrix. Now we can see again, in the adjusted R square value again it is
speaking at the 4-variable model, and the same numbers are there for R square also
speaking at the 5-variable model, and the same numbers are there again for cp also 3
variable is seems to be the more the appropriate one and the same numbers for RSS as
well.
520
For the variables also, SR price being presented in all the models then followed by age
and then fuel type petrol and came the result seem to be exactly same.
Now, we are interested in looking at the coefficient value, then we can do.
521
So, using coefficient function and again as we can see this, again the same results are
there is no change SR price then SR price and age, then fuel type petrol SR price and
age, and then fuel type petrol SR price and then entry of kilometer k m and the same
fashion, the same result we get. Stepwise regression there is another function that is
available to us for stepwise function regression that is called step.
So, in this case and this is a step function we have to pass on the again we have to pass
on the l m function as the first argument. So, in the l m function we would be specifying
the formula as usual in the first argument and then the data. The direction is specified as
both so, we can add and remove. So, both the kind of operation can be performed. If you
are understand in finding more about this particular function, you can go into the help
section type step and choose them.
522
So, you can see the choose a model a i c in a stepwise algorithm.
So, you can see different options are there this particular method would also allow us to
run backward forward and then both.
So, as we discussed that the stepwise regression is about a star just like forward
regression, and at each step we can consider dropping off insignificant variables. So, that
is why the both indicating meaning the same both indicating the step wise regression.
And the other options using the same function we can also, build backward and forward
523
as well which we had just done using reg subsets. So, in this particular function the a i c
is used to find out the different subset models. For more information you can always look
at the other arguments that are there. So, let us use this function so, let us compute this.
And this is the results that you get. So, you can see that we start with a i c value up 6.89,
and the formula that we start with is price and then other variables, fuel type SR price km
transmission owners and airbag plus age. So, you can see all 7 variables are present, and
we start with a i c value of 6.89.
Now if we look at the possible additions, and possible addition, or elimination that that
could be there. You can see that if we eliminate transmission, we would gain will reach
the a i c level of 4.917, and that would be and that would be a lower a i c value, and then
followed by if we remove airbag.
524
Then we will have this 5.146, and then 5.9 to 5, and then the existing model that is
represented by none. So, none means if we do not drop any variable, then this is what we
will have, right, all the variables are there. And so, we have 3 candidate models here, at
trans when we drop transmission when we drop airbag and when we drop. So, rather so,
when we transmission followed by this dropping of airbag and owners, these are the
more these are the options alternatives that we have.
So, in the next one if you see that we have fuel type SR price k m plus owners plus
airbag and age, we have 3 variables one has been dropped. And you can see the one that
has been dropped is the transmission. And that was the first one, right, because we could
achieve a lower a i c value 4.917. So, that particular one has been selected and
transmission has been dropped.
Now, again from this model also. If we further drop airbag we would achieve a much
lower a i c value of 3.147, we drop owners then we reach 3.939 and then if we do not
drop anything then this is the model that we have at present. So, 2 candidate models
seem to be performing better than the present model, with respect to the a i c value a i c
criteria. Now if we again drop airbag to get the first model. So, you would see , that we
reach the a i c value of 3.15, and the variables that we have in the model are fuel type,
then S R price then k m then owners and then age. So, if we look at the options that we
have is the owners if we drop owners.
525
Then we will have the a i c value of 2.113 which is less than the value for the current
model 3.15. Then the current model is as you can see the 3.147 that is 3.15. Then further
we can drop kilometer or we can add airbag. So, this is step we have already followed we
will reach to the previous models. So, probably we will select the first one and we will
drop the owners.
And you will see that reach this step a i c value of 2.11, and now the variables that we
have is fuel type plus S R price plus k m plus age. So, these are the 4 variables is
superior you remember. The model the final model that we got using reg subsets of
function, that we had there also using adjusted R square criteria criterion the final model
that we selected was of 4 variable model having the same variable, right, S R price k m
age and fuel type right.
So, in this case also as you can see the first the current model having 2.11, and there is no
other model we can see the first row right among the options that we have the first row
none that is the same model. So, no other model can improve this further. So, using a
different criterion. So, we talked about the mallow c p, adjusted R square and R square ,
using different criteria like a i c we also get the same results and in this case by running
stepwise regression.
Now as I said that in the in the results that we are getting here, they are they are with
respect to the sample that we have very small sample we are dealing with very small
samples and the small number of observation and also the partitioning. So, as I said if we
change the partitioning, the number of the observation that are randomly selected in the
training partition if they change the results might also change. If you want to see what
will happen if we change the partitioning, we can repeat few of the models that we have
just done. So, let us change the partitioning again, again if we do this partitioning again.
So, we have re generated these partitions, also as you can see regeneration has happened.
Now if we look at the model let us look at the let us go back to the same point variable
selection if we look at the exhaustive search. So, once the partitioning has been done.
526
We can read on this and you would see that, the results that we might get or might be
slightly different. So, you will have to look at this.
You can see how the numbers have completely changed. At least the numbers of these
different criteria have changed you can see different partition you can see adjusted R
square now I start with the 0.48, right. The earlier one was different right, we have the
previous results as well and the output. So, if we go back you can find out previous
results, yes, we started with 0.48, then 0.64 then 0.71, 0.64 than 0.71, and then 0.72 now
527
there is just that we have now you can see that start with 0.48, then we these .57 than
0.61.
So, we do not reach to that 0.72 level, and you see that 4-variable model we reached the
a peak value our gestures by 0.63. And then it remains 0.63 for 5 variable model and 6
variable model, especially if we look at only the 2 decimal point. And then you would
see for 7 variable model and 8 variable model the value drops again.
So, we will have to select you know 4 variable model to 6 variable models we had 3
options, in this case you can see just by changing the partitioning, instead of you know
using the adjusted R square as the criterion. Now we have to pick from these 3 model
instead of just one in the previous scenario previous partitioning that, we did we look at
the R square value that is also. There also the results in is remains to be the same that 5
variable model having 0.68 value is going to be selected if we look at the c p value.
So, those numbers have changed significantly. Now we look at c p value 3 variable
model the value is 4.6, and then the this value we need to compare is 4, and difference is
point 6, right. And we look at the next value. So, it will be compared with 5 and the value
is this one 3.5 8. So, therefore, difference of for more than 1.
So, probability variable model is again going to be selected here in this case as well, but
if we look at the variables now, the column for SR price you would see it is still present.
So, it still present and then age is also present and followed by km. Now if you see that
km is present in 5 of the models and you see the 5 fuel type of petrol, let us present in the
4 and 4 in the models, if you go back to the previous results that we have had, it was the
fuel type of petrol which represent in the 5 you know 6 of the models, and k m was in the
5 of the models right 5 of the 8 models. So, that has changed.
So, fuel the s R price and age they are still present in 8 models and 7 models
respectively. But k m and fuel type of petrol they have you know they have changed their
places right, who k m coming into 5 models and a fuel type of petrol in 4 models. So,
you can see that once we change the partitioning the results changed. And this is mainly
due to the small sample size that we are dealing with if we had a much larger size
probably it would not change in spite of partitioning, because we have we would have
more number of observations to learn from to build our model from. And therefore, the
results are going to be more robust right.
528
So, with this the same thing you can apply on other algorithm that we have discussed,
and that with the change in the partitioning the results would also change. So, with this
we will like to stop here.
So, this also concludes our discussion on multiple linear regression all right. So, in the
next session will start with KNN.
Thank you.
529
Dr. Gaurav Dixit
Lecture – 28
Machine Learning Technique K-Nearest Neighbors (K-NN)- Part I
previous lecture, we concluded our discussion on multiple linear regression. So, in this
particular lecture, we are going to start our discussion on the next technique K-nearest
neighbors. So, this particular technique is more of a machine learning algorithm a data
mining technique. The previous one the multiple linear regression was the statistical
technique. So, let us start our discussion on k-nearest neighbors or k-NN.
So, as we have been discussing in previous lectures about the difference between
statistical techniques and data mining techniques that in one specific difference being
that in statistical techniques; we assume some form of relationship between the outcome
variable of interest and the set of predictors; for example in linear regression multiple
linear regression and other regression models typically, we assume that there is a linear
relationship between the outcome variable of interest and set of vectors.
However in data mining techniques, as we have often mentioned that no such

assumptions about the form of relationship between outcome variable of interest or set of
530
predictors are made. So, therefore, first very first point in k-NN also that we have
specified in this particular slide is that no junction about the form of relationship between
outcome variable and the set of predictors.
Now, another important fact about k-NN and typically about other data mining
techniques as well that nonparametric methods. So, k-NN is specifically nonparametric
method. So, for example, in the multiple linear regression that we concluded in previous
lectures we used to estimate the betas and the sigma value right. So, those we do not
have such parameters to estimate. So, no parameters from the assumed functional form;
so, those betas while coming from the linear relationships that we had assumed about the
outcome variable and the predictors; so, those betas were required to estimate right. So,
no such parameters especially from the; this functional form point on the functional form
sense are there to be estimated right. So, this is being nonparametric method.
Then how do we build the model what the model is about right what we actually do in a
k-NN modelling. So, what we actually do is we learn from the data. So, useful
information from modeling is extracted using the similarities between the records based
on predictors values. So, each row representing the records each row in the data set will
represent a record and the different than the values are predictors for each record and for
different records we can find out the similarities between those records and those
similarities would become the basis of our modeling exercise.
So, typically how do we find out how do we measure the similarities between records.
So, typically distance based similarity measures are used. So, let us move forward. So, as
we said we learn from the data even though we learned from the data itself using
statistical techniques as well, but we assume about some assume about the structure of
the data or the relationship between variables right here we do not have such
assumptions.
531
So, let us discuss few more things for example, we talked about the similarity between
records and specifically the distance based metrics distance based similarity metrics are
used. So, one of the popular one; one of the popular metric that is used for distance is the
Euclidean distance. So, for 2 records which are having values operators as denoted by x
1, x 2, x p, if there are p predictors and the second record being w 1, w 2 and w p; the
distance between these 2 records as we all understand the Euclidean distance formula D
E u is going to be square root of x 1 minus w 1 square plus x 2 minus w 2 square and
plus up to x p minus w p is square. So, this is going to be the; this is the typical
Euclidean distance formula that can actually use to compute the distance and therefore,
the distance based similarity.
So, the main the main idea being that using the predictors information as the coordinates
right for a particular record and therefore, the for 2 records to be considered as similar or
considered to be closer if the distance is smaller distance of values predictors values
between those 2 records are smaller than prop, then probably, we can say that those 2
records are similar or closer. So, using; so, these this particular distance based metric
Euclidean distance metric is quite popular in k-NN the main reason being low
computation cost as compared to other distance metric some of other distance metric
matrix that could be used or statistical distance or Mahalanobis distance Manhattan
distance. So, there are some other matrix also which can also be used.
532
So, some of the other matrix we would be covering in other discussions about you know
discussing about other techniques other data mining techniques and statistical techniques.
So, k-NN Euclidean distance is the most popular metric as we said especially for the
reason being that low computation cost now we the main reason another region is that we
need to compute more distances in k-NN and therefore, we would prefer a metric which
is computationally less computationally intensive as expressed in this last line, in this
particular slide Euclidean distance is preferred in k-NN due to many distance
computations that we have to perform.
Now, another important aspect of k-NN is the scaling of predictors. So, as we talked
about the distance metric that are going to be used. So, because of this the predictors set
of predictors different predictors having different scales different units and because of
that the easy because of just this scale some of the predictors might dominate the
distance because if you look at the Euclidean distance formula it is just the way it is
being computed the difference between predictors values of 2 points right. So, therefore,
the particular difference of a particular difference of a you know of a predictor which is
having higher scale might dominate the overall distance value and therefore, might
dominate the results. So, therefore, scaling is very important in k-NN.
So, therefore, before we start our k-NN steps it is advisable to standardized values of
predictors. So, how do we apply k-NN in a classification task? So, main idea is to find k
records in the training partition. So, again here also we would be doing that partitioning
training validation and test partitioning and the training partitions are records belonging
to the training partition what then we used to find k records which are neighboring the
new observation to be classified.
533
So, the new observation could be in the validation partition or if we have the third
partition the test partition then this new observation could be or it could be a completely
new observation as well.
So, new observation could be in the validation partition or test partition and we need to
find the k records from the training partition which are closer to this which are
neighboring to this new observation closer to this observation or similar to this
observation or neighboring this new observation now these k neighbors are then used to
classify the new observation right into a class. So, typically the predominant class among
the neighbors that is generally assigned as the class of new observation right. So, these
are the 2 main steps for k-NN especially for the classification task.
So, let us move forward now the next important point would be finding neighbors and
then performing this classification.
534
So, if we look at steps for this for the same first we need to compute the distance
between the new observation and training partition records. So, that is why when we talk
about why Euclidean distance metric would be preferred in is generally preferred in k-
NN technique k-nearest neighbor technique the main reason being as expressed in the
first point we need to compute the distance between the new observation that we want to
classify the new record that we want to classify and training partition records.
So, if we are; we have a large number of observations in our sample and therefore, large
number of observation in our training partitions. So, the number of distance
computations that we will have to perform would be would be much more in comparison
to other scenarios therefore, other distance metrics that we talked about the statistical
distance or also called Mahalanobis distance and we also mentioned one more name
Manhattan distance. So, those metrics are not actually used then once the computation
once the distance between the new observation and the training partition records once the
distance has been computed, then we can go ahead with the point number 2 that is to
determine K-nearest or closest records to the new observation.
So, out of all the records in the training partition and right. So, the distances have been
computed for all of them. So, out of all those distance values we can find out the K-
nearest or k closest records to the new observation now within these now this will also.
So, the next step would also be that find most prevalent class among k neighbors. So,
535
once k neighbors have been identified in a step number 2 we can find out the most
prevalent class among k neighbors and once this is done the this particular class the class
most prevalent the majority rule decision would take place here. So, the among all the
nam all k neighbors the class which is more common which is the majority class among
those k neighbors that would be the predicted class of new observation. So, that is the
typical procedure typical steps that we apply in that we apply in k-NN modeling for
classification tasks.
So, let us understand some of these concepts using an exercise in R; let us open R studio.
So, first as for as first step, as we generally do typically do that first we need to load this
particle library x l s x. So, that we are able to import the data sets available in an excel
file. So, the data set that we are going to use here is this sedan car dot x l s x file the
sedan car data set.
536
So, let us look at the in the data set first. So, this is the one. So, this particular data set we
have used in the in previous lectures as well. So, in this particular data set we have three
variables mainly annual income and household area and ownership annual income and
household area being the numeric numerical variable and ownership being the
categorical variable.
So, for our classification tasks that we shall be performing ownership would be the
categorical outcome variable. So, we would be using the 2 variables annual income and
household area to classify the ownership of a particular ownership of observation right.
So, for different; 2 different classes are there non owner or owner. So, whether a
particular household. So, a unit of analysis here is the household. So, for different
households we have the information about those household as annual income of those
households and the area of those households and then whether they own is whether the
own sedan car or not.
So, typically we would expect that the households which are having higher annual
income they are expected to own a sedan car similarly household which have bigger
which have larger household area they are also expected to own a sedan car because you
would need more space in your household to park to make the parking arrangement of
your car. So, these 2 variables numerical variables are going to be used to classify or to
build our model k-NN model specifically classifier model.
537
So, let us import this data set. So, the function once we have loaded the library this x l s x
the read dot x l s x function would be available for us. So, we can call this function and
as we have been doing in the previous lecture the file dot choose function would allow us
to browse this particular excel file; other ways of importing or mentioning this particular
file that we have discussed in the supplementary lectures as well that you can mention
the absolute path of the file name or you can change your working directory as using the
set WDA function and then once that is done and if your file the excel file is has been
copied or stored in that working directory then you can use the name of that particular
file that is sedan car x l s x in the first argument and it would be easily imported.
So, different ways of importing a particular file. So, let us execute this code.
And you can see in the environment section we have imported that is data set. So, this
has been stored in. So, we have used the data frame d f and then twenty observation of
three variables if you want to confirm the data that we had in the excel file you can just
click on this icon.
538
This is small icon that is that appears in the in the environment section and then the data
sub section this is small worksheet kind of icon. So, you can immediately see this that
the annual income household area immediately you would be able to you do you; this
particular file and this particular information.
You see another important aspect of about this R environment or R studio environment is
a is that once we import a particular dataset, it would be loaded into memory and
therefore, immediately whenever we click on this particular small icon or whenever we
want to access the full data or few observation or few rows the output is produced quite
quickly in comparison to when we open the excel file, it takes a bit more processing and
takes a bit more time for especially for larger excel file to load and therefore, that being
the main difference. So, once we import the data set is loaded into memory and therefore,
it can be easily accessed in a much faster or quicker way.
So, the next line is about deleting the removing the na columns. So, once we have data
set in excel files and if we happen to delete some of the irrelevant columns in those excel
files. So, those deleted columns would be picked up as variables na variables; variables
having na values in R environment. So, therefore, if we had such columns and then the
imported data set will have something like twenty observation of five variables if we had
2 columns which were deleted irrelevant columns which were deleted in the excel file.
539
So, to get rid of those situations we have this next line where as you can see in the
column value.
So, for the data frame as we have talked about in the supplementary lectures that the first
one is in the brackets within the brackets first value is for the rows and the second value
is for the columns within the second you know for the second value we are applying this
apply function where we are first we are applying this is dot na function whether a
particular and then the second argument is the margin 2 so; that means, that the function
that is all specified in the third argument is going to be applied column wise on different
variables of this particular data frame right. So, if data frame is having na values. So, that
would be appropriately written and then those columns would be appropriately subsetted
using these brackets.
So, let us get rid of these any columns now let us look at the structure of this particular
data set.
So, as you can see using the str function the structure function, we get to know about
these three variables and we will come and householder area 2 numerical variables then
the ownership the cat categorical variable or factor variable as it is called in R
environment with 2 labels non owner and owner.
540
Now, let us look at the first 6 observations. So, this is how it is now one of the first few
things that we do in R is in our modeling exercise that we plot a we generate a scatter
plot between the predictors or important variables of the model. So, in this particular
case we have a 2 variables and income and household area and you would see that ranges
have been appropriately specified as we have been using this particular data set in
previous lectures as well.
541
So, let us create the generate this scatter plot let us also generate the legend for the same.
So, these are the points. So, this different-different characters are being used to denote
the owner or non owner class among these points.
So, on x axis we have annual income and on y axis we have household area. So, we
would like to classify these observation into belonging owner or non owner category as
we can see this is particular small data set all very already we can see a clear class
separation between these points; however, we would like to do same the classification
task using a modeling.
So, for example, if we have a new observation let us say annual income 6 l p a. This is
the new observation and will a particular household having annual income as 6 l p a and
household area as 20,000 the square feet right 22,000 square feet because we have the
household area in the hundreds of square feet. So, this observation if you would like to
lets also plot this observation using the point function. So, point function can actually be
used to plot points or add points in an existing plot the current plot.
So, you can see the type is the argument that we are specifying. So, its p right; so, on the
plotting character is also been selected differently different plotting characterise has been
selected. So, this point would be added in into the existing plot the idea; idea is that to
understand that we want to classify this is the new observation. So, this particular
observation we want to apply K-NN and you would like to classify it into one of the 2
classes owner or non owner.
Now, as we have discussed that scaling is an important aspect of modeling using K-NN
because these scales different scales that could be there for different predictors and that
could actually. So, some of those predictors having higher scale could actually dominate
the distance metric that we are going to use the Euclidean distance metric that we use.
So, therefore, they can dominate the results and then influence the results and the we
would end of these end up with the meaningless classifications. So, let us before
proceeding further; let us normalize the values standardize the values. So, let us take a
backup of this particular data set.
Now, as you can see we have selected just 2 columns for annual income and household
area column one and 2 and these being the numerical variables. So, we the scaling has is
542
to be applied on only these 2 variables. So, you can see we are using a scale function. So,
scale function can be used to apply different kind of a scaling.
So, for example, in this particular case the first argument is the variables as subsetted
using this from this data frame alright. So, we would be applying a scaling on these
variables 2 variables and then we would see the second argument is about centralization
of these values. So, which has been specified as true then the scaling of these values
which has been specified as true.
So, centralization and scaling both would be performed using this particular function.
If you are understand and finding out more information on this particular function you
can type scale in the help section and you look at more information on this particular
very function you can see scale is used for scaling and centering of matrix like objects a
scale is generate the function and whose default method centers and all scales the
columns of a numeric matrix right you can see center either a logical value or numeric
vector of length equal to the number of columns of x.
So, different types of standardization or normalization scaling could actually be

performed using this plot form function in the coming lectures we would be trying
different variations of the same. So, let us perform this scaling here let us execute this
code.
543
Now, let us look at the first 6 observation as you can see after the scaling has been
applied the values have changed right. So, these values have change. So, earlier the scale
was different now the scaling has been done so that to eliminate the influence of the
same.
Now, let us do the partitioning. So, the data set that we have is mainly for the illustration
purpose and we are having very small data set. So, therefore, the results might not be that
stable therefore, it is generally recommended to have a larger data set and the data
544
mining modeling is actually applied on largest data set the process is suitable for the
larger data set where we can create you know partitions having enough number of
observation.
So, that the results could be stable right and could be reliable. So, in this case because
this is for the illustration purpose we are going to partition this particular data set we
have twenty observations twenty values and of 15 values are going to be used in the
training partition and the five values five records are going to be used in the test
partitions. So, partitioning let us do the partitioning here and then lets clear the training
partition using the index that we have just generated.
So, let us create this in the test partition as we have been talking about in the previous
lecture also sample function can be used to create the indices can actually allow you to
select the random and indices belonging to different observations different observation in
your sample in this case you can see that second argument is specified as 15 because we
want to draw 15 indexes we want to randomly select 15 indices to create the partition
right the placement is also this is this process is done without replacement.
So, the first partition that is the training partition; so, all these 15 indices and the
appropriate sub setting of the data frame d f is performed in the second partition that test
partition the remaining five observation of the 20 total that would be left for the testing
partition now once this partitioning has been done.
545
We can start with our modeling process. So, as you can see it library class is actually
used to perform the k-NN modelling. So, let us load this particular library and we will
do. So, there is another important aspect of k-NN is the value of k. Till now, we have
been using k as our reference, but k actually signify the number of neighbors number of
neighboring values that are going to be used to perform computations, right.
So, we will start with an example of 4 n n; that means, value k value of k is being taken
as four, but later on we will see how the value appropriate value of k optimal value of k
can be selected right. So, let us apply lets execute this code. So, the next function that we
are going to use for k-NN modeling is the k-NN function.
So, where in the first argument is that the training data set the training partition? So, that
we have appropriately specified second argument is the test partition. So, that we have
again specified you can see only the numerical variables are being used and then because
this is a classification task you can see the third argument that is CL. So, there we are
specifying the different classification different you know using the third variable in the
training partition that is ownership. So, from this using this argument the classification
would be used by the k-NN model.
Now, we have also specified the value of k s 4. So, let us execute this record and we
would see.
546
Let us execute on the summary function and in summary you would see that out of out of
five observations that we had in the test partition right. So, three have been classified as
non owner and 2 have been classified as owner here we are interested in looking at the
classification matrix. So, we can use this table function.
So, in the table function first argument is always about the actual values and then the
second argument is for the predicted values right. So, let us use this. So, you can see all
the values have been the mod the model that we build and the classification that we have
got as we can see from the classification matrix all three owners have been collects all
three non owners have been correctly classified as non owners and 2 owners have been
correctly classified as owner.
So, if we look at the misclassification error. So, this we can compute using the mean
function where we are equating the actual values with the predicted values if anything is
not equal. So, that would be counted and an average would be taken right. So, that will
get the misclassification error. So, we would see that misclassification error in this case
is 0; so, as because this was a small data set. So, the results that we have they depend on
the partition that we do the observation that are selected in the training partition and the
observation that are left over for the testing partition.
So, in this particular case we got the 100 percent accuracy and 0 percent error, but if you
repeat the exercise if you again do the partitioning because this being a small data set the
547
results might change significantly; however, as we have discussed in the previous
lectures a larger data set is generally recommended. So, that our results remain stable this
even if we repeat the exercise.
So, I will stop here and in the next session we will start our discussion on how to how to
select a select a suitable or a appropriate and optimal value of k for the k-NN modeling.
Thank you.
548
Dr. Gaurav Dixit
Lecture - 29
Machine Learning Technique K-Nearest Neighbors(K-NN)- Part II
previous lecture we started about discussion on another technique K nearest neighbors or
K NN. So, we discussed most of the theoretical part, the discussion what K NN is about,
the difference of K NN from K NN being the data mining technique, the difference of it
from other technique like statistical technique, multiple linear regression.
So, we discussed all those things, we also did an exercise in R where we use the sedan
car or x l s data set and did few did the partitioning normalisation and then the modeling.
So, we are we also looked at the results and. So, the next important aspect of K NN is
that the selection of this value of k. So, appropriate value of K, the optimal value of K
how do we find out that.
So, in the previous lecture we in our small excise that we done, we had used the value of
K as 4 for the illustration purpose, but the optimal value of K could be different. So, how
we find out that? So, first we need to understand certain thing about this particular aspect
of K NN, the selection of choosing a an appropriate value of k. So, for example, if we
take K value as 1.
549
Value of K as 1, then this going to be a powerful technique especially for large number of
records in any partition because K being 1 fewer number of computation, distance based
computation would have to be performed and because the value of K is 1 so; that means,
we are essentially dealing with just 1 neighbour. And therefore, using this just 1
neighbour the class of once we identify the closest neighbour, the class of a that
neighbour itself is going to be the predicted class of new observation.
550
So, for example, the plot that we had generated in the previous lecture we are able to use
that plot then this one. So, for example, if this is the new observation represented by x
symbol and if we are able find the, if we are require to find just the 1 a closest neighbor
of this new observation then it would be probably this point, this point seems to be the
closet point. So, the class of this closest point would actually be assigned to this
particular new observation.
So, if the value K is 1, then the computation number of computation would just we want
distance between a you know number of computation would be much less and the easily
we can classify new observations and assign the same, if the value of if the value of K is
greater than 1. So, as we move towards higher value of K the smoothing effect will start
to takes place, smoothing effect in the sense because the, if the value of K is more than 1
lets says 4, 5, 6, 7 even 10. Then the different, more number of neighbours would
actually be used to classify the record and therefore, it would in a way control over
fitting is used. When we have the value of K low value of K or let us say K called if
value of K is 1, then we might in the fitting to the noise. So, therefore, if you have larger
value then that is smoothing effect with will in a way control some of the over fitting is
used arise arising due to noise.
So, the next point exactly talks about this that low value of K it might lead to you know
more chances of fitting the noise, if we opt for a high value of K than more likely to
ignore the local patterns in the data. If we have going for high value of K then we might
ignore the local patterns because the smoothening a would be heightened and therefore,
those global effects would actually be, you know the will be dominant effects will take
dominance and the local patterns might be ignored.
So, therefore, we have to trade of between benefits from local pattern verses global
effects. So, we go we target lot local pattern and we select a small value of K we might
and fitting the noise therefore, we do not want that and we go for the global effects and
go for the smoothening smoothing effects. So, that we are able to control over fitting
issues, we might end up ignoring the local patterns and we might end up ignoring the
predictors information as well.
So, for example, if we take the next point being, if we take the value of K as n that is;
that means, all the observation that are there then that would essentially lead to naive
551
rule; that means, majority hold decision. So, the if you take all the observation; that
means, the class which is the most prevalent among all those observation that would be
assign to the new observation. So, irrespective of the predictors information. So, we
would end up ignoring the predictors information and more often the not all most in all
the cases for all new observation the majority class would be assigned. So, we do not
want this situation as well, we would like to balance between you know ignoring the
predictors information and also fitting to the noise.
When we have low value of K we want to fee we want to fit to the local pattern right, the
genuine patterns that could be there and we will not like to fit to the voice. When we
have higher value of K we will not like to ignore the predictors information or you know
too much of smoothing effects. So, you need to balance out, the local patterns and global
effects incorporation of predictors information and avoiding the fitting to the noise.
So, let us a move further, now value of appropriate value of K will also defined on nature
of that, nature of data for example, if the data that we have is slightly complex and has
carries the irregular structures then we would probably better of having low value of K
right. Because the global effects might not be suitable higher value of K and therefore,
global effects there is smoothing might not be suitable for this type of data and we would
be better of capturing the local patterns because of the complexity that is involved in the
data.
However if we have you know, if we have regular dataset regular a structures which we
are familiar with in those cases it is you know higher value of K might be preferred right.
We understand those structures and there for we would we better of focusing on the
smoothing effects. So, that we are able to control the over fitting issues or due to which
mainly arise due to fitting to the noise.
Now, we are look at the typical value of K that could be you know generally that is
suitable. So, between 1 to 20 so typically.
552
This is, so we take a value of K between 1 to 20, now even within this range the odd
value of K is preferred to avoid price in majority class decision. So, because we are in a
classification task discussing classification task, there when we have to make our
prediction decision using to majority class rule. Once the K neighbours have been
identified then there the if there are the value of K is even number 4, 6, 8 or 10 then
sometime there could be ties 2 class having suppose if the value of K is 8 you know there
are 2 class, one having 4 neighbours you know representing that class the other class is
being representing by remaining 4 neighbours.
So, in that case there would be ties and it would be difficult to assign the new
observation you know to assign the new observation there for to avoid those ties the ideal
value of K would be the odd value. Now, having discussed all these points the low value
of K the high value of K the nature of the data the structure that is complex and irregular
or regular difficult value of K and the hard value. So, having discussed all these points
how do we find out the best value of K should not would be based on the performance.
So, classification performance on validation partition it could actually be used to find out
the best value of k.
So, what we are going to do is will try to understand through an excise in R how do we
select the best value of k.
553
So, let us open R studio and the data set for the excise that we are going to use is the
same data set on that using in previous lecture. So, that is the sedan car data set. So, we
already know 3 variables that are there the house hold income, the income annual
income, the house hold area and the ownership. So, what we are going to do is.
There are in this particular in this particular data set that we have total number of
observations are 20 and out of these 20 observation we had we had partitioned 15
observations into the training partition and remaining 5 in to test partition we were also
554
able to we also build a model in the previous lecture using 4 K value as 4 right. So, right,
so not.
Now, what we are going to do is different possibility this the possible value of K could
be from 1 to 50 depending. So, the these are the observation that are there 15 observation
are there in the with any partition and these are less than the typical value that we talk
about 1 to 20. So, therefore, will then loop for all these you know 15 times. So, K value
of K could range from 1 to 15.
So, in the. So, what we are going to do is for different, for different value of K we would
be building K NN model and for each of those models we would be predicting the class
of observations that are there in the text partition as well as we would also be rescoring
the training partition, we would be predicting the training partition observation of
training partition itself, right. So, and then we would be recording the errors this
classification errors in training partition and testing partition. So, as you can see we have
initialized these 4 variables, 4 vectors mod train. So, that is for modelling, that is for the
modelling using train training partition and then scoring the training partition itself.
Then the next one is mod test. So, this is for scoring the test partition and. So, let us set
this, initialize this then we have error training. So, this is the misclassification this
variable is to record the misclassification error and different in the training partition for
different values of K.
555
Then we have another vector another variable error test to record the misclassification
error misclassification error associated with the you know different value of K for the
observation and testing partition.
So, once this is done we can run this loop, you can see that for both the training and test
test partition as usual we have selected the first 2 variables numerical variables the
income and household area. It also see that the one different within these 2 lines of code
is that this third argument that is c l function this is this remain same. So, the training of
finding the K neighbours would be based on the training partition and it is the test
partition that is different in these 2 lines off code we can see here it is d f train. So, we
are building the K NN modelling on the training partition and then the scoring the
observation of training partition itself and then the second pa code second line of code
we are building the K NN model, training observation that essentially means identifying
K neighbours and then scoring the observation belonging to test partition.
So, an every iteration for every value of K we would be recording different errors we
would be you know running the these doing this scoring building these models and doing
these scoring part training and testing partition. So, let us execute this particular code, we
would see that different variable have been created the, right now we would like to focus
on error test and error train. So, you can see two numerical variables in the environment
section having 15 values right. So, once this loop has been executed we can move to the
556
next line of code. So, we are constructing a data frame where the first column is
representing the a value of K that is between one to 50 and then the second column is
error training. So, that is has been captured in this particular variable error train and the
third column is for misclassification error that is error validation that has been captured
in error test variable. So, let us construct this data frame.
Now, we are interested in just the 2 values after decimal points. So, let us round these
values.
So, these are the values. So, this is the output we have for value of K when the value of
K is 1 you can see the training error is 0 and the validation error is 0. So, when value of
K is 1. So, essentially we are going to use just 1 observation, if you look at the training.
So, for any observation in the training set we are going to use just one close by
observation right. So, this is yielding this result that error is 0. So, almost all the
observation using the value of K they are mean correctly classified, the same case say
same is applicable in error validation. So, there also as we can see that that 0 value is
there typically this is not the case, this is mainly because of the way partitioning has been
done.
So, if we look at the different values of k. So, you can see for value of K as two the error
training is 20 then for 3 then 4 and then 5 the error keeps on increasing at has been
increase the value of k. So, as we move from 1 to 15 the error in training partition keeps
557
on increasing and if we look at the error validation it remains at 0. So, this is peculiarity
about this particular data set this particular partitioning is actually been captured not the
typical case, we would see only at the K value of K 14 and 15 we see the different errors.
So, properly we need to run these partition do this partition partitioning and do this
modelling again because the some peculiarity about the you know would be this small
data set just 5 observation in the validation. So, that is being captured and we are not
getting a typical kind of result or closer to typical result. So, what will do will run the
will execute the partitioning again and then again will execute the following code.
So, let us do this. So, again the 15 random we want to, we will like to have different 15
observation at any training partition in 5 observations. So, that we are able to get a
typical result suppose, if we are using a small data set. So, that is why some peculiarity
in the results can be seen. So, let us execute this, you can see we are recreating these
partitions and hopefully we get different partition this time and therefore, different result.
So, we will have two run form here once again I am execute this code and lets construct
this particular data frame again, now there is some change some change there, you can
see some change you can notice there. So, still even if we look at these results the error
training as you can see for different value of K this keeps on increasing. So, increases
then side drop there and then again keeps on increasing with few drops and we are
558
mainly interest in the validation error, we would see the ah that again in validation itself
again we see some particularity in the in the results.
So, what we actually expect is that the error in validation would actually increase
initially and then it would decrease and as been increase as be go long for more values of
K it will increase further had. So, there is going to be an a minima right the. So, we are
interest in the very first and minima that could be there.
So, because of the smaller data set we are no able to see those kind of results. So, what
will do will try once more. So, as you can see I am initializing these variables and we
executing the loop and then again, now I thing yes. Now, this time we are getting more
typical results yes this is more, more of a typical scenario. So, we have a good enough
sample size probably this particular a result is replicating the regular scenario we have a
larger sample size.
So, here you can see that error training it is starts from 0 right and then in the validation
it start from a higher value and then as we increase the value of K it key, it goes down
and it remains a again some peculiarity are still there are still it is it is better than the
previous cases. So, its keeps on it decreases or remain same and then further it increases
had the extreme points.
So, optimal value of K would actually be have would actually have to be identified as
using the error in validation. So, the error in the validation would typically be more than
the error in the training partition because we are able building the, we have build the
model on the training partition and in this case K NN. So, essentially we are identifying
the using the same observations to identify K neighbours and using those K neighbours
we are trying to classifying the observation of the same partition training partition, you
would see in the training partition start from 0 then error keeps on increasing then again
few dips and then further its increases.
So, typically validation will error will be start with the higher side and then will decrease
and it will increase further. So, if we look at the data and then probably value of K equal
to 3 is the optimal value where we see a direct dip in the error value from 40 for K of
value of 1 and K value of 2, again the error is 40 then there is a dip and becomes 20.
559
So, further we keep on increasing these value of K the results the error result do not
change, error rate does not change and later on as be reach to the value of K as to 14 and
15 then again there is increase. So, let us use this particular result for our further
discussion.
So, we can draw a plot between the error rate and the value of k. So, we are mainly
interested in the validation error. So, let us look at the range 20 to 60 and will have to
specify this range here. So, appropriately covered 0 to 65 it would be covered there and 0
to 16 the limit on x axis it would be covered, if we look at the range of training error
training and that is between 0 and 40. So this will also lie in this range. So, this particular
range for x and x axis and y axis they are appropriately specified. So, let us generate this
scatter plot.
560
Now, this is the scatter plot or validation error. So, this typically would be more is
smooth if there are more num, you know more numb more values of K to choose from.
So, in that cases we if we plot smaller number of smaller number of values of K then
again the plot would look like this, but if we have large number of values for K to pick
from then and then larger sample size then the error values will also smooth out. So, this
curve would be you know much more smooth that is also a plot, that also at the curve for
training partition.
Now, this is the curve now we look at this lets zoom in.
561
If we look at the plot we would see that in the training case as we start from 0 the error is
on the lower side and as be keep on increasing the value of K, the error keeps on
increasing right few dips here and there and then its keeps on increasing. So, if we look
at the validation, the validation error is always about the training partition error right. So,
we start from higher value then it decreases right. So, this is this kind of curve that we
are getting this is mainly because of this smaller data set that we are using, if we want to
know typical curve that would be actually expect is would be something like this.
562
So, probably if you are having a larger data set. So, what is expected is that these you
know value of k. So, different values of K on x axis and here you have error rate. So, this
could be on the validation partition or on the training partition. So, typically the training
error rate on training partition would be on the would be smaller in comparison to error
rate on the validation partition.
So, validation partition it will be start from higher side and has been increase the value of
K it will probably decrease and it will reach to the minima optimal minima and then
further it might increase and in that fashion it might go. So, this typically this might for
applicable for validation, if we look at the, if we want to understand the curve for
training partition. So, again be we believe that for low value of K our model right our K
NN model if it is starts fitting the noise then even for low values of K as well the error
should be on the higher side.
So, but essentially when we are talking about the training partition. So, especially in K
NN case there are is going to much lower right. So, it might start from where between
this range right and the as we move further, as we move further. So, the error it will it
will come down, it will come down right and as we further for more values of K again it
will come up, so this would be mainly applicable for training.
Now, in this particular range if, in this particular range for value of K starting from K 0
to starting from K 1 to few smaller values of K if we are not fitting the noise then
probably this particular curve actually draw further at. So, this might be like this. So,
different variation for this particular depending on the, depending on the data set and
when you on the modeling excise that we have depending on the predictors information
that we have and how much the model is able to how much better the model is able to
classify.
So, this is typical scenario that we expect which is in a sense in a rough sense is reflected
here in this particular curve, but we have to you know do many iterations to reach to this
particular result, design being we are dealing with smaller data set and therefore, for this
is small data set sometimes the results might not be as expected in a regular scenario. So,
it is therefore, recommended that a larger data set should actually be used in a data
mining excises.
563
So, that we are able to get the usual result and then the analyse the same. Now, once this
is done, once you have a table of this kind where you have the error values. So, you
would interested. So, in this particular case we are just dealing with 15 values, 15
different values of k. So, it is easier to spot the best value of K out of all these values. So,
it is probably the K is the value is 3. So, it easier to spot in this case; so, probably we the
best value of K is 3, but if you are having a you know a large number of you know we
are trying of this particular excises for large number of values you can use this particular
function minimum and find out the minimum of this values we can see this is 20.
Now, for this particular value we would like to find out the low lowest value of K for
which this particular error was achieved. So, for different value of K this particular error
could be the lowest error there could be many minima there could be may many minima
points. So, this particular graph could also be different depending on this different data
set scenario to grow come up and down and this is in this fashion.
So, there could be 2 such points. So, how to we find out the best value of K? So, it is the
generally the first 1 it is generally the first one that be would like to pick. So, this is the
value. So, this is the value of K that would we would like to pick because here it might
we fitting to noise or some peculiarity in the data set. So, it is generally the first one that
you fit.
564
So, we can do this using this particular code. So, in the brackets you can see for in the
row side we are trying to identify the index where this particular value would be this
bestk value would be there right which dot mean and the very first. So, for different
values indices of early rows where this value of 20 is there they would be there in the
result, but we want to pick the first 1. So, that would be selected here. So, would see that
3 as been selected which be also had identified by looking at the table itself. So, we are
dealing with smaller data set. So, this it was easier for us to do that.
Now, once this excises is done and we have identified the optimum value of K, now we
can also see how we can actually use this optimum value of K and predict the class of a
new observation. So, the in the is scatter plot that we had generated we had use this
particular new observation annual income 6 and house hold area 20. So, how do we how
we can score this particular observation how we can classify this observation.
So, again you can we are using K NN function and training train first argument is the
training you know partition that we have specified, then in the test we have a specified
just to one point it is not the test partition. It is just the new observation 6 and 20 using in
combined function and then the class is for the third argument, it is the classes ownership
variable from the training partition then K value as you can see we have picked up the
best K value and let us look at the classification.
565
So, this particular observation would be classified as owner as for the our model, if for a
different point let us say 5 or 15 this observation having income as 5 and area is 15. So,
let us run for this as well let us score this one also this one also would be specified as
owner.
So, in this fashion for its specific points we can classified them using K NN and we also
had a look on how for a test partition we can a score it if you looked at how we can
identify the optimum value of k. So, we will start will stop here and will start our
discussion then ne next lecture of the remaining points of K NN and also how K NN
could be used for the prediction task.
Thank you.
566
Dr. Gaurav Dixit
Lecture – 30
Machine Learning Technique K-Nearest Neighbors(K-NN)- Part III
the previous lecture we were discussing k-Nearest Neighbors, k-NN and we covered the
theoretical aspect we also did an exercise, so mainly the discussion and the exercise that
we had what was actually specific to the classification task right. So, let us continue the
discussion on k-NN.
So, few things that we wanted to discuss with respect to the classification task. So, we
will do that. So, in the example that we had done right, that was about mainly the
majority decision rule. So, sometimes you might have as we have talked to in the
performance metric lectures as well that sometimes we might have the class of interest
and therefore, we would like to we would be interested in classifying the records to that
class of interest.
So, even if it comes at the expense of misidentification or misclassification of more

records of the other classes right. So, in that sense the majority rule that we have been
talking about in the k-NN case right where if we find the distances that we find the k-
nearest neighbors and then depending on the those nearest neighbors then we classify the
new observations depending on the a majority rule right. For the classification task we
find out the majority class most prevalent class among those k neighbors and then on that
particular classes assigned to the new observation or classified as the class of the new
observation.
567
Now, this majority decision rule if we connected this particular rule with the cutoff
probability value that we have been using specifically in the context of class of interest
or even in the when we have the equal misclassification cost or we do not have a specific
class of interest and we are looking to minimize, the overall classification error, overall
misclassification error even in that case also the we can connect this majority decision
rule with that the probability value.
So, if we take the two class scenario, where two class scenario where we have to classify
our records either to class 1 or class 0. The majority rule can is similar to the cutoff value
of 0.5 and. So, for example, if we have 10 records and out of those 10 records if you
know 6 records belong to class 1 and 4 records belongs to class 0. So, as per the majority
rule the class 1 would actually be assigned because it is more prevalent. So, more
number of records belong to the class 1 having you know 6 records belonging to class 1,
it being more prevalent the class and having the majority. So, this particular class would
be assigned to the new observation.
Now, we just compute the probability using the same example 10 record 6 of them
belonging to class 1 and remaining 4 belonging to class 0. So, probability of a particular
record belonging to class 1 would be 6 divided by 10 that is 0.6, and the probability of a
particular record belonging to class 0 would be the 4 divided by 10 that is 0.4, right. So,
in that case also if we follow this two class scenario we follow the cutoff value of 0.5.
568
So, the 0.6 that being more than this cutoff value. So, the class new observation would
again be classified into class 1. So, whether we use the majority rule in especially in two
class scenario whether we use the majority rule or the cutoff value concept we would get
the same result. So, majority rule the majority decision rule can be easily you know
connected with the cutoff probability rule that we talked about in the previous lectures.
In case we have, the same thing can be extended to m class m classes scenario, wherein
if you have m classes again there also as per the majority rule we will find the most
prevalent class so that class would be assigned to the new observation. So, if we talk
about the cutoff value the probability value. So, for example, there are 4 5 classes they
will have different probability value of you know belonging to those particular classes
like 0.3 and 0.3 again 0.25, 0.2. So, these kind of if there are 3 4 classes. So, these are the
kind of probability values that we might have. Again there also the highest probability
value that can be used and the new observation can be assigned to that particular class of
having highest probability value. So, this particular concept can be easily extended into
m classes scenario. So, majority decision rule that we have been talking about can be
easily connected with the probability based values and using the cutoff probability value
to classify the new observations.
So, next thing that we want to talk about is the k-NN for multi class scenario. So, as we
have been talking about we have been mainly discussing and the exercise that we have
done that was mainly for the two class scenario right. So, k-NN all the concepts the
different steps let us say look at these steps that we had discussed right. So, these are the
steps, so most of these steps you can see that they can be easily extended to an m class
scenario right more than two class scenario.
569
You can see the first step being compute the distance between the new observation and
training partition record. So, whether it is a two class scenario or m class scenario. So,
this step is going to remain same still we are going to compute the distance between the
new observation and training partition records.
Then determine k-nearest or closest records to the new observation all right. So, that also
this step will also more or less remain same the remaining k-nearest neighbor selecting
that that value right, so that will also remain same now find most prevalent class among
k neighbors. So, that is where things will change, but depending on whether it is a two
class scenario or m class scenario. And then it would be predicted it would be the
predicted class of new observations. The most prevalent class there also you can use the
probability value. So, class having the highest probability value among these k
neighbors. So, that can easily be assigned. So, easily this two class scenario and the
exercise that we had done that can be easily extended to m class scenario.
Now, once we have talked about how that majority decision rule can be connected with
the cutoff probability based method for assignment of classes now let us come to the our
class of interest. So, till we have been talking about, till now we have been talking about
when we do not differentiate between different classes and we would like to you know
optimize for the overall error right. So, but in some situation as we have talked about in
previous lecture specifically the lectures in performance matrix topic right. So, there
570
sometimes you might be have might have the class of interest and therefore, sometimes
we might like to we might like to identify more records belonging to this particular class
of interest even if it comes at the expense of misclassifying records in the other classes.
So, how do we change? How do we change our steps? For this, instead of the majority
rule now we can compare proportion of k neighbors belonging to class of interest to a
user specified cutoff value.
So, now we will not be looking at the probability value for all the classes among k
neighbors and then finding out the class having the highest probability value. So, instead
of that instead of following this majority rule or the respective probability base method
cutoff probability method will focus on the class of interest and we will try to find out
the proportion of k neighbors right, proportion of k neighbors that belong to the class of
interest; that means, also in the probability terms the probability of probability of a
particular record belonging to the class of interest among those k neighbors. So, once we
compute that probability then we can compare it to the user specified cutoff value
because we are more interested in finding out the class of interest members. So,
therefore, instead of having a higher you know in a two class scenario we would like to
have the 0.5 is the cutoff value.
Even if for the class of interest also if it is a two class scenario we would like we might
specify a lower cutoff value than 0.5, it could be 0.4, would be 0.3, 0.2 depending on the
571
scenario. So, if it is the class of interest is quite rare in the data set that we have then
probably we might lower it down to 0.2 or 0.1 all right. So, depending on the situation
we might lower down the user specified this cutoff value from 0.5 to as low as 0.1 all
right.
So, now, the proportion of this probability value among k neighbors the probability value
of belonging to class of interest that would be compared to this a specified a cutoff value.
So, if it is greater than this value cutoff value specified cutoff value then the class would
be that particular record observation would be classified to the class of interest otherwise
not. So, we are not interested in classifying the records into other classes. So, our main
focus is the class of interest. So, we just focus on the, we are just compute the relevant
probability value for that class of interest and then we compared it to the specified cutoff
value. So, eventually we would be able to identify more of more records belonging to
class 1 and then there are going to be generally typically there are going to be more
misclassification for other classes.
So, let us move forward. So, till now the discussion that we had about k-NN modeling
this was with respect to classification tasks can be used k-NN for prediction tasks. So,
yes it can be used.
So, as you have seen that in the classification tasks the outcome variable that is the
categorical that has to be categorical variable and if it is a numerical variable then we
572
will have to convert it into a categorical variable through winning. So, for a classification
task we will have to do that. But for prediction tasks our variable has to be a numeric
variable or continuous variable. So, let us discuss how k-NN is different for a prediction
task in with respect to in comparison to classification task.
So, again main idea is to find k records in the training partition which are neighboring
the new observation to be predicted. So, this particular this particular step does not
change right. So, main idea is to find k records in the training partition which are
neighboring the new observation to be predicted. So, this is step is remains as is.
Now, let us look at the second step. So, these k neighbors that we have identified in the
first step are used to predict the value of new observation. So, earlier be used to predict
the class of new observation in the classification task. Now, we want to predict the value
of new observation now how do we do that. So, in the classification task we had this
majority rule decision or the you know higher probability value you know class having
the highest probability value. So, that could be used.
Now, instead of the majority most prevalent class now because this is a prediction task
we would be taking the average value of the outcome variable among the neighbors. So,
the k neighbors that we have identified in the previous system right. So, for all those
variables we take the average value of the outcome variable and that and this particular
value would actually be the predicted value for the new observation. Sometimes
researchers or analyst they might prefer the weighted average value and generally this
weighted averages is computed in a fashion that weight for a neighbor decreases at its
distance from the new observation to be classified increases.
573
So, the points if we want to say this, different points could be their belonging to different
classes for our class of interest right. So, the weights for the points for example, these are
the close y points. So, we are having the k value of 4. So, therefore, the weights this
point seems to be this particular neighbor seems to be closer to this new observation all
right, this is let us say this is the distance and based on the distance and value of k we
have identified 4 nearest neighbors. Now, we try to compare these neighbors probably
this is the smallest, and then followed by this one, then let us say this one and then this
one.
So, as the distance increases right as the distance increases the weight would be
decreasing. So, we can give more preference to the record which is closer to the new
closer to the new observation. So, more weight for this particular record followed by this
record than this and this. So, as the distance between records between the record and
between the new observation and the record increases the weight will actually decrease.
So, weighted average is also sometimes used in k-NN for prediction tasks.
Now, the change in performance metrics. So, earlier we were looking to minimize the
overall error overall misclassification error right. So, that was the metric for
classification tasks. When we talk about the performance metric has discussed in the
performance metrics topic different lectures that we had right. So, we talked about
various metrics that could be used for the prediction task. So, we also talked about THE
574
R M S E. So, in this case R M S E is generally used, but we can also use some other
prediction error metric that we talked about in that in those lectures.
So, if we compare the steps of k-NN for classification task and the prediction task, the
first step remains same that we try to identify the k records right the k-nearest neighbors
and then either we pick for the for the classification we find out the most prevalent class.
And for the prediction we try to take the average value as the predicted value among the
k-nearest records.
So, if we are interested in, we are interested in understanding the steps in more detail,
finding neighbors and prediction.
So, let us go through these steps once again. So, number one is going to be compute the
distance between the new observation and training partition records right.
So, we can we will have to compute this compute the distance between the new
observation and training partition records. So, if there are more records in the training
partition then we will have to compute more such distances right even for just one
observation one new observation, if there are more records in the training partition more
distance computation would have to be performed. Once this is done determined k-
nearest or closest records to the new observation which is now easy to do, we can just
sort those records in the you know in the increasing order and first few records which are
575
closer first k records which are the closest or nearest having the smallest distances and
they can be picked as the k-nearest or closest records to the new observation. Once this is
done then we can compute the average or weighted average of outcome variable values
among k neighbors as we talked about and now this particular value this average value or
the weighted average value of outcome variable values of the among these k neighbours
that can be that is going to be the predicted value of new observation.
So, let us talk about a few more specific points on k-NN algorithm. So, some of the some
of the advantages of k-NN are very obvious. So, for example, simplicity, you can as you
might have understood through these steps very simple steps that we have been
discussing. So, simplicity is the advantage and k-NN algorithm and it is a nonparametric
approach so we do not have to estimate any parameter which is from any assumed
functional form for example, as we talked about in linear regression multiple linear
regression and there we have to estimate the betas and other parameters right. So, we do
not have to follow any such you know a functional form, linear or other forms and we
just have to compute, we just have to measure the similarity and that to you know we
have lots of distance metric which can be used as distance based similarity matrix and
that are generally used, simple and nonparametric approach. So, these are some of the
advantages of k-NN.
And as we have been, as we have talked about if we are dealing with a very large data set
then in that case we slide the value of k to a lower value than this computation problem
that can also be handled and large number of large number of you know observation you
know that will not you know that can also be classified or predicted.
576
Now, let us look at the some of the problems that we might have to encounter in k-NN
modelling. So, first one the computation time to find nearest neighbors for a large
training partition. So, of course, if the value of k is the optimal value of k is on the higher
side in that case we will have to do a lot more computations. So, computation time to
find nearest neighbors for a larger training partition. So, because there would be a more
number of observation, more number of records, because for a new observation we have
to compute distances with each of the records that are present in the training partition.
So, if the partition is quite large then lot many, a lot more computation will have to be
perform and, so this would increase the computation time.
Now, dimension reduction techniques. So, those can be applied to you know to handle
this problem to manage this problem. So, we are able to reduce the number of variables
the set of predictors that are going to be used for k-NN modelling, then definitely this
particular problem this particular issue can be managed can be handled. We will have
less number of computation to perform as you know that we use the Euclidean distance
metric to compute the distances, and there if we have less number of coordinates right,
less number of coordinates you know to less a fewer number of computation would have
to be performed our Euclidean distance formula would be smaller and having fewer
number of computations. So, dimension detection techniques would definitely help. So,
we should have just the most useful predictors in our model and that would also reduce
the computation time for k-NN.
577
Now, another approach to handle this problem could be steps to find neighbors can be
optimized using efficient data structures for such operations likely used. When we want
to find out the k-nearest neighbor right, if the value of k is on the higher side right, it can
take a lot of computation time search operation, search algorithm, would have to be can
actually there are many optimized more efficient search algorithms. So, that would be
used. Some data structure efficient data structure for example, trees are also available
which can significantly reduce this particular time, the search time right. So, those can be
used. So, that the steps to find neighbors they can be optimized. So, in that sense also we
would be able to reduce the computation time to find nearest neighbors.
Now, the another thing that can actually be done is the identification and pruning of
redundant records from training partition and now these records which will not be
included in neighbor search steps. So, there are going to be a few records depending on
the data set if the data set is quite large there are going to be many records which might
not be many records in the training partition which might not be required for the
neighbor search steps right. So, they will always be crowded by other records which also
you know fall in the same class especially if it is a classification task. So, many many
such records would also for they will be carried by the record belonging to the same
class. So, therefore, you know we can avoid searching through those records or even
computing you know distances for some of those records because if they are not going to
figure out in that k neighborhoods or the search steps. So, some of those records can be
identified and pruned. So, that would reduce the number of computations right. So, this
can also be done we can identify and prune the redundant records. So, that when we do
our neighbor search steps when we perform our neighbor search steps. So, we want we
would be able to avoid we will be able to reduce some of the computations, some of the
required computations.
Now, the next problem that can be associated with the k-NN algorithm is the because of
the curse of dimensionality. So, we have been talking about if the there are more number
of predictors. So, another problem might figure in specifically in the k-NN context
because if there are more number of predictors as is as written in the first point under
curse of dimensionality this sample size the sample size requirement depends on number
of predictors. So, if we have more number of predictors the number of observation that
we might require in our data set to have a useful model that would also increase. So, we
578
have more number of predictors, he would be requiring more number of cases, more
number of observation and our records. So, that would also increase the number of
computations in a k-NN algorithm because the in the training partition as well we will
have a more number of records and therefore, for a new observation we will have to
compute the distances and then you know search for the k neighbors. So, those
computation would also increase.
So, in the previous the previous point we talked about dimension reduction techniques so
that also if we do not handle that so probably this is one of the thing that should be done
dimension reduction techniques would be applied to have the useful set operators only
otherwise other problems could also come in one related to sample size, the number of
observation that requirement would also be on the higher side if we include more number
of predictors. So, as we discussed if we do not handle this curse of dimensionality this
sample size the number of predictors more number of predictors therefore, the bigger
sample size and therefore, more number of observation and that would eventually lead to
more computations for neighbors whether it is for distance computation that is for the
distance computation to identify a nearest neighbor or then whether it is for selecting the
k-nearest neighbors.
So, even with these limitations the k-NN algorithm in some situations it outperforms so
many other algorithms because of this you know depending on the requirement the
optimum value of k could be lower down, and still will have useful results. And in some
situation this still remains one of the useful algorithm.
So, with this we conclude our discussion on k-nearest neighbour. In the next lecture this
will start our discussion on naive bayes.
Thank you.
579
Dr. Gaurav Dixit
Lecture- 31
Naive-Bayes – Part I
Welcome to the course Business Analytics and Data Mining Modeling using R. So, in the
previous lecture, we concluded our discussion on KNN, K nearest neighbors. So, in this
particular lecture, we are going to start our discussion on Naive Bayes. So, let us start our
discussion.
So, before we go into the Naive Bayes, let us discuss the complete or exact Bayes, so
which is the basis for a Naive Bayes modelling. So, a complete or exact Bayes especially
a specifically for classification, so let us get down the steps that would be required. So,
you would see there is some similarity in these steps with a KNN algorithm. So, in the
complete or exact Bayes for classification tasks, so first step is search for records in the
training partition having same predictors values as the new observation to be classified.
So, in this case depending on the number of predictors that we have in our dataset 3, 4
predictors depending on the number of predictors for all those predictors the value that a
new observation is having of those predators, in the training partition we have to find out
all such record all such the records which are having the same value as the new
580
observation to be classified. So, the values should be same. So, there has to be there you
know exact matches. So, the values a predictors values for the new observations right, so
we have to find records in the training partition which are having exact values for all
those predictors. So, that is the first steps with. So, we need to find out all such records in
the training partition which are having the same predictors values, values as the naïve
observation to be classified. So, this being the first step.
So, once we have the list of all such records, then we can find the most prevalent class
we are discussing this for the classification task. So, therefore, the outcome variable of
interest would actually have the classes. If it is two class scenario, we will have class
one, class zero; otherwise m class scenario we have more than two classes. So, we will
have to find the most prevalent class of the outcome variable among the records. So, the
records that we have listed down in the step number one, first this step. So, among those
records, we will have to find out the most prevalent class of the outcome variable.
Now, this particular class is in the third step as you can see this particular class is going
to be assigned as the class of new observation. So, you can see there is a similarity
between in terms of steps. So, the KNN approach. So, KNN approach also we talked
about very simple step there we had. Finding you know doing the computations between
the new observation and the observation in the training partition right then the searching
for the k nearest neighbours and then looking for the most prevalent class and then assign
it to the new observation. Similarly, here we look for the within the training partition, we
look for the records which have the same values for the predictors values, there we used
to compute the distance.
Now, here we are interested in finding the cause having the same values as the new
observations. And again the next step is same and then finding most prevalent you know
class within those identified in this step one and then assign that class with the new
observation. Now, as we talked about the class of interest scenario which is different
from the usual or typical scenario, where we focus on the overall misclassification error,
overall error we would like to minimize that. And we do not have any preference for a
particular class, so that is the typical scenario, but sometimes we might be interested in a
particular class we would like to identify the members belonging to that particular class.
So, we have a class of interest.
581
Then in that case as we have been talking about in the previous techniques in previous
lectures as well that the user specified cut off value for the class of interest has to be
established first. So, first we need to establish this particular cut off value, so that will
depend on our expertise and the level of misclassification error that we would like to cut
tolerate for other classes and still be able to identify more number of records belonging
to the class of interest. And if the class of interest also happens to be a bit rare class then
a the situation for the cut off value would also be more you know more expertise would
actually be required to and get that value. So, we need to establish this value first.
Then the next step is going to be search for records in training for partition having same
predictor’s values as the new observation to be classified. So, this is same as the in the
previous approach where we go by the majority decision rule or the most prevalent class.
So, this is step is same. So, we have to again find out the records in the training partition
which are having the same predictor’s values as the new observation. So, once this list is
known to us, then we can find the probability of a record belonging to the class of
interest among these records we have among the list that we have identified. So, within a
list identified in the previous step, we can look for the probability of belonging of a
record belonging to the class of interest.
Then as we a discussed in the KNN and even before that, this probability value
computed probability value for the identified records, this will be compared with the cut
582
off value a user specified cut off value right from the step one. And if it is greater than
that then the new observation is going to be assigned to the class of interest. So, these are
the two approaches. One approach is where we have the equal class and or we do not
have the class of interest then simply the most prevalent class. If we have a class of
interest, then we have to specify the cut off value then we will have to also focus on
computing the probability value of belonging to that for a class of interest. And then
compare these two numbers and then do our assignment of a classification, assignment
for classification. So, these are the steps in.
So, these steps that we have talked about they are for the complete or exact Bayes. So,
till now we have not started our discussion on Naive Bayes. So, we have started our
discussion on the main principle main Bayes principle that is for complete or exact
Bayes. Now, for the complete or exact Bayes the two scenarios that we did discussed the
regular scenario and the class of interest scenario in both those scenario we have been the
underlying probability related concept is the concept of conditional probability.
583
So, for an outcome variable with m classes, so C 1 to C 2 to up to C m, and p predictors

from x 1, x 2 to x p, so we are interested in the following probability value. So,
probability of a particular observation belonging to class i, given the predictors p
predictors values x 1, x 2, x p given these values. So, conditional on these values x 1, x 2,
x p, we would like find the probability of a record belonging to class C i. So, we are
interested in these conditional probabilities value in both the scenarios.
Now, in the scenario number one, which is the regular scenario typical scenario, where
we are interested in minimizing the overall error all right. Then in that scenario, we
would like to assign the new observation to the class with highest probability value right.
In the other scenario, so the value would be computed using this conditional probabilities
right. So, what these steps that we talked in there that we talked about you know before,
so that has been expressed in this probability value format. So, we need to compute the
probabilities value using this conditional probability for all the classes and given the
predicted information. And once that is done new observation can be assigned to the
class with the highest probability value in the regular scenario.
And we have the a class of interest scenario then if the probability value for the class of
interest is greater than the cut off value for the for the same class, then assign the new
observation to the class of interest. So, main idea being that the conditional probability
when we talk about finding the records in the training partition which are having the
584
same values as the new observation, then essentially we are looking to compute this
probability, because in the next step we would be finding the most prevalent class or the
class having the highest probability value. So, actually the probability value would be the
conditional probability and using this particular expression.
Now, as you might have understood by now that the predictors that we have to that we
are going to include in our Bayes modeling they have to be categorical. So, in the for
example, in KNN or other classification other techniques other techniques which could
be used for the classification and tasks, generally the predictors could also be the
continuous variable. But in this particular algorithm in the Bayes model, the predictor
should also be categorical it is not for the classification task task, not says the outcome
variable has to be categorical even the predictors should also be categorical. If they are
numerical variables then they will have to be converted into categorical variable through
binning.
So, when if we are you going to use Naive Bayes modeling or Bayes modeling in general
then all the variables whether the predictors or the outcome variable all of them they
should be categorical, if they are continuous and they will have to be converted into
categorical variable you through binning. So, this is an important difference. So, the
techniques that we are going to cover in this course this is a one particular technique
585
which requires which had such condition which requires all the predictors to be
categorical.
So, therefore, from this you will also understand that Bayes modeling cannot be used for
prediction tasks. So, this is mainly it can be used it is used for the classification. So, even
though theoretically we can apply Bayes model for prediction tasks, but it would be very
difficult for us to find records in the training partition which would be having the same
values as the new observation right especially for the prediction tasks. It is going to be
very difficult to match the numeric values of a new observation with and records that are
there in the training partition, so very difficult to get those matches.
And therefore, and this particular difficulty can increase multi fold as the number of
predictors increase right. So, even if there is one mismatch that particular observation
cannot be selected for further steps. So, therefore, it becomes impractical to apply Bayes
model for prediction tasks. So, most of the discussion that we are going to do in this
particular topic Naive Bayes and overall Bayes as well both, right now we are discussing
the a complete or exact Bayes, so would be around the classification task. And therefore,
the results the predictors or the outcome variable both have to be categorical variable or
converted to categorical variables through binning. So, what will do we will do an
exercise using excel. So, we will apply the complete or exact Bayes, and see what are the
issues that we might encounter and how it can actually be applied on a real problem.
586
So, let us open this excel file. So, this particular example is about this is audit data. So,
this is for the financial source on so many some funds are required to submit some of the
financial document, some of these financial statements to regulatory bodies for further
inspection all right. So, before that they are required to get their audit done with the
accounting firms. So, for the accounting firms they have to apply use a lot of their human
resources and other resources, systems and analytic solutions and software to analyze the
financial statements, the financial reports submitted by their clients. And because they
have a big incentive to find out the fraudulent statements, fraudulent reports; otherwise
they are going to be penalized by the regulatory body. So, they are required to certify that
those firms have all the legal reports, legalized reports. So, the responsibility lies with
them. So, therefore, it is very important for them do find out the fraudulent reporting any
erroneous reporting.
So, therefore, if a firm has some information because if they have loyal customers, loyal
clients, so they might also have information about the previous such reporting previous
such you know previous troubles that the legal troubles that might have occurred for
different clients. So, the accounting firm might be interested in knowing that whether
that information, whether the historical information about the auditing of these financial
reports whether that can be used to find out the potential you know fraudulent reports,
potential fraud fraudulent statements. And that can make their task of auditing much
easier, because they can do more intense inquiry on a few identified records that few
identified reports or statements right. So, where the fraudulent, the chances of the reports
being fraudulent or more likely, so they would like to identify those reports where the
chances of those reports being fraudulent are on the higher side.
So, let us say we have a 1000 such reports. So, this is sample size is 1000. And the
variable is from the historical information for their clients the accounting firms clients
that we have is whether that particular whether that particular client had a prior legal
trouble or did not have prior legal trouble. Whether in previous years the reports
submitted by those clients, whether and that was found to be problematic, and therefore,
they had some legal trouble or they did not have any legal trouble. So, with that
information, can we identify you know some of the fraudulent reports and all identify the
truthful reports.
587
So, this classification matrix or this based on the data set also, we can produce this kind
of summary table using a pivot table option that we have in excel. So, this kind of
summary we can easily summarily we can easily generate using our data. So, you can see
this side we have prior legal, this is the predictor, this is just one predictor that we are
using here. And as we talked about if there are more number of predictors, as we talked
about the chances of finding exact matches always go down right. So, here this is x 1, so
that means, prior legal trouble; x value a 0 this is no prior legal trouble. Then we have
whether the report was fraudulent or truthful.
So, 50 fraudulent reports they had prior legal trouble, 50 fraudulent reports they did not
have prior legal trouble, 180 truthful report they did not have a prior legal trouble, and
720 truthful reports they had no prior legal trouble. So, this is the hypothetical example
that we have and then totals for all these scenarios are have also been done. So, now,
from this, we can actually compute some of the conditional probability values that we
talked about.
For example, probability of a particular record belonging to fraudulent class that is c 1

and given the prior legal trouble that just one predictor that we have in this case prior
legal trouble. Given that information probability of a particular record belonging to the
fraudulent class would be 0.22 computed period by using these numbers. 50 or 50 times
50 reports have been fraudulent reports I also had prior legal trouble, so this is the
number 50 and then divided by the 230 total reports which had. So, out of total reports
that is 230 which had prior legal trouble 50 of those were also identified as fraudulent.
So, conditional probability would be can be computed using these two numbers, this
comes out to be 0.22.
If we want to compute the probability of a particular report being truthful given the
information, we have a related to prior legal trouble. So, this also can be computed. So,
out of all the records which had all the all the reports which had a prior legal trouble that
is 230, 180 were a found to be truthful. So, the conditional probability would be come to
computed using these numbers 180 divided by 230. So, this will give us the probability
of a particular require belonging to truthful class given they had the prior legal trouble,
this number comes out to be 0.78.
588
So, if we apply the most probable class method here right. So, most probable class
method would say that among these so once those records as we talked about these steps,
once those records having exact match matches have been identified, then we will have
to compute the most prevalent class among those records among. So, we already have
the information related to probability of different classes. So, whether put fraudulent or
truthful, so we have 0.22 and 0.78. If we look at the higher number it is at being 0.78. So,
if we apply the most probable last method then based on that 0.78 being the higher value,
so the record new observation would be classified as the truthful report. So, most
probable class method would lead to this conclusion.
So, this whole example is quite simple in this case because we are dealing with just one
predictor information right. Now, if we apply a different approach, the cut off probability
method so suppose we are so interested in one particular class. So, we have a class of
interest. So, in this particular case, the class of interest is typically going to be the as we
discussed about the problem typically is it is going to be the fraudulent report. So, we are
more interested in identifying the fraudulent financial reports, because we like to do
more intense, scrutiny of such reports as an accounting firm. So, therefore, we are more
interested in identifying such reports. So, the class interest is the fraudulent class.
So, for this we will have to specify the cut off probability for this class of interest. So, let
us say the cut off probability is 0.2 as indicated here. So, if this is the established you
know usually specified value. So, any probability value that we compute following those
steps finding the records from the training partition which are having the exact matches.
So, this matches we do not have to perform in this example. And then the second one
being identifying the most prevalent class right, so that we can do the probability values
we have using the information that we have here.
For this example, so if the cut off value is 0.2, and if we compare it to compare this one
with the computed probability value which is 0.22. So, you can see that the probability
value is greater 0.22 is greater than 0.2. So, in that case, the new observation is going to
be classified as the fraudulent class all right. So, you can see depending on the method,
our answer might change, because the cut off value for a class of interest can actually be
specified by the user because we are interested in one particular class right. In the typical
regular case where we do not have, we would like to minimize the overall error. So, there
589
we would just go with the majority rule or the most prevalent class in that case as you
can see the observation would be identified as the truthful.
For m class scenario the formula that we talked about the conditional probability formula
that we talked about can be expressed using this particular expression. So, p probability
of particular observation belonging to class C i given the predicted information x 1 to x
p. This conditional probability can be computed using this. In the numerator, we have
probability of x 1, x 2, x p probability of those predators value given particular class C i
and this multiplied by the probability of a particular record belonging to C i. Then in the
denominator we have a for every class for example, C 1 we have the same expression
probability of a record belonging to this, but you know predictors values and given it
belongs to class 1 and then multiplied by the probability of belonging into class one. And
for all the classes that is how the denominator would be there.
So, this is the typical conditional probability formula that you might have studied in your
previous degrees, in 10th or 12th or during your graduation. So, using this particular
conditional probability, we can find out the exact probability value, and then that can be
used to find out the prevalent class and then all can be used to compare with the cut off
user specified cut off value in case of class of interest. So, you can see when we have just
one predictor it is quite easy for us to perform these computations and then do your
comparisons, and follow the steps, and therefore do your classification.
But what we are talking about is the finding exact matches right. For the new
observation, we will have to find out the records, which are having the same predictors
value. Now, this becomes quite difficult if the number of predictors increase right. So, for
an example, if you know in a university we have to find out a person who is doing a B.
Tech in a particular engineering let us say computer science and engineering. And then
he might have also taken an elective on data analytics and then his CGPA would be some
you know 7.8 or something or more than that a 7.8 it exactly if you know all that at that
grade in that particular subject, because we are talking about all the predictors being
categorical. So, and then whether that particular student being male or female, all
belonging to a particular city or state.
So, as we keep on increasing the number of predictors and the values, so it would
become more difficult for us to find exact matches as the number of a predictors even
590
though these predictors are categorical. So, they will have now classes they will have the
labels right. So, even for a limited number of labels for a particular predictors as the
number of predictors increase it will be very hard to find the exact matches. So,
therefore, that becomes the practical limitation of applying complete or exact Bayes.
So, as you can see in this particular slide as well complete or exact Bayes limitation for a
model even with small number of predictors many new observations to be classified
might not get exact matches right. So, as we talked about with an example that even if
we have a small number of predictors, we talked about only four or five predictors, the
gender, the city or the hometown, gender, hometown, the program the undergraduate
program that he might have he or she might have might be involved to. Then the
particular course that he or she might have taken then the grades, grade in that particular
course. So, four, five, six variables and even with these variables you would get the idea
that it becomes quite difficult out of a you know thousands of students that might be in
the campus with the specific details that I have given.
If I give the name of the city, if I give the engineering that he or she might be doing
student might be doing, and then if I take an name of an elective course that he might be
a studying, and a particular grade within that the for that course, the gender information.
Then you would see we will end up with the you know out of so many students ten
thousand students, we will end up with just a 10 or 5 students or 10 not even 10, because
they have to come from the same city hardly two or three students.
So, the number of exact matches with just 5 or 6 variables they come down to such a
small number. So, hard to find exact matches. And therefore, many probability value will
not be to compute. Therefore, how are we are going to classify the new observation if
there are no exact matches, we would not be able to classify because we need to find the
exact matches and then within those matches then we need to find out the most prevalent
class right. And even then if we are able to find the exact matches, and there are very few
exact matches, then the classes might not be all the classes might not be covered all the
probability values that could be there because of the few exact matches they might not
remain meaningful right.
So, the idea is there should be a lot more matches, so that the values that we get they are
slightly stable given the sample given the problem and so that our classification might be
591
more accurate, and will not be dependent on the data. And those results should be more
stable, our classification results should be more stable. So, we need more matches fewer
matches even if we find matches, if there are very few that could also be problematic.
The next point in the limitation is the probability of a match might reduce significantly
on adding just one variable to the set of predictor. So, if we add just a one additional
predictor in these set of predictors, and that one predictor even if it has just two classes
right, so that and the those classes are you know the frequency of those classes that is
equal in that case equally occurring you know with equal frequency classes occurring
with equal frequency, then that would reduce the number of exact matches by the factor
of two right, two classes by factor of two it will reduce the probability of finding a
match. So, therefore, if we include a predictor and it has more than you know four or five
classes and that would reduce the probability of finding exact matches Bayes you know
factor of five right.
So, therefore, the complete or exact Bayes, so because of these limitations it is quite
difficult to apply complete or exact Bayes for our modeling exercise. So, what is the
solution, solution is the Naive-Bayes some assumption, some simplification of complete
exact Bayes that we perform that allows us to apply our Bayes modeling on for different
classification problems.
So, we will stop here and we will continue our discussion on Naive Bayes.
Thank you.
592
Dr. Gaurav Dixit
Lecture - 32
Naive Bayes - Part II
the previous lecture we started our discussion on a Naive Bayes and then we started our
discussion on complete or exact Bayes. We also understood the different steps that would
be required if we do our modeling following complete or exact Bayes. We also discussed
if you discuss few more a points about class of interest and other things like limitation of
complete or exact Bayes.
So, while discussing the limitation we talked about that in complete or exact Bayes
sometimes it is difficult to you know if we have even if we have a small number of
predictors many new observations to be classified they might not get exact matches the
values, for those values operators being taken by those observation those new
observation which we want to classify. So, they might not get the exact matches in the
data set in the training data set.
So, and also we talked about that if we include a one or two more predictors having few
categories few classes, how that decreases the probability of finding a match. So,
593
because of these limitations complete or exact Bayes application of complete or exacts
Bayes in typical you know data mining modeling exercises that becomes difficult. So,
what is the solution? So, that is the part that we are going to cover in this particular
lecture.
So, instead of complete or exact Bayes we can actually switch to what is called naive
Bayes the core discussion for this particular lecture, for this particular topic.
So, therefore, we should we should switch to Naïve Bayes we will talk about what
happens in Naïve Bayes. So, in Naïve Bayes all the records are used instead of relying on
just the matching records.
So, as we talked about in the complete or exact Bayes for a new observation that we
want to classify. So, the values of different predictors that is being taken by that new
observation all those values have to be matched with the training a partition records and
the matching records have to be identified. But as we talked about that sometimes that
may not be possible. But in Naïve Bayes, we do not actually a just rely on the matching
records rather than the procedure or steps that we take they incorporate the entire dataset
while performing Bayesian calculation Bayesian probability values. So, that is of course,
with some relaxation, with some assumptions that we you know that we incorporate in
Naïve Bayes. So, let us talk about the modification that we do on that we perform on
complete our exact Bayes to convert it into a Naïve Bayes scenario.
594
So, let us discuss the Naïve Bayes modification. So, for example, if there are m classes m
classes of outcome variable then let us take a particular class for example, per class i of
outcome variable. So, these are the steps that we are talking about these steps for Naïve
Bayes and then we will analyze how they are different from complete all exact Bayes.
So, first step being that for class i of outcome variable there could be m classes. So, class
i is one of those m classes. So, for class i of outcome variable we compute the
probabilities P 1, P 2 and up to P p that is for we have we have P predictors.
So, therefore, we are computing P probabilities. So, we compute the probabilities of

belonging to class i for each predictors value. So, value could be x 1, x 2, x p, taken by
the new observation to be classified. So, the values for the new observation predictors
values for the new observation. So, for each of those values we compute the probabilities
of that particular record and having that particular value for a particular predictor of
belonging to class i. So, for class i we do compute these probability in the first step.
So, for example, if there is a predictor x 1, there is a predictor x 1 and it has three classes
and one of those classes is actually the class of one of those classes is actually the value
of new observation for that predictors. So, for that value we try to compute the
probabilities of that new observation having that value and the probabilities are
belonging to class i with respect to that value. So, that would be the P 1.
Similarly for the second predictor x 2 it will also have some value, but it being a
categorical variable as we talked about in the previous lecture that all the variables are
going to be whether outcome variable or the predictor wherever they are going to be the
categorical variable. So, they will have many classes. So, therefore, for second predictor
as well x 2 there are going to be classes different classes of this predictor and one of
those would be taken by this not new observation and for that particular value for new
observation will try to find out the probability of that particular value belonging to class
i. So, in this fashion for each of the predictors values, each of the predictors values x 1, x
2 to x P we will try to compute the probability of these values belonging to class i.
So, once this is done once we have computed these values P 1, P 2 up to P p for the Pth
predictor then we multiply these values with each other as you can see in this step next
step compute P 1 multiplied by P 2 and up to the multiplication goes up to a P p and then
we multiply it with the proportion of records that are belonging to this class i. So, out of
595
all the records that we might have, how many records out of the total records actually
belong to a class i. So, that proportion we will multiply using my, but we will multiply
that proportion into this particular value right. So, we will have to compute P 1 into P 2
into you know multiply up to P p and then the proportion P of C i probability of C i.
So, now, these two steps as expressed in the third step. So, these two steps have to be
repeated for all the classes. So, we talked about for a specific class i, but these steps. So,
there could be m classes in an outcome variable. So, the m classes for each of those
classes C 1 to C 2 to C 3 up to C m for each of those classes we will have to do, we will
have to do this exercise step 1 and step 2 for each of these steps we will have to compute
and finally, arrive at the values that we have in this step number 2 for all the classes.
Once this is done, once we have done these computations for all the classes with respect
to the new observations values then we come to the next step.
Now, to compute the probability of the new observation belonging to class i divide the
value computed in step 2 right. So, all those probabilities multiplied by the proportion for
that class. So, we divide that value by the summation of values computed in step 2 for all
the classes. So, we talked about in the step 3, that this step 1 and 2 have to be repeated
for all the classes. So the values that we get for different classes we will have to sum
those values and this value will become the denominator and will you divide the value
computed in step 2 for class i with this summed up value. So, that will give us the
probability of new observation belonging to class i.
596
Now, in the next step we will execute what we have done for class i for the remaining
classes of the outcome variable. So, for all other classes also will have to perform this.
So, we from this you would see we will have to do lots of probability computation we
will have to compute lots of probabilities and then followed by other multiplication or
division operations and that has to be done for all the classes.
So, for all the classes once we execute this then will have the probability value,
probability values of the new observation belonging to each class of the outcome
variable. So, if there are m classes. So, if there are m classes in the outcome variable.
597
If our outcome variable is having m classes C 1 to C 2 to C m. So, once we have

executed the probability value for that new observation P 1, P 2 for all these classes. So,
once we have this step then we are going to classify the new observation to the class with
the highest probability values. So, among these values that we have computed following
4 or 5 steps right then will find out the highest value and the class of that highest value
highest probability value that would be assigned to the new observation. So, new
observation would be classified to that particular class. So, you can see most probable
class method that is being applied, so class having the highest probability that would be
assigned.
So, let us recap some of these steps. So, what happens in Naïve Bayes modification is
slightly different. Some calculations are different from complete or exact Bayes which
will discuss later on. So, in the first step we compute the probabilities of belonging
belonging to class i for each predictors value taken by the new observation to be
classified right. So, as we discuss for each of those values x 1, x 2 we tend we try to find
out the probabilities we try to compute the probabilities of those values belonging to
class i. So, then we compute this expression of all these probabilities multiply then
multiplied by the proportion or for that class and then this is executed for all this step.
Once we have the values for all these classes for each of those class the value for that
class and it would be divided by the sum of all those numbers and that is how we will get
the probability for all the classes.
598
So, let us look at the Naïve Bayes formula. So, this is the formula we are going to
compare how this is different from the complete or exact Bayes formula that we saw in
the previous lecture.
So, you can see P C i probability of a observation belonging to class C i given the
predictors values that are x 1 x 2 and up to x p. So, P predictors are there for each of the
predictors then observation is going to have some value. So, given those values what is
the probability of that observation belonging to class i. So, this can be computed using
this particular expression right.
So, here you can see that a probability of x observation belonging to x 1 given the this
particular class C i and probability of probability of this predictor value x 2 belonging to
probability of this particular x 2 given the class C i, then the probability conditional
probability of x C x P with respect to say C i. So, all these numbers have been multiplied
then the proportion is also multiplied proportion of records that belong to this particular
class C i that is also multiplied. And then in the denominator you can see the same
expression is there for all the class with the 5 expression is for C 1 and it is going to be
followed by the same expression similar expression is going to be for the C 2 and
similarly for other classes up to the last class that is the a m’th class. So, you can see the
same is here.
599
Now, let us have a look at the compare the, let us have a look at the formula for compete
or exact Bayes this was the formula for the complete exact Bayes.
So, you can see here. So, this is the conditional probability expression that we generally
use. So, you can see probability of a particular observation belonging to C i given the
predictors values and this is how it is expressed right. So, you can see one difference that
is in the numerator as well as in the denominator the different expression that we have
this particular value, this particular expression, this particular value probability of
different predictors values right, probability of particular observation having these
predictors values and given the class C i.
So, this particular equation has been changed to this multiplied product of different
probabilities right. So, this is what is explained in the few more points. So, Naïve Bayes
formula is directly derived from the exact Bayes formula after making following
assumption. So, what is that assumption? That predictors values x 1 x 2 x P occur
independent of each other for a given class. So, this is the approximation that we have
done P, x 1 x 2 x P this was the probability that was that we were supposed to compute in
the complete or exact Bayes and now this is the product of probabilities that we are now
computing under Naïve Bayes right.
So, you can see for each of the observation once we say that these values occur
independent of each other. So, this particular expression can be written in this for form
600
well as you might have studied in your probability courses that if the events occur
independently then the probability of those events happening R is going to be the product
of the independent product of the probabilities of the independent event. So, this is the
same thing that we have done because these value can occur you know independent of
each other that is the assumption. So, this particular expression can be decomposed into
this form.
So, from the conditional probability we have these two in way unconditional
probabilities with respect to unconditional in the sense that the vectors values are going
to be independent of each other. So, this is the approximation that has been done.
However, these this particular assumption might not, sometimes it might not be true in
practical situation a, but however, what has been seen that the results that we get by
applying Naïve Bayes are quite good in comparison to other techniques despite you
know some of the times when this assumption is not met.
So, let us move forward. So, as we discussed that for classification Naïve Bayes formula
works quite well. Now, since another thing another important aspect is since we do not
require probabilities values to be accurate and absolute term, rather just a reasonably
accurate rank ordering of these values.
So, essentially as we discussed here that once we compute the probability values right
then we need to find out the highest value and the particular new observation is going to
601
be assigned to that class belonging to that highest probability value. So, therefore, we
just need to find out which probability value is higher which are lower in a sense we
need to just have the rank ordering of these probabilities value.
So, even if some of these probabilities values are not accurate, not accurate why we are
talking about this because we have made one assumption in the Naive Bayes we have
converted the conditional probability into the product of unconditional probabilities for
the predictors values, right. So, because of that approximation because of that
assumption that those predictors values are going to be independent of each other. So, the
our probabilities value are not going to be accurate more often than not, but that is not
going to create any problem for our modelling exercise reason being that we just need to
find out the order of those values we need to just know the rank ordering. And that rank
ordering is going to help us in finding the highest rank and, therefore, the class the new
observation is going to be assigned to the sign to the class of the highest probability
value just need to have the rank ordering. And that is why when we were saying that and
even if that assumption of independent predictors values is not met in practice the Naïve
Bayes model works quite well reason being that we just have now we just need accuracy
in rank ordering and not in the actual probabilities values.
So, for the same reason that we have been discussing we should use the numerator only.
So, and drop the denominator which is common for all the classes. So, let us go back to
the formula that we discussed. So, this is the Naïve Bayes formula. So, if we are just
arranging the rank ordering of different probabilities values for different classes. So, you
can see the numerator or this as per this expression the Naïve Bayes formula the the
denominator is going to be the same for all the classes because the it is the summation of
probabilities value a computed for all the classes. So, this is going to be the same.
So, therefore, for the comparison of probabilities value to create the rank ordering this is
not going to matter. So, we can just focus on the numerator and compare the numerator
values and it still will end up the same rank ordering. So, that is another simplification in
Naïve Bayes that can be performed. So, we will not have to compute all the values you
know some of the calculation. We will not have to perform which are related to
denominator and we can just focus on the numerator and the computation that are
required for the same.
602
Now, till now what we have been discussing is mainly when we try to minimize the
overall classification error or try to maximize the overall classification accuracy. So, as
we have been discussing for other techniques multiple linear regression or k-NN
specifically k-NN because there we discussed mainly about the classification tasks. So,
when we have a class of interest right. So, how our steps will are actually going to
change.
So, the earlier steps were actually the typical scenario which is that we are looking to
minimize the overall error. So, when we have a class of interest, we want to identify
more records belonging to the class of interest even if it comes at the expense of a miss
identifying or misclassifying some of the records belonging to other classes right where
the focusing is on one particular class of interest. For example, the financial reporting a
example that we discussed in the previous lecture of Naïve Bayes we talked about that
the an auditor and accounting firm would be interested in finding the financial reports
which might be fraudulent and not in the reports not he might the odd auditor or the firm
might not be interested in you know identifying the truthful reports because they would
like to a do more serious scrutiny of the financial repo reports which might be fraudulent.
So, therefore, they would like to you know identify more of those reports which could be
the candidate for serious scrutiny which could be the fraudulent reports and therefore, it
would be important for them to identify those reports those statements which could be
fraudulent even if it comes at the expense of misclassifying some of the truthful reports
as fraudulent reports.
So, let us discuss steps when we have a class of interest. So, a first step is typically going
to be a specifying a cutoff value for the class of interest. So, this cutoff value as we have
been talking about in the previous lectures as well that this cutoff value when we have a
specific class of interest. So, the cutoff value, the default cutoff value is typically 0.5
when we have a class of interest. So, we are interested in identifying more of particular
class members and other classes that could be there we are not too much focused on
them. So, therefore, we would like to identify more records of this class as we have
discussed. So, this particular a cutoff value can be treated as a parameter of the model,
this can be treated as a parameter of the model or slider right. So, we can slide this value
this cutoff value. So, it can sometimes be 0.2, it can sometimes be 0.4, the main idea is to
identify more of to identify more up class one records right. So, if the there are many
603
classes of course, the probability of a record belonging to class one that is class of
interest is going to be slightly on the lower side.
So, if we change our cutoff value that we are supposed to specify to let us say 0.2. So,
any value that you know any estimated probability value for a particular new observation
if it is more than 0.2 then that particular record is going to be classified as belonging to
class of interest that is seem otherwise not; that means, it would be belonging to one of
the other classes right. So, this is the typical idea. So, we would like to specify a cutoff
value which can help us in identifying a more of the records belonging to class one.
So, once we do our modeling exercise once we have estimated the probabilities values
for different observations right. So, once for different observations, once we have the
probabilities value of belonging to a class one right. So, we can always slide the cutoff
value and in that fashion our results of all the classification matrix they will change as
we slide the cutoff value and therefore, we can find the appropriate cutoff value which
can help us in identifying more of class one records more of class of interest records.
So, as a first step we have to specify user specified cutoff value. So, one of the when we
discussed the performance when we were discussing performance matrix then also we
had done one particular exercise using excel where we were actually changing the cutoff
value and we had saw that the results were in the classification matrix they were also
changing. So, that is the particular exercise I am referring to. So, that kind of exercise
can help you and finding we also created a one way variable table to have a different
cutoff values and the respective the classification error numbers and accuracy number.
So, that kind of exercise can help us in specifying in establishing a cut off value for the
class of interest.
So, once this is done we move to step number 2. So, this is quite similar to regular
scenario step.
604
So, we compute the probabilities P 1 to P p of belonging to class of interest for each

predictors value that could be x 1 to x 2 to up to x P taken by the new observation to be
classified. So, as discussed in the regular scenario the new observation will have some
values for different predictors. So, they could be x 1, x 2, x p. So, for each predator the
value could be some class. So, for that particular class we have to find out the probability
of that class belonging to class of interest.
So, once these probabilities are computed then we can move to the next step where we
multiply all these probabilities P 1 to P 2 to P p. So, that is with respect to each of the
predictors value for the new observation and then we multiply it by the proportion of the
records a belonging to class of interest so that is I think what the probability of a record
belonging to class of interest. So, once this is done, we will have this particular value for
the class of interest.
Now, will have to execute a these two steps for all the classes right. So, especially this is
required if we want to compute, if we are not just focusing on the numerator and we
want to compute the value, the Naïve Bayes value for the probabilities. So, in that case
this term number three would be required for all the classes. And even if we are not using
the denominator even then for the comparison purpose, we would be when we will be
required to compute these values especially in the regular scenario not so much in the
class of interest. So, let us move to the next step that is to compute the probability of the
605
new observation belonging to class of interest we divide the value computed in step 2 by
this summation of values computed in step 2 for all the classes right.
So, in this step 3 we said about computing the step 2 value for all the classes. Now, all
those values would be summed up and then that would be used as the denominator to
divide the value computed in step 2 for the class of observations that will give us the
probability of new observation belonging to class i, so in this particular as I clarified that
we have both the numerator and around and not just the numerator.
So, once this is done we can classify the new observation to the class of interest if
computed probability value is greater than the cutoff value defined in step 1. So, once
this probability value is known as we discussed there that for example, if the value is, if
the value comes out to be 0.25 and the cutoff value is 0.2 then probably the particular
new observation is going to be classified to the class of interest.
So, with this will stop here and in the next lecture will learn Naïve Bayes modeling
through an exercise in excel and also will try to do an exercise in R.
Thank you.
606
Dr. Gaurav Dixit
Lecture - 33
Naive Bayes – Part III
the previous lecture we were discussing in Naive Bayes and specifically we looked at the
formula for Naive Bayes. So, specifically we looked at the Naive Bayes formula, we also
looked at the a steps for the same when we have a class of interest, when we do not have
a particular class of interest the general or typical scenario that we have also discussed in
the previous sessions that is. So, these are some of the steps that we have discussed in the
previous.
So, now, what we are going to do in this particular lecture, we are going to do a small
exercise to understand Naive Bayes and the different and the various steps for Naive
Bayes modeling computations and other things through an exercise. So, we will do an
exercise using excel and then followed by an exercise in R.
So, in the previous session we had gone through one example related to complete or
exact Bayes. So, you can also see that classification matrix here as well where we had
607
computed the probabilities right fraudulent given for legal. So, here this particular
example was where we have just one predictor and we had computed the probabilities of
belonging to that fraudulent class and truthful class given that predictors information and
based on that using most probable class method and cut off probability method with cut-
off 0.2 also we had seen what would be the assigned class now the same thing can be
extended especially using the Naive Bayes calculation if we have you know. So, 2
predictors let us go through an example where we have 2 predictors.
So, first one in this case is prior legal trouble as was in the previous case as well and the
second one being the company size now the status of those financial reports whether they
were found to be truthful or fraudulent. So, that is also the given here. So, this is our
data. So, this is a small data set that we have this is the highlighted section here. So, as
you can see for observations whether they had prior legal trouble or not yes or no has
been indicated appropriately and for the company sizes as you can see whether the
company size is small or large that has been specified for every observation here and
then the status of these records these observations the specifically financial report
whether they have been found to be truthful or fraudulent.
So, that is also specified now if we were to follow complete or exact case modeling. So,
in that case how do we go about different calculations and computation? So, let us see.
So, let us say we have to compute this probability of a particular record particular
financial reports being fraudulent given the prior legal trouble and the company size
right. So, in this case because we have just 2 predictors 2 categorical predictors and both
these predictors are having 2 classes. So, prior legal trouble we have just 2 classes either
yes or no for the company size also we have just 2 classes either small form or large
form.
So, given this 2 classes each for these 2 predictors we will have 4 scenarios. So, we will
have either we will have these scenarios for the fourth of these scenarios for example,
and especially if we are interested in only identifying the fraudulent classes. So, for the
fraudulent we have these 4 scenarios when the company when the prior legal trouble is
yes and the company size is small. The second one and the prior legal trouble is yes and
the company size is large third one when the prior legal trouble is no and the company
size is small a fourth one when the prior legal trouble is no and the company size is large.
608
So, as you can see since we are interested in calculating the probabilities mainly for the
fraudulent loss.
So, this is how we can do it. So, these are the details. So, these details I have further
written down in 3 separate columns mainly to perform mainly to be able to use excel
functions and to perform computations. So, you can see F the status and then prior legal
and then the company size for all these for all 4 probabilities whether that we want to
compute. So, I was specified here now to compute the exact based probability this is how
we can do it you can see first you can see I am using this countifs function in excel. So,
what it does we can look at the details if you are interested. So, you can see a let it load
countifs function.
So, this is what we are going to use to perform our comparisons here.
609
So, function applies criteria to cells across multiple ranges and counts the number of
times all criteria are met.
So, the particular criteria which is specified that is applied across the range and then the
count is done.
610
So, you have to specify criteria range one and then the criteria. So, you can therefore, in
this function you can specify multiple ranges and the associate criteria for all those
ranges which is exactly what we want. So, essentially the reason being because we want
to compute this probability you can see that the first one here you can see this one prior
legal trouble and then followed by the small value that is that is yes.
So, you can see here in this particular colour. So, colour coding is also quite visible here.
So, in this first criteria would actually help us identify all the yes observations which we
want all the forms which had prior legal trouble. So, we want to identify all those funds.
So, there are 1 2 3 and 4 such firms here. So, first criteria and the first range and the
associated criteria that is yes we will identify or select or filter those forms then the, if
we look at the second one right second one is then the for the company size, the criteria
is small here.
So, in this particular column we want to find out all the firms which are having a small
size. So, now, in this together these then the further next criteria as you can see is that is
whether the firm in the whether the status is fraudulent or truthful. So, you can see the in
exchange is that it is only here in the green and within this we would like to apply these
criteria that the form is fraudulent. So, these 4 observations they would be identified.
Now countifs function will apply all these criterias right and there and then the number
would be counted of those observations which satisfy all these criterias.
611
So, therefore, one has to be yes you know firm has to that prior legal trouble as yes
company size is small and status as fraudulent. So, only those observations would be
counted here. Then followed by the next one as you can see here the denominator part,
new numerator part we have understood. So, all the counts of the funds you know which
had prior legal double as yes and company such as small within the fraudulent class. So,
that would be counted in the numerator and in the denominator we would have the again
2 criterias. So, first one is that the company had prior legal trouble as yes and then
followed by a whether the company size was small.
So, that is the Naive denominator, out of all such firms which had the pretend formation
as the prior legal trouble yes and company size being is small we would like to find out
the number of firms which all which had status as fraudulent. So, once we execute the
particular excel formula then we get the number which is 0.5 in this case if you want to
check whether our formula work correctly or not you can do. So, by understanding from
this table let us look at the yes and small firms. So, yes I in a small here 1 and then we
have 2.
So, there are just 2 cases one yes a small status being truthful and 2 cases yes and
smallest is status as being the fraudulent now out of these 2 cases 4 yes and a small right
out of these 2 cases based on these 2 predictors information only one of them is
fraudulent status I one only one of them is having status of fraudulent. So, therefore, the
probability is going to be 1 divided by 2 which is 0.5. So, the other formula is computing
this particular exact calculation correctly right in the same fashion again we can compute
the exact Bayes probability for the second scenario that is probability for a particular
form belonging being filing you know submitting fraudulent financial reports given that
they had a prior legal trouble as yes and then companies are also large.
So, the same fashion countifs function using the countifs function available in excel we
can compute this particular value as well. So, because this is a small data set again. So,
we can find out from the either particular data set itself what is the probability. So, yes
and large if you look at the number of observation matching these records as we
discussed in the computer exact case you have to find the exact matching records. So, 1
and 2 out of these 2 both are fraudulents, therefore, 2 out of 2.
612
Therefore, the probability is going to be one the same thing as been computed using the
excel formula as well right. So, the formula also the count ifs as you can see a criterion
ranges and criteria. So, that has been specified. So, for all the all 3 cells whether the form
is no fraud fraudulent or fraudulent reporting or and the prior legal yes and this company
size as large. So, these 3 criteria and the numerator and in the denominator of only the 2
criterias that is the predictors information yes and large. So, that would give us the
correct exact probability for this particular case.
Similarly for the next case when we want to compute the probability of a firm being
fraudulent given the predictor information of prior legal trouble being no and company
size being small. So, the same fashion also we can compute as you can see no and is
small 1, 2 there are 1 2 and 3 there are 3 exact matches in this particular small data set
that we have for the these the situation no and small, but all 3 of them are actually
truthful. So, therefore, none of them are fraudulent. So, the probability is going to be 0
divided by 3 therefore, 0 right. Similarly for the last scenario no and large again we can
look for the exact matches no and large 1 to large 2 right and lower and large 3. So, in
this case, first 2 cases of exact matches no large they were truthful the last one was
fraudulent.
Therefore, the exact probability is going to be 1 out of 3 that is 1 divided by 3 0.33 same
thing we have computed using excel formulation as well right. So, in this fashion we can
compute the complete or exact we can do the complete or exact Bayes calculation and
based on that we can perform further steps now if we were to a perform the same steps
using a Naive Bayes calculation right. So, how we will do this, the same 4 scenarios that
we are interested in identifying fraudulent cases because for accounting firm as we talked
about they would be interested in identifying the fraudulent fraudulently reports first
because then they can decide about serious auditing or serious scrutiny of those reports
right.
So, these 4 scenarios again the same scenarios as for the exact Bayes calculation the
same fashion we have written these particular details in 3 separate columns whether for
the calculation of the probability. So, in this case you can see how we are trying to
compute the Naive Bayes probability. So, in this case we have generated first these
conditional probabilities. So, first let us have a look at this conditional probability that
have been computed first look at the proportion of records which belong to each of these
613
classes. So, the proportion of records which belong which are which belong to the
fraudulent class these many let us look at the formula here. So, you can see the last
column status out of these records which is in the denominator and in the numerator and
then we have the criteria which is specified as false. So, out of these how many are
actually out of these it costs, how many are actually fraudulent as status.
So, you can we will get the appropriate number which is 0.4 same thing you can compare
by just looking at this filter data set being because this is quite a small. So, you can see
just 4 out of these 10 observations they are fraudulent. So, therefore, 4 is the probability
similarly for truthful class the same fashion we can compute. So, out of this column we
can see 6 of the observations and they satisfy these criteria. So, therefore, count if this
particular formula that we have written over there it is going to return the value as 0.6.
Now, let us have a look at the other conditional probabilities that we have computed. So,
so this is for. So, we have 2 predictors, therefore, and then 2 classes in the outcome
variable. So, far first this fraudulent class and these are the 2 for values for 4 values with
respect to the predictor this prior legal. So, you can see in the numerator you can see we
are trying to identify we are trying to find out the value we are in the as you can see yes
where the predictor prior legal trouble is yes right. So, that is the first criteria, that is one
that is one criteria prior legal trouble yes and also the record is supposed to be fraudulent.
So, these 2 criterias they are in the numerators those counts are to be done and this
particular count is then divided by the denominator which is nothing, but the number of
records which are fraudulent out of the total records. So, using this we can find out the
this particular probability of a record probability of a record being credit for having a
prior legal trouble given that it belongs to the fraudulent class, the same thing we can do
here in this case this is for the second predictor that is company size.
So, again we can have a look at the formula. So, here also you can see that for these
small. So, we have to look at this that the for the for given that this predictor information
that the for this particular form this is company sizes small right for this out of the
fraudulent cases that we have what is the probability for a firm belonging to that. So, this
is comes out to be 0.25. So, after this we can compute the, we can perform the same
computations for 3 other scenarios. So, you can see.
614
So, these values have been computed similarly the same thing has been applied here in
the truth for the truthful class. So, here also the same thing is being applied as you can
look at the excel formula as well that for the same for the first predictor that is prior legal
trouble and the first value being yes now. So, we are trying to count the numbers where
in this is particular is predictor value is yes and then the class is truthful out of all the
classes which are all out of all the records which are truthful.
So, this value comes out to be 0.17. So, the same thing we have done for other 3
scenarios similarly for the next predictor that is the company size also the same type of
computations have been done, now once these values have been computed right.
So, if we look at the Naive Bayes formula that we have gone through look at the Naive
Bayes formula here you can see that and these are the probability values that we are
trying to compute here P x 1 given C i P x 2 given C i and the proportion P C i this we
have already computed and P x 1 C i and P x 2 C i these values we have already
computed right, these 2 values. So, for each class F, these are the values for predictor 1
predictor 2 for class true truthful for predictor 1 and the predictor 2 and then for all the
scenarios.
So, we have computed these values right. So, we have been able to compute these values
now once these value have been computed we can compute the overall formula for Naive
Bayes. So, this is in this particular cell as you can see as you can see from here this is
615
these 3 cells, you can see these 3 cells are the in the numerator O 8 that is this one this
one and then multiplied by P 8 this particular cell and then multiplied by P C 7 in this
particular cell.
So, we are trying to compute the probability right of this particular part and then divided
by the total. So, that will give us the, this value is 0.53. So, to compute the actual
probability as we talked about the first you have to compute the numerator and then you
have to divide by the same expression values for all the classes this is what we have
done. Similarly for the next scenario as well here also we look at the formula first for the
fraudulent class which we are interested in because we want to compute the probability
of belonging to fraudulent class you can see.
So, these values are O 9 and then multiplied by P 9 and then multiplied by the P 7 value.
So, these 3 values and in the denominator we will have this one value plus the other one
value which belongs to the truthful class and will get the actual probability as per the
Naive Bayes formula similarly for third scenario and similarly for the in the fourth
scenario.
So, exactly following the Naive Bayes formula this one for the Naive Bayes calculation
and as we have discussed.
616
This one for the exact based calculation now once these computations have been done
you can look at now you can do a comparison of these 4 scenarios. So, we look at the
complete or exact based values you can find out that this was 0.5 probability of a
particular report be belonging to the fraudulent class and given the prior information yes
and is small and the same thing you can see the value is 0.5 and here it is 0.53.
The second one the fraudulent class prior legal for years and company size large it is one
4 in exact based calculation 0.87 here in this case the third scenario prior legal trouble no
and small form point this is the 0.07 in the exact Bayes this was 0.0 here it is 0.31 in the
exact based at this 0.33. So, if we look at the probability where is the actual probability
value that we get Naive Bayes from Naive Bayes calculation they are quite close to exact
based calculation.
However, we have; however, we do not have to deal with one overwhelming problem
that we phase in exact Bayes calculation is that we have to find the exact matches the
records exactly matching. So, that we do not have to do because we use the entire data
set in Naive Bayes calculation. So, you can see close numbers in both exact Bayes and
Naive Bayes calculation and based on these probabilities value then we can decide on
you can decide on whether to classify as truthful or fraudulent.
So, these probabilities value if we are not interested if we are not in a rare class scenario
then we do not have a special class of interest then in that case we have to compute these
617
value or values for the truthful class as well for all scenarios and then we can compare
and as we had done here in exact Bayes this particular example for the fraudulent as well
as truthful he had computed these 2 values and most probable class method can be
applied to find out the whether the observation is going to be below assigned as truthful
class or fraudulent class cut off probability if we follow the same thing 0.2 right.
Then accordingly in this case also for example, as per Naive Bayes calculation or even
for a exact Bayes calculation we follow the cut off tool right using these probability we
can assign a new observation to the appropriate class for example, first second and fourth
scenario they would be assigned as to belong to the fraudulent class and the last one the
third one would be assigned to belong to the truthful class.
So, with this let us do an as small let us do any small exercise in R. So, what we will do
is we will let us get familiar with the data set that we have here.
So, the data set that we have here is this one. So, this is the data set that we are going to
import into R environment and then we will be in doing in exercising R for we will be
applying Naive Bayes modeling. So, let us look at the variables that we have these are
the variables.
618
That we have first one is flight carriers. So, this particular data set is about flights and
their carrier the date then source and here then we have schedule time of the departure
actual time of departure, then we have scheduled time of arrival, than actual time of
arrival then we have destination and then we have day off week.
So, 1 representing here Sunday and 2 representing here in Monday, 2 representing here
as Monday, we have just 2 days of information whether the day was Monday or Sunday
or Monday then the flight status whether the flight was delayed or on time. So, this has
this is based on whether the actual time of dep departure whether that was less than all
same as the schedule time of departure if it was less than or same as the schedule time of
departure then it is on time if it is more than that then it is delayed.
So, the main problem is the classification task and the predictors as you can see from the
data set itself the predictors that we are going to use in the modeling are going to be the
categorical predictors and in this particular case the main tasks being the classification
task we are trying to predict the status of a flight whether it is going to be depending on
the predictors information whether the flight is going to come on arrive on time or it is
going to be delayed.
So, we will stop here in this particular lecture and will continue this particular exercise in
our in the next one.
619
Thank you.
620
Dr. Gaurav Dixit
Lecture - 34
Naive Bayes – Part IV
previous lecture we were doing an exercise for Naive Bayes we did an exercise for Naive
Bayes in excel. Now we also talked about a particular data set on a flights the on flights,
and we are going to use that to do a modelling exercise in R. So, let us open R studio so
on.
So first let us reload this particular library xlsx because we have data in excel format. So,
we would like to utilize this particular library functions available in this to import the
data set as we I am been doing before.
621
So, this so file is flight details dot xlsx. So, let us import the data set. So, here you can
see 108 observations of 13 variables , but we do not have those many variables and there
are some NA columns and NA rows in this actually. So, you can see in the data set, and
these are some NA columns, and we might also have yes and there are some and I wrote
as well. So, we would like to get rid of these NA rows and na columns because we do not
have those many variables and rows.
So, first particular code as we have been talking about in previous lectures as well
previous our exercise as well that the first one will remove the NA columns. So, let us
says execute this. The second code would actually remove the NA rows. So, let us
execute this. So, how it works we have discussed in previous lectures as well. Let us look
at the first 6 observations. So, these are the observations first we have flight number the
flight number is given there for different carriers flight carriers also there.
622
So there are as we will see there are three carriers we have information on data on three
carries we have and date these are the 2 dates specifically. Then we have source, then we
have a schedule time of departure here you would see this number is actually you can see
899 12 30 how and then after this we have schedule departure time.
So, how this number is coming in the why this particular date is coming in there so this is
important detail that we will discuss. Now all the variables which are storing time actual
time of departure schedule time of arrival or actual time of the depa arrival you would
see that this particular date is also appended before the time right.
After this we have a destination variable destination variable, we have actually 3

variables in this. So, we will discuss them 2 days, flights on 2 days a one represented by
one is representing Sunday and 2 is indicating Monday and then we have flight status on
time or delayed.
So, as we discussed in the previous class if the actual time of arrival of a flight is a less
than or equal to schedule time or arrival time than it is on time if it is more than that
more than schedule time of arrival then it is delayed it has been classified as delayed. So,
let us run this structure function.
623
You can see that flight carrier it has 3 levels Air India, Indigo, and then Jet Airways as
you saw in the first 6 observation. So, these are the three carriers we have information on
these 3 carriers then the date, is there are 2 dates 30 and 31st of July 2017.
And then the next variable is source that is factor variable 3 levels “BOM” that is for a
Mumbai and “DL” that is for Delhi and then m a “MAA” that is for Chennai airports.
And then we have scheduled time of departure and the actual time of departure schedule
time of arrival schedule time of arrival.
So, these are actually this POSIXc t, format that is there are and the times are given all
right. So, we will discuss this why this particular date is coming there then destination
we have three levels so these are the 3 airports, “BLR” for Bangalore, as “HYD” for
Hyderabad and “IXC” for Chandigarh. So, these are the 3, these are the 3 airports.
Then this information on days so that is 2 days of flight on 2 days we have in our data
set. So, one representing Sunday as we talked about 2 and then we need to this is shown
here as the numerical variables. So, we would be required to convert it into a factor
variable or categorical variable. then we have flight status 2 levels in delayed and on
time. So, 2 levels as you can see here.
So, let us talk about the this peculiar date that is being appended in the in this particular
arrival or departure time. If we look at the excel file that we have here if we look at the
624
excel file we do not have this information here you can see that scheduled time of
departure as execute time of departure and the excel file we do not have this information.
But when this flight data set is actually loaded into our environment are actually on the
format in which are imports this particular data it requires it to also have the calendar
date and the of course, the data is being imported from excel into R environment it is the
date default calendar date that is available in excel that is being taken and the default
date that is being taken is actually in the calendar date of day one that is there for in the
windows excel.
So, in the windows excel we have nineteen hundred one and one that is first January of
1900 that as the day one so that is the default value ah, but this particular value is not
shown here instead 18, 99, 12, 30 that is 30th of December 1899 that is shown here that
is again because of that the excel incorrectly treats 1900 as a leap year and this has been
this being the case all the dates which are come after 1901 so they are for those dates the
we have to do the adjustment.
So, the so this particular this particular date has been taken the after this adjustment for
post 1901, you can see this particular date which was 1901 and on which was meant to
be in windows excel as 1901 1 now becomes 1899 1230 right. So, so this is the particular
date that is taken by default when we import data from excel windows excel we are
running windows operating system and we are importing this data from the windows
system in and the excel data into our environment so, this date is being appended here.
Now how do we correct this.
So, let us look at the few code if you lines up code so first let us take a backup. So, this
was the data originally has it was imported. So, we taken the backup now what we will
do we will format these particular for these particular times you know arrival and
departure times into this format just this format so that the end date is we are able to get
rid of the date information.
625
So, let us execute this. So, once this is has been formatted if you are interested in the
value you can type here and you would see that all the values are in this format. Now and
you are interested in the class of this now and that also you can see this is now character
vector of 107 length right. So, this has been extracted now we will convert it into a
POSlXlt format that is available in R.
So, we use the paste function where we will first paste this particular date that is stored
in the another variable in the data frame which is having the corrected information as we
saw in the previous output.
Let us look at again so this is the date column where we have the correct dates. So, that
particular date is going to be used now and then we will have stored in this format. So,
let us execute this know once this is done.
626
Now let us look at the values. So, you can see correctly specified the dates right where it
is supposed to be 30, it is 30 or 30th of July 2017, where it is supposed to be 31st of July
2017 it is that and the time format is also now IST.
The earlier time format if you are interested in but we will not look at the time zone does
not look at the time zone. So, we can still do it use using by looking at the this particular
the backup that we had taken. So, there you can see the by default not just the calendar
date that was appended was the day one of windows excel and that too in you know
incorrect. And then the time zone is also taken by default at GMT right.
So, we have been able to change this and now the correct dates are appended and then
IST. So, you might be thinking why we have date calendar date information at all in
time. So, this is the our format where it requires if you do not if you just try to create a
POSIX lt or ct object without the calendar date it will automatically take the systems date
and append it to that time information. So, let us do the same exercise for other columns
other date columns times column that we have arrival departure columns. So, let us
execute these lines of code.
627
And now let us look at the first 6 observations. Now if you see 30 dates are being
correctly specified and we have also seen already that time zones are have also been
appropriately corrected we can also have a look at the structure of this particular data
frame you can see structure now and you would see in the structure as well the correct
dates are now there.
Now let us also take a backup of this particular data frame as well, but as we have
discussed that a particular task is to predict the flight delays therefore, we are not and the
predictors information that we would like to model is not actually incorporating the
actual dates rather than we are going to use the time intervals of departure.
So, therefore, we are not interested in actual date so it does not matter us whether the
date information is appended in the in the departure and arrival times and what that date
is we are not interested in that, but we are interested in the time intervals during a
particular day. So, therefore, having different date appended in a particular you know
arrival or departure time will complicate the thing and complicate the process as you will
see in the later part of the code.
We would like to have to simplify that coding process that we are going to follow later
on we would like to have the same date not even though the data is on 2 days 31st of July
2017 and 30 th of July 2017 we would like to have just one date here. So, that our
628
computation which we are going to do later on are easy because we are interested in time
intervals in on any calendar date right.
We are not bothered about the specific date. So, let’s restore this particular data frame
they are be the first backup that we have taken and we are again going to use this
particular function strip time is the t R p time so this is going to format the information
that we have into the in the format that we want we will see just now.
So, if you are interested in finding more detail on this particular function you can type
the name of the function here in the help section and you can find more information your
date time conversion functions to and from characters.
So, this is these what we will take the character vectors character vector in the x and will
format as per the specified format and the time zone if it is null not specified then the
systems time zone would actually be taken and this is what has happened if we look at
the first 6 observation you would see that all the times now they have been appended
with the this particular days systems date here you can see 8th of August 2017.
So, all the time, time related variables departure and arrival times they have been
appended with this particular date even though the actual date as per the date columns
were different. But since we are not interested in dates we do not mind because we
would be calculating time intervals.
629
So, there we would like to have the same date instead of different dates, can have a look
at this structure variable as well you can see the different date has been mentioned here.
So, we had been converting data into one format to other another because of this issue
wherein the departure and arrival times are being appended with the calendar date as
well.
So, we saw that by default it takes the 18 99 30 12 value then we change it to the correct
values at for the as per the information the data set that we have 30th July and 31st July
then because the later computations and the modelling that be required we do not
actually focus on these specific dates and the computation would simplify if we have the
same date for all the times it would be easier for us to compute the time intervals later
on.
So, we have appended the correct if we have appended the current data of the system so
as you can see here. Now as we discussed what we are going to do is we are going to
take the actual departure time and this particular variable ATD and we are going to break
it into appropriate time intervals.
So, we are converting a variable converting particular date variable or time variable
rather time variable into a categorical variable. So, we are going to do binning for this.
So, let us look at the range.
630
So, here it was for this range that we changed the calendar design the calendar date for
arrival and departure times. Now had it been the had it been the same dates the actual
dates as you would see that as in the, this particular data frame I guess. So, we would see
that ATD if I run a range for this not this one the, they where we had the actual dates. So,
I think this one.
So, this is what you see that it is also in this particular range it is also counting the
calendar date which is 30th of July comes first. So, the first date that we have is this
particular information on 30th July to 19 and then last one is 31st July and 20. So, this is
this we do not want and do not want because we want to create time intervals
irrespective of the day.
So, in this case if we have the same date then you would see that we do not focus on the
date then you would see the first time is 1 10 and then the last one is 20. So, this is the
difference we have to be on to create the time intervals you have to focus on this.
So, this is what we are going to do it here is we will create different intervals. So, this is
a function that we are going to use breaks then we are going to create a sequence of time
intervals for time intervals and as you typical the strip time function can be used to
convert the character vector into a appropriate you know time and date and time format
in R.
631
So, as per this format specified format it will be converted 0 0 value and up to 24 and we
are going to create 6 by 6 hours so we are going to create 4 time intervals if for within 24
hours for a day.
So, first is going to be as you can see in the next code labels 0 to 6. So, between 0 to 6 a
m that is first time interval for us from 6 to 12, and that is the second time interval first
for us, and then 12 to 18 the third time interval, and then 18 to 24. So, 24 hours of a
particular day we have broken we are trying to break them into 4 time intervals and then
later on as you will see we will class we will break the departure time departure,
departure time, actual departure time information into one of these intervals right.
So, if right if it has an actual departure time which lies between 0 and 6 am it would be
given the value 0 to 6 a particular flight has the actual departure time and between 6 and
12 between 6 am and 12 pm then it would be given this particular this particular class
this particular category and then similarly for other flights as well. So, this is what we are
trying to do here.
So let us execute this code you can see breaks this these this particular variable has been
created labels and we can use the cut function and use the breaks information that we
have just created and the first argument in the cut function is going to be the variable
which we want to cut which we want to bin.
So, we want to bin actual departure time into 4 bins then we have already defined labels
name are appropriately defined and as per the breaks it is going to be different
observation is going to be binned. Now there is one more argument that you would see
that right as false so this would indicate whether the weather in a range the right side
value is going to be counted or not so because this is false. So, right value is not going to
be counted.
So, the range is actually going to be open on the first value of the these interval and
being closed being open or open on the open on the last value of the interval and closed
on this particular values it will take the value it can take the value0, but not 6 6 will come
here in the in the 6 to 12 category.
Similarly 12 will come here in the 12 to 18 in category and similarly flight at departuring
at 18 would come into this category. So, in this fashion we would be able to create this
632
bin this particular variable. So, you can see in the environment section the departure this
function has been created you are interested in the actual values you can see.
So for all the 107 values you know all the values they have been given a class or they
have been assigned a bin to which they belong. From the output of this particular
variable categorical variable that we have just created you can see the labels as well.
So, using the cut function by default we are creating a factor variable you can see the
labels have also been appropriately specified in this. So, once this particular departure
time the actual departure time once we have doing this particular variable we can add
this into our existing data frame and as we talked about the day variable that we had let
us also look at more details on this day and you can see this is numerical variable even
though this is categorical stored stored as numerical here.
So, we would like to convert it into factor or categorical variable using as dot factor
function. So, let us execute this code now let us look at the labels of this we can see
labels are 1 and 2. So, to bring more clarity to these labels will change the layer and
labels name as Sunday and Monday because one is representing Sunday and 2
representing Monday.
633
So, let us change the labels as well so once this is done we can have a look at the again
the structure. So, you can see now the day has also been appropriately being mentioned
here in the output Sunday, Monday here and the destination also now you can also see
the another variable departure DEPT with 4 levels.
So, now we have been able to create the variables have the variables in the format that
we want. Let us look at the first of the observation as well one more variable as you can
see DEPT and they also you can see it has changed has now become Monday and
Sunday.
Because we have changed the labels also this has also changed now certain variables in
this we are not interested so we will get rid off them for example, flight number this are
not going to use the date column this also we are not going to use now the departure and
arrival times and different columns that we have that also we do not require the 5 to 8 the
4 columns that we have. So, we are not interested in that we have appropriately
transformed the created the required variables.
634
So, let us get rid of those columns now let us look at the variables that we are interested
in now we are reduced to a data set of 5 variables, data frame of 5 variables, 6 variables,
5 predictors, and one the outcome variable of interest is right status and the other
variables you can see first one fight carrier now 3 levels and then the source then we
have destination 3 3 air force each then we have day that is Sunday or Monday. Then we
flight status at the outcome variable and then we have departure that is different time
intervals or part for a particular flight you know the time interval for the departure.
So, with this we are reading for ready for our modelling. So, we have appropriately
perform over data processing. So, you can see first 6 observation also so these are the
observations now we are ready for our modelling exercise. So, what we are going to do
here is we will first start with the partitioning of this particular data set. So, we will keep
60 percent of the observations in the training partition and 40 percent of the observation
in the test partition.
So, let us do the partitioning using this sample function. Let us create the training
partition then test partition. Now the library that we require the package that we require
to perform Naive Bayes modelling in R is e 10 7. So, let us load this particular let us load
this particular library.
635
Now let us look at the Naive Bayes functions in the Naive Bayes function the first
argument is going to be the formula that we have to express in this form as we have been
doing in other functions in previous lectures. So, you can see if flight dot status is the
outcome variable of interest here.
So, you can see appropriately specified then the tilde and dot that is representing that all
other variables are going to be counted as predictors. Now as you have seen that data
frame that we have now we have just the outcome variable and the predictors no other
variable there.
So, we can use the dot here and the training partition that we have here is the day of train
let us execute this code, let us look at the attributes of this particular variable mod. So,
this is so these are the attributes classes Naive Bayes and this is a Naive Bayes object and
so these are the attributes these are the other information that we have right labels call
tables a priory.
So, let us look at the output in more detail. So, first we have calls so that is the naive call
that we made then we have prior probabilities. So, these are the actual proportion of
records belonging to delayed class and the outcome variable or on time class in the
outcome variable.
636
So, as you can see 37.5 percent of the records belong to the delayed class in the data set
and in the 62.5 record belong to the on time class of outcome variables this is with
respect to training partition and not the full data set that we have. Now after this we see a
table on conditional probabilities as we can see here that why that is nothing, but flight
status and flight dot carrier and for different labels we have the probabilities values right.
So, just like the exercise that we have done on using excel. So, here we have more than 1
or 2 predictors in this exercise and you can see for first predictor that is flight carrier and
the outcome variable for different classes we have the probabilities values.
So, this would essentially mean that the Naive Bayes formula that we have talked about
so these are the probabilities right. So, these are the probabilities values for each class
and for each class of that predictor and given that whether they belong to the delayed or
on time class about come variable so these probabilities values you can see here for other
predictors as well.
If you want to have a relook at the Naive Bayes formula we can again go back to the
formula you can see.
637
So, these are the values we are interested in x 1 that is predictor given the that particular
class that is Ci whether delayed or on time and the probability of belonging to probability
of that particular value predictors value.
So, So these are so these predictors being categorical. So, they can take these values for
example, SRC it can take 3 values so for each of these values with respect to different
class in the outcome variable the variable value. So, we have so this table actually has
the all the information that we required to compute these Naive Bayes probabilities and
then the numerator or denominator or the actual Naive Bayes probability.
So, the attributes that we saw if we you want to access these attributes you know one by
one this is all we can do. So, these are the a priori one the first has give you the other one
tables that we have already gone through.
So, with this we will stop here and we will continue our discussion we will continue
perform doing our modelling using the same data set in next lecture.
Thank you.
638
Dr. Gaurav Dixit
Lecture - 35
Naive Bayes - Part V
Welcome to the Business Analytics and Data Mining Modeling Using R. So, in the
previous lecture we were discussing Naive Bayes and specifically we were doing an
exercise and modeling exercise in R environment. So, we were able to import the data set
and then the process and some of the transformation that we did in the previous lecture.
We also looked at some of the issues that we encounter related to dates and arrival and
departure time and how we were able to correct those dates that come because of the
importing of data from an excel windows, PC to R environment.
And then later on be process to be created the different time intervals based on the
departure time, actual departure time. We also did our mod exercise the partitioning and
modeling. So, let us, we also looked at the tables the conditional probabilities values for
different you know combination outcome variable and predictor combination for
different values that is nothing but different categories for the predictors and the outcome
variables. So, we looked at all those things.
Now, some of these things can also be performed using a pivot table and excel some of
these conditional probabilities. So, though we rely on the computation in R using this
particular functions, but these are essentially the mathematical computation computation
essentially the technique is more of mathematical in nature and the probability
computational or all the crux of this Naive Bayes modeling. So, what we will do? We
will export the data set, and that for the training partition that we have done we have
created this training partition that we have created in the excel format. So, we will also
learn how to export data into excel format.
639
So, and then we will do a pivot table we will create a pivot table and see how the same
conditional probabilities that we have computed using a Naive Bayes function that can
also be done using pivot table.
So, let us find and approve it folder for this first. So, if this is the folder then we can
specify in this name here and as we have discussed before that we need forward slashes
and not the backward slashes in R environment to be able to use the absolute path or
files. So, right dot x l s x is the function if we want to export the data into an excel file.
So, the training partition data is going to be exported here now we execute this line it has
been processed and a file flight details one has been created.
640
So, let us open this. So, this is the file you can see first, first column is nothing, but the
serial numbers. So, nothing but the index is also these are indices for which were
selected when we created the partition. So, right now we do not need this further,
column. So, you will delete this and then we have the other columns we have the 6
variables that we require for our modeling exercise. So, these are the 6 variables as you
can see here.
Now, to create the pivot table will select the data set and including the header. So, we
will select using ctrl shift and down arrow. So, once this is done we can go to the insert
tab within excel and then we can see that pivot table option there and then here we can
click on this and we will get the drop down menu and first option is pivot table. So, let us
create this. So, the range is already selected.
641
Now, we want this pivot table to created on a new worksheet, so we will just do and this
is what we get. Now, from here we can create pivot table for different pairs of variables.
So, one is flight status this definitely has to go here and column and then for example,
depending on the destination or example you want to calculate values for destination and
so that we can do.
So, the column label is flight status row level is destination as you can see a particular
pivot table has been created here you can see and the count of count is the default and
642
has been taken here for flight status few things will change. So, for this left click you
would see value field setting and the value field setting. So, everything is ok, and this is
summarized by is everything, but show value as there we would like to change and we
would like to make it percentage of column. So, this as you can see once we did this. So,
the numbers have been converted into percentages.
Now, if we can compare this, to the results that we have got in R to find out whether
everything is ok or not, you can see let us look at the results that we had for output a
variable and the destination. So, you can see 3 categories b l r, h y d and IXC. So,
numbers you can see here 50, 29.17, 20.83 same numbers are here, similarly for on time
47, 47.5, then 30 and then 22.5. So, you can see the probability computation can also be
performed using just we just looked at the data created a pivot table and we were able to
compute these conditional probabilities. So, the computations that we have performed
using the function in R this can actually be performed using a pivot table that are
available in excel as well.
So, let us move forward. Now, the table conditional probability table if we are interested
in accessing specific values for example, in the table if we just want to have a look at the
flight carrier probabilities, conditional probabilities related to flight carrier, we can
express in this format because the tables that we have is actually a data frame a list and
we can execute this you can see that for a out for outcome variable and the flight carrier
we have the conditional probabilities.
643
If we want just to one first row and third value that is the value for jet airways, so this is
how we can you use the brackets notation and axis this value you can see 0.625, if we do
not want to use the numbers for row and column we can mention the name of the name
of the rows and columns also in this fashion on time and indigo or second example, will
get this value 0.375. So, let us execute this you can see. So, this is how we can access the
values in R for the Naive Bayes tables from the Naive Bayes tables.
So, let us go through an example where we will try to compute the values by accessing
these numbers.
644
So, probability numbers as we have shown we can also be computed using excel and R
as well we can access the individual probabilities value. Now, using them using these
value we can now go through one example where in we will try to compute the
probability values. So, this is an example.
We want to classify an indigo flight from MAA this particular airport to IXC this
particular airport between 0 and 6 am on a Monday. So, you can see in this example. So,
all the information related to different predictors is available. Now, using this predictors
this information how we can go ahead. As you know that in complete exact way this will
have to find the exact matches, which we are not going we are going to use the Naive
Bayes computations.
First let us see whether there are any exact matches or not. So, we will just look at
whether there are exact matches or not. So, this is how we can do this in the d f train
partition and here we can look at whether the flight dot carrier is having this value indigo
and that in the source and destination and then day and departure. So, appropriately we
can mention the value and find out the rows you can see this particular expression has
been written in the row value right, the column value all columns are selected. So, let us
execute this, you can see 0 rows. So, no such observation is there in the training partition
right. If we just changed this particular value from 0 6 to 6 and 12 let us see if there are
any values.
645
So, we just change the interval you can see there is one value which is actually having all
these. So, there are other predictors information or net we have a match for other pattern
information except for the except for the departure time interval. So, for that particular
interval we did not have any matches therefore, compute exact this cannot be applied.
So, let us come to the Naive Bayes formula. So, we are going to compute numerator
value and numerator value only first. So, as we have discussed in excel exercise as well
first we need to compute the probability value for delayed given the information given
646
the predictor information as per the example. So, you can see is how we can access. So,
first a proportion of values belonging to delayed class. So, this is their then delayed and
indigo and delayed MAA and delayed IXC delayed 0 to 6 time interval departure time
interval and then the later on Monday. So, for all these in this fashion we can actually
access the individual condition probability values and we can multiply and the
proportion is all is also there. So, this is how we can actually compute the value you can
see p 1 has been computed.
So, let us print up two values in different digits. So, this is you will get 0.0007064 this is
the probability value. Now, for the on time class also we can perform the similar
computation here also as you can see that mod and then you can see on time proportion
of records are belonging to the on time class. So, that is there.
Then for flight carrier you can see on time and indigo, then for source you can see one
time and the MAA this particular airport and then the one on time IXC this the
destination airport then the time interval departure time interval this is for on time 0 to 6
and the last value that is for Monday. So, let us compute this. Let us also open the value
up to five significant digits. So, now, these are the values the for the numerator part.
647
So, as we have done in, as we have discussed before also that we want to compute the
actual probability value we can do so, using this p 1 we can divide by the summation of
these two value we have just computed. So, we will get the actual probability value.
So, these are the actual probabilities value. Now, once we have the probabilities value
depending on whether we want to use the most probable class method or depending on
whether we want to use the you know we have a class of interest. So, then in that case we
648
would like to specify the cut off value and compare using that, based on that we can
always classify the observation.
Now, once we are done with our building our model using the training partition. So, let
us score our test partition and evaluate the performance of this Naive Bayes model on the
same. So, what we are going to do here is use the predict function as we have been doing
in previous techniques as well and let us say score this.
So, we have two options here type is the argument that can be used within the predict
function. So, if we use type as class then we will get the, we will get the scoring for the
actual class and if we use this particular time as raw then we will get the estimated
probabilities value. So, let us compute each of these two to actual class membership and
also the estimated probabilities value.
649
So, let us look at the classification tables that we can generate using table function. So,
you can see actual class stored in the flight dot status for test partition as well and the
predicted last just be now just now we computed.
So, these are the results as you can see from this particular classification matrix that 6
delayed flights have been correctly classified as delayed, but 13 of them has been
incorrectly classified. On time 4 of them have been incorrectly classified then, but 20 of
them have been correctly classified. So, this gives us the idea. So, you can look at 26
650
observations have been correctly classified and 17 observation have been incorrectly
classified this is mainly because of this smaller sample size that is why you are not
getting that good result.
Now, we can also have a look at the full information the predicted class the actual class
and other on the probability value and the other variables to understand what has
happened in the modeling process. As per the results can see predicted class an actual
class you can see that these values the probabilities here again.
Again you can see that in this case the default class for our variable default class for our
variable we can look at here let us look at the default class you can see delayed and on
time. So, the delayed is the going to be the reference category and on time is the main
category included in the model. So, would see these probabilities values that p c here is.
So, this is a probability of a particular class belonging to on time category, more than 0.5
value of the value is more than 0.5. So, it will belong to the if the does the probability is
such then this is going to has been predicted as delayed if it is more than 0.5 if it is less
than that and it has been classified as on time.
So, predicted values also we can see here. So, for any probability value that was more
than 0.5 we have classified it as delayed and the probability value of less than 0.5 this
one from this we can understand it has been classified as on time. So, if we want to look
651
at the actual classification error accuracy and misclassification number this is all we can
do it.
So, a comparison between these cold values and the actual values will give us the
accuracy. So, that was 0.6 and other was the remaining 0.39 or about 0.4.
Now, we can also compare the performance of model on test partition with the
performance of model on training partition. So, let us the score the training partition
using the model itself. So, the same function predict.
652
And we can score this we can look at the classification matrix here you can see many
more observations are classified correctly in this case as expressed using the accuracy
and error numbers.
So, model in performing much better for the training partition main reason being it was
built on training partition itself and perform is performing slightly poorly for the test
partition. Now, this is also because of the smaller sample size that we have now for this
model if we want to generate lift curve.
653
So, this is how we can do it. So, cases we have how many cases. So, for the test partition
and I we want to create generate the lift curve cumulative lift curve lift curve. So, we
have the 43 observation. So, let us create this.
Now, the values actual classify, actual values for this particular partition test partition let
us access them using more test and variable. Now, labels let us change the labels because
we need cumulate we need cumulative values for based on the actual, based on the actual
classification of the observation. So, we will like to change the labels. So, in this case
delayed and on timed one for delayed and 0 for on time.
So, let us change these labels. Once this is done because this mod test and being the
factor variable. So, let us convert it into numeric variable so that we can apply the
operations that is the mathematical operations. So, this is you can see now numeric
variable and 0 and 1. Now, we would be able to compute the cumulative values right.
So, using this, using the probabilities value that we have that we have computed earlier
right for the delayed class right, using this particular value and the actual class that we
have just now created a variable we can create a data frame of these two variables.
654
And then we can sort these variables in terms of the probabilities values the decreasing
values right. So, we can if you are interested in looking at the few values. Let us look at
the first 6 observation in this particular data frame.
You can see probabilities value in actual class, let us order them, as we have done in
previous lectures at as well. So, you are again interested in looking at few observations
you can see now these values have been sorted or ordered in the decreasing probabilities
values sequence right. Now, we can compute the cumulative values. So, let us go through
655
the code for the same. So, the first value is going to be the same. So, how we can do and
then for the remaining values we can run this loop when in the previous value is going to
be added to get the cumulative value. So, once this is done let us look at the range. So,
we have 43 cases, let us look at the cumulative value range 1 to 19 then as you can see
appropriately the limits have been specified and other things there.
So, we can generate the plot. So, this is the cumulative against curve that we have, this is
the plot that we have now if we want to compare its performance with the reference line.
So, this is what it is.
656
And let us look at the plot. So, you can see this particular plot is performing better than
in terms of identifying the most probable ones or most probable delayed flights. So, this
model does quite a good job with respect to the average case that is the reference line has
so many reference line right. However, if we just look at the prior probabilities and the
actual numbers that we have computed right the naive bayes over all on comparing with
the probabilities we can find out that if here do the naive rule then what would have been
the scenario.
657
And if we applied the Naive Bayes modeling, you can see here. So, had we classified all
the observations as belonging to the majority class following naive rule then added
observation would have been classified as on time and we would have had the error as
just 24 divided by 64 and 0.375 right.
But if we look at the training partition error that we have computed just you know before
this right, it is the training partition is model is doing good and training partition we look
at the models performance on test partition you see the error is 39.5 and here you is you
would seen naive rule the error is 37.5. So, now, rule is actually doing better than our
model, but it is the cumulative lift curve that we saw that is giving the usefulness that is
indicating towards the usefulness of the model in terms of identifying the most probable
ones most probable delayed flights.
So, this is the our modeling exercise and that we wanted perform let us go back to our
discussion on Naive Bayes. So, what we will do? We will discuss Naive Bayes few more
points.
658
So, we look at the Naive Bayes as a technique its quite simple right quite simple and
easy to understand is based on the probabilities values. So, the interpretation is quite easy
and can give us surprisingly good results in some situations especially when that
independent that predictors independent of predictors values that a junction is actually
met then it can give us the surprisingly good results or sometimes even outperform other
techniques.
So, this good performance can come as mentioned in the first point despite assumption of
independent predictors values being far from crew. So, even if this assumption is not
being met even then we get good performance using Naive Bayes. However, it requires a
large number of records because we would like to cover as many scenarios as possible
because the predictors they are mainly categorical in nature. So, therefore, the different
scenarios that could be there we would like to compute most of them. So, that for a new
observation there is always probabilities values for to compute and assign a class to that
observation. So, requires a large number of records.
Now, if we do not have a larger sample size then few classes or predictors might not be
represented in the training partition record. So, in such situation 0 probability is going to
be assumed and then it would be difficult to classify that particular observation right. So
however, I mean this particular case is cannot be handled by other techniques as well, but
the situation is more complicated in naive Bayes main reason being that the information
659
in other predictors. For example, for one particular out of 4 or 5 predictors it is just to the
one predictor that is not matching, and if that is not if that is then, that is not present in
the training partition that one particular class then the value 0 probability value is going
to be assumed and therefore, all other predictors. So, the remaining predators information
would not be counted which is not the case in other techniques.
Far from this Naive Bayes is mainly suitable for classification tasks as we have discussed
in previous lecture previous lectures as well. Another important point is good for a
classification, but not for estimating probabilities of class membership. So, as we have
talked about that in the Naive Bayes formulation it is the numerator that is sufficient for
us to get achieve the rank ordering of rank ordering with, rank ordering of different my
classes of outcome variables and that is what we are interested in. So, the model as such
will do a good job; however, if we are interested in the actual probabilities values. So,
then Naive Bayes is not a suitable technique for such scenarios.
So, this will stop our discussion on Naive Bayes. In the next lecture we will start our
discussion on classification and regression trees.
Thank you.
660
Dr. Gaurav Dixit
Lecture - 36
Classification and Regression Trees- Part I
Welcome to the course business analytics and data mining modeling using R. So, in this
particular lecture, we are going to start our discussion on classification and regression
trees. So, let us say start. So, in short; this is also known as cart. So, many of these
statistical software packages and data mining packages, they have they generally
implement this particular procedure this.
So, this is the more common procedure for classification and regression tree and the
more general name for this kind of technique is decision trees and under this we have
this particular algorithm cart. So, this is a flexible data driven method. So, where in we
do not actually have any conditions or like assumed functional form between the
outcome variable and set of predictors. So, those kind of conditions that we generally
have in statistical techniques that is not applicable here.
So, you would also see the data driven method and based on. So, how it works. So, it is
based on separating observations into homogeneous subgroups by creating splits on
predictors. So, as we had done one exercise in the starting lecture where we had the
observation we had the observation for the sedan cars owner and their income and then
and we had the data on their household area and you and the outcome variable was
ownership or owner or non owner. So, there one of the method as an example; we had
tried out we were creating rectangles in that particular graph and that was actually a
similar approach what is actually done in cart.
So, the number of observations that we see; so, they are the partitions are created right.
661
So, we separate those observation into homogeneous subgroups. So, the same thing that
we had done in the very fast lectures very first lecture series the introduction lectures that
we had. So, here the classification and regression trees we generally separate
observations into homogeneous subgroups and by creating splits on predictor. So,
depending on the values of the different predictor variables we find the best one to create
2 separate create partitions to create homogeneous subgroups and that is how we keep on
keep on separating observation and creating homogeneous groups. So, this can be this
particular technique can be used for both prediction and classification tasks as we will
see in this lecture series.
Now, how the model is represented. So, typically the model is represented by a tree
diagram so that you might be familiar with. So, generally we start with the root node and
the root node; there we have to find out the predictor the best predictor which can create
the optimal splits and based on a finding a particular predictor and a particular value a
value for that predictor to create optimal splits based on that the root node start and this
process continues for both you know left sub tree and right sub tree. So, model is
represented by a tree tree diagram and easy to interpret logical rules.
So, out of this model once our tree diagram has been built constructed then the rules that
we get out of this technique are very simple for example, if there are 2 variables like age
and income. So, and we have to classify responder and non responder. So, out of this tree
662
we might get a rule like this if is greater than 25 and income is less than fifty thousand
then class one. So, these are the kind of rules that might that we might get if age is
greater than fifty and the income is greater than sixty thousand then class zero or class
two
So, these are the kind of rules logical rules that we get and they are very easy to interpret
so easy to identify observations easy to identify records which will meet these criteria
which will meet these logical rules and easy to implement. So, classification and
regression tree because of the logical rule simple logical rules that we get and because of
the ease of implementation of these rules they have been applied in a variety of a wide
range of domains and very popular across domain not just the analytics engineering
medical and other domains also they have been used. So, and another important thing
about this particular technique trees in general decision trees in general and classification
regression trees cart in particular that very simple model.
So, therefore, when you first think about the problem when you first encounter a problem
as we talked about the analytics challenge that you have to you know if they are if you
are able to identify certain prediction tasks classification task then classification and
regression tree is one particular technique where you do not have to think too much
about the applicability.
And whether the assumptions the applicability the data various things that we have been
trying out in previous lectures, you do not have to think too much you do not have to
process or follow steps like other techniques and you can simply apply. So, this is going
to be one automatic selection irrespective of the; a type of prediction tasks or
classification tasks so different problem situations. So, this particular technique can be
applied in different problem situations and also across different domains
So, let us discuss further the classification trees. So, we will start with the classification
task first.
663
So, classification trees there are 2 typical steps that we have to follow to build a classifier
model based on the cart procedure. So, typically the first step is recursive partitioning.
So, what happens in recursive partitioning is as we have talked about that this is about
partitioning. So, p dimensional space. So, if we have p number of predictors our set of
predictors in our set of predictors we have p variables right. So, we are going to partition
this p dimensional space of predictors using training partition using training data set and.
So, this is how it starts. So, every time we will keep on partitioning and this process will
go in any recursive fashion.
So, first partition and once the after we do the first partition will get 2 more partition if
that is if we are if we are getting 2 partition sets then on those each of those 2 partitions
will further apply will further apply this recursive partitioning approach and this process
will keep on it will continue till we create you know all our partition that we finally, get
till we achieve the pure homogeneous subgroups pure homogeneous partition out of this
process.
So, in this; so, first step recursive partitioning is mainly about partitioning the
observation and mainly about creating the homogeneous subgroups in a recursive fashion
the second important step in building classification trees is pruning. So, in pruning the
build the tree that we have a build using training partition; so, because we keep on
building that particular tree till we reach a pure homogeneous subgroups so that partially
664
fully full grown tree is going to classify all the observations correctly. So, that would
actually over fit the data.
So, to have a good model which is not fitting to the noise and extracting and using the
predictor information we will have to prune the tree back to a level where it is not fitting
to the noise. So, pruning is that process. So, this is about pruning the build tree using
validation data. So, the second partition that we deeply you know that we also discussed
in initial lectures that the validation partition can also be used to fine tune or refine the
model. So, in this particular technique classification and regression trees you would see
that in the pruning step it is the validation data is that is used to prune the full grown tree
or that has been built using the training partition; so, 2 main steps recursive partitioning
and pruning.
So, in recursive partitioning, we will first partition the p dimensional space operators and
do and do and will try to create homogeneous subgroups and once this process is done,
then we like to prune the tree. So, that the model is the tree more diagram that we model
that we have is fitting to the predators information and not to the noise.
So, let us discuss in more detail each of these steps. So, first start with the let us start
with the recursive partitioning. So, in recursive partitioning number one what we do is
partitioning p dimensional space operators as we talked about into non overlapping multi
dimensional rectangles. So, of course, we are in the p dimensional space. So, with one
665
predictor it be partisan this particular p dimensional space, we are using one particular
predictor the first step then the partition that we create this would be non overlapping and
of course, they are even going to be multi dimensional.
So, we are going to create non overlapping multi dimensional rectangles up to first step.
Now the as we discussed that partitioning process is recursive in nature; that means, the
previous partition; so, the; we keep on applying this process recursive partitioning
process on the results of previous partisans now for let us also understand the detailed
steps for recursive partitioning. So, the first step is about and finding an optimal
combination of one of the predictors let us say this predictor is x i.
So, and the value that is going to be used the split value that is going to be used to create
a split to create partition is v i. So, we need to find out this optimal combination x i this
particular particular predictor x i and a particular value of this predator v i. So, this has to
be selected to create first a split of p dimensional space into 2 parts. So, a part one can be
the all the values x i the values less than that vertical value v i.
So, x i values being less than values less than or equal to v i. So, they would go into the
part one and the part 2 the x i values which are greater than v i, they will go into the part
two. So, using that particular value v i. So, all the observations all the observation or
records which are having x i values less than or equal to v i, they will go into part one
and all the observations are records which are having x i value greater than v i, they will
go into part 2.
So, this is how we will create these 2 parts and also as we talked about. So, the selection
of the particular predictor x i and the value the splitting value v i, it is has to be an
optimal combination. So, next step in the next step the step one that we talked about that
is applied again on the 2 parts.
So, when we talked about the recursive partitioning this is the recursive part the results
of previous step again the same process is applied on those results. So, step one is
applied again on the 2 parts and a process continues to more rectangular parts now once
this process. So, this process will continue in the step three as you can see the
partitioning process continues till we reach pure or homogeneous parts. So, till all the
parts all the subgroups that we create all the partitions that we create they have the same
666
observation, they have observation belonging to the same class till that till we achieve
that scenario this process continues you can see in the slide as well.
That all the observations in the part belong to just one of the classes.
So, the all the subgroups partitions rectangles that we create the observation that belong
to those subgroups partitions rectangles they should belong to just one class one class.
So, one of the classes; so, therefore, this process will continue till we reach that. So, this
particular tree after we follow this these steps these three steps of recursive partitioning,
but we will end up with a full grown tree which will be able to classify all the
observations or the carts that are there in the training partition and it will have hundred
percent accuracy and zero percent error because we have been able to we have been able
to create groups homogeneous groups having members of just one of the classes.
So, let us understand this process regards the partitioning process through an exercise in
R.
So, let us open R studio.
667
So, as we have data in excel formats let us load this particular library that we typically
used to import the data set from excel files let us do this line.
So, this is loaded. So, the particular excel file the particular data set that we are going to
start with is sedan car data set that we have used in previous lectures as well. So, let us
import this particular data set as you can see in the environment section data frame the
data set has been imported into this data frame d f we have twenty observation of three
variables. So, let us remove NA columns, if there are NA. So, once that is done let us
668
look at the structure of this particular data frame. So, as you can see this is the particular
data set that we have used in previous lectures as well. So, there are three variables
annual income household area and the ownership, alright.
So, using these 2 predictors annual income and household area. So, we have 2
dimensional space here 2 dimensional space predictor space we have and we would like
our outcome variable of interest is ownership. So, we would like to apply our recursive
partitioning and that step of cart procedure on this particular data set. So, let us also look
at the first few observations.
These are the first few 20 observations of this particular data set.
669
So, as you can see the income is in rupees, lakhs per annum this is annual income in
rupees lakhs and the household area is in the square feet; hundreds of square feet. So,
these are the unit of measurement for these 2 variables and then we would like to classify
the different households unit of analysis is household. So, we would like to classify
households into owner or non owner category; so, whether they own a sedan car or not.
So, let us plot; let us create a scatter plot of this. So, let us first set some of the graphics
parameter. So, as we have been doing using a power function and within that we would
like to set this parameter margin; m a r; to these numbers to this. Now let us look at the
range of these 2 variables annual income this is the range and the household area.
670
This is the range as we have been doing in previous lectures as well while we create plot
we also specify a limit on x axis and a limit on y axis which is quite close to the values of
these values or further as specified in the range of the variables to be plotted. So, you can
see actually accelerate to twelve this is. So, the range of annual income is within this
within this particular limit similarly for y axis y limit you can see 13; 25. So, the range
for household area is within this limit fourteen twenty four. So, so that now let us create a
scatter plot. So, this is the plot that we have it is also generated legend for this.
671
So, these are the twenty observations that we have and we would be applying the cart;
cart procedure the card algorithm on this particular data set to create homogeneous
subgroups or rectangles.
So, what happens in the first place? So, let us assume that first split is going to be on this
particular value.
So, 18.8 that is this particular value; so, household area eighteen pointed this value is
going to be somewhere around this somewhere close to 19 marks somewhere in between
672
18-20. So, this particular line is will probably go like this in between these 2 close points
and we will get our first partition let us create first split. So, again these splits are
hypothetical based on our visual inspection visual analysis that we see that probably this
is split created using this particular value.
So, the variable that we have selected here is the household area and the value that we
have selected is 18.8. So, out of 2 predictors that we had household area and annual
income we have selected household area and particular value 18.8 as the optimal
combination. So, of course, this is not from the result of applying the model cart
procedure this is one example for illustration purpose we are going through.
So, if a household area and the particular where we value 18.8 that is the that is the
optimal combination for operator value and this is how this is the kind of spirit that will
have if we look at the upper rectangle here you can see here 7 observation belong to the
owner class and remaining three observation belong to the non owner class, right. So, is
not pure homogeneous, but most of the observations belong to the owner class, if we
look at the lower rectangle the lower subgroup then we have 7 observations belonging to
the non owner class and three observations belonging to the owner class. So, this is
majority is with a non owner class, but again this is also not pure homogeneous.
So, we will have to continue the process because as we discussed in recursive

partitioning we will continue the process. So, we have these 2 partitions. So, on each of
these 2 partition will apply the again we will do further partitioning till we reach to your
homogeneous partition or subgroups. So, let us try understand; how we can proceed
further. So, what could be another important another important concept here is to
understand the possible set of split values right. So, what could be these possible set of
values from which we can find out the optimal combination.
So, for example, if the variable is numerical; so, if the variable that is going to be
selected the split variable if it is going to be if it is a numeric variable numerical variable
then mid points between pairs of consecutive values for a variable they are going to be
the candidates for possible split values. So, now, these midpoints can again be ranked as
per the impurity reduction the heterogeneity and that could be there in the resulting
rectangular partition.
673
So, that is how we can pick the optimal predictive value combination. So, for a particular
predictor right the midpoints if it is a numerical variable the midpoints between pairs of
consecutive values. So, they are going to be that those points are going to be the possible
candidates and again we can further rank them with respect to their impurity in the
resulting rectangular partition.
So, let us go through this process for the annual income variable. So, let us first sort this
particular variable.
So, these are the value we have 20 observations. So, these are the value in sorted order
you can see they are in increasing order. So, starting from 4.3 4.7 and then up to 9.4 and
10.8. So, once these values have been sorted we can start computing the midpoints. So, if
we the once the value is sorted. So, we can the number of values are 20 here, right. So, in
this case we will have we can create a diff here. So, this particular diff is going to give us
on the sorted annual income value if we create this diff. So, these are the values that we
get. So, in this diff these are the value.
So, annual income if we sort them and if we take it difference. So, if you are understand
finding more about this particular function diff. So, what we are trying to do here is we
are trying to compute the lagged differences here. So, it will; it is going to return suitably
lagged and iterative differences. So, the default value as you can see here the lag is one.
So, for annual income right the sorted values of annual income.
674
So, the lagged lag one values are going to be written and then we you would see it in the
in the we also divide these values by 2 because for each value in the in the sorted
sequence each value will add this the difference they will add the difference of you know
half of the values half of the values for this particular output out of this output of this
particular expression and in the first part you would see we have used sort let us also
look at the sort function in the help function help section.
So, let us look at the arguments.
So, you would see sort the variable and after this we have we have taken head. So, let us
look at these. So, the sort is going to create the order the typically that by default this is
as you can see the raising is false. So, by default it is the increasing order.
675
So, let us look at the head function the second argument we are interested in; you can see
the second argument is n in this. So, this is a single integer. So, a positive size for the
resulting object number of elements and you can see if negative all, but the n last or first
number of elements of x.
676
So, once we take head of this particular and minus 1. So, the last value will not be
included. So, you would see that you can see in this will have around almost 19 values
you can say start from just like this output 4.3 to 10.8 and this starts from 4.3 to 9.4 right
so on.
So, the mid points to compute the mid points; so, these first we compute this these values
this particular series and then we add a you know the we take a lag and up using that lag
we add this particular value here the half of that value we add here. So, let us compute
this. So, this is what we get if you if you look at these values. So, they are the midpoints
you can see 4.3 and 4.7; it is the midpoint values value for this particular; this particular
here is 4.5 an x minus 4.7 and 4.9.
So, midpoint value is 4.8. So, from here you can actually see how we have been able to
compute the midpoints. So, first we because we also wanted to rank them; so, first we
have sorted; so, consecutive values. So, first we have sorted and then the last value, we
had removed here using the head because the midpoints would be computed based on the
based on the last 2 last value and then we have added the half of these numbers right half
of these numbers sorted numbers and then by taking a lif diff.
So, diff will give will actually compute that delta that is there. So, if we look at this
output of the diff here point four you can see the difference between .4 and 4.3 and 4.7
first pair this is 0.4. Similarly for others also the next one is 0.2 and 0.3. So, half of this
677
will actually you know give us the remaining part to compute the midpoint values for the
same thing same process we can apply for all sold area as well. So, these are the sorted
values in the increasing order.
So, once this is done as you can see the same code is there. So, again we leave out the
last value and then we add that difference that is there, then half of that difference
between each pair of you know consecutive values. So, we will get the midpoints; so, for
these 2 variables the midpoints that we have this one for the annual income and then the
household area also.
So, these particular set of points are the candidates possible candidates of split value split
values. So, the in our process and because the partitioning process when we try and
identify a particular combination optimal combination of predictor and the value of that
predictor we will have to try out these many combinations for annual income we have
around 14, 15, 16, 17, 18, 19; so, 19 values midpoint values that we have. So, this has to
be 19 and then another for household area another 19 value. So, out of these 38 predictor
value combination we will have to find out the optimal one and that is going to be used
for our first split.
So, we will stop here and we will continue our discussion on this finding possible set of
split values for categorical variables in the next lecture.
678
Thank you.
679
Dr. Gaurav Dixit
Lecture – 37
Classification and Regression Trees - Part II
Welcome to the course Business Analytics and Data Mining Modeling is Using R. So, in
the previous lecture we started our discussion on classification and regression trees. So,
we talked about two steps recursive partitioning and pruning and we started our
discussion on recursive partitioning and we also started our exercise on the same and in
the previous lecture. So, we were discussing about the possible set of split values that
could be those values and how we can compute them, how we can get an idea about
those split values using R.
So, in the previous lecture we talked about if the variable is numerical the predictor is
numerical then what could be the possible set of split values. So, we talked about annual
income and also household area that sedan car dataset that we are using for this exercise.
So, we also computed midpoint values for these two variables and we talked about we
have two variables and 19 midpoint values for each of them twenty observation we have
in total. So, about 38 predictor value combination will have and out of these 38 predator
combination if the algorithm the implementation of that algorithm if it follows this
process and out of these 38 combination we will have to select one optimal one which is
going to reduce the impurity; that means, the heterogeneity.
680
That could be there in the resulting partition. So, resulting partition having the least
impurity; that means, more you know homogenous partition so that particular value
combination would actually be selected for first spilt.
So, what if the variable if our variable is categorical? So, in that particular case the set of
categories that we have they are divided into two subsets, for example, if we have a
particular variable.
681
Let us say we have this variable. So, our values on the categories that are there, they
could be this. So, from this we have to we can have many midpoint many set of possible
candidates here right. So, there could be different you know value there different options
here for example, you know this could be one. So, we have to create two parts from here.
So, one category will go into one part the other categories will go into the other part right
part 1 and part 2. So, in this fashion there could be various other candidates it could be B
and others could be here then similarly it could be you know C and the others could be
here. So, in this fashion there could be many combinations of these splits. So, there could
be many split value the predictor and split value combination in this case also.
So, for categorical variable this is how we can create you know different combination of
variable and split value. So, two subsets, for each the variable 4 categories A B C D, so
all you know two subsets combination could be the different values that can be used as
the possible set of candidate.
Now, let us talk about the impurity measures that we could be using for in this in this in
this particular algorithm classification and regression tree. So, impurity measures that we
are going to cover is two measures mainly a gini index and entropy measure. So, let say
start our discussion on gini index.
So, for an, so both these measures whether gini index or entropy measures, they are in a
sense major the impurity. So, for impurity for the original, original rectangle, original
682
group in our in our data and then once we create partitions. So, two parts part one and
part two for each of those parts we can further compute the impurity using these matrix.
So, then later on we can compare that the after we have done the particular partition after
we have done a particular split whether there has been a decrease in impurity. So, how do
we measure that impurity of different partitions? So, these are the two matrix which can
be used gini index and entropy measure.
So, let us talk about the gini index first. So, for an outcome variable with m classes, gini
impurity index for a rectangular part is defined as this gini 1 minus summation over k
one to m because we have m classes and then P k square, where P k is the proportion of
rectangular part observation belonging to class k. So, for each class if we have we have
m classes, for each class will have to compute the proportion of observations belonging
to that class in that particular rectangular part.
So, for example, if we had the full original rectangle all the observations and. So, we can
compute the you know for each class, class 1 to m c 1, c 2 up to c m for each class we
will have to compute the proportion values right proportion of observations belonging to
class one in that particular rectangular part. Portion of observation belonging to class c 2
again in that same rectangular, in this fashion for all classes c 1 to c m we will have to
compute the this proportion values P k and then square and summation of this. So, this
will actually represent, this will actually the summation of this once we subtract this
value from one this is actually going to represent the impurity right.
So, this will give us the impurity index for the rectangular part and once we create
partition once we do a split we will have two more parts. So, for those two parts again
we can use the same formula to compute their impurity value and these two parts we can
add these two, we can add the impurity values of these two parts and then we can
compare it with the original rectangular partition and see how much impurity has been
reduced because of the partitioning alright. So, this is one particular metric that we can
use.
Let us talk about the second metric entropy measure. So, before that let us understand the
values gini values range. So, gini values lie in this range 0 m minus 1 divided by m.
683
So, if there are m classes. So, this is going to be the range for gini index and if there are
if there are just two classes so the range is going to be 0 to 0.5. So, for 0 to 0 0.5, when
the how we compute these two range? When in a two class scenario if the representation
of both the classes is equal right, in that case the proportion would be 0.5 and 0.5 for
both the classes. Now, if you go back to the expression here 1 minus summation over P k
square. So, we have use 0.5 and 0.5 for both the classes. So, you would get that value
right. So, the value that you get and then that is going to be 0.5 so that is going to be the
highest value. So, when we have the equal representation from all the classes the value
the gini index value is going to be the highest because there that is the situation where
the impurity is, where the impurity is highest because the observations belonging to
different classes they are equal. If there in a particular rectangular partition if most of the
observations belong to one particular class then of course, impurity is less because very
few observation would be belonging to other class.
If this you know this particular ratio keep on decreasing and becomes equal where you
know the different classes the observation belong different classes they are in equal
proportion then of course, the impurity is going to be the highest and that is also you
know indicated in this particular range. So, m class scenario the value is going to be 0 m
minus one divided by m and 2 class scenario the value the range is going to be 0 to 0 0.5.
684
Let us talk about the next metric that is entropy measure. So, for an outcome variable
with m classes and entropy for a rectangular part is defined as this entropy minus
summation over k equal to 1 to m and P k log and log of P k base 2. So, this is how we
compute the entropy value. So, as we discussed for gini index right same thing P k stands
for the same thing proportion of class k members in the rectangle in the rectangular part.
So, then we compute that value then we take log of it log base 2 of it and then multiply
these value and then sum it over and classes and the minus of that is going to be the
entropy value.
So, the range for entropy value, here it is going to be 0 and a log m base 2 for m class
scenario and 0 and 1 for 2 class scenario, how?
So, for example, for two class scenario the highest impurity is going to be in this
situation when the members belonging to each of those two classes they are in equal
proportion they are in equal numbers. So, in that case the P k value is going to be 1 by 2
or 0.5. So, if the P k value is 1 by 2 you can plug in that value in this particular
expression and you will get that log base 2 of 1 by 2 is going to be you know minus 1.
So, that minus they will cancel out and then P k is there then that you will get the 1 by 2
and then for the second the other class also it will compute this value and once you sum
it 1 by 2 plus 1 by 2 is going to be 1. So, this is how the range is.
685
So, highest impurity highest impurity scenario is when the all the classes they have equal
proportion they have equal representation in a particular rectangular part right. So, that is
when the highest impurity is going to be there and that will also give us the range for
entropy values and also for gini index. So, what we will do? To understand more about
these two particular matrix will do a simple exercise in R. So, let us go back.
So, let us first understand the plot of you know gini values versus P 1 this which is
proportion of observations in class 1 now this is for a 2 class. So, let us understand how
the plot is going to be depending on how we vary the proportion of observation
belonging to class 1. So, let us say that P 1, this is our, sequence is the function that we
can use to generate different proportions. So, let us compute this you can see P 1 has
been created as you can see in the environment section and if you are interested in
looking at the specific values. So, the proportion can range from 0 to 0.1 to 0.2 to 0.3 up
to 0.9 and then 1.
686
So, against these proportion values P 1 values we are going to compute gini index values
and then we are going to plot them. So, as we are already familiar with gini index
formula. So, let us first initialize this gini variable. So, let us do the initialization and
then we are going to run this loop i in 1 to length of P 1 that is eleven values in total. So,
for each of those values for each of those proportion values we are going to compute the
gini index. So, this was the, this is how we can express the gini index formula here in R,
1 minus and within parenthesis we have for each proportion we first, we use the
proportion values and they take a square of it and then we do a sum and then we add all
these values for the all the classes.
687
So, let us compute this. We would see that a gini vector numeric vector has been has
been created again 11 value. So, 11 gini index values corresponding to different
proportion values right. So, let us plot this.
And this is the plot gini index versus proportion. Now, from here you can clearly
understand as the proportion increases from 0 to 0 0.5 somewhere here you would see
that gini index, index value is highest and it is 0.5 as we talked about right and as we
further increase the increase the proportion right then again because P 1 this proportion
688
keeps on increasing again then the gini index value this will start decreasing right. So,
this will keep on decreasing and again when the proportion is 1 this will go to this will
become 0. So, this is how the values are going to be for gini metric, gini index.
The same thing we can do for entropy measure as well. So, let us plot and let us plot a
graph entropy versus P 1 that is proportion of observation class 1. So, P 1 for we have
already defined. So, let us initialize the entropy here now within for loop you can see
how we have written the code for calculation of entropy value. So, you can see for each
class we have one expression and each expression we have proportional and then
multiplied by log base 2 value of that proportion and once we sum all these expression
and we take a minus of it. So, let us run this loop to find out the entropy values you can
see 11 values have been created right. You would see that first particular value that is n a
n it is showing as n a n, this is mainly because the proportion value for 0 here and log of
0 is not defined. So, because of that we have got this particular value.
So, let us plot. So, here you would see that in the plot function we are also using a spline
function which will smooth smoothen the plot that we generate. So, let us see how what
is going to happen. So, this is the plot here. So, you can see this particular plot is much
more smoother than the plot that we had created for gini index. So, again here also as we
move from 0 to 0 0.5 you would see that entropy measure is this particular value is
maximum the value is 1 at 0.5 and as this proportion P 1 increases further this value goes
down up to 0. So, this was about the two matrix, two matrix the gini index and entropy
measure. So, let us talk further about our technique classification and regression trees.
So, next important point is the tree diagram or tree structure that we create. So, as we
talked about the recursive partitioning steps. So, let us understand the tree diagram what
is how this is going to be built. So, for each split of P dimensional space into two parts,
so that is of course, the part of recursive partitioning, can be depicted as a split of a node
in a decision tree into two child nodes. So, we can have a root node right, we can have a
root node, let us this is our root node and this is the original party partition then the each
split that we perform it can be denoted using two nodes here, right. So, this is one part 1,
this is part 2. So, in this fashion they split that we are talking about can be created.
So, P dimensional space if it is P dimensional space we will start with the root node and
this is going to be partition two parts are going to be created. So, this can be represented
689
in this fashion decision node having two child nodes. Now, once we have these two parts
these two child nodes then again the same process would be applied on these two parts
till you know, so the tree will start growing till the point we have created homogeneous
partitions or homogeneous groups. So, first split creates branches of root node. So, as we
can see. Now, two types of nodes in tree structure first one is a decision node. So, that is
depicted with a circle here and then the second one is terminal or leaf node that is
typically depicted using rectangle right.
So, these terminal nodes they typically they correspond to final rectangle parts. So, when
we talk about just the recursive partitioning step where we build the full grown tree; that
means, we get pure homogeneous parts. So, in that case we are going to have you know
terminal nodes. So, for example, if this was you know root node and we created two
partitions and once we created two partition we were able to achieve the homogeneous
rectangles right. So, let us say further partitioning of this leads to homogeneous
rectangles right. So, we will have, we can represent those nodes because they are going
to be the terminal nodes leaf nodes using these rectangles right. So, these are decision
nodes right. So, predictor and predictor value combination are going to be applied on
these decision nodes and then the terminal nodes would indicate the actual class because
this is now pure homogeneous group. So, it is going to be either class 1 or class 0, class 1
or class 0. So, in this fashion the tree structure could be there.
So, two types of node decision nodes and terminal node. So, decision nodes are the one
where we apply the predictor value combinations and create a split and the terminal
nodes or leaf nodes are the one where we finally, end up with pure homogeneous part
homogeneous group and therefore, we can label it with the class name class 1 or class 0
if it is a two class case.
Now, let us understand the steps to classify new observations, new observation using tree
based models. So, for a new observation once the tree has been built. So, new
observation to be classified can be dropped down the tree. So, it can be dropped down
from root node and then depending on the different comparison it will take different
branches and the finally, it will end up with the terminal node or leaf node.
So, first step new observation to be classified is drop down the tree is starting from root
node and at each decision node which also root, root node, root node is the first decision
690
node. So, at each decision node the appropriate branch is taken until we reach a leaf node
right. So, for example, this is a variable, variable V 1 and you know let us say X 1, this is
X 1 and then the corresponding value for this particular you know variable is V 1 and the
split is created right. So, values less than V 1 they go this side values greater than V 1
they go this side two parts alright. So, in this fashion here again we will have another
variable X 2 and the value V 2 here we will have X 3 and value V 3 right and then the
observation having value less than V 2 will go here greater than V 2 will come here
similarly for here.
So, in this fashion we will continue till we till the new observation reach the terminal
node or leaf node where then finally, it is going to be classified as per the class of that
particular terminal node. So, finally, at leaf node majority class is assigned to the new
observation. So, now, this is going to be when we do not have any special class of
interest where we are trying to maximize the overall accuracy or trying to minimize the
overall misclassification error, but when we have a special class of interest as we have
been talking about in previous lectures for other techniques the steps are going to change
a bit. For example for a class of interest scenario proportion of records belonging to the
class of interest is compared with the user specified cut off value for the same right.
So, for the once you once we reach the leaf node typically you know when we talk about
the recursive partitioning it is going to be a purely homogeneous partition. So, there is
691
going to be no such problem, but if the tree is not fully grown tree it has been pruned
back pruning will discuss in coming lectures. So, in that case the partition the leaf
terminal node might not be homogeneous and there could be some observation belonging
to other classes. So, therefore, how do we decide? So, for when we try to, when we do
not have any special class of interest and when we are looking to maximize overall
accuracy in those situation we can just look at the majority class in the terminal node and
assign that class to the new observation.
But when we have a class of interest we will compute the proportion of records
belonging to that class of interest and then compare this particular proportion value to the
user specified cut off value because that is the class of interest. So, we would like to
identify more observations belonging to that class one even if it comes at the expense of
miss identifying more observation belonging to other classes. So, the step is, this step
final step is going to change depending on whether we have a class of interest or not.
So, what we will do? I will go through a simple exercise in R. So, let us go back to R, but
before that let us also go through and this exercise where we compute the impurity using
two matrix that we talked about.
692
So, sedan car example that we have discussed before, let us look at the summary of this
particular ownership variable. So, we have 10 observation belonging to non owner
category and then observation belong to owner category. Now, the different matrix that
we talked about the impurity index how we can compute. So, for gini index and entropy
value for the original partition, original rectangle we can compute in this fashion you can
see 1 minus because 10 observation belong to the non owner category out of 20. So, in
this fashion we can compute the gini index for other classes as well. So, this would be
the gini value. So, entropy value also we can compute in this fashion you can see 10
observation belong to owner non remaining 10 of belong to non owner. So, in this
fashion we can compute the entropy value.
693
Now, once the first split that we had created earlier let us look at the graph. So, this was
the graph you can see here we had created the first split at you know household area
value of 18.8 and from this using this let us compute the gini entropy and entropy
measure values. So, from this let us zoom into this particular plot. So, in the upper
rectangular part you can see we have 7 observations belonging to the owner class and 3
observations belonging to the non owner class. So, it is 7 out of 10 to owner and 3 out of
10 non owner for upper rectangular part. So, gini for upper rectangular is going to be 1
minus 7 divided by 10 and that is square of that then 3 by 2 divided by 10 square of that.
So, in this fashion we can compute the gini value for upper rectangular. Similarly for the
entropy value for the upper rectangular also we can compute using similar approach. So,
let us compute these two values.
Now, if we look at the graph again you can see that lower rectangular part this is
symmetric to the upper rectangular part in terms of proportion. So, portion of
observations belonging to the owner and non owner. So, you know upper rectangular is
dominated by owner lower rectangular is dominated by non owner, but the proportion
they are very symmetric. So, the values for gini index and entropy measure they are
going to be same. So, why not assign the same values for lower rectangular as well. So,
gini value is going to be same as follow a rectangular is going to be same as upper
rectangular. Similarly entropy value is going to be a follow rectangular, is going to be
same as that for upper rectangular.
694
Once this is done, so for a split 1 we can compute the gini index value. So, we will add
these two values for upper rectangular and lower rectangular. So, you can see we are also
multiplying these value by their proportion here. So, 10 out of 20 observations in the
upper rectangular, 10 out of 20 observation in the lower rectangular, this will give us the
impurity index after first split and for entropy values of first split.
So, you can see in the environment section. So, values have been created you can split
one around 0.88 and again you split on around 0.42, and the original values also you can
695
see original value is 0.5 g i o r g and e m o r g 1. So, now, we can compute the difference
between you know that the delta that deduction that has happened in impurity. So, that is
gini delta we can compute and e m delta. So, you can see e m delta minus this one minus
0.11 around minus 0.12 and gini delta is minus 0.08. So, if we can see there is a
reduction in impurity. So, therefore, these two the first split is off course help us in
achieving more, help us in achieving more homogeneous parts which is also very clearly
visible from the plots as well.
So, in this fashion we can keep on continuing creating partition.
So, I will stop here and the other partition and the values gini values and the other
exercises and discussion will continue in the next lecture.
Thank you.
696
Dr. Gaurav Dixit
Lecture – 38
Classification and Regression Trees- Part III
Welcome to the course, business analytics and data mining modeling using R. So, in the
previous lectures we have been discussing classification and regression trees and
specifically, we were talking about the classification trees, we talked about the recursive
partitioning, we did few exercises, we talked about the impurity matrix as well. So, we
were doing exercise related to that.
So, the sedan car dataset, that we were using and we talked about the first partition, the
hypothetical partition that we created and we also computed the gini index values and
entropy values for the original partition and then after first split right? So, then we also
computed the impurity values, then we also compared whether there is an any reduction
in impurity or not.
So, let us continue that same exercise and now, we will do second split and then again
further we will move to applying classification and regression tree. So, this was the
particular graphic for scatter plot, for household area versus annual income and this was
the first split we created, and as we discussed you know the upper rectangular and lower
697
rectangular being symmetric, we computed impurity metric values as well. So, similarly
second split can also be done, the second split can be done at this point. So, you can see
this particular value, 0 and 18.8 at annual income value of 7, and up to this 18.8 of
household area value.
We can create this separator and again we will have these 2 rectangular partitions, here
now you can see this particular partition is purely homogenous. Because, we have all the
observation that is 2 here belonging to the same class, that is owner. In this particular
partition will have 3 and 3 and 1; 7 observations, belonging to the non-owner class and
just 1 observation belonging to the owner class.
So, the reduction impurity can be clearly seen and here, this has already become
homogeneous. So, in this fashion we can keep on creating partition and that is also to
explain, you the typical process that happens in classification and regression trees. So,
what could be our second partition; third partition.
So, this could be another line, you can see this particular, after this partition will have
another area, with homogeneous area all the observation belonging to the same class and
the this one, will have just one observation belonging to the other class, right? So, in this
fashion we can keep on continue our partitioning process, till we reach or create all the
part, reach the partition which have which are purely homogeneous. So, know you can
see that, any partition that you see, it is pure homogeneous partition the observations
belong to the same class, right?
So, this is now all the observations have been correctly classified. So, the performance of
this particular classifier is going to be 100 percent, all the observations correctly
classified. But off course, there is over fitting that has that has happened. So, how do we
adjust? How do we get rid of this over fitting in our model? This will discuss in coming
lectures. So, let us continue with our exercise of based on sedan car dataset. So, let us
look at the structure of the data frame, that we have. So, this is the sedan car data set, that
we have been using right? Annual income, household, area, ownership.
Now, to apply a classification and classification regression model is specifically

classification model in this case, we need to have this package installed in our R
environment. So, R part is the name of the package for cart procedure. So, we are going
to load this particular library. So, in this particular package, the function that we used for
698
modelling is R part and there we have a particular argument method. So, in this method,
we have to assign value as class, for classification tree and anova for regression tree, as
since we are doing classification task right now. So, what we will do? As we have been
doing modeling in previous other techniques. So, mod R part our outcome variable, here
is ownership and remaining variables it is going to be built on the other variables, that
are predictor variables 2 of them, annual income household area method is class, because
we want to do with classification data is df.
So, all the observations we are going to absorb in this modeling process. So, this is
mainly for demonstration of you know classification tree. So, we will observe all the
observations in this process, then another argument that we have is control. So, this
control can be used to call another function, R part dot control to set few more
parameters, to understand more about these functions.
So, first let us type R part in the help section and you would see that, the package name
is R part here, the function is R part and let us look at the some of the arguments here.
So, formula this we have already specified data weights subset and other variables you
can see control is the another one more argument and the parameters parms, that is one
more argument some of these we are going to use. So, let us focus on the control
argument. So, let us scroll down.
699
So, this control is essentially a list of option, that control details of R part algorithm. So,
what could be the list of option, that we will have to check from this function R part dot
control, as you can see in the code itself, we are calling R part dot control function and
within this function, we are passing on the arguments. So, let us look at the definition of
R part dot control. So, this is the function. So, various parameters and control aspects of
R part. So, you can see there is one, first argument is minsplit. So, minimum number of
observation that must exist, in a node in order for a split to be attempted.
700
So, you can see the default values default values 20; however, we have a specified as 2.
So, even if 2 observations are there, we would like to go for the split. Essentially, we are
trying to simulate the scenario of a full-grown tree. So, for that purpose, we are doing
this. So, minimum split we have a specified the value as 2. So, there are 2 observation
you would like to split to be attempted.
Now, the next argument is minbucket. So, here you can see minimum number of
observation, in any terminal node. So, here you can see minbucket, we have specified 1
701
alright. So, minimum num observation could be 1, as per our specification, the default
value is this based on this computation round off minsplit divided by 3. So, minsplit
value it would be divided by 3 and then round would be taken. So, that is the minbucket
value default min minbucket value. But as we said we want to simulate the full-grown
tree process. So, therefore, min bucket has also be kept at a very low value of 1.
Max compete number of competitor splits, retained in the output. So, this is 0 because
we are not interested in other competitor splits, we are just interested in the best the
optimal split, you know combination predictor value combination. So, that this has been
specified as 0, you can see maxsurrogate, the surrogate is split, this also we are not
interested. So, this has also been specified at 0 in the code then you would see x val so,
this is number of cross validation.
702
So, what happens in the R part function, some of the observations that are used for the
building process, they are use they are reserved for the cross-validation purpose as well;
however, we would not like to perform cross validation, rather we would like to go
through we would like to use the validation partition later on, in another example that we
will see. So, at this point because we are using all the observations, here of the data and
we do not want to do cross validation. So, cross validation is very similar to what we do,
as using validation partition, but here in this case the observations from the training
partition itself, are going to be used to perform the validation exercise.
So, you can see after having understood some of these parameters, let us start again, let
us also discuss one more on the parms parameter, which is there in the should be there in
the previous page the R part page. So, let us go back.
703
So, parms this is optional parameters for the splitting function. So, you can see what we
have done is, we have selected this split argument split argument. So, split argument is
actually the this is the this split argument. So, what it does is splitting index can be gini
or information. So, in this case we have a though by default is gini only, but we have
clearly specified gini. So, that we understand how we can use this function or part, and is
specify the required splitting index. So, in this case, the split is the argument within the
parms you know argument, within that list we can specify it right? Gini.
704
So, once we have understood the arguments and the function let us execute this code.
You can see mod has been model as being built, now let us look at the tree, that we have
constructed. So, let us first set some of these parameters for graphics margin 0, outer
margin 0 and then xpd value as NA, we would like to use the whole device region, more
information on xpd you can keep on following the parameter functions and other related
function, to find out more about xpd argument here.
So, let us execute this, now let us plot. So, plot is we would like to plot the tree, that we
have just built. So, this is the tree and now we will have to add the information, this we
can do using text command. So, in the text command using again the mod function and
other arguments, we can fill this with more details.
705
So, this is our tree. So, you can see annual income is the first predictor, that has been
selected and the specific value that has been selected is 5.95. So, the predictor value
combination is annual income and the value of 5.95. So, here if this value is less than
5.95, will go to the left tree and which will also, we classified as non-owner right? That
is the terminal node here and immediately, we will get terminal node in the left branch,
in the right branch if the value is value is not less than 5.95. So, we will go to the right
subtree and there again, we will have to find out another split another predictor value
combination. So, this comes out to be household area this time 19.5 and then again if we
two partition. So, right partition goes is actually terminal node and then here, in this part
we have annual income less than 8.05 and further partitioning, on the left side annual
income 6.1 the others are terminal nodes. So, from this if we compare to the hypothetical
example that we had done.
706
So, here the first split is based on the annual income predictor and a specific value is
5.95; however, if we look at the splits, that we have manually done, our hypothetical
splits right. So, first splits was 18.8 using household area, right? In this particular graph,
you can see first split, this was the first split, that we had created right this was at 18.8
using household area, but if we look at the look at the split, that is as per the tree using
this cart technique annual income 5.95.
So, annual income 5.95 let us look at the original plot, where this value is going to be.
So, annual income and the value is going to be somewhere here, 5.95 this is very close to
6 right. So, this value the partition is going to be somewhere here, in this fashion. So, the
partition will go up to this. So, here you would see, if the partition is created from here
right? So, probably it will bisect these 2 points. probably it will bisect these 2 points. So,
if we want to create a partition on this. So, let us use the sum of ab line function that we
had used. So, earlier the first split that we had created earlier. So, that was at 18.8, right?
So now, let us use ab line to create the partition.
707
So, now this is going to be we have to use v, v this is going to be a vertical line and this
is going to be 5.95 and we do this, we will get this line. So now, let us look at it. So, this
is the line, you can see this particular line, this is the first split as per the algorithm R part
algorithm and you can see that, this 5.95 later value combination for annual income,
these and this particular observation, is being split using this partition and the left
partition, that we get this is pure homogeneous. So, and the this other partition of course,
it is being dominated by the owner class, but it is going to have, 4 observation belong to
non-owner and remaining 10 observations belonging to the owner, the this particular left
partition will have 6 observation belong to the non-owner, this is pure. So, the probably
this is the first partition, which is creating, which is reducing impurity much more than
what we had done, we had done this partitioning 18.8 using household area. Now, this is
the optimal combination, annual income 5.95. So, we our to compute the delta that
decrease in impurity index, here gini or entropy, we would see much more decrease in
this particular case.
So, let us go back, to our modelling. So, once this is understood, now there is another
way to create the tree diagram, that we have created. So, that tree diagram, that we had
created using the plot function this is the diagram. So, if you like a more presentable tree
diagram, you know much better than this, more nicer or pretty version of the tree
diagram, then we have another we will happy another package, that is r part dot plot. So,
708
we will have to install this package. So, let us load this library. So, first we are required
to install it, as it is not already install. So, let us install this particular library.
So, once this package is installed, we will load it and then, we will have access to
another function prp, which can help us in plotting a more nicer version of the tree
diagram, or pretty version of the tree diagram. So, let us load the library, now this prp is
the function we can see first argument is off course, the model.
709
And then there are various other arguments that, I am passing here, if you are interested
in more details, you can always type prp in the help section and find out more details
about this particular function. So, you can see plot and R part model. So, this is
especially designed for the plotting, you would see a huge number of arguments, that can
be passed to this function. So, too many arguments. So, we would like to understand the
few that, we are going to use.
So, for example, the first second argument that, we have type. So, that I actually
indicates the type of plot, that we want to create. So, in this case we have said type one.
So, you would see what is type one? Level all nodes not just leaves. So, the default type
would just you know label leaves, but we would like to label all nodes; that means,
decision nodes, as well as terminal nodes or leaf node not just the leaf node. So,
therefore, we have change the type to one.
710
Next argument is extra. So, extra the value that we have given is one. So, let us look at
what it means. So, when the value extra, value is one then display the number of
observations that fall in the node. So, that split of observation to, you know each of the
classes that is going to be displayed, that is the extra information that is going to be
displayed, had we obtained for extra 0, then no such information would have been
displayed, there are many other extra options. So, that you can look at in your own time,
now the third argument is under tree under that is specified as true.
So, let us look at this particular argument, applies only if extra greater than 0. So, that is
the case, we have selected extra value one. So, that is greater than 0. So, the under
711
argument can be applied. So, what it does? If it is true. So, that this will put the text
under the box. So, the box that is going to be created, that we will see right. So, the box
would be seen at each node, a decision or terminal and the text extra information, that we
talked about using the extra argument, that is going to be displayed under the box.
Then we have varlen, another argument. So, let us look at what this is going to do? So,
this is the argument. So, length of variable names in text at this splits. So, varlen is 0. So,
0 means useful name. So, there are other options also. So, they will probably abbreviate
the full name, depending on the requirement. So, because we have assigned the value as
0. So, we do not want to abbreviate the names. So, full names would be seen at this split,
similarly other arguments are also there. So, you can go through them on your own time
CEX compress margin digits.
So, let us create this plot. So, this is the plot that we get. So, this is much nicer version of
the tree diagram or pretty version of the tree diagram, as you can see here annual income
5.95, if you know if it is less than theirs and then the left branch, if it is not less than that
than the right branch. You can see the root node itself has been you know labeled as non-
owner, even though there are equal number of observation belonging to owner or non-
owner class, that is because you know there is no majority class. So, they have just gone
with the alphabetical ordering. So, non-owner and owner. So, n comes before o. So,
therefore, non-owner has been used.
712
So, alphabet kill ordering is used, because there is no majority here. So, the root node is
also, as you know when we set type one and extra one that arguments, that we
understood from the help section. So, because of this type one the labeling of decision
nodes is also being performed here. So, depending on the majority, that label is going to
be you know displayed here. So, you would see this, is the terminal node that we have.
So, all 6 observations belong to this categories this is non-owner, again you can see these
split 6 and 0 this is again in the same order, non-owner first and then owner right. So, if
we this particular discussion we have done before as well, when you define a particular
variable as a factor variable in R environment the labels of that factor variable right. So,
they are going to be alphabetically order. So, right the same ordering scheme is again
being used by this particular tree diagram.
So, 6 and 0, 6 number on non-owners on all observation belonging to the non-owner

category and that is the label also, now we look at the right child, right subtree. So, the
second split would be performed household area less than 19.5. So, this is going to
perform, you can see 4 and 10 is this split at this node. So, 10 is the number of
observation belonging to owner, and this also is the majority in this particular situation
and the owner labeling has been used. So, the household area where is less than 19.5 will
go to the left side, otherwise will go to the right side, which also happens to terminal
node and that to owner.
So, in this fashion, we can understand this particular tree diagram. So, this is full grown
tree. So, because we had just 2 predictors, and we had also just 2 classes in our outcome
variable. So, this particular tree is quite a small; however, if we are dealing with you
know more than 6, 7, 8 predictors and we are trying to build a full-grown tree, this
particular tree diagram is going to be much more messy and many more decision nodes
and terminal nodes are going to be displayed. So, therefore, the visualization or analysis,
graphic visual analysis would also become more difficult, in those tree diagrams.
So, let us go back, if we are interested, in visualizing the tree diagram, which is there
after first split or second split or third split that can also be done in R. So, for that to be
performed first, we need to understand the node numbering. So, how node numbering
happens in a tree diagram. So, the same function that we have use, prp for plotting of the
tree, there we another argument n n. So, this particular n n argument as you can see here
in the court, this is specified as true in this case. So, therefore, node numbering for all the
713
nodes, is going to be displayed and we are also correct expansion, also for node
numbering is done. So, that this not much larger than, in comparison to the diagram.
So, let us create this tree diagram with node numbering, and let us zoom in, now you
would see all the nodes have been numbered here. So, root node becomes number 1, then
left you know this left terminal node 2, then 3 then you would see certainly after 2 and 3,
we get to node number 6. So, that is because the we in the tree diagram, if we want the
unique node numbering and even if a particular node is not present, it is assumed as the
numbering is skipped for that particular node, but the numbering is done. So, for
example, one root node is 1. So, it can have 2 child’s left child and right child. So, left
child is going to be number 2, if it is present it will be displayed as well right child it is
going to be number 3, if it is present if it is going to be displayed.
Then we again come back to the you know left part of the tree diagram, and again we
look at the second node, that is the left child of root node and in this case, we see that
there are no further partition, there are no further child’s of this particular node, but had it
been there, they would have been you know given the number of 4 and 5. So, there
would have been 2. So, 4 and 5. So, those numbers have we are not being used, but they
are counted here and the next node at the same level, the same level that is right third
level, is going to get the next number that is 6 here. So, this fashion will give us the
unique node numbering, for all the nodes and also will also get to know which node
numbers are not part of the tree. So, this gives us some flexibility, in later on more
processing more computation, that would be required later on we will see.
So, similarly 6 and then say at same level. So, we will keep on moving at the same level.
So, this is the 7th node, then you would see certainly from 7, we reach to 12 th, because
had this particular node number 2, had 2 you know 2 children’s, they again had 2 more
you know, 4 sub child’s would have been there at this level. So, 8, 9, 10 and 11 for those
4-sub child’s would have been there, but they are not part of the tree. So, numbering
comes to 12. So, in this fashion numbering can be performed.
714
Now, why we have done this kind of you know node numbering and that to unique node
numbering, even if a particular node is not part of the tree, will understand in few
seconds. So, first split we create first split, that is in this case is annual income value
5.95. So, that is the first split at root node that we perform. So, if you just want to display
the root node only, then the full-grown tree that, we have developed that is represented
by mod. So, you can call snip R part function. So, let us look at the function this
particular function snip R part.
Snip subtrees of an R part object. So, using this particular function, we can snip the tree,
we can remove some of the nodes in the full-grown tree and look at just the part that we
are interested in for example, if we are interested in the nodes, that have been created
after first split, that is root node and then the 2 child nodes, the 3 nodes if we are
interested, and the of course the corresponding terminal nodes. So, then we will have to
snip the remaining part. So, how do we snip the remaining part? So, let us again look at
the plot.
So, this is the plot. So, first split had root node 1, will have 2 nodes here 2 and 3. So, the
other nodes like 6 7 and others we will like to snip them off. We would like to remove
these nodes. So, this using snip R function, we have to pass this full model for full grown
tree and then, you can see we are passing on the unique node numbers of you know, the
nodes which we want to snip. So, you can see 6 7, 6 and 7 they are present 12 and 13,
right? And then 24 and 25 the last level, right? So, we would like to get rid off these
nodes. So, this part would be removed off and then we would be left with the part, which
715
is you know which we would actually see after first split. So, let us perform this is
snipping.
Now, once this is done we will get another you know another model, another object of R
part type, R part object. So, we can plot this using prp function. So now, we would be
able to visualize just the snipped part, after first split right? So now, you can see root
node and then 2 child node and then the terminal nodes, right?
So, this is how we can see what is going to happen after first split. So, root node we had
10; 10 similar portion, this was the predictor value combination and will come 5.95 and
then, left child what is the terminal node immediately, then the second one, this one
owner again further partitioning. So, second split is also in a way shown here, a non-
owner and owner, here right? So, in this fashion we can actually focus on, the actually
display the part that we are interested in. So, first split and 2 child’s we can see. So, we
will continue this discussion in our next lecture.
Thank you.
716
Dr. Gaurav Dixit
Lecture - 39
Classification and Regression Trees- Part IV
Welcome to the course Business Analytics and Data Mining Modelling using R. So, in
the previous lecture we were discussing classification and regression trees. In particular
we were doing an exercise in R using a particular data sets sedan car data set that we
have been using. So, let us do few of the steps. So, that we are able to resume from the
same point, where we ended in the last lecture.
So, let us load the this library let us import the data set, quickly. Now in the last lecture
we were we were able to create the model, for this particular data set. So, let us come to
that point yes. So, this is the part. So, these were the variables. So, let us reload this
library R part
717
So, R part is the function that we used last time. So, let us build the model and the plot as
well. So, this was the plot that we had created last time, and we also talked about much a
nicer version or pretty version of the plot or tree model.
This can we talked about the node numbering, and the requirement of node numbering, if
we want to snip off sum up some particular part of the tree right.
So, why we need this it will be more clear as we go on we go along, discuss more in this
particular lecture. So, node numbering as we talked about in the previous lecture this n n
718
argument, we have to specify n n as true. So, node number would be assigned, and then
the character expansion this parameter is also can be used
So, once node numbering is done you can see in the plots, you can see in this is the full
grown tree using the full data set that we have, and node numbering has also been done.
So, in the previous lecture we had done first split, where we just wanted to have the
nodes, which we nodes till the point of first split right. So, this is the first split. So, the
we would like to have the nodes you know just 1 node, and then the other you know
terminal nodes right.
So, in that case as we talked about that we have to from the these node numbers, we have
to see if we want to keep these 3 nodes to re-modify our example, we want to just keep
these 3 nodes this first decision node then this terminal node, and then other one, this 1
also be decision note and followed by terminal nodes. So, we want to keep the you know
you know these 2 decision nodes, and the corresponding terminal nodes as well, then we
will have to a snip off you know nodes starting from here 6 7, we talked about the unique
node numbering scheme that we have entry models in general and tree algorithms also
So, we have to specify. So, this is the function snipped r part. So, snip dot r part function
will allow us to remove some of the you know unwanted nodes. So, we just want to look
focus on the you know first level or first few nodes, then we can remove this the other
part other nodes using this snip r part function.
719
First argument is as we talked about is going to be the model that is mod that this object r
part object, and then the second argument is where we specify the node numbers which
we want to get rid of. So, you can see 6 and 7 12 and 13 24 and 25. So, these are the
node numbers that we want to get rid off. So, we can run this particular function, and you
would see that new model you know the sub tree model has been saved in mod sub. And
now we can print this particular tree model.
Now, you can see this is the new model rather snip snipped off version of the same
model right.
720
So, you can see 2 decision nodes here, and the corresponding terminal nodes as well we
want to keep just 1 node. So, you know the first split that we talked about. So, we can
remove this 1 as well
So, then we will end up with the in just 1 node. So, let us say in the previous this was the
full tree right so, in this full tree so, if we will have to remove 3 as well. So, if we put 3
here, we also pass on 3 node number 3 also as the argument here, and then we do our
sniping then this we will also be removed, and the new plot would have just 1 root node
721
You would see just 1, you know decision node that is the root node, and that is also
indicative of the first split, and the others right there would be now as you can see these
are the terminal nodes where we get the you know final classification.
So, in this fashion depending on the because as we talked about the full grown tree could
be large, and you know if the data set is quite large, and then there are more number of
variables that we will see in through 1 more exercise 1 more example, data set that the
whole full grown tree could be quite messy and difficult to understand
So, therefore, this exercise will help us in just displaying or plotting the first few levels
so, that we can look at the some of the most important variables that are being used to
create splits or to create partition. So, similarly if we wanted to have a 3 splits, so as we
talked about we can go back to the full grown tree
So, this was the tree. So, in this if we want to have this you know 3 splits. So, split 1 then
2 and 3. So, therefore, we will have to remove the other nodes right so, let us go back.
So, other nodes would be 12 and 13 and 24, 25. So, we can get rid off these nodes right
12 13. So, appropriately mentioned here so, we can get new sub tree, and then we can
plant it.
Now, you can see, and this new 1 here just 1 2 and 3 3 splits and the corresponding
terminal nodes as well.
722
So, in this fashion depending on the levels we can snip the tree full grown tree, and the
importance of this particular process will realize when we discuss further. We are
interested in different attributes that this R part object, R part model object you know
actually it is carrying. So, that we can you know do using attributes command. So, this
particular is generally command applicable to other techniques as well. So, different
models that we have been building in previous lectures while discussing techniques.
723
So, there also the attributes function can also be used to find out, the different
information different you know results and information that is stored in the mod object.
So, this is r part object, and you can see information that is there. So, more details on
what kind of what these you know frame is about what these other you know variables
are about, you can always go to the help section, and you can always find out more about
the r part object, and if you look at the r part object in the help section, and you would
see frame first value frame you would see then second value
So, the mode frame is quite big you know it is actually a data frame. So, containing lots
of information, then where call and other attributes that you can see here, they can be
easily seen and understood. Summary mod function is also there. So, we are interested in
looking at the summary. So, what summary function again generic function what it
contains is we call the CP table which will discuss again, later in the discussion also.
But since we are in the r part dot object you know page, you can see that there is another
description or CP CP table a matrix of information on the optimal pruning based on a
comp complexity parameter. So, that is generally displayed in CP table.
Let me also tell you that r part control function specific function that we talked about in
the previous lecture right. So, if we open the page for r part control, you would see there
724
is 1 specific argument for x val that is number of cross validation. So, the r part function
that we have in this particular package, it also does some cross validation
So, some of the as we talked about in the previous lecture, that the some of these
observations are reserved for cross validation, within the training partition within the
data that is supplied to r part function, and that is then later on used to produce the CP
table and other information
So, we will discuss them as we go along. So, the coming back to the summary output
you would see that later on you would see the variable importance is also mentioned
here. So, classification and regression tree you know 1 aspect of with this particular
technique is, it can also serve as a variable selection technique or dimension reduction
technique. So, some of them we discussed before for example, without we did we have
already discussed dimension reduction technique, and variables you know regression
based variable selection approaches right.
So, the classification and regression tree is also 1 of those techniques, which help us
helps us in terms of identifying the more important variables more important predictors
right. So, in this case you can see variable importance is also part of the output annual
income is you know 77 percent importance, and household area is remaining 23 par 23
percent
If we had more number of variables, we would have got much larger much bigger list.
So, in terms of from this we can say in terms of identifying, in terms of classifying the
sedan car ownership, annual income is more important than household area. The same
thing you would see is also reflected in the plots as well you can see the first split is
based on annual income, then in the summary output that we were discussed discussing
right
725
So, for every node some information about the split that has been performed on that
node, and more information on that node also is always going to be available, and the
summary outputs you can see node number 120 observations. So, you can see in the let
us go back to the full plot as well.
So, this was a full plot. So, you can see 20 observation in the root node all 20, 10 and 10
class counts were also 10 and 10, you would see the expected loss that this value is also
there all right, how many observation in the lefts and, how many observers and the
rights, and left child or right child so, different terminologies are there. So, you can see 6
observation in this particular node let us zoom in. So, you can see in the left child this is
also terminal node, and you can see 6 observations here, and you would see that right son
it has fourteen observations right node number 3
So, you can see node number 3 and 14 observation we will have to combine all these 4,
1, 5, 7 and 7 14. So, in this right child is actually a quite you know in it is own it is a sub
tree. So, within this sub tree you can see 14 observations are there 6 and 14. So, that was
the partition at root node.
So, split is done on based on this particular classification rules. So, that also you can see
the improvement value is also there. Now if we if we scroll down further we will if we
scroll down further will also see that other nodes node number 2, node number 3 right so,
there are also 6 observations 14 observations you know this was terminal node
726
So, there no. So, this was no split was done here as you can see the information there in
the node number 3 the split was done here you can see the split rules classification rule is
also mentioned, and the left son number observation returns. So, in this fashion we can
always find out the details about these splits that have been performed by r part function.
Now, with this let us start our you know let us discuss further using another example that
is a bigger larger data set, having more than having around 5000 observations. So, let us
look at this particular data set. So, this is the promotion offering data set, so, 3 specific
variable in this.
727
We had used in a previous lecture where we had income spending and 3rd variable from
offer. So, where we based on the income and spending, we were we use the this
particular for classification all other visual techniques we use these 3 variables. Now this
is full data set where we have information on income spending, and the promotion offer
is our outcome variable of interest, and then we have information on age pin code,
experience, family size, education on then online activity status.
So, this is particular this particular information is about you know a particular firm with
it finding out you know we are trying to build a classifier, whether if a particular whether
particular customer is going to respond to their promotional offer or not. So, the
promotional offer that particular column that particular variable being the outcome
variable, and using the other information other predictors other variables that we have in
the data set, they will play the role of predictors income spending age pin code and
others
So, using these variables will try to classify a model where we build a classifier where
we would like to classify whether particular observation, whether a particular customer is
going to respond to the promotional offer or not. So, other variables age is again age of
the customer with the responder pin code is the location of that particular customer
experience this is the actually the professional experience that you know that of that
particular customer, family size of that customer and then we have education and
728
whether it is twelfth pass HSC or graduate or postgraduate. So, we have some
information about the education, and then the online activity. So, whether the customer is
online active or not so, 1 representing 1 indicating that the customer is online active, and
0 indicating that the customer is not active online.
So, with this information we would like to build a classifier to predict the whether the
customer is going to accept the promotional offer or not. So, with this background let us
come back to R studio, and we will start our classification tree model. So, first let us
import this particular dataset, since there are 5000 observations. So, R will takes like bit
more time to import this particular file. So, as we have talked about that once the data is
imported that it is actually stored in the memory. So, therefore, it takes a bit more time to
bring all the observation into memory and therefore, now this is done. So, you can see in
the environment section df 1 data frame 1 5000 observations of 9 variables.
So, let us remove NA columns if there are NA so, there was none. So, first 20
observation you want to have a look again, you can do this using subset using brackets.
So, these are the. So, the data has been correctly imported in the r environment as you
can see using first 20 observation. So, all the variables are there.
729
Now let us look at the structure of this proper data set. So, you can see income is
numeric then spending is also a numeric promotional offer, is us again is known as
numeric. So, therefore, we would be required to make it a factor variable, but because
there are just 2 value 0 or 1, 1 meaning acceptance, and 0 in the rejection of the
promotional offer, and we have aged which is numeric
So, no problem pin code so again this is also 1 variable that we talked about in our
starting lectures, when we discussed more about the categorical variables that pin code or
zip code, these codes are essentially categorical variable; however, it is quite difficult to
incorporate these variables in the models, because for different customers for different
individuals the pin codes can differ, there can be you know too many number of pin
codes, and that would add to the dimensions or number of variables that we include in
our model
So, the dimensionality or could be a problem right, the curse of dimensionality because
of the categorical variable having so many categories, because each pin code is going to
represent 1 particular category. So, if there are 5000 observation, and they are covering
250 locations so, we will end up with a categorical variable categorical variable with 250
labels.
So, that can that can create problems in our modelling too many predictors, too many
dimensions, and so how do we overcome this situation, 1 solution could be we can we
730
can model this particular variable, as a numeric variable right, because the location can
also be though even though this is a numeric code, but because the number of categories
are many, and location can also be associated with the latitude longitude and therefore,
latitude and longitude which are numeric in nature
And these codes in that sense can represent latitude, and longitude, because generally
when these codes are assigned. So, there is some method there is some you know some
process that is adopted to assign these codes. So, therefore, they might in a way represent
the latitude longitude and the they might you know in sense might be the ordinal variable
So, therefore, there are too many categories they can also be treated as a numeric
variable however, in this particular exercise that we are going to perform, we will treat
them as categorical variable, and I will see how we are able to reduce a large number of
categories into few groups few number of categories, and then use them as a factor
variable
So, we will see that then the next variable is experience which is numeric. So, just fine
family sized numeric which is also fine, education is factor 3 levels so, appropriately
mentioned again. So, no need to change then online, So, this is mentioned as numeric,
but we will have to change it whether it is factor variable with 2 categories active or non
active. So, let us start with our pin code variable
So, what will what we plan to do is grouping will try to group the categories of pin code.
So, how do how do we go about grouping these categories, as you can see in this our
classification task, we are trying to classify the value of we are trying to classify, or
predict the class of promotional offer variable that outcome variable of interest and
therefore, with respect to keeping in mind this being the supervised approach supervised
algorithm. So, everything is driven by the outcome variable. So, essentially the whole
focus is in decreasing the overall error for classification of this particular variable,
promotional offer and therefore, the pin codes they can be grouped with respect to this
particular variable promotional offer.
So, what we can actually do is the areas, which have similar sort of acceptance rate, or
similar sort of rejection rate, they can actually be grouped together right. So, because the
task is prediction of this particular variable promotional offer so, in terms of the you
know predictability or importance of this particular variable pin code, the areas are pin
731
code or cities or towns having similar acceptance rate level, can be grouped and I all
having similar rejection levels can be grouped together
Because you know having them separately as individual category, and grouping them
and grouping the locations with similar success rate would not you know impact much to
the model, another will end up with fewer number of dimensions. So, we would be able
to reduce the dimensionality by clubbing you know the similar types of you know
locations, where the acceptance or rejection rate is similar.
So, it seem this kind of exercise we had all also done in visualisation technique, but that
was limited to the you know generating bar plot. So, will go through the in an actual
modeling exercise
So, first we will like to look at the number of categories that will have to handle. So, pin
code is the variable you can see as dot factor is being used to coerce it into a factor
variable, and then table function is being used. So, that we are able to understand the
number of categories that are going to be there, and the frequency of those categories as
well. And so, let us look at this so t pin. So, let us we are interested in looking at the
actual values.
732
So, we can do this, so you can see for each of these pin codes, you can also see the
frequency. So, for example, the pin code 1 1 000 1, this is 54 times it has appeared in the
data set. The other pin codes also the similar numbers are being displayed there right.
So, the frequency for different pin codes we can easily see. Now pin names for these for
these you know pin codes, we can easily extract. So, we want to extract the you know
pin codes, and so we can do this using dim name dimension names function dim names
function, and we can pass on this table object here t pin, and the first you know first
element
So, this is actually as a list. So, first element of which is going to give us the pin names.
So, we can store this here so, that we can later on use it when we generate the bar plot or
for other things. So, what we are going to do as we talked about, because we want to
group these categories as per their success rate as per their acceptance rate, or rejection
rate of the promotional offer. So, we will have to count for each of these pin codes will
have to count. So, that we can compute the acceptance or rejection proportion or rate
right.
So, let us initialize this variable count pin code C underscore pin code. Once this is a
slice you can see I am running a loop here. So, for all the pin names which we have just
computed for all of those pin names, we are going to run this loop, and in this loop you
733
can see within this combined function. The pin code and we are converting it into a
character vector.
So, that it could be compared with x which is character vector, because pin names is the
you know character vector, and here in this case same thing you can also immediately
understand by looking at the environment section, you can see pin names is stored as
character vector. So, therefore, pin code which is essentially the you know numeric
vector the way it is stored right now
So, we will convert it into this. So, that comparison could be performed, and once for a
particular value you know x that particular pin code, then when you look at whether how
many of the observation for example, let us say pin code this 1 1 00 88. So, 54 times it is
appearing in the data set. So, out of those 54 times how many times the customer has
accepted the promotional offer.
So, we would like to count you know that we would like to count that the number of
acceptance right. So, in this fashion so, once all the you know all the observations with
that particular pin code are selected, and within that the promotional offer when that
promotional offer was accepted. So, those observation will get the indices of those
observation using the which function, and then length when we apply the length function
on those that indices vector will get the count
So, for each pin code the success count the acceptance count, number of acceptance that
will get through this code, and then that is going to be stored here, in this particular c pin
code variable. So, what is going to happen is for each of the pin codes which are there in
pin names right. So, for each of them this kind of count of acceptance would be
performed using this particular for loop.
So, let us execute this code, now you would see in the environment section we have
created C underscore pin code variable here, this is an integer variable all the all the
values that are there they are counts actually. So, those all of them are integers we can
see some of the values, you can see here in the environment section itself right C for
different so, 96 pin codes were there right
So, for all those 96 pin codes we have the count of success, you can see in pin names the
character vector that we had created 96 you know pin names were there. So, 96 pin code
734
for there, and for all those 96 pin codes 96 locations, we have been able to count the
number of observations, which actually number of customer which who actually
accepted the promotional offer
So, once we have this information. Let us look at the range before we keep generate the
plot. So, you can see the maximum number of acceptance is 12 right, and the minimum
is 0
So, with this information when we have this information, we can actually generate a bar
plot and analyze visually what the what has been the you know acceptance, and the
rejection proportion and other things. So, let us create this bar plot you can see the limits
on y axis are appropriately specified 0 to 13. So, the range is within this limit, let us
generate this plot. So, first we need to correct the graphics setting. Now so, again let us
generate this plot we get the plot in the desired format.
735
So, you can see on the y axis we have number of promotional offers accepted. So, this is
0 to 12, and on the x axis as you can see the different pin codes have been mentioned
here. So, all the pin codes that we have 96 of them, so all the pin codes for different pin
code we can see the bars right.
So, from here we can further analyze, and identify which of the groups can be clubbed
and whether there are any groups by looking at this plot, which can actually be you know
clubbed or grouped. So, at this point we will stop here, and we will our discussion on the
classification regression trees.
Thank you.
736
Dr. Gaurav Dixit
Lecture - 40
Classification and Regression Trees- Part V
previous lecture, we were discussing classification and regression trees; and specifically
we were talking about the data set on the promotional offers. We were looking at the
variables of those data sets, and we identified the pin code particular variable which is
going to be categorical variable, this is a categorical variable and then too many
categories 96 of them which we saw in the previous lecture. And we wanted to find out
phase where we can actually reduce the number of categories to a fewer number, so that
the dimensionality problem is solved there.
So, we generated a bar plot which you can see here, and we were we wanted to analyze
this particular bar plot to understand whether some of the categories can be grouped
together. So, if we look at this particular graphic we can see that some of the bars having
zero values right, some of the pin codes and they have zero values, so that therefore, zero
offers have been accepted in those locations here and here. So, of course, when the task
is to predict the class of you know this our outcome variable promotional offer, so with
737
respect to that we can club these two locations in one group. Because the level of
acceptance or rejection in these two particular locations two or three, and depending on
because this is quite a you know big plot.
So, all those locations which have similar acceptance level, we probably can group them.
So, for example, other bars we can see for example, this one first bar and the third bar
they also have the similar acceptance level five, so probably we can group them. So, we
can also identify many other you know locations for this one right. So, many other bars
which are at the similar acceptance level, and probably we can group them. We can also
group them by having a different range right. So, depending on the exercise and
depending on the suitability of that particular grouping with respect to our model and its
performance we can start grouping.
So, for example the pin codes with 0 to 5, you know acceptance count probably we can
group them, then you know 5 to let us say 8 group 2, then 8 to 12 group 3. So, in this
fashion also we can group these locations. So, we look at this particular scenario then
will end up with this three groups. However, what we are going to perform here in this
our exercises depending on the acceptance whether it is 0 or 1 or 2 because we have only
you know maximum value is 12. So, for each of the acceptance level, we are going to
create you know different groups. So, as you can see rather it is 13, yes, so depending on
738
the different grouping strategy can be done. So, we can have we can also do this range
based you know grouping, we can also do this.
So, this one seems less suitable, but in our exercise we are going to perform this and this
could be another grouping mechanism. So, but however, we will have to justify we will
have to try and understand why this range and the why this particular range, we will have
to understand probably these locations are having lower levels of acceptance. Probably
these locations are having medium level of acceptance of promotional offers. Probably
these locations having these acceptance numbers they are having slightly higher in our
data set as per our data sets slightly higher level of acceptance.
So, in this fashion also we can perform grouping. However, for our exercise we would
like to have you know so as you can understand we started from 96-pin locations pin
codes right. So, for 96 we can have a this situation also three groups, and we can also
have this one as well where we end up with thirteen groups. So, for our exercise we are
going to perform this one; however, this can also be done. So, once this is understood we
will have to a do few more computations, so that we are able to group all those records in
appropriate category, new categories that we are going to create.
So, what we are going to do is the count of a pin code, the count that we have computed
all right, the count of pin code that will treat as a level. So, if a particular pin code has
zero acceptance count, so that becomes its level 0. If the pin code has a particular pin
codes or number of pin codes which have one acceptance count, so that could be their
level one. And if they have 5 or 10 or 12 acceptance count, so that is going to be level.
So, all the locations depending on the acceptance counts that they have, so that is
something that we are going to treat as the level of that particular location. This is mainly
to simplify our coding and simplify our computation, so that we can easily group them.
Later on we want we can give them appropriate name instead of saying 1, 2, 3 or 13 or 0,

we can also say group 0, group 1. So, later on that kind of transformation can be done,
but for our purpose, we will stick to 0, 1 and up to 13 these thirteen levels and we will
later on convert it into a factor variable with thirteen groups.
739
So, this particular loop as you can see. So, what we are trying to do here is a assign count
of pin code as its level. So, pin codes having same count will have same level, and
therefore that is how they will be grouped. So, loop you can see x and pin names, so for
all the pin names all the pin codes that are there. So, loop will run for those many pin
names for each of them. And first we try to compute the index where this particular you
know x values they are same. So, first we select all the records, all the records having the
same pin code. And once those records indices of those records is known with using
index variable, then pin code of you know pin code of those indices that is being
assigned this number which is nothing but the count of you know pin code, the
acceptance count for that particular pin code.
You can see C pin code where we had the counts we are trying to identify again you
know the pin code indices where the same pin code is there; and once it is known the
length of index that we already know, so that is going to be repeated. So, C pin code
count of all the locations, all the records where same you know the same pin name is
appearing same pin code is appearing. So, for all those you know we are going to give
that count rep is the function, so it is going to repeat. So, this is going to be equal to the
indices that we have already computed right. So, let us execute this particular code and
then it will be more clear.
740
Once it is done, you would see that d f 1 pin code, you can also look at the environment
section as well.
But if you compute this, you would see all the this particular variable different records
that we had 5000s of them. So, all the records in the pin code variable, now we have the
count right. So, 7, 3, 4, 4 these counts actually represent the acceptance level of those
third particular location. So, because if the count is same, so they are again in the same
level, so they can be easily grouped. So, in this fashion, we are trying to group them.
741
Now, once we convert this variable into a factor variable. So, now, if we again look at the
values of this variable once it has been converted, so you can see levels you can easily
see 0 to 12, so 12 levels are there. And now this particular variable has been converted
into a factor variable. So, other variables that we wanted to a transform into factor
variables for education. So, three levels we had. So, let us convert it the promotional
offer and also online.
So, let us look at this structure now. However, education was I think already it was
factor, so we repeated the exercise. So, in this fashion, you can see now the promotional
offers factor variable appropriately mentioned here. The pin code now you can see 13
levels. So, we drop down the dimensionality from 96 to 13. So, one of them is going to
be taken as the reference category. Now, the education and online or so 3 and 2 levels
respectively. So, now, all the variables are in their desired variable type.
Now, we can go ahead and start without partitioning exercise. So, in this particular data
set, we have 5000 observations; out of this 5000 observation, we will take the 50 percent
of them that is 2500 observation in the training partition; out of the remaining 2500
observation, we will take first 1500 observation in the validation partition and the
remaining 1000 observation in the test partition.
So, let us sample. So, the partitioning in this particular exercise slightly different, as you
have been watching that in the previous other techniques, other lectures when we did
742
partitioning we just created two partition training and test partition. So, they are the
training the sample and indices be we computed using the sample function randomly
drawn indices and that were the part of the training partition, the remaining indices they
were assigned to the test partition. Now, if you look at this these four-five lines of code
for partitioning, first we are trying to randomly draw 2500 observation from the sample
part any partition.
So, let us do this. So, you can see part index has been created in the environment section
integer vector of 2500 observation. So, now, these observation can be safely assigned to
the training partition. So, this is done training partition is created with the randomly
drawn 2500 observations of all nine variables. Now, the second we again you know call
sample function. And, now in this case, you would see that all the observations which are
remaining now you can see this vector indices vector. And the remaining observation we
do minus part idx. So, the remaining observations, so remaining indices, now, out of
those indices, we can again randomly draw 50, 1500 you know further observation
observations for our validation partition right. So, in this fashion that is again create this
index.
743
Now, to because the way we have a randomly drawn these indices right replacement is
also set to false. So, there is no overlapping observation in the training and the validation
partition. If you want to check the same, you can check using intersect function. So,
intersect function will give us if there are any if there are any similar rows similar values,
so part idx and part idx 1. If we run intersect function here you would see we see no
values. So, these are two different you know two different set of indices. So, now we can
safely create our validation partition by selecting the 1500 randomly drawn indices. And
the remaining indices, so remaining ones which is the where we part idx and part idx 1
both of them once we remove them out, the remaining one are going to be the part of the
test partition. So, in this fashion we can actually go we can actually do our partitioning.
Now, we come to the next part that is once our partisans have been created we can use
the training partition to build our model. So, if similar you know exercise that we did for
the sedan car owner that sedan car dataset. So, here our outcome variable is we are using
r part function within the r part function, the first argument is the formula. In the
formula, you can see promo offer is our outcome variable and this is being modelled
against all other variables which are predictors. Method is class for classification model.
The data is appropriately mentioned as df1 train the training partition. R part control
function which we talked about in the previous lecture took control a certain aspect of
our tree model right; it is complexity parameter is 0, because we want to grow full you
know full grown tree and we. So, minimum is split this two observations in min bucket
744
the bucket one observation and so all these parameters we have already talked about x
value 0.
So, if we do not particularly you know I specify this value zero the default value is 10.
So, some observation is going to be used for cross validation by r part function, which
we do not want to do. So, we would like to use all the observations just for the training
you know building the model. And the validation we have the validation partition for
validate to form the validation. So, we do not want to make any we do not want to use
any observation for this cross validation exercise that is inbuilt in the r part function. So,
x value has to be 0. Other parameters are split this the gini metric that we have discussed
in previous lectures. So, let us execute this code and will build the model.
So, now mod one is created as you can see in the environmental section. Now, let us look
at the tree. So, let us set the parameters graphics parameters, margin outer margin. And x
p d you can look at the parameter function par function for more detail. So, this is a
basically x p d is basically to a you know generate your plot in the device region. So, you
can if you are interested in more detail, you can find out from the help section. Now,
once the para graphics parameter are set then we can generate the plot.
So, this is our basic plot. Let us add the information. Now, you can see the part is quite
messy here. So, this is what we were talking about. If we have very large data set and we
you know generate full grown tree, so it is going to be quite messy. So, you can see the
745
number of splits too many splits are there because as we talked about in the full grown
tree, we also did the exercise where we were partitioning the observations right. In this
sedan car dataset and we kept on partitioning till the all the observations were classified
correctly.
So, the same kind of thing happens in a full grown tree where we continue to build our
tree model till all the observations are classified till all the partitions that we create are
your homogenous partition; that means, all the observations belong to the same class. So,
because of that too many partitions and the full grown tree is going to be quite big as you
can see here. If you want the nicer version or pretty version of this particular plot, so as
we have been doing as we have done previously prp is the function that can be used. The
relevant package we have already talked about.
So, we can generate this. And you would see that this is the another way of representing
this full grown tree. So, this is slightly better version, but again because the tree is quite
big, so because of that this one also looks messy, but however, we can look at few things
for example, first split is done using income variable and this is value is 101.5. And then
the if you can look at other splits right, the spending and education, right then further
spending here, income here, family size and income. So, you can also see pin code and
you would see the different categories of pin codes, they have been used using comma.
Had we used pin code as an ordinal variable right, then we have would have seen some
746
numeric kind of value right. We the budget I would have been treated as you know
because it already had too many categories. So, we could have treated it as numeric
variable. And then it would have some numeric value, because we have treated its
categorical variable we can see specific categories as part of different sub trees.
So, we will discuss more on this as we go along, let us come back. So, the sniping
exercise that we had done in with using the previous data set something similar we will
have to perform in this particular case because this is quite a you know large tree and the
full grown tree is quite large in this case. So, if we want to this, if we want to see just
first four levels, so that we are able to understand what are the rules, what are the
important variables, and how the split is happening if you want to visualize that. So, first
four levels how we can go about this. So, first we need to do the node numbering as we
talked about in the previous data set.
So, you can see node numbering has been done. So, all the nodes have been numbered
now 1, 2, 3, starting 4, 5. So, because this tree is quite large, so you can see most of the
node numbers visible in this case right the earlier some of the numbers were missing, but
now you would see 1, 2, 3 is all the initial all the initial node numbers are there right up
to 13, 14, 15. So, quite you know in a sequence any node numbers can be seen here. So,
let us using these node numbers, we can always snip off the tree part that we are not
interested.
747
So, toss was one of the argument, the second argument that we use in the snip are r
function as you can see here in this particular line. So, we need to compute this we need
to create this particular you know argument right variable the number of node actual
node numbers which we want to snip off. So, lets first k toss 1. So, this is the function
mod one and the frame. So, we talked about r part dot object, and one of the element
there one of the attribute was frame. So, within frame we also have row names. So, row
names will actually have these node numbers. So, more detail you can always find using
the using the r part, this upper curve, you can see r part object. You can see it look at this
page in the help section, you can see frame.
So, for any r part object there is going to be a frame attribute which will be actually a
data frame with one row for each node in the tree. And the row dot names of frame
contain the unique node numbers that follow a binary ordering indexed by node depth.
So, these are the node numbers that we actually shown in the tree. The same node
numbers are stored in this particular in this particular attribute frame. So, from this we
row, row dot names we are trying to extract that information and then those number
those names and we are trying to convert it into integer variable.
So, let us compute this toss one. So, you can see toss one having 109, since we have 109
total nodes that tree the full grown tree that we had with it has 109 you know nodes. So,
all those nodes the node numbers the unique node numbers that are assigned by the
748
algorithm, so that has that has been captured. Once these numbers have been captured,
we can sort them. So, all this we are doing, so that we are able to identify the nodes
which we want to snip off.
So, once we solved, so these numbers might not be in order. So, if you want to have a
look at this you would see that 1, 2, 4, 8, 16, 17, then is suddenly 34. So, the this
numbering is slightly the way this is recorded in row names is slightly different right.
You can see 1, 2, and 4, 8. So, you can see the row names they are recorded in this
fashion 1, 2, 4, 8, and then 16, 17 then 34, 68 right. So, in this fashion these row numbers
are recorded. So, this is slightly different. So, therefore, we will have to sort this out.
So, let us create another variable sorted values. So, once these values have been sorted
we would like to identify the nodes which we want to get rid off. So, how we can do
this? So, let us again zoom into the full grown tree and with node numbers. So, for
example, first four levels, so level 1, 2, and 3 and 4. So, after these four levels, we would
like to get rid of the remaining part of the remaining part of the tree, that means, the
nodes is starting from node number 16. So, let us go back to the code, you can see. Same
thing I mentioned here the toss 2 this is once you know so 16; and up to the last node
right length of the toss 2. So, from these node numbers starting at 16 to the last node, I
would like to get rid off this node, so that we just see the first four levels.
749
Let us compute this. This is done. Now, as you can see I am using snip dot r part function
here. So, mode one and we would like to snip it using the toss 3 argument containing the
nodes to get rid off. So, this is done.
Now, we can use p r p function to print the tree. Now, you would see just four levels of
the tree. Now, this is quite clear easy to understand what is going on here. So, you can
see first split is income less than 101.5 and then we have split starting using spending
and this split using education at the right part. The left part is spending then further
750
spending, income, family size, income. So, these are some of the common variables they
would see income and spending they remain the two important variables here, income
spending, family size is also there and the pin code is also visible here in this part you
can see pin code is also there. So, income, spending, education, family size, pin code are
the important variables; however, the income and spending seems to be occurring more
often.
So, now since the whole full grown tree is developed to understand more about this
particular full tree. We can look at few more things for example, number of decision
nodes that are there. So, again r part object this particular split attribute that is there it
will give us these splits contains the information about the variables used for a split. So,
we can also a length off this particular this particular very attribute will give us the
number of additional nodes. So, 54 decision nodes have been used. And we look at the
terminal nodes. So, a total number of nodes would be can we find out by the this frame
as we risk as we looked at the health section that frame contains a unique row number for
each of the node.
So, therefore, it can give us the total number of nodes that are there. You can see 109
which we already know. And then once we subtract the number of decision nodes will
get the number of terminal nodes 55. So, as you can see number of decision nodes 54 and
number of terminal nodes you know 55, which is much more one more than the number
of decision nodes. This is property of binary trees. In binary trees the number of terminal
nodes or number of leafs are one more than the decision nodes.
Now, if we are interested in having a table where we have the information about the
variables which have been used for splitting and the split values. So, the predictor value
combination, if we want to look at look at that list, so by default the output that we get
out of summary function, it is more descriptive right. So, we would like to have a tabular,
tabular output then we would have to write the code for the same. So, here we are
essentially trying to do this. So, we are trying to capture the split variable information
and a split value information. So, particularly split value for each of the splitter split
variable, because as we saw that the in r part object split function has the information on
the variables used for split. So, for all those variables can we compute the can we extract
the split values from the model.
751
So, let us compute the split values. So, split value lets compute. So, we have to counter j
is the counter file split variable i is counter for split values. So, then a split value is the
this variable where we are going to record all the split values. So, let us initialize this one
on also the counters. Then in the loop you can see x in mode one frame var. So, for all
the variables that are there. So, frame has all the information on all the variables. So, for
we run a loop for all the variables, and then we place a check there if as dot character and
leaf.
So, if the particular variable is the leaf node right if the particular you know that variable
is leaf node then we would like to skip that you know if it is not leaf node will like to
continue. If it is a leaf node, we would like to skip and go to the else part. So, within this
we see that if the variable is not factor right, there is the split variable could be factor or
numeric. So, again we do a check if the split variable is not factor that means, numeric
variable then simply the value of that split value can be found find out using this
particular code where the splits attribute that we discussed. And the index column that is
there the index particular column is actually contains the split value. And j is anyway
counter for a split variable. So, for that split variable and index column will have this
split value will be immediately recorded or captured here; if we go to the else part which
will essentially deal with the factor variable.
752
So, here if we go to this part, we have another variable c l as null initialized. So, split if
this spirit variable is factor. So, k is another counter that we are starting which is ranging
from 1 to largest number of levels in the factors. So, the different factor variables that we
have the pin code had the largest number of categories right. So, it had thirteen
categories. So, k will we you know lie between 1 and 13. So, we are running a loop for
the those number of categories. So, you can see k in one, two number of column that are
then and c s is split this is particular attribute actually contains information about factor
variables. So, it will have that information. And once it has that information, so we can
run the loop here. And within this we can look at we can again we are getting this temp
variable the we are recording this information of index.
So, this index and particular level, so for a particular variable and its levels, so there are
going to be different categories for factor variable. So, we get further information so that
information would be whether the that particular whether the level of so level of that
particular variable is going to be recorded in temp. So, if that goes into the left child, so
for a particular for particular factor variable or categorical variable, if it has four
categories. So, once the tree is being constructed, and the categorical variable is the split
variable, we have to check which category is going to the left side and which categories
are going to the right side right. Whether a is going here, and b, c, d are going away to
the right side.
753
So, the same information as the same thing is being captured in this code. So, once we
have recorded whether a particular you know whether particular level right that is
represented by k right. So, k is running for all the levels right; maximum of levels that
will be run for all the you know it can be used for all the variables and the all the levels.
So, if it is goes for left goes into the that particular level is goes to the left child, left
branch, then that is being recorded here right that is being recorded in c l that c l variable
that we had created class. So, this we are recording here. And then you can see the next
line here, and the else part itself. Again for all the levels, so this loop will run and for all
the levels we would end up recording this, we would end up recording all the levels
right.
So, those levels are actually nothing but the values specific values for the categorical
variables right; just like numeric variable you know the specific value which was used
for split. So, the these level different levels which level has gone to the left part, which
level has gone to the right part, they actually represent the they actually represent the
split value. So, you can see spirit value that variable that we had initialized. So,
numerical variables are being stored in the if part; and in else part we are storing the
categorical variable. So, here will keep on storing using this particular function, in this
particular code we will keep on storing the value for categorical.
754
Now, the second else part that we have the else section that we have so, this was for the
leaf node. So, if we come in arrive at a leaf node then we assign a spirit value as NA,
because leaf node is not the receiving node and it will not have any split value, and then
we continue with our counter. So, this is the code. So, this is how we can go about
capturing extracting the split values.
So, we will continue our discussion. We will stop here will continue our discussion in the
next lecture from the same point
Thank you.
755
Dr. Gaurav Dixit
Lecture – 41
Classification and Regression Tress- Part VI
Welcome to the course business analytics and Data Mining Modelling Using R. So, in
the previous lecture we were discussing classification trees and specifically we were
doing an exercise in R. So, let us start from the point we left in the previous lecture. So,
we were we were building classification tree model using the promotional offer data set.
So, let us redo few points.
So, that we are able to resume from the same point where we left in the previous lecture.
So, let us reload the data set that we used in the previous lecture for promotional offer.
So, this is the data set let us import it into the R environment. So, as we saw in the
previous lecture this is a quite large data set 5,000 observation. So, it will take slightly
more time.
So, you also discussed about the pin code, and how we could? How we were actually
using pin code? How we were grouping different categories of pin code? So, let us repeat
some of those steps. So, that we are able to reach to the same point. So, pin code
promotional offer.
756
Online all these have to be converted factor variable as we did in the previous lecture
partitioning and then we also build the model.
We need to load this particular library r part library. So, once it is loaded, then we can
build the model. We had also saw the tree diagram using a different function the plot
function as well as prp function.
757
Then we started our exercise about finding the split values or finding the split values for
different partitions.
So, let us go back. So, this was where we stopped. So, we were calculating the
computing the split values.
So, this discussion we had already done. So, once for every split variable the variable
that were used to develop the full grown tree; once we have computed this split values
extracted; rather extracted this split values for all the split variables. So, let us. So, this is
758
the data frame which we will a get give us some useful information. So, the previous the
split value that we had computed, now we can see here the same thing which we saw in
the.
The same thing which we saw in the full grown tree and where we could actually follow
the root node and what was the split variable and value combination then the other splits
and the; predictor and value combination all right. So, the same thing is represented in
this a tabular format in the here and, that is why be required to extract this split value. So,
that we could create this tabular format you can see node number. The unique node
number as we discussed about the tree diagrams, and they split variable that has been
used in that particular node number.
And the corresponding is split value that has been used to create a split and number of
cases and the class for that particular node right. So, so, in this fashion we can look at
each of the each you know different nodes that are their node number two the split
variable was spending. So, this is with respect to full grown tree as we saw in the
previous lecture. So, this is the split value for spending and the number of cases that are
there in that particular node. So, we want to have a look at the full tree diagram again.
So, that was quite a.
759
That was quite a big tree model and so, this will this prp function will give us the tree
diagram.
So, we need to load this particular library to be able to use the, this particular function; so
once this particular package installed will be able to use rpart this function. So, let us
reload the library the name was different. So, that was the problem. So, it was actually
rpart dot plot. So, prp function is there. So, we have to use this. So, now, this function
would be available to us. So, let us plot this you can see this was quite a big plot.
760
Tree diagram and the table that we had generated.
For split variable and value combination, this is the table that we are generated. Now
from here and looking at the tree model you can get a better sense of, what happened in
our tree diagram? So, you can see first node, root node income 1, 0 less than 101.5 and,
that is the node number and the same value is here you can see in the table as well.
Similarly, they say this is the second node expanding. So, the same thing is presented
here you can see the value is also same.
761
So, the tree diagram the information that we have is specifically focusing on the split
variable and split value, that we can also present in this tabular format and understand;
what are the important predicted value combinations? This is especially important if we
have quite a big tree we are developing a full grown tree and. So, in those situation this
tabular format might be more useful. And you can see the next node number is 4. So, the
4 is will come here. So, this one is the split variable value combination is spending less
than 2.16 right 2.59. So, there must be one node this is the 1, 4 and then this is the node
number 8. So, in this fashion you can see.
So, the; these are this is the table for split variable value combination. So, let us move
forward. So, once we have built this model. So, let us look at the performance of this
model on training partition itself; since this model has been built you know this has be
this 3 model is a full grown tree. So, therefore, what we expect is that? All the
observation would be correctly classified, because once we develop full grown tree.
So, the; we keep on creating partitions, we keep on doing splits till we are able to reach
till we are able to create pure homogeneous groups or rectangles or partitions right. So,
predictor is the function as we have been using in previous techniques as well. So, this
can be used to score the a particular data set. So, the model mod 1 and we are again
trying to score off the training partition itself and minus c 3.
So, we are not including the; dependent variable outcome variable in this particular data
set and the type is class. So, with this we can look at the performance and you would see
that, this is the classification matrix you can see actual value and predicted value. So, all
the 2251 actual value have been correctly classified as; correctly classified as class 0 and
you can see 249 actual class one records they have been correctly classified right.
So, the if we look at the we find the classification accuracy and error you would see it is
100 percent right 1 and 0 here. So, you can see the performance is 100 percent because
the tree model is all fitting the data completely. So, the whole data the data set the
training partition has been completely fit with 100 percent accuracy and 0 percent error.
Now, let us let us look at the performance of this full grown tree model on validation
partition and test partition. So, we had already we have already created these two
partitions starts apply the same function.
762
Predict let us look at the classification matrix, now here you would see few errors.
Right 22 class 0 records have been incorrectly classified as class 1 and 12 class 1 records
have been incorrectly classified as class 0. So, let us look at the accuracy and
misclassification error number. So, you can see 0.97. So, 97.70 percentage, that is; the
accuracy and then we have 2.27 percentage that is the error right. So,. So, when we apply
the model.
763
So, there is this much difference in the performance of the model. So, it is of course, this
is expected; because it is not possible for the; the over fitted 100 percent over fitted
model to perform well on new data. However, the 97 this performance is also quite good;
however, this is affected that it would be lower than the training partition. Now, let us
look at the test partition and the performance. Now, let us look at the accuracy and
misclassification error numbers we can see 97.4 in this case.
So, you if we compare from the previous performance; so would see then that in the
validation partition is 97.7 and here it is that decreased 797.4. So, if we compare from
the training to validation and test; so 100 to 97.7 to 97.4; so the performances though the
performance is good for all the partition, but it is decreasing.
If we are interested in looking at the information; the way R output is presented. So, we
can use the summary function as we have been doing for other techniques as well we can
look at the details here. So, this is quite big output reason being that, there are too many
node numbers, because this is full grown tree. So, you can see in the summary output.
764
The call of this particular function the details the number of observation then this is the
CP table.
The CP table this is complexity a complexity parameter values. We will discuss this
particular, what these values are about? In detail later on and split is the number of splits
and then the relative error. So, I will discuss all these three all these three terms later on.
Now let us look at the variable importance. So, for if we compare this to the previous
data set that we had used the sedan card data set we had just two variables right. So, now,
765
in this case we have you know quite a few variable here and. So, you can look at the
variable importance income comes out to be the most important variable 44 percentage
importance, then followed by education.
And then followed by family size, then followed by spending then pin code and age. So,
from this we can also understand. So, this as we talk about that classification and
regression tree can also work as a variable selection approach or dimension reduction
approach. So, from here you can see that we can have a rank ordering a list of the;
variables which are more important and so, from that list again we can base on about
domain knowledge and expertise and looking at these numbers we can again understand
which variables are important which variables to retain in the model.
Now, if we look at the splits to the partitions that are how they are being created. So,
node number one we can look at this. So, some specific details complexity parameter
value and other details we can clearly see predicted class expected loss and all those
expected loss is nothing, but the node root node it is classified as class 0 right; because
majority of the members belong to that class 0. So, what would be the a loss that is
proportion of you know records which belong to the class 1 which would be incorrectly
classified.
So, that is the loss here in this case node probability of this node this node probability is
1, right this being the root node. So, the class counts are also there the probabilities and
766
that, that is also there as you can see here you can see left son 19 1926 observation will
go to the left son the right son the remaining 574 observation.
So, the same thing you can visualize in the tree diagram itself right. So, here; however,
the node num node details this is not given here, because of the space problem in the tree
diagram. Now, the primary split the split that has been performed is income less than
101.5 which we can see in the tree diagram as well the same information, how the
improvement value that you can also see here after the split is performed. So, this is the
improvement that is expected if this is split that is done right. Similarly, if you for node
number two; we can look at other details 1926 observations and the complexity
parameter value and then other details as we saw in the root node.
Now, look at this split spending less than 2.86 right improvement 3.31. So, node number
2. So, this is the node number 2. So, this split is performing in this fashion; using this
predicted value combination you can look at the improvement and other things. So,
similarly for all the node numbers in the summary output for all the node numbers, this
detail we can find out right for all the node numbers this detail is available. So, further
importance of some of these numbers that we have been talking about for example,
complexity parameter and other things as we said we will discuss in the during the
lecture itself. Now let us now that brings us to our next part that is pruning. So, let us go
back to the slides.
767
So, as we have talked about we have discussed in the in some of the starting lectures of
classification regression trees that there are two primary steps in cart procedures. In cart
algorithm first one is recursive partitioning the second one is pruning. So, till now our
discussion has revolved around the recursive partitioning, were we tried to create where
we have tried to create the full grown tree we have tried to achieve your homogeneous
subgroups right.
Now as we saw that this particular once we develop full grown tree it over fits the whole
data and. So, how do we avoid this? So, pruning is the next step that can help us in
avoiding over fitting. So, the idea is to avoid over fitting why we perform pruning? The
idea is to avoid over fitting. So, full grown tree leads to complete over fitting of data as
we have been saying poor performance on new data. So, because of the over fitting on
new day new observation the performance will further decrease.
Now, if we if we look at the; the way tree models or in general other models that we have
been talking about the focus is on typically the focus is on overall error. So, we always
look to minimize overall error of classification models and also for tree models as well.
So, what happens in when we try to you know minimize overall error? So, this particular
error for any particular technique and also for tree models so, this is expected to decrease
until the point where relation relationships between outcome variables and predictors are
fitted.
So, so, till the time the relationship between outcome variable and the set of predictors;
that is being captured; that is being tapped in by the model. So, the till that time the
overall error will continue to decrease right as we keep on you know building our model
using that you know relationship using the information, that is there in those predictors.
So, till the point the predictors information is being used to build the model the overall
error will decrease; however, there is going to be a one point where the all the all that
predictor information that could have been captured, that could be captured depending on
the different strengths and weaknesses of different technique and for tree models.
As well at some point the models will start fitting to the noise once the all the predictors
information; that could be useful for that classification tasks prediction task. In general,
once that has been done the model will start fitting to the noise and overall error will start
increasing, once we start fitting to the noise the overall error will start increasing. Now,
768
why specifically we talk about the tree models? Why that would happen? And in general
also due to as you can see in the slide as well due to splits involving a small number of
observations.
So, as we keep on doing splits using predictor value combinations. So, initially for every
split will get you know much more reduction in overall error, but as we go along and
most of the you know predictors information has been built into the model and once we
start fitting to the noise we will also reach to the point where we are actually dealing with
a small number of observation. So, we are further creating more and more splits most of
the observation have already been put into their right rectangles right partitions. Now,
further we are looking for few observations which are still to be correctly classified.
So, because those observations as we did through an exercise using sedan car data set the
partition that we kept on creating using the graphic right. So, finally, we were also
changing even if there is one point; that is, incorrectly classified in a particular group we
also created a partition for even that one single point in that exercise right. So, you would
see that essentially we are dealing with a small number of observation as we do to you
know you know as we do too many splits right.
So, because of this the model tree models in general other models also they start fitting to
the noise after some point once they predict information has been modelled. So, the
overall error will start increase and will start fitting to the noise. So, the same thing
happens in the when we develop full grown tree we keep on splitting we are we keep on
splitting, we are keep on getting partition until we reach to the it will be generate pure
homogeneous groups. So, some of the you know last splits some of the you know last
split which have been created based on few observations. So, they are leading to they are
actually fitting to the noise.
So, how do we overcome this how do we overcome this complete over fitting update of
the data. So, there are two approaches. So, one is stop tree growth before it starts over
fitting data or fitting noise all right.
769
So, we have to find out you know you know that point where the tree has stopped you
know building the predictor. You stop using the predictors information to build the
model, rather it has started the model as the technique has started over fitting the data or
noise. So, how that can be done? So, number of is number of splits or tree depth level.
So, if we can find out that optimum number of splits or tree depth level, where this
actually can happen where this is actually going to happen the over fitting of data are
fitting to the noise you know the number of split after which this starts to happen, if we
are able to find out that number.
Then, probably we can stop tree growth there and we will get the optimized tree second
approach could be number of observations in a node to attempt this split. So, how many
observations could be there before and split is attempted. So, we are able to understand
for a particular problem and the data set keeping everything in mind, if we are able to
determine that and then also we can stop tree growth accepted a level of reduction in
impurity. So, once we create split we are expecting certain reduction in the impurity the
different two important impurity matrix we talked about gini index entropy measure. So,
what is the level of reduction that we expect once we create this one split once we create
partition and different subgroups.
So, if we are able to determine a; you know accepted level of you know impurity
reduction in impurity. So, that can also be used to stop tree growth. So, this is another
770
approach now these three things that we talked about, whether it is number of a splits
that can be performed optimized number of splits or optimal number of observation in a
node to attend the split or the accepted level of reduction in impurity if. So, you would
see with all these three you know approaches it is difficult to determine these points and
how do we determine the number of splits or number of observation.
It is slightly difficult more number of experimentations would have to be performed

right, it is going to take lot of time to build the model and to fill find the and the optimum
you know these points and optimum values of some of these approaches. So, because of
this difficulty this particular approach is you know not that popular and there is another
approach; that is, generally tried which is actually the pruning approach that is the focus
of this discussion this particular exercise also.
So, prune the second approach is prune the full grown tree back to a level where it does
not over fit or fit noise. So, that is the second approach. So, once first we can develop the
full grown tree and then we start pruning it back to the level where it does not over fit
data or fit noise. So, that is one approach how it can be performed we can use validation
partition. So, we can you know build our tree we can construct our tree using the training
partition and, then we can start pruning it back using the validation partition.
So, this is one particular technique. So, we talked about in the starting you know lectures
of this course, we talked about the importance of partitioning the training partition
validation partition and also the test partition, but if we look at the some of the
techniques that we have tried we have just been creating two partitions training partition
and test partition.
Now, in this particular technique you would see the importance of validation partition
training partition is being used to build the tree that is full grown tree validation partition
is being used to refine, the model to fine tune the model how to prune back the tree to the
level where it is stops fitting to the noise or stops over fitting to the data. So, validation
partition also is going to be in a way part of the modelling process. And therefore, the
importance of test partition which would act as a new data partition and where we can
apply our model and check it is performance.
So, in this particular technique you we see that the training partition and the validation
partition they both become part of the modelling process. So, in this approach the
771
pruning approach; that we just discussed the idea is to remove the tree branches. When
we talk about removing the; you know pruning that we back to a certain level. So, the
idea is to remove the tree branches which do not reduce the error rate further right. So,
the error rate is not you know some of the branches which are not able to decrease the
error the overall error then probably we should remove those branches.
So, that is the main idea. So, so, to implement this particular approaches we need to we
need to be able to understand to identify or determine the branches or the nodes which
are not in decreasing the overall error further and then remove them off snip them off so
with this so, this so, a few other things in pruning. So, as we talked about.
I find the point where error rate on validation partition starts to increase. So, we talked
about the point where you know construct the full grown tree and use the using retaining
partition and the prune impact using the validation partition.
So, first we need to find out the point where error rate on validations we can build the
model using partition and then apply, then a score the validation partition and check the
error rate and find out at which point the error rate starts to increase; that is, the point that
is the level at which till which we have to prune the tree. So, another concept related to
the; this pruning is cost complexity parameter or complexity parameter CP; that is that is
generally used in cart algorithm.
772
So, the complexity parameter values are typically used to perform the pruning in many
implementations of cart. So, what is a complexity parameter? So, if you look at the
definition here this complexity parameter that is CP is error; that is nothing, but the
overall error that we talk talking about the misclassification error. So, CP is error plus PF
into TL. PF is the penalty factor and TL is the tree length or tree size.
So, for tree size we introduce a penalty factor and error plus this penalty factor. So, that
will give us the CP value. So, you know if there are two particular tree models two
candidate tree models and they both have the same misclassification error, but if
particular you know they have both these tree models have different tree sizes different
number of nodes.
So, it is the smaller tree model that would be selected that would be based on the
complexity parameter, because there is going to be penalty on this on tree size. So, the
model having bigger tree length or high bigger tree size would have additional you know
additional value additional value added to this error component and therefore, it is error
the CP value will would be more than the tree having similar having same
misclassification error, but of a smaller size.
So, complexity parameter or cost complexity parameter in a sense can be used to control
the size of the tree to control the length of the tree. So, a complexity parameter can also
be used to prune back the tree and as we have talked about that validation partition; that
can be also used error rate and we can also identify from, where it starts to increase or
the second approach is use the compute the complexity values and complexity parameter
values and use that to find out the optimum length of the tree now.
So, we will stop at this point. And we will continue our discussion on pruning in the next
lecture.
Thank you.
773
Dr. Gaurav Dixit
Lecture – 42
Pruning Process- Part I
Welcome to the course Business Analytics and Data Mining Modeling Using R. So, let
us continue our discussion on classification and regression trees. So, in the previous
lecture, we were talking about the, we were discussing the pruning, the second step of
classification and tree classification tree models. So, we talked about the pruning and
different approaches to control the; to control to avoid over fitting in the full grown tree
model.
So, we talked about pruning approach and two ways validation partition using validation
partition to find out the exact point, where we can stop where we can prune back the tree
to that level or using cost complexity or complexity parameter values to control the tree
length. Now, the next related concept is a minimum error tree. So, as we talked about in
one approach that validation partition can be used to find out the point at which from
where the; error starts to increase.
So, the tree with minimum classification error on validation partition; that is, called the
minimum error tree. So, the point where we achieve the minimum classification error so,
first we build the tree model and then we look at you know first we build the tree model,
then we look at different candidate models. So, the tree as we talked about full grown
tree and if it is being prune back to certain levels.
So, for different levels of you know different pruned models are different prune models
which could be you know which are going to be the candidate models candidate tree
models for different off for all those models we can look at the misclassification error,
where the misclassification error is minimum and that particular tree that particular
pruned tree is going to be the minimum error tree.
So, let us move forward another related concept is a best prune tree. So, this is the tree
that I would we would like to find out we would like to determine and then use for and
then use it for our on our; on our new data and for deployment as well. So, what is best
774
pruned tree? So, best pruned tree can be can be determined, using by adjusting for
sampling error or minimum error tree.
So, the minimum error tree you know there could be the due to changes do to you know
samples, there could be you know that minimum error tree can move to some extent
because of the sampling error. So, how do we; so how do we certain? How do we how do
we find out the best pruned tree? So, if we are able to adjust for sampling error, if we are
able to identify a range right for the minimum error tree that, this is going to be the best
prune tree is going to be within that range of a minimum error tree.
So, that range would be actually the adjustment for sampling error so, typically, how that
is done? So, is smallest tree in the pruning sequence right? So, we need to find out the
pruning sequence. So, a smallest tree in the pruning sequence which lies within one
standard error of you know minimum error tree. So, we need to find out the; minimum
error tree and in the pruning sequence then within the one standard error of the error rate
we have to check the tree which is going to be there. So, the smallest tree in that
sequence is going to be the best prune tree.
So, typically if we have the pruning sequence and if we have you know, because it says
you can. So, it is going to be sorted. So, therefore, if you know let us say minimum error
is let us say a point 0.1 and this standard is 0.01. So, within the 0.11 you know range we
have to see the tree which is having a smaller number of nodes right. So, that is going to
775
be the best tree. So, let us try and understand these concepts through an exercise in R to
get more clarity let us open R studio.
So, let us start our pruning process. So, in when we have already done the modelling in
the previous lecture where we had build the model full grown tree model using the
promo promotional offer data set right. So, you can already see this is about the model
that we had that we had built.
776
So, you can see the root node and other nodes of the tree. So, this is a full grown tree you
can see, as we talked about its quite messy too many new nodes are there and it is
completely over fitting the data. So, now, we would like to now we would like to prune it
to some level where it is not over fitting the data or where it is not fitting to the noise. So,
let us start our pruning process. So, let us look at the number of the scene nodes number
of total nodes that are there in full grown this particular tree. So, you can see 101 total
nodes are there total number of nodes are there in full green full grown tree let us look at
the number of decision nodes that are there.
So, you can see 50 decision nodes are there and let us look at the number of terminal
nodes or leaf nodes 51. So, as we talked about in the previous lecture as well; that
because we have been building binary trees.
So, binary tress have the property there the number of leaf nodes R terminal nodes are
always going to be one more than the number of decision nodes. So, that same thing you
can see here number of decision nodes are 50 and the number of leaf nodes or terminal
nodes are 51, one more and the total being the sum of these two numbers 101. Now let us
look at the node numbers.
So, as we talked about the R part object in the; Pre in the previous lectures and we talked
about one particular attribute frame. So, this particular attribute contains the row names
which are nothing, but the unique node numbers that are being assigned. So, this
777
particular this particular node you know node names are in in the in I I think in the factor
these are let us look at the class. So, these are stored as a factor variable I guess character
variable ah.
So, this has to be coerced into an integer, because these are actually nothing, but the node
numbers let us look at the first six values of this particular you know attribute row names
you can see 1, 2, 4, 8, 6, 7, 16, 17 let us look at the tree model. So, had it been numbered
ah. So, this would be node number one and this would be node number two and this
would be node number four and then 8. So, in this fashion 1, 2, 4 in this fashion you can
see the row names have been recorded here right and. So, this particular this particular
code will give us the all the unique all the unique node numbers that are there in the full
grown tree.
So, let us execute this. So, we will have toss 1. So, if you are interested in looking at the
unique node number. So, you can see here.
So, we had 101 nodes as we saw total number of nodes were 101. So, there are 101
elements in this particular vector and integer vector and you can see the unique node
numbers right; 1, 2, 4, 8, 16 and you would see all the you know numbers are not code
only the node numbers which are present number a unique node number which are
present in the part of the full grown tree only they are you know recorded here in toss
one.
778
So, now, we will we would like to sort this particular interior design being we would like
to identify the nodes which we would like to get rid of, because the idea is to remove the
branches right which are not you know decreasing the error for the right. So, let us sort
these values. So, once we sort we will get toss 2 this variable you can see.
Now, the node numbers have been sorted 1, 2, 3, 4 and up to 915 the last node number
node number though we have just 101 nodes the node numbers are unique. So, therefore,
they take they their range is going to be much higher. So, 1, 2, 915 node numbers are
present in the a full grown tree. Now, once we have done this we can start counter for
nodes to be snipped off as we talked about we would like to prune the tree back to a level
where the error does not increase further. So, first we will initialize the counter for nodes
to be snipped off. So, I one is our counter now this another variable that we are
initializing here is mod 1 split v and then we have mod 1 mod 1 s train v train training
vector validation vector.
So, in these in these vectors we are essentially storing the; you know storing the models
model objects R part of the model objects, then we have error train vector and error valid
v. So, there they are we are storing the error rate right the overall error the
misclassification error for different level of you know see different level of pruning right.
So, as we keep on removing branches. So, we will get a sub tree for each of those sub
trees we would like to record the error rates.
779
So, that we are able to compare later on come from those error rate we are able to
identify the point from where the error rates are going to increase a similar kind of
exercise we had done in KNN to find out the you know optimized value of k. So, let us
do the same thing here. So, let us initialize some of these variables now we would see we
are starting a for loop
And in this again you would see the x that look for loop counter will take values from
this particular variable mod 1 frame a var. So, you would see in the R part object you can
go and look at the help section and you would see in the frame attribute it also records
the variable variables; which have been used for the splitting. So, the split variables are
recorded in this. So, the counter will run for all the you know split variables right now let
us look at the next line. So, this is if condition here.
So, you can see for this particular variable if it is not leaf and if the length of toss 2 right.
So, the toss 2 the series of nodes which you know from which we would like to create
these different pruned models right we talked about different prune model depending on
the labels depending on how we keep on moving the branches and we will get different
sub tree models different prune models. So,. So, this is the counter for the same. So, I is
was the counter. So, it will this is the maximum value for it toss 2 we have already
computed right then for every time and then we have to compute the actual nodes that we
would like to snip off.
780
The function for sniping we are already familiar as we have used it in the previous
lectures snip dot R part. So, there we need to pass on the in the second argument is about
the unique node numbers which we would like to get rid off right. So, this toss 3 variable
is essentially to record the same. So, you would see that for counter depending on the
counter value the next node number I plus 1 up to the last node number we would like to
get rid of.
So, we will start from these you know a smallest tree to the; you know full grown tree.
So, in that in that fashion we are trying to create different prune models right smallest
tree to the full grown tree. So, toss three would actually capture this then you would see
the next line is mod one is split so, there here using a snip dot R part and so the cross
tree that we have already computed. So, those node numbers are going to be used and we
will get a sub tree model now once that is done. So, that model is going to be recorded in
this vector that we are going to create many more sub tree model for this and later on we
will compare the error rates.
Now once this model has been you know built we are also a scoring the training and
validation partition you can see here this training and validation partition we are trying to
score off and then also you would see later on you know here you would see that we are
also recording the overall error for these two partition training and partition. So, we score
them and then we compute the overall error for training and where validation partition
for all these all of these prunes models. Now after that you would see that counter I
counter is back and then the you know next sub tree is going to be computed and then
performance is going to be recorded
So, let us execute this particular loop it will take time to computed. So, now, we have the
list now we are going to create a tabular a table for where we have the you know that
number of split you know and then the error for the training partition and the validation
partition. So, let us look at this particular table. So, this is based on the computation that
we have just done. So, let us look at this.
781
So, you can see decision number of decision nodes one and error training and error
validation is; there if number of decision nodes are two, then what was the error and
what was the you know error in the validation partition what was the error in the training
partition. So, that we can see you can see the error in training partition is decreasing and
the same is true for validation partition also the error is decreasing, but the rate of
degrees in the training partition is much more and you would see that these errors keep
on decreasing and you would see that this in the validation partition we are interested in
this 1.16.
782
So, it has decreased up to this point number of decision nodes 9 and it has decreased up
to 0.016, and for one more node also the same error rate is there and then it just starts to
increase you can see 0.018. So, now, this is the point probably node 9 and 10. So, these
are the two candidates; which are recording the minimum error in the validation partition
you can see this 9 decision nodes model with 9 decision node and model with 10
decision nodes.
So, these are recording the minimum error right and after that the error starts to increase;
however, if we look at the error in the training partition it you know it is you know it still
keeps on decreasing right, because we start to over fit the data now or we start fitting to
the noise. So, therefore, you would see that in the training partition the error is still you
know keeps on decreasing; however, in the validation partition the error starts to increase
right.
Right and as we go down to the full grown tree we go down to the full grown tree levels.
So, you would see that the error in the training partition has reached to 0 and for the
validation partition it has it is much more right much more than the lowest minimum
error that we saw. So, at this point I would also like to tell you that this table that we
have just computed it is actually based on the sequencing of the nodes right the node
numbers you can see one.
So, you can see node number sequence that is one node number 1, 2, 3 in that fashion,
because we had sorted the node numbers right we actually saw that the the sequence that
we are using to snip off the nodes and that is the sorted sequence 1, 2, 4, 8, 16 in this
fashion this was toss 1 this is the toss 2.
783
So, the this is sorted sequence of node numbers. So, the snipping or the branches that we
are removing in this particular example is actually in that sorted sequence; however,
what is expected is that we should be you know snipping of the branches based on the
pruned sequence right order in which the complex tree for example, based on the
complexity parameter values right. So, for once so, first split has been created ah. So, we
need to understand whether the second you know the split on node number 2 is going to
be the optimum or an x split on node number 3 is going to be optimum right let us
understand here.
784
So, for a root node we need to understand that once a particular split has been performed
predictor 1, value 1 at root node right and then we reach here. So, we have two options
now or where. So, the next split we perform here or here we can always find a and a
optimum split for the node number 2 and we can also find an optimum split for node
number three. So, which should be the next split?
So, that is going to be determined by the impurity reduction right so, P 2, V 2 here and P
3, V 3 here ah. So, therefore, if the impurity reduction on this particular half for the right
sub tree is more than the split number 1 is this split number 2 is this and then again
further we will have to compare for the third split between these nodes which one is
giving further reduction. So, that is going to be the; that is called the pruned sequence ah.
So, rather a better term is the nested sequence.
So, the most important split and the second most important split and in that sequence, the
sequence of splits in terms of based on importance right importance means impurity
reduction. So, that is the sequence that we. So, that we should be using actually to
remove the branches, but what we are doing in our exercises we are just going in the
sorted order fashion. So, first node number 1, then node number 2, then node number 3,
node number 5. So, we are in our exercise, what we are doing is?
We are following this sorted order this is not actually same as the sequence of splits
based on importance. So, this exercise you can do ah; however, what we are trying to do
here is? We are just following the node order of the node numbers and that is not actually
the; sequence the nested sequence or this you can based on the importance right. So, the
least important the split should be moved first right and; that is, how we need to prune
the tree; however, we are following the order based on the node numbers.
So, even with that we can understand the process right how the how the tree can be
pruned back to some levels using the using different function available in R. So, based
on that so, based on that we can move further, we can also look at the tree order last last
snipping that we did how I will like to move further and we would like to plot the error
rates that we have just created. So, let us look at this the plot so.
785
So, these are the this is the plot for the error rate versus number of splits that we have
just computed.
However, they did these computation are based on the node numbers ordering and not on
the based on the importance right. So, that I have just discussed. So, even from this
computation we can see that the error for thee this is the validation this is the training
partition you can see. So, this goes like this. So, the error keeps on decreasing right. So,
keeps on decreasing here for the training partition, if we look at the validation partition
786
the error decreases, but at some point it starts to increase right. So, at some point starts in
case we go back to the table that we had generated right in this table; if you look at the
number where it was minimum we can look at the number of decision node 10 and
number of decision node 9.
So, they had this lowest error values out of all these points. So, node number 10 and 9
and node number 10 and 9 are going to be here somewhere here and you can see the
validation error value is also minimum somewhere at this point. So, probably this is the
point which is going to be the point with respect to minimum error tree and within one
standard deviation of this point, we can actually determine the best prune tree. So, after
this we can see clearly the remaining partition.
So, after 9th and 10th splits all the remaining splits about 50 splits although all of them
are actually fitting to the noise or over fitting to the data. So, with this with this let us
move forward. So, what we try to do now is we will try to identify the minimum error
tree and then best prune tree as well. So, we can simply use the find the, we can simply
use the min function to find the value of the you know the error of the minimum error
tree this is the error 0.016.
Which we find out looking at the table as well, now we are interested in finding out the
number of nodes that are going to be in this particular tree. So, that we can also find out
you can look at the code number of splits and we are trying to identify the error value
787
where it is minimum and that number of nodes will get as this was you can see nine. So,
we had two values in a 2 you know rows 10 and 9 both having the same, but the
minimum, but the first one of them has been selected.
So, minimum of that has been selected, because this main function on any split has been
applied. So, out of 9th and the 9th is small number. So, that has been recorded here. So,
the minimum error tree will have will have actually 9 decision nodes ah. So, let us look
at the standard error that we talked about ah. So, we need to find out the best pruned tree
is going to be one standard error within this one standard error of the minimum error
tree. So, this is how we can compute the standard.
So, variance of error validation vectors that error rate vector that we had and divided by
the length of the same vector and taking the square root of the same will give us the
standard error. So, this is the error now to find out the best prune tree let us compute
another error. So, as we talked about because this is a kind of sequence right. So, we
have to go to the left side. So, therefore, the trees that we have to look at the; in this is
going to be within this range the minimum error plus this standard deviation and then
with this standard deviation we can compute this value.
So, our best prune tree will have will be the smallest tree having error value less than this
particular value m e t 1 s t d. So, this value, our best prune tree is going to have the error
on validation partition less than this. So, how do we find out that particular tree. So, this
is a slightly still the table we have already computed. So, from the table also we can find
out. So, let us do this exercise using table first. So, 0.0180 is the number that we would
like to compare with.
So, we have to go to the left side. So, the smaller number of nodes so, we can see we go
to the left side the first number the d c number the 8th row where number of decision
nodes are eight. So, this is the tree that is going to be the selected as best prune tree from
looking at this number right, because this particular value 0.108 is smaller than the value
that we are looking for one standard deviation. So, within the one standard deviation or
minimum error tree, the best prune tree would actually have eight decision nodes right.
So, the same thing we can also compute using this particular code you can see we are
looking at the range. So, which of the rows are having error less than the minimum value
and then greater than the minimum value, but less than the that at particular range that
788
we have just computed using met 1 s t d and also the number of nodes there should be
less than the minimum error tree and so, there could be you know more rows here. So,
we would like to identify the first one, because that will have a smaller number of nodes
once we compute this you can see we got eight had this table been a bit you know a bit
longer still we would get the right answer.
There had been two three candidates we would get the right answer, now once we know
that eight decision nodes are required. So, we can again follow the same process that we
did earlier the toss variable that we require to snip the tree we can use we can remove all
the remaining nodes. So, again here also we are doing the; we are following that node
number sequence.
So, we are keeping the nodes from 1 to 8 and from 9 to other nodes we would like to snip
off; however, this is not the substrate way of doing this would actually follow the we
should actually remove the splits which are least important and that sequence has to be
followed rather than these sorted sequence as we are doing in this example. So, let us do
this for an exercise. So, we compute this toss three then we call this snip dot R dot
function and pass this argument and we will get the sub tree and then we can print this
sub tree here you would see. So, this is the sub tree that we have.
Generated through our exercise; however, this is not the desired result that we wanted
just for an exercise we have done this we have used the node numbers you can see. So,
789
the way we have that of course, that we have taken we are always going to get this kind
of tree which is going to be fairly balanced right. So, because we are not following the
that important splits criteria. So, would see this tree is always going to look like this now
this is. So, this tree is much smaller than the full grown tree that we had earlier right.
So, now it has been pruned to this level and we can also see the important variables and
value combinations and other things. Now, we can apply once this particular tree has
been build we can apply.
790
This particular tree on our training and other partitions to look at the performance of this
program model so, again predict function we can use and let us score the training
partition look at the accuracy number .974 right, then let us test it on validation partition.
Now, you would see that validation partition is also part of the modelling exercise.
Performance is slightly better than the training partition you can see the in the full grown
the performance was continuously decreasing, but here we would be surprised we should
not be surprised that the model, because now it is pruned model. So, it is giving much
better performance in comparison to the training partition ; however, the performance
what we expect is performance to remain a bit stable; however, is still even with the
prune tree or best prune tree we expect that our performance on new data. Test partition
another partition should actually we expect that to go down; however, it should be you
know stable for any new observation that we would like to predict.
So, if we look at the; test partition and the performance on test partition, that is; 98 point
that is even better. So, you can see the best prune tree is performing quite good on new
data; however, again I would like to remind we have not followed the actual process to
arrive at this best prune tree. So, this that actual process will do in a later lecture let us a
another few other important concepts that I would like to cover here is the complexity
parameter. So, complexity parameter, the particular function R part that we use. So, I
already discussed this that the it some of the by default, it reserves some of the
791
observations for cross validation purpose and based on those observations on those
observation the model the remaining observation are used to build the model and these
observation reserved observation or then used to test those models tree models to look at
the same process, what we did to look at the performance you know performance.
On those observations and the number snip off number of and the particular number of
splits where the performance is on the lower side that could be one of the you know; so
candidate to get the prune tree. So, we will stop here and we will continue our exercise in
the next section where we will see the inbuilt complexity parameter that is there in R part
that can be how that can be used to you know achieve the prune tree.
Thank you.
792
Dr. Gaurav Dixit
Lecture – 43
Pruning Process- Part II
previous lecture, we were discussing classification trees, in particular, we were doing an
exercise in R for the same. So, we did some modeling using the promotional offers data
set. So, we talked about the way we did a modelling, there especially, the pruning part.
So, we were specifically focusing on the pruning part and there; when we try to prune
back the full grown tree to a label where it does not over fit the data or fit the noise the
way, we followed the pruning process that was you know a sequence of pruning was as
per the node number ordering and it was not the nested sequence, right. So, we talked
about a bit about this in previous lecture where we discussed that if this is our root node.
And in this root node we will have a predictor one and value one. So, predictor value
combination based on which the split would be performed. So, some observation will fall
in this part other observation will fall in this part. Similarly, for next split, we have to see
that whether on this node or this node you know where the reduction performing you
793
know for the these nodes where the optimum split mole reduction in impurity is going to
take place.
So, let us say; the next you know impurity reduction high impurity reduction happens in
this particular node. So, let us say, this is happens at variable P 2 and V 2 right. So, this is
going to be about a split 1 this is split 2, then after the split is perform some observation
will go to this side other observation will go to this part. Now, again for next split will
have to check between these 3 which on you know which particular node and which
particular predictor value combination will improve the impurity further, right improve
the impurity the improvement reduction and impurity would be highest.
So, let us say now that here at this node the reduction in is impurity is highest, then this
is let us say the predictor value combination for the same is here. So, this is split 3 right
now. So, here again we will have some observation that will go into this part some
observation will go into this part right now for next split. Now, among these 4 nodes will
have to check which one is giving the most reduction in impurity let us say this is the
split this is the node and we have a P 4 V 4 and predictor value combination and it will
be split 4.
So, the pruning sequence. So, from this we wanted to derive the pruning sequence. So,
we look at the pruning sequence, it is going to be this node, right. So, if it is node number
one. So, if we follow the unique you know node numbers that ordering that we discussed
in the previous lecture this is going to be node number one this is going to be 2, this is
going to be 3, then 4, 5, 6, 7.
So, our pruning sequences first node number 1, then the second split happened at node
number 3, then it happened at node number 2, then it happened at node number 6, right.
So, 4 first 4 splits in this is example, if we look at first 4 splits. So, they happen in this
order. So, when we prune back the full grown tree to a certain level will have to follow
this splitting pattern. right ah. So, let us say last know if there are n number of splits ah;
that means, actually this is going to be n number of this is going to be equal to the
decision nodes decision number of decision nodes in full grown tree.
So, therefore, we have to when we start pruning the full grown tree back to the desired
levels will start deleting the you know least important splits; that means, splits which
have done a least amount of reduction in impurity. So, probably we will start from here
794
and go our way back to the higher up to level. So, that we get to a point where the error
on validation data is minimized. So, essentially the exercise that we had performed in the
previous lecture the pruning that we had that we were following was based on this.
So, we just looked at the node node numbers and you know pruning was based on this.
So, we are following the sequence in the increasing order as per the node numbers the
optimal way of pruning that we want to follow is this one. So, today we will do an
exercise in R, wherein, we will follow this particular pruning sequence and then let will
understand few of the, you know few more points using a particular exercise in R. So, let
us start. So, first let us load this particular package x ls x. So, let us go down. So, all
these things we have already done.
In previous lectures, let us load this program package as well we would be requiring this
R part and one more package we would be requiring this one as well R part dot plot.
Now let us move to our data set. So, promo offers dot x l s x is the file. So, we would
like to import it here in R environment.
795
So, let us perform this. So, it will take some time because it has this particular data set
has 5000 observations. So, it will take slightly more time that we have been doing for
other datasets smaller datasets.
So, once this particular data set is loaded we will go through some of the steps that we
had performed in the previous lecture and once that is done. So, once the pruning
specific steps start, then we will discuss what we have covered here. So, you can see all
the observation 5000 observations of 9 variables all of them are loaded in R
environment. Now let us move NA columns structure these are the variables.
796
So, some of the steps will have to quickly go through for example, we did grouping of
categories. So, we will have to perform this again. So, that we are able to reach to the
same point. So, let us go through this code we have already discussed this part before.
So, we are just going through this. So, that we are able to create the; so, this is the now
our data frame is ready all the variables are in the appropriate you know types data types
numerical and factors.
797
Now, let us do the partitioning already discussed these steps as well. Now let us build the
full grown tree. So, this is the code that we had used before.
So, you can see here as we have discussed before that x well a value is 0, right, by
default this is 10. So, that is reserved for task validation. So, just we want to pull the full
grown trees let us plot this.
So, this is our full grown tree you can see quite messy as we had seen, it in the previous
lecture as well. So, now, let us move to the point where we wanted to where we want to
798
discuss further. So, the split variable and value combination this particular table we have
already discussed we have gone through this.
So, we will not do this again performance of full grown tree we have gone through that.
So, let us come back to the pruning process where as I discussed we followed a different
pattern you know different pattern for pruning. Now we will follow the actual pattern the
desired pattern for based on complexity.
799
So, a pruning process let us look at the number of total nodes in this particular tree as
you can see number of total node number of total nodes 69 and 34 in the 34 decision
nodes and 35 terminal nodes. So, node numbering; so, you would see now certain steps
that we had performed you will see differences now toss one is the argument that we
want to compute at this point which we would be passing on to this snip r part function
now toss one.
So, as we have discussed that r part object it has a frame attribute and within that frame
attribute it has the row numbers. So, this we have discussed in previous lecture. So, we
will get the row numbers will convert it into integer vector. So, that we will have the
these numbers unique node numbers ah, but the ordering is not at for the desired order.
So, now, we will constrain now we will create this data frame where we have the these
node numbers in toss 1 and we will also have the variables write the variables involved
at different nodes. So, whether the decision nodes are leaf nodes for leaf node it would
just mention leaf as we have seen in tables in the previous lecture.
Now, for each node, we will also have complexity value which is also still stored in the
frame attribute and within the frame we have this complexity variable. So, it would be
stored there. So, let us create this data frame you can see here let us scroll through this
particular data frame, as you can see first column is toss one which is nothing, but unique
you know node numbering with respect to rows.
800
The ordering of these we have already discussed in the previous class, the previous
lecture that it follow this sequence node numbering we actually discussed it by showing
the node numbers.
So, 1, then 2 and then 4, 8 in this fashion these numberings are going to be there. Now
once this data frame is there. Now you would look, you can see that in the second
column, you can see the variables that are involved here and the involved variable have
the you know income spending, then leaf for each of these nodes whether what predictor
was used if it was a decision node what was the splitting variable for that decision node
and what was leaf node the leaf, it just means the it mentions that this that particular
node is a leaf or terminal node the corresponding complexity value the complexity
parameter concept that we talked about that is used to control the size of the tree.
So, the corresponding complexity value for that particular node is also mentioned. So,
this is the value at which point the tree will collapse so, based on this. So, we can
perform our pruning so that will give us the you know that will control the our tree size
and will give us the you know best prune tree and minimum error tree. So, we will do
this. So, before that we will like to order this particular data frame.
801
In decreasing order of complexity values right. So, the starting nodes from where the
first split and then say onwards other splits happen. So, the starting nodes will have
higher complexity values, right here the complexity value would be much higher that is
why this was the split 1 number 1 here the complexity value would be after this.
So, that is why it was split number 2 followed by you know split number three and spirit
number four. So, we would like to order this particular data frame by complexity values
and once we order this particular data frame that we just saw by complexity values will
also get the this sequence, right because this was the first split and the complexity value
will be higher for this, right. So, this would be first then this was the second split
complexity value for this particular node is going to be the second one after this. So, it
will come here.
So, once we order this particular data frame which is having complexity values for each
node we will get this. So, let us do this execute this code. So, you would see that before
ordering first we are trying to remove the leaf node. So, we do not want to have a leaf
node at this point you would see there are many leaf nodes here. So, we would like to
remove the leaf node because the pruning is basically driven by the decision nodes, right.
So, once we remove the leaf nodes from this data frame, we will get the new one this is
the one were.
802
So, this is the one this is the new data frame that we can see DFP 1, we have 34
observation which is equal to the number of decision nodes that are there this is this
these particular numbers, we have already seen in the previous output as you can see 34
decision nodes and 35 terminal nodes. So, once we remove the leaf nodes the number of
you know observation that are there in the new data frame the 34 equal to the number of
as you can see in the environment section 34 equal to the number of decision nodes.
Now once this is done we can order as we talked about we want to obtain the nested
sequence of splits based on complex tree. So, we will order this data frame based on the
complexity values and once we order this will get the desired nested sequence, right. So,
let us execute this code now you would see that ordering has been done and if we scroll
back to see this table now the first you know you know first is entry is income variable.
So, this is the split number one and the complexity value is there the second split is also
having the same complexity values, right we will discuss this further what happens if we
the same complexity values are there, then why you know income was the first split and
education was the second split ah. So, considering that what happens when this is the
scenario?
So, family size and then third mode the third spirit is based on this having the third
highest complexity value. So, in this fashion you can see that complexity values are
decreasing. So, this is this is how our trees when we develop the full go grown tree. So,
this is how this sequence determines how the splits are going to take place and how the
tree is going to be built. So, once we start deciding about pruning you know pruning this
full grown tree this is the process that we have to take and therefore, the earlier one that
we did in previous lecture is not the desired process now you would see because we have
sorted this particular data frame the row numbers have changed you can see these were
the original row numbers we present in the original data frame DFP 1.
Now one sorting of that has been performed the row numbers are still same. So, we
would like to change these row numbers to reflect the now DFP 2, let us look at the table
like again you can see now the row numbers row numbers are also sorted. So, 1 to 34; 34
decision nodes, right. So, once this is done, now we can start calculating our toss
argument that we have to pass on to the snip dot r part function. So, toss the 2 argument
can is simple nothing, but in the data frame that we have just you know created the P 2;
the first you know variable toss one that is going to be this argument.
803
So, let us create this toss two now what we are going to do is we are going to start our
pruning process and as we did in the last lecture after every pruning, we used to record
the model and we used to apply that model to a score on training, you know partition and
other partition validation another part validation partition training and validation
partition. So, that later on we can compare the error rates right.
So, the same thing will follow here what we did in the last lecture, but now this time with
the actual pruning sequence the nested training sequence. So, counter for nodes to be
sniped off I and once, then we have this mod one split v the same wherever that we use
in the last lecture this is going to you know store all the mod variables then you know
mod 1 train v, these variables are going to store the other things that we will see this
score right mod one mod one train. So, it will its score it will have the that return value
of predictor.
So, it is scored variable right list ah. So, let us initialize them, then we will have these
two vector 0 train v and other valid v. So, these two vector are the important for our
plotting and to identify where the error on validation partition is minimizing. So, let us
initialize these two variables now you can see as we discussed in the previous lecture the
loop is running for all the variables. So, the in the this in this particular case you can see
we are running this loop for all the decision nodes that are there right DFP 2 in this
particular column.
804
Now we have only the decision node you can see again that in environment section DFP
2 has just 34 observation that are the number of decision nodes in this particular tree. So,
we will run this loop for the number of decision nodes. So, some of the checks that we
had done in the previous lecture; that code that we were eliminating the leaf now we do
not need to perform because we are dealing with only decision nodes. So, the if if
section, you would see that I; we are comparing with the length of the toss two that is the
total number of nodes and then because we would be pruning a node by node. So, we are
starting this protocol process from i that is one to the full to the final node number that is
the last one.
So, first we start by you know pruning all nodes, then you know from node number 2 to
the last one node number 2, in the sequence not node number 2 actually node number 2
in sequence as a stored in toss 2. So, to show you the toss 2 values you can see toss 2 1 to
34. So, 136. So, node number are actually unique node numbers are 136. So, first we
start by you know first we start by sniping all the nodes, then we start by sniping from
this particular node to the remaining nodes then we start from this particular node that is
6 to the remaining nodes in this fashion we will start and then a snip dot for part function
is being called for the for every time the loop is run and you are recording a few more we
are correcting few more things.
For example, CP table; once we create and you know once we do this sniping, we will
get the new model new sub tree model. So, therefore, we need to correct the CP table and
there. So, the code for the same is there then one CP table is corrected will also have to
correct the variable importance code for that is also there right. So, for this we are using
an importance function which is nothing, but taken from the source code of you know
prune dot r part function. So, there they have written this importance function.
805
And we are directly using the same source code here in this our exercise because this
particular function is not available for us to you know call you know is not part of the r
part library, once we load they are they do not have access to this function this is called
internally within r part. So, that is why we have to get that source code here and to be
able and then we are using this particular here.
So, we will have to now create this function here. So, that will do here. So, the you and
you would see that in the environment section function this importance function has been
created. So, we will not go into the detail of this particular function this function is
actually being called to once we create the sub tree model, we would like to we would
like to change the variable importance accordingly right in the sub tree model. So, the
same thing is being done by calling this function. So, then we are storing we keep on
storing these you know; all these all these sub tree models then we score them off score
the training partition and the validation partition as we did in the last lecture and then we
are storing the error rates for training partition and validation partition and this is the
entry look counter.
So, let us execute this code. So, it is done. Now you would see that one thing I would
like to point out here that those we have been storing the models, but we cannot access
all of them you can see this is quite large list and given 3 MB and just having two
elements. So, these this R part object that we were trying to store in list there you know
806
the size is quite big. So, therefore, it has not stored all the all the you know all the R part
or model sub tree models and therefore, only two are there. So, ah, but; however, we are
interested in only the error rates.
So, let us create this data frame like we did in the previous lecture. So, now, let us look at
these values 4 for decision nodes in this ordering sequence and you can see either
training and validation. So, now, when the for the first decision node this is the training
error and validation error you can see that training error is slightly lower than the
validations you know error when we start and as we perform second split then again the
both are same.
So, there is not no not much decrease in error after second split then third you would see
that further the error has significantly decreased for the training as well as for the
validation. So, in this fashion if you as we did in the last lecture if you scroll down this
these are rates the second column that is error rate for the training part it will keep on
decreasing till it becomes 0, right till it become 0 or close to 0 right and in the in the
validation partition you would see that error will keep on decreasing till one point and
after that it will start in you know increasing.
So, you can see that this is this is the point where the error is minimum, right. So, this is
the point where the error is minimum and then after this particular point it will it will
hold up to for some more nodes and then it will start increasing that it keeps on
increasing. So, with this now we can go to you know we can also we can create this plot
to visualize the same information then information that we saw in table.
807
So, this is the plot that we had seen in previous lecture as well now with the correct
pruning sequence you can see that the plot which is this plot this is you know the this
particular is for the validation data the lower plot the upper plot is for the validation data
and the lower a plot is for the training partition. So, for the training partition you would
see that the error you know keeps on decreasing till it becomes 0 for the validation part
you would see the error keeps on decreasing up to some point and after that it will start
you know it will start increasing right.
So, probably here we need to in this particular zone we need to find out the point with
minimum a error tree like we did in the last lecture and then within one standard
deviation we will have to find out the best tree. So, let us look at this value minimum
error tree is this, this is the value which we already saw in the table then let us look at the
particular number of decision nodes corresponding to this error value error 8 decision
nodes minimum tree is can we obtain at 8 decision nodes if we look at the graph again.
So, 8 decision node would be somewhere around here. So, probably this is this particular
straight line straightening line you see. So, all these you know nodes they are nothing,
but representing the, you know minimum error on validation partition. So, whether they
are 8, 9 or 10.
808
So, we can look at these values, if you are interested how many how many of these
decision nodes are having the same or having the same number of same error minimum
error 8, 9 and 10. So, the sub tree models with decision nodes 8 decision nodes and 9
decision nodes and 10 decision nodes; all three of them having the same number of same
amount same amount of error on validation partition, but; however, we will just use the
smallest tree here and then we look at the standard error of you know off error rate.
So, this is the value; now we will look at the range where we need to find the now best
prune tree. So, the best from tree should be having value less than this particular value
error like we did in the last lecture and should be greater than the error that we saw for
the this one minimum error tree, right. So, this is the code for the same. So, this part we
have already discussed. So, you can see best prune tree is now coming at 5. So, if you
want to confirm this, we can go back to the level table and we can see that node 8, this is
the point where the minimum error tree is there, now within that range, we can see this
particular tree is giving us the best prune trees this is within one standard deviation of
minimum error tree. So, this part we have already discussed.
Now, once this is done once we have identified then we can go ahead and create our min
best prune tree model. So, this is how again BPT we would like to contain we would like
to contain these many number of design nodes. So, we can generate about toss three and
809
then call this snip R part and we will have the best prune tree. Now let us plot it. So, this
is our best prune tree let us look at this particular plot.
Now this particular best prune tree now the earlier one which we did in the previous
lecture because we are following the order of you know shorter order of a node
numbering. So, we were getting the balance tree. Now, we get the right tree would see
that this is not balanced first income then education and then that sequence the it is it
split sequence the optimized split sequence is being followed in this particular example,
right. So, this is the best prune tree that we can have you can see 1, 2, 3, 4, 5 decision
nodes are there you can see important variables of course, it is income education. So,
income education families, spending.
So, all of them figuring here; now we can check the performance of this particular tree
on different partitions; so, you can see the performance 98.56, then on validation 90.9
close number, then on test 97.4, this is also close. So, performance is quite good. So,
there could be another approach to follow this process that we discuss in the previous
lecture as well based on complex tree value.
So, we use the complex tree value for example, we have identified the best prune tree
now following the you know actual order that is the split order. So, in that we can find
the appropriate you know complexity value because we have this pruned function this
which we use in the previous lecture which takes the CP value and cuts the prunes the
810
tree inter based on that CP value; however, we will understand some of the problems
with this particular function.
For example, let us find out the complexity value for the best prune tree that we have just
identified which was the tree with 5 decision nodes. So, CP best is this is the
corresponding complexity value and this you can see, we you can see toss 1 is 15, right
and so, this will using this particular value. So, we can go back to the table and find out
how many number of nodes are there here ah. So, let us look at let us look at that table.
So, if we look at the value that we just saw their 0.0515. So, you can see this is the value
0.0515 and we can see that toss 1.
811
So, 1, 2, 3, 4, 5; so, this is also 5. So, the same you know corresponding tree is there, but;
however, it might. So, happen that. So, now, we are discussing the problems that would
be there with the prune function now the previous few values sometimes if we run the
same model previous few values might also have the same complexity value in that case
the tree with the smaller size would be selected by in this fashion. So, if we do the you
know pruning using the complexity values, even though we have identified you know
followed that passes minimum error tree and within one standard deviation best prune
tree.
And now instead of the number of decision nodes we use the complexity value to prune
this tree you know the previous you know nodes they also had the not in you know they
also had the same complexity value. So, the pruning will happen will happen at that
level. So, it is might with the tree size might reduce from 5 to 3 or 2 something in some
scenarios in some runs and even in this data itself we do again the same thing, we do it
again, then probably because of the sampling and the different observation that are going
to be selected in the training partition and therefore, the different model that could be
there because of the limitation on the sample size that we have even though this is larger
data set.
So, we can get different prune tree using this particular prune function. So, in this
particular case it comes out to be the same. So, we can use prune function, we pass on
812
the full grown tree model mod one and then the pruning value till the point where we
would like to prune it. So, we can see this.
So, this particular tree; 1, 2, 3; you can see 4 nodes are there and here we had 5 nodes.
Now in this particular internal processing that happens in prune function one more node
they spend they spending one it has been removed off. So, that is the tree that we will
have if we follow that complexity value right. So, the tree will collapse at that value
collapse at this value right and only 4 particular decision nodes would be there. Now
further we can we can we can compare this particular case with the minimum error tree.
So, we can plot the minimum error tree as well.
813
So, this is going to be the corresponding complexity value. So, this is up following the
prune process prune function process. So, you would see just. So, these are the nodes you
can see this is the value. So, we look at the, we prune it. So, this is the model that we get.
So, you can see minimum error tree model is much bigger, even if we follow the prune
function right.
You can see 1, 2, 3, 4, 5, 6, 7 nodes; there right. So, we saw that size when the prune
sequence the last one is also removed off. So, we get the 8 size, if you are interested in
814
looking at other things; for example, CP table and other things. So, this particular aspect,
we have already discussed, right. So, with this we stop here and in the next lecture, we
will start our discussion on regression trees.
Thank you.
815
Dr. Gaurav Dixit
Lecture – 44
Pruning Process- Part III
previous few lectures we have been discussing classification and regression trees. So, we
discussed different issues that we faced while building classification models and the
recursive partitioning step and the pruning.
So, specifically we have been discussing pruning steps and we talked about different
scenarios and how the pruning can be performed and the problems that can be faced
while pruning. So, in previous lecture we had talked about the pruning sequence.
So, we will again focus a bit about this in this in this particular lecture. So, pruning
sequence till now we had performed two ways of pruning one was just for the illustration
purpose just for the demonstration purpose where we followed the node numbering
order, node numbering sequence to perform pruning so where in the unique node
numbers that are generally generated for a particular tree model that were used to
perform pruning.
816
So, as we talked about this was just for the illustration purpose and the tree that results
from this particular pruning process looks a quite balanced tree because the node
numbering sequence that is being followed; however, however this is; however, this is
not the; however, this is not the optimal the optimal tree that full you know that full
grown tree that we talked about.
And full grown tree that we talked about and pruning it back to pruning it back to a level
where it does not over fit the noise or a does not over fit the data or fit to the noise.
Therefore, we talked about creating a you know pruning sequence which is based on the
complexity value, creating a pruning sequence which is based on complexity value we
also talked about in this discussion that we have to follow the splitting order right.
So, if this is the root node so, first split predictor and value combination that we talked
about has to be this then we have to see which one of the right or left child is going to
provide further impurity going to give us further impurity reduction and therefore, we
have to you know the split is going to be performed on that particular node.
And then for next split again we will have to compare within the remaining nodes right.
So, in this fashion the splitting order would be created and we talked about that we will
get a as pruning sequence which would be the desired sequence that we want right.
Now in this particular lecture what we are going to discuss is combining these two
approaches because we would still face some issues while we do our pruning based on
the complexity parameter values. We will understand this as we have understood in
previous lecture that the complexity values at C P values.
817
So, for once the exercise in R that we had did that we actually did in previous lecture we
had sorted our we have sorted our decision nodes and the and the split based on the
complexity values so we had toss variable which was capturing the sequence of nodes as
per this splitting order right. So, let us say for example, this is highest value is 0.3 and
there are some nodes here in this particular column then we have again something like
this then some value and in and in this in this fashion the values are going to be there and
based on this sorted complex tree order we created our you know splitting sequence.
Now, we will cover the problems that we might encounter.
So, for many nodes this particular as we go down the tree many nodes we will have the
same complexity value. So, there might be many such decision nodes which will have
same complexity value and therefore, which particular node is true be pruned first is
going to be the you know question mark right how do we decide which particular node
should be pruned first right.
So, as the tree grows further some nodes that we would see they will have they will carry
same complexity value now how do we decide about pruning you know our pruning
sequence here right you know so we have sorted. So, this is the optimized pruning
sequence right. So, for root node one this is the highest. So, once the full grown tree is
developed.
818
So, we start splitting from the least important split so; that means, from here right so
corresponding node which is having the least complexity value that would be pruned first
and we will move in this fashion. However, when we come into these zones where the
nodes are having the same complexity values then what typically happens is in the prune
dot r plot function that we have in r that will actually that will actually prune the tree
from prune the tree from the node which is have which is which is which will give us the
smallest tree.
So, for example, if these were the nodes and you know this particular node also had let
us say this particular node also had the same value. So, out of all these nodes which have
the same complexity value the prune dot r function will prune from here it will prune all
the nodes it will prune all the nodes from here and the and will give us the tree with the
smallest number of node
So, a smallest sub tree would actually be written there using prune dot R part function
that we had used in previous few lectures. However, however the method that we are
following we would like to follow the sequence of course, but whenever we encounter
these groups these groups of node having same complexity values we would like to
prune the node with the higher node number.
Let us say this is 22, 44 then let us say this is you know 45, 90 and then here we have a
180, 181 in this fashion. So, the as per this particular sequence and you know we would
like to as we go about pruning different nodes we would like to prune those trees which
have higher node number. So, therefore, when 4 5 nodes are having same complexity
values.
We will like to prune with the know with we would like to prune the node which is
having the highest node number. So, probably this is the node being at the rightmost part
of the tree and having the same complexity value for you know some of these reasons
right so that would be the node that we that we would like to prune first and then we will
move further.
So, now in the exercise that we had performed we will have to make certain changes in
the code that we did in previous lecture. So, let us reach to the same point and we will
discuss further. So, let us first load these packages x l s x and then r part and then r part
dot plot.
819
So, let us load these packages. So, once these packages are loaded let us import the data
set that we have been using in previous few lectures promotional offers data set.
So, let it load and then we will look at the some of the some code that we will have to
change we I am also going to correct a few minor problem in the code that we had used
in previous lectures. So, some minor corrections in the code and also we would like to
accommodate what we have just discussed right.
So, if we again to come back to this point earlier you know we did an exercise where
node numbering sequence was followed then we did an exercise were splitting sequence
was followed, but splitting sequence also have run to this problem. Now a merging these
two approaches you would see that though we follow splitting sequence for pruning, but
for few groups having same values we follow the node number sequence right.
So, the nodes with higher value right they would be pruned first. So, the code is to be
changed for the same and we will just see so data is loaded already here. So, let us
perform some of these steps grouping categories that we have been doing in previous
lectures. So, let us create these pin code categories.
820
So, now, let us convert these variables to appropriate data types variable types. Let us
look partitioning as well.
Now, everything is in this structure function the promotional offer is factored pin code is
factor 13 levels online education everything appropriately mentioned here. Now let us do
the partitioning. So, we will just so now, this was the model that we have been using mod
one so this is done. Now let us move forward to the place where we were performing
pruning.
821
So, here we were performing pruning. So, this time in this particular run we have total
number of nodes 87, total number of decision nodes 43 and total long number of terminal
nodes for 44.
One thing why now you must have observed that every time we run this model the
number of nodes number of nodes that have been part of the tree and then within that
number of nodes in the decision tree and also number of terminal nodes. So, the total
number of nodes and also the number of decision nodes and number of terminal nodes
822
have been changing every time we run the model right though we have good amount of
observations.
So, we have 2500 observation in training partition even though this has been changing.
So, we will discuss this particular aspect that this is mainly because the classification and
regression tree model they are quite sensitive to sample changes in sample. So, because
of this every time when we construct a tree we get a slightly different result because
every time when we run randomly draw the observations they are going to change and
therefore, tree is also changing the model as is sensitive to sample changes. So, we will
discuss this further.
So, now, let us go back to the point we were we were trying to perform here merging
those two approaches is splitting sequence and node number as we have just discussed.
So, let us first create this toss argument which is the which will actually contain the
number of nodes which we would like to which will be used to determine about pruning
sequence.
So, as we have as we did in previous lecture let us create this data frame. So, this has toss
1 this program variables and the variable used for splitting and then complexity values
for the corresponding variables. So, once this data frame is created you can see 87 nodes
in all the nodes of the tree.
823
Let us remove the terminal nodes using this particular code. So, you would see terminal
nodes are removed now. Now, let us look at this particular code so we wanted to create
nested sequence of splits based on complex tree now this was the this now commented
outline this was the one that we use in previous lecture.
Now improvement on this is that now the this particular data frame is being ordered or
being sorted first with complexity values and then with node numbers right you would
also see a minus sign before this particular variable for node number and this is because
you know we want to sort our data you know in the decreasing order for complexity
values.
But once the for the group having for the group of nodes having the same complexity
values we would like to order them in increasing sequence right right. So, first lower
node bit lower node number then followed by nodes with higher node number so that is
why minus sign is there. So, you can look for more to understand more about order
function you can look at help section and there you will get examples also how we can
manage this.
So, this starting is actually by the sorting that we are trying to perform here is actually by
multiple columns right. So, let us sort this data so this is done now let us also change the
row numbers. So, you can see 43 decision nodes and now they have been sorted if you
want we can confirm this through of 1 or 2 examples.
824
So, let us look at the unsorted you know let us look at the unsorted list and this one the
last one you would see that this one is ok the node numbers are also in the desired
sequence is also ok. Now let us move further to find out an example this is also this
group you would see that the complexity values are same for these many you know
nodes and the node numbering is also in the desired order, but there might be situations
where the they might not be in that increasing order or numbers might not be the
increasing order for the growth.
So, let us see this one this particular value this is quite large group, but this is also seems
to be in the desired order this seems to be in the desired order as well. Now let us look at
this particular this is also in the desired order. So, probably in this particular run that we
have done everything was well in place however, if we run again probably we might not
get the probably we might not get this particular data frame in the sorted order node
numbering orders.
Now once we run this parallel code we will get this in the so it will be decreasing
complexity values and within then and within this list within this particular list the rows
having the same complexity values they would be in increasing node number orders now
after having run this particular code. So, here we won’t have any problem if that was
there in the first place. Now once this is done now with this we would be able to use this
particular pruning sequence to perform pruning right.
825
So, now, the pruning would happen with the least important split; that means, the nodes
having the least complexity values and within the nodes which are having the same with
within the group of nodes having the same complexity values notice with higher node
number would be pruned first right. So, with this we will repeat our exercise what we did
in previous lecture as well few minor connect corrections have been done for example,
earlier you know these different models for different number of decision tea trees that we
were building.
Now we are going to record them in the list format. So, that we are able to easily access
them. So, earlier this part of the code was not functioning properly. So, we have made
certain minor changes in the code to be able to record this.
Now, these particular variables these three variables mod one is split v mod 1 as train v
and mod 1 as valid v they have been initialized as list. And therefore, in the loop itself
we will have more of a chance to be able to record all these models. So, the code for and
in this loop is same as we saw in the previous lecture.
826
Let us run this we will have two you read since we are calling this function important.
So, we will have to load it once this is done. So, let us move forward just come back to
this same point. So, let us run this again now once the importance function is loaded into
memory this will done.
827
So, you would see that now let us look at the environment section now you would see
that a mod 1 is split v it is a large list of having a 43 elements that; that means, all the
models all the successive models that have been built in this particular loop all of them
have been recorded the same is true for mod one has chained a scoring and mod one has
valid the scoring for validation partition for all the models.
So, once this is done so you would also see one slight changes that now this particular
toss 3 is running from I 2 length of toss 2; that means, the number of models that we are
running is from 0 decision nodes to 43 decision nodes. So, last one in within this loop
would have 42 decision nodes.
So this is appropriately mentioned here 0 this is a minor correction early it was 12 this
particular value. Now this is from 0 decision nodes to 1 less than the total number of
decision nodes. So, this data frame is very much similar to what we created in the last
lecture let us create this.
828
Now the last node we know the model with all the decision nodes is the full grown
nothing, but full grown tree model. So, we will add a row for full grown tree model in
this particular data frame. So, you can see that here the last row is nothing, but mod 1
that is full grown tree model. So, we are adding the relevant details here the number of
nodes then performance on training partition then performance on validation partition.
So, we will have to first score these two variables. So, we will go back to the part here
and let us see score these two partitions using full grown model. And then we will come
829
back and create that row again. So, let us create this value now we will have the data
frame.
So, you can see in this particular output that 0 decision modes number of design modes 0
to number of decision mode 43. So, 43 one is the full grown tree model and therefore, in
today’s output in this lectures output you would see that the training error in training
partition is 0; that means, tree completely fits the data right that that leads to over fitting.
And now we have the corresponding numbers for when we do not have any decision
nodes 0. So, that is just you know the root node that is what we have so, those numbers
are also here now this data frame now can be used to find out the best tree and best prune
tree I mean on tree which we will do in a while.
So, now since we have made some changes and now we are able to access all the models
that we have developed in this loop for different number of decision nodes. Now this one
is the last model the model having one less node than the full grown tree. So, this is in
mod one split we can also access this using the list notation. So, this is the full model if
you want we can you can see this is the full model.
830
So you and this is the last you know model with just one pruning of one node right and if
you want to compare this with the original model that can also be done. So, here we can
create the tree diagram for the full grown tree model from this you would sometimes it
might be possible for us to visually find out which particular node has been removed so
least important node.
So, the same thing we can also cross check using the particular data frame that we had
created D F P 2 right. So, this one also can be used so this particular node should be
831
spending from this pruning sequence. So, let us look at this particular data frame this is
this is tree model for this is tree model for full grown tree so the last node should be
spending right. So, this is one terminal node this is seems to be one node right one
decision nodes and is related to spending.
Now let us look at whether this was the same one in the in the in the the last tree in the
loop the last model in the loop. So, it will take some time to load here node number is
163, so probably this one should be so there is one more spending node here. So, this
could be 1 163 and we will have to see it we have to compare this with the full tree
model. So, to visualize which one which particular node has been deleted.
So, I think 160 the node number 160 this could be the 1. So, we can see here that by
spending 2 point, it is visible there 3.6 this one is also present there. So, this we can so
right now we will not look to spot this particular node. So, one node there is one less
node. So, let us execute these three free lines so you would see that total 85 nodes are
there, 42 decision nodes, and 43 terminal nodes in the last tree which removed which
pruned just one least important node.
And if we look at the full grown tree a number of decision node in the full grown tree so
there that is 43. So, one node has been removed however, another way to spot this is
using this particular output mod 1 here you can see here, but; however, this is this is in
the node numbering sequence.
So, if we if we look at the earlier output that we had saw so where we expected that the
node that has been removed is the node number 163. Now we look at that particular node
here in the full grown in the full grown tree example that model results right so, a 163 is
somewhere here. So, this is where 163 is now the same thing we will look to find out in
the last output in the that we had in the loop.
832
So, now instead of looking this one will look for the different models that we have stored
here. So, we can see that mod split v. Now we did using the double bracket notation we
can access the first you know a tree after first snip so this we can again create.
So, you would see this is just nothing, but root node and you can also look at that there is
just to one node and no decision node and just one terminal node. So, this is what
happens in the first after first snip similarly we can access other nodes right
833
Now, let us plot the let us clear the this particular plot between number of splits and error
rate so error rate going on the y axis let us look at the range so this is between 0 to 10
around 10.
So, let us create this plot also at the validation now you would see the plot is quite
similar to what we have been creating in previous lectures as well. So, you can see that
training partition what any partition that keeps decreasing till it reaches 0, and for the
validation partition the error keeps decreasing and reaches a minimum point somewhere
here and then after that it starts increasing.
So, this minimum minima point is the one where we would like to prune the tree up to
that point we would like to prune the tree. So, let us find out this particular point. So, as
we have as we have done in previous lecture. So, this using this particular code and the
data frame that we have created we can find out this is the tree this is the error and the
corresponding number of nodes is 6 if you are interested to look at the data frame once
again we can do that. So, you can see 0.17 and node number of node 6.
834
So, this is the point this is the point where minimum error is there and the minimum
errors where is 0.0173 number of decision nodes are six now this being the minimum
error tree let us try and find out the best prune tree. So, let us compute the standard error
for the error rate on validation partition.
Now, let us compute this particular value so that we are able to find out the best prune
tree now this code we have already discussed. So, best prune tree also comes out to be in
this particular case best prune tree also comes out to be the same as minimum error tree.
So, we look at the table we will get better idea that the value that that particular row that
we were looking for should have the error validation error less than this vertical value
0.0198.
835
If we go back to the table at six decision nodes row and 0.0198 we go up and we do not
see any value. So, therefore, the minimum error tree itself becomes the best prune tree
because within one standard deviation there is no other options available. So, once this is
done we can create the toss three argument and create our best prune tree diagrams.
So, if you look at best prune diagram now. So, this is the best prom tree that we have
right. So, we have income education family size and spending. So, these are the
important variables that we can see in best prune tree right. So, income being the most
836
important you know coming at the top of the tree and again early occurring and then
spending is also their education also seems to contributing to this tree model coming at
the top of the tree here second level and then here as well right so these are some of the
important trees that we can see.
Now, the same process we can look at what using the r parts prune so that is let us so for
that we will have to specify x value as 10 that is the default value so let us go through
one more time like we did in previous lecture.
So, let us create this model let us look at the C P table and from here we will try and find
out the row for which x error is minimum. So, this x error is similar to what similar to
what we have done for validation partition that error that we had computed. So, this is
the value lowest x error value and the corresponding C P value also you can see. So, let
us record this and once this is done we can also plot we can also create this particular
plot plot plot C P function.
837
So, it will actually give us a graphic for C P versus this relative error x value relative
error and size of trees also you can see here and the top axis and the bottom axis we have
C P value. So, generally it is considered that you know first tree which is below this
dotted line right in the left part of the plot that is the best prune tree right.
So, many points could be below this line as you can see in this particular run there are
very few ah, but you can see this particular range 9 to 22 and from 4 to 22 we can see
that all these points all these sub trees they are below this particular reference label. So,
generally best prune tree in minimum trees around seems to be around this particular
mark right now the typically best prune tree is the first point after you know which
comes below this line. So, this could be around after four this could be around five right
So, corresponding C P values then C P value can and then we used to pruned the tree to
back to that level. So, this is a quite similar approach to what we have done, but this is
based on complexity values and we have already discussed some of the problems that we
might encounter there. So, as per the process that we have adopted sorting by complexity
values and then further sorting by node numbers that will give us probably the best
models. So, we will we will stop here and in the next lecture we will start our discussion
on regression trees.
Thank you.
838
Dr. Gaurav Dixit
Lecture – 45
Regression Trees
previous few lectures we have completed our discussion on classification trees. So, in
this particular we will move on to regression trees. So, before that let us discuss few
more points about classification trees. So, out of classification trees we get some
simplified classification rules. So, each terminal node or leaf node that, we get in our
final tree model that is equivalent to a classification rules.
So, for any particular tree we can always we can always formulate based on the final tree
model right based on the final tree model we can always formulate the classification
rules for example, let us say this is income less than something 10, then this is let us say
age less than 35 then in this fashion.
So, classification rule is going to be like if income less than 10 and age less than 35 then
you know age less than 35, then let us say this is re if it is you know let us say greater if
less than goes this side and greater than all the points greater than this go this side then it
will come here and then age less than 35 will come here. So, observation will fall here let
839
us say this is class 1 so you can say and class 1 so these kind of simplified classification
rules we can get out of a tree model so that that gives us the ease of implementation
where these rules are very easy to understand these classification rules very easy to
understand and easy to implement.
Now when we have build a you know when we finally, build you know select a build and
select a final tree model then all the terminal nodes as we taught what they will represent
some classification rules. Now these classification rules can be simplified further as you
might have as we have seen in previous lectures in different exercise that we have done
the same variable may might re occur again at some level.
So, let us say income comes here again right. So, therefore, there are going to be two
conditions on income, two conditions based on income variable in your classification
rule right. So, in those situations we might look at those values and we can simplify the
rules further right similarly some rules you know out of all the rules that we might have
based on you know number of leaf nodes we can further identify the redundant rules. So,
if one rule is applied the second rule which might be kind of you know sub rule of a
particular rule so those rules can be identified so those redundant rules can be identified
and removed from the list
Now, let us just start our discussion on regression trees. So, regression trees outcome
variables should be numerical. So, classification trees because that was for the
840
classification task so typically we were looking to classify looking to predict the class of
a new observation. Now in this case regression trees the outcome variables would be
numerical so we would look to predict the value of a new observations.
So, as far as steps to build tree models tree model are concerned they are quite similar to
that of classification trees few differences are there for example, prediction step so that is
of course, going to be different from the classification step right and the impurity
measures they are going to be certainly different in the prediction task and the
performance metrics are also different.
So, let us discuss these differences one by one so, first one prediction steps. So, a value
of a leaf node is a predicted value for a new observation that fell in that leaf node. So, in
a regression tree once when we want to predict the value of a new observation so again
just like the classification trees that particular observation will also be dropped down
from the root of root node of the tree and it will keep following a particular sequence and
finally, it will reach to a terminal node.
So, the value of that terminal or leaf node is going to be the predicted value of that
particular new observation. Now how the value of a leaf node is decided in a regression
tree as we know that in case of classification tree that is the majority voting that is taken
and based on that class assigned. In case of regression tree we compute average of all
training partition records which fall in that particular same you know terminal node.
So, value of a leaf node is computed by taking average of training partition records
constituting that leaf node. So, when we build that tree using the training partition points
so the training partition record which are going to be part of that particular leaf node
right which will you know fell to drop down to that particular leaf node.
So, average of you know those training partition observation would actually be the
predicted value for that leaf node and any new observation that falls that when dropped
down from the tree if it is falls down to that particular leaf node the predicted value for
that observation is going to be the value of that leaf node. So, this is difference in the
prediction steps otherwise in terms of building the tree the steps are very much similar to
what we have discussed for classification trees.
841
Now let us discuss impurity measures so impurity measures that we have used for
classification tree were gini and entropy. Now we are in regression trees the impurity
measure is different here we use sum of a square deviations from mean of a leaf node.
So, once we reach to a leaf node right for all the observations that are there all the
observations the of training partition record training partition that are part of that
particular leaf node right during the rebuilding process.
So, me so their deviation from the mean value so their deviations from the mean value
that actually becomes the metric impurity measure. So, sum of a square deviation from
mean of leaf node. So, this is quite equivalent to squared errors since mean value of leaf
node right the training partition records and their mean value is also taken as the
predicted value right. So, therefore, the deviations are from the actual value deviation are
between actual and predicted values so therefore, this is cannot this is also equivalent to
squared errors
Now, lowest impurity is 0 and this will occur when all the observations that fell in a leaf
node have same actual value of outcome variable. So, all the observations that fell in a
particular leaf node right. Let us say there are 10 observation 1, 2, 3 up to 10 and if they
have same value let us say 0.5 then mean will also be 0.5 and therefore the deviation
would become 0.
842
So, lowest impurity would be recorded when all the observations in a particular leaf node
in a particular leaf node they have you know they have same actual value right. Now
before discussing further on CART. So, let us do a regression tree exercise using R. So,
let us go back to our R studio environment now this is the code for regression tree the
script for regression trees let us load this.
So, for regression trees we are going to use the car data set that we have used in previous
technique some of devious techniques as well. So, let us import this particular data set
you can see 79 observations of 11 variables. Let us to move NA columns and let us look
at first 6 observations.
843
So, you are already familiar with this particular data set you can see brand model
manufacturing year, fuel type S R price that is showroom price, K M price, transmissions
owners air bags and c price.
So, as we have been doing for this particular data set and for previous techniques as well
let us compute this age variable out of manufacturing here. So, this is age variable let us
add this to the data frame. Now let us take up backup of the data frame and which since
we are not interested in first few columns and also C underscore price. So, we will get rid
of them.
844
So now, what we will have is these many variables so just 8 variables, 79 observation,
so, you can see fuel type is appropriately specified as factor variable 3 levels right c n g
diesel and petrol. The others S R price, K M price then we have transmission this should
actually be a factor variable because in, but since it is a numeric code format.
So, therefore, we will have to change it to a factor variable as we know that any variable
and if categorical variable if it is having values in text format like c n g, diesel, petrol.
So, it will automatically be stored as factor variable in R environment now for
transmission we will have to do it and other variables are fine owners airway airbag and
age they are numeric variable.
So, let us just change this one transmission as dot factor and you would see in the new
output our structure function transmission is also changed a factor with two levels. Now
let us perform the partitioning.
845
So, we will go for 60 percent, 40 percent, 60 percent for training partition and 40 percent
for test partition. So, let us create these two partitions here once this is done we can go
ahead and build our model.
So, again the same package same function that we are going to use r part. So, let us load
this particular library this particular package. Now, if we look at the r part function the
arguments are quite similar to what we have used in classification tree exercises, but one
difference now method has changed. So, method has changed to ANOVA.
846
So, for regression trees we have to specify method as ANOVA and for classification trees
the method was class now other things are quite similar right. So, other things do not
change you can see x val value is specified as 0, by default is it is 10; 10 as we have
talked about in previous lectures. Because we want to first develop the will the full
grown tree and therefore, we would like to have x val value as 0, and also c p value at 0
because we would like to for the same reason we would like to build the full grown tree.
So, let us run this and the model is built now let us look at the number of decision nodes
46 in this case number of terminal nodes 47 and so these are the number of nodes in the
full grown tree model. Now what we will do we will straight away we will move to the a
pruning process the pruning process as we talked outward is of course, similar to what
we did in classification trees.
So, first we will record the node numbering so as we have talked about that node
numbering is in a particular order in r and that you can understand from the frame
attribute of r part object so, let us compute this.
847
So, these are the node number as you can see in as we have explained in previous
lectures as well. So, node numbering in R environment typically happens in this fashion
let us this is let suppose this is our tree. So, node numbering typically happens in this
fashion 1 and it will go like this 2 and this fashion and then again it will move back once
that tree full level is achieved it will move back and move back in this fashion the node
numbering is recorded for a particular tree. So, as you can see 1, 2, 4, 8.
So, row numbers 1, 2, 4 it will be in the left part of the tree. So, 1, 2, 4, 8, 16, 32, 64 will
keep on going there, but 64 it ends there then we will have to go back to 65 the nearest
you know node and then 65 then we again go down 130 and in that fashion then we go to
260 to 520, 521 then again we will have to go back so in this fashion the node numbering
happens as we saw in the here in the board.
So, as we have as we have done in classification trees we will create this data frame. So,
first column would record these node numbers then we will record the variables that
have been used for a split and then we will record the complexity for all those all these
nodes.
848
So, you can see 93 total number of nodes in the tree are 93 and the same number of node
rows are present in this particular data frame. So, as we have done in case of
classification trees we will first get rid of the leaf nodes. So, the code is quite similar
actually same.
So, let us get rid of these nodes and you would see in the D F P 1 we have just 46
observations which are nothing which are same as the number of decision nodes that we
had that we have in the tree you can see 46 were the number of decision nodes and the
same number of observation you can see here in D F P 1.
So, now in D F P 1 we have just the decision nodes leave leaf nodes have been those
rows have been removed. Now next nested sequence office splits based on complexity so
as we did finally, for the classification trees and that first ordering would be based on
complexity values and then we will like to further order the node numbers. If a particular
number of nodes are having same value same complexity value than for those group of
you know those group of rows we would like to order them based on their node
numbering.
So, let us execute this and we will get the sequence let us also change the row names you
can see 46 are the total number of rows now as you can see so 46 decision nodes. Now
these if you look at the complexity value this has been sorted so, this is decreasing order.
849
So, higher complexity value first in the root node then root node number 2, and number
5, 11 and then four so you would see in this fashion then as we move down we can again
even for regression tree we can see that some of the rows are having the same
complexity values you can see here this is same complexity value for 19 and 39 right.
So, there again as we have done for classification trees we would first like to remove 39
and then 19. So, therefore, it is at it was sorted same for classification trees, same for
regression trees
Now, we can find out similar types of groups of rows here which have the same
complexity where here once again you can see same complexity value. So, the sequence
has to be based on node numbers because we would like to have a smaller sub tree. So,
the nodes with higher number would be should be pruned first.
Similarly as we go down we can find many more nodes having same complexity value.
So, therefore ordering has to be accordingly pruning has to be accordingly here also you
can find that so everything seems to be appropriately ordered here now once this is done.
Now we can record another task to argument which will have the desired sequence of
nodes to be pruned off so we can have a look at the toss two again. So, this is probably
the this is this the desired sequence of nodes to be pruned off.
850
So, you can see the pruning will start from the last node that is this one 131 this would be
pruned first then 351 260 and then 311. So, in this fashion pruning good pruning or full
grown tree would have. However, in the code that we are going to run we will build all
such models.
So, all models with you know no decision nodes with one node that is this one first one
then 2 nodes that is first and second with 3 node there is 1, 2 and 5 then 4 node that is 1
and 1 2 2, 5, 11. So, in this fashion we will keep one building our model. So, let us start
executing our code for this for the same.
So, let scene is like i then you can see these 3 variables which are being used to either
store the record the different models or to record the scoring of what is scoring for
different partitions. So, let us initialize them and because this is our regression tree. So,
the metric that we are going to use so earlier in the classification trees we were
computing the misclassification error and we looked at the misclassification error to
identify to find out the minimum error tree and also best prune tree.
In this case we will look at we will use the r m s e value. So, we are going to use this
particular package R minor. So, let us load this.
851
So, that will be computing the r m s e value within the loop for each model and then that
is going to be used for finding the minimum and best prune tree the code is quite similar
to what we had for the classification tree with one change. Let us so, you can see a
predict function now you would see type is vector because we would like to predict the
values right instead of plus and in computing the error for different partition and for
different models can see r m s e value is being computed right. So, these are few
noticeable changes in regression trees in comparison to what we did in classification
trees.
So, let us run this loop, but before running this let us load the importance function and
that is part of this code this function let us so once this is loaded we can go back to the
loop and run it.
852
So, you can see here in the environment section you can see model split v large list for 3
6 elements, we had 46 decision nodes. So, the same number of models have been
developed right and scoring also test partition and training partition also. So, depending
on the number of observations so you can see list of a 46 here and list of 46 up all 46
models we have the scoring right.
So, once this is done as we did from the classification trees in the last you know model
that we developed in classification trees will get this data frame having a number of
decision nodes in the same fashion as we did last time for the classification trees and
trees and then the error train v and a error test v. So, let us say create this data frame
Now, this will have the all rows starting from 0 decision node to the last one decision
node and for that we will have to for to include all model with all decision nodes will
have to bring the full grown tree information details. So, that we are doing now.
So, last row we are getting manually you can see here that we are scoring off the training
partition using the full grown tree model and then that is added to the data frame and
then we are scoring the test partition and we are adding this information to the data
frame. Now we have the data frame now you can see we had 46 decision tree, and we
have 47 models and their performance.
853
So, first model is model with 0 decision nodes that is just rude node acting as a terminal
node and the corresponding error values also you can we can see then as we move down
we can see that models with different number of decision nodes. Now for the training
partitioned you can see training error it is highest when we have 0 decision node and it
keeps on decreasing and till when we have the 46 decision nodes it reaches 0 right.
So, this are finally, decreases to 0 for the testing partition you can see that it starts from
this particular value 2.56, that keeps on decreasing and it decreases I think to this value
this seems to be the minimum value in this particular row and then again it starts
increasing right.
854
So, once we have this particular table this particular data frame now let us go ahead and
plot the curve for error rate for these two partition training and the test, lets look at the
range 1000 and here 256 so within 0 to one thousand we will have these two plots.
Now, we would see some differences with respect to regression tree so you can see that
training partition it starts with a quite high error then it keeps on decreasing until it
becomes 0 for the validation partition again it comes down to a minimum level and then
it starts increasing right.
So, so from this we need to find out the point where the error on r m s e value is
minimum for the validation partition. So, so this can be done using this particular code
which be used for classification tree as well let us find out the minimum error on
validation partition using this column third column. So, this is the value the same that we
identified in the label in the table right.
855
So, this is the value you can see number of decision or corresponding no originals are 5.
So, with 5 decision node we get this minimum value right same thing would, should
come in this code 5 you can see that has been recorded. Look at the let us look at the
standard error. So, this is the standard error to find out the best prune tree within this
within one standard
So, we will let us compute this particular value. So, the error should be less than point
nine eight, but as we have seen in the table let us look at the go back to the table you can
856
see my 0.95 and as we go upwards in the table we do not see any any row where the
value is within that 0.98 you know that less than that so within one standard deviation of
this particular value 0.9531.
So, therefore, it seems so that best prune tree would also be same as minimum error tree.
Let us find out by this code using this code let us run this you can see its 5. So, the best
prune tree is also the same as minimum error tree. So, what we will do create the toss
argument again so in this case we will like to remove we like to give just the five
decision trees in the model that is prune tree model and snip of all the remaining nodes.
So, let us create this we will get this particular and let us load this library r part dot plot.
So, now we can plot this tree.
So, this is the tree that we have out of this regression tree exercise.
857
So, if we look at this tree the first root node there the first spilt predictor and split value
combination is comes from K M so K M greater than 23.5 right. So, if yes then this side
so all the observation with that a greater than this value they will come this side and then
we get immediately we get the terminal node on the right side right child and fill then in
this side and we have S R price and within S R price.
So, this should be less than 10.475 then of course, we go this side again we have S R
price. So, S R price kilometre is seems to be the important topmost you know most
important variable and then followed by S R price which comes at second level then
again we see kilometre. So, then again we see S R price so you can see that we have 5
decision nodes and out of these five decision nodes S R price is occurring thrice and the
K M is occurring twice.
So, K M and S R price out of all the variables that we had K M and S R price are
eventually determining the this particular tree model. So, K M and S R price seems to be
the most important variables for this prediction task as you can see at each terminal node
as well and we have some value right. So, this value once for a new observation if you
would like to classify it the observation has to be drop down from this particular tree and
once it reaches to a particular leaf node these nodes last in our nodes leaf nodes so that
the value of these leaf nodes is going to be the predicted value.
858
So, let us come back now the same exercise that we followed through you know by
minimizing error on validation partition and for different more models with different
number of decision nodes you know we can have alternative mechanism which is quite
similar using R part prune function.
So, what we will do x value is now 10 as you can see and we will build this model once
this is done. Let us look at the c p table. So, this is the t c p level that is part of this
output. So, here we need to identify the row where x error is minimum right so that we
can do using this particular code.
So, corresponding c p value would be recorded 0.01250 we look at the table we look at
the table so I think this is the error that is being minimum value you can see from the
value itself x error minimum value is 0.7478825 and then corresponding c p value we
can also see. Now this value can be used to prune the tree. So, you can see number of
splits 6 right so this can be used now we can also plot c p for this model and based on
this as we did for classification tree.
859
As we talked about that a similar approach wherein the first tree which is below this
vertical line you know that can be used as the best prune tree; however, all the points are
below this particular line. So, the model is not performing as well on the validation
partition. One specific reason for the same is that we have just seventy nine observations
in the full in the total full data set and out of that few are being used for the training
partition and remaining what test partition. So, because of this smaller sample size this
kind of result we are getting there.
Now, even within this as we can see you know probably this particular point which is
corresponding to 6 nodes can be the can be the prune tree as per this particular plot. So,
we will use the value recorded in c p 1 and that is corresponding the minimum x error
value to build a model and this is the model that we get.
860
If we look at the this particular model you would see that this is there is one more node
in comparison to what we saw in our exercise owners is also there, but otherwise we can
see that K M, S R price this seem to be two important variables and then one extra node
owners is also there. So, this is from the, this is this particular model is from using the by
using be r parts prune. Now we can look at the number of node is a node 6 and 7. So,
with this we completed our exercise in r for this.
861
Now let us discuss a few more comments a few more important things related to cart
algorithm. So, some of the advantages of cart algorithm are that can be used as a variable
selection approach no variable transformation is required you do not need to transform
your variable and key derive new variables. Because the way tree is build using
partitioning approach that recursive partitioning approach eliminates any requirement for
variable transformation because the tree is essentially going to be the same because it is
the midpoints values and the that are used or subsets in case of categorical variables that
are used to create partition.
So, therefore, variable transformation in r require robust to outliers. So, because again
recursive partitioning approach that does not rely on the specific values of outliers
therefore, the model is going to remain too robust to outliers non-linear and
nonparametric techniques. So, we did not make any assumption about the relationship
between outcome variable and set of predictors right.
So, this is non-linear no parameter that we have used handle missing values. So, again
for the same region because of the recursive partitioning approach and because we look
to use mid values to find out the possible split points so, missing values can also be very
well handled using this worker technique.
Now let us look at some of the problems of cart some of the disadvantages or issues with
cart algorithm sensitive to sample data changes as we have seen in our classification tree
exercises and regression tree exercise if the in the regression tree we had very small
sample very small sample and we saw that the performance on validation partition was
not good and in case of a classification trees we had large enough sample size; however,
every time we use to run because of the different observation that become part of the
training partition.
The tree model used to change full grown tree model used to change not even best prune
tree a full grown full grown tree model used to change right and therefore, this particular
technique is sensitive to sample data changes. Now if we look at the approach of cart you
would see that this main approaches recursive partitioning and then pruning. So, a
recursive partitioning approach in a way it captures the predictors strength as a single
variable and that is actually model and not as part of a group of predictors.
862
So, modelling is does not consider the strength of a group of a set operators rather it
relies on the predictor strength as a single variable right so that is that could be one
drawback right. So, there could be some set of predictors which put together might give
better performance using other techniques right so relies on strength of single variable.
Other comments on cart might not fit linear structures or relationships between
predictors right so that is a one problem. So, if the typically it is understood that cart
procedures if the partition because the recursive partitioning approach is used if the
vertical and horizontal separation kind of scenario exists in a particular data set then the
cart algorithm is to perform well.
However, if the partitioning is you know some diagonal line kind of a partition would be
more suitable for the data then probably this technique is not going to give better results.
So, in that case solution could be we can derive new predictors which actually you know
which are actually based on the hypothesize relationship. If some diagonal line is better
separator for observation then probably new variable can we derived to you know to
express the same and that can be used when the cart models.
So, another problem another problem with cart algorithm is it requires large data set ah.
So, robustness depends a lot on large data set and even then every run the changes could
be there high computation time. So, because we are using high large data set recursive
partitioning is there and pruning is there so much of sorting related exercises are to be a
863
part of this process. So, because of this it requires high computation time. So, with this
we conclude our discussion on classification and regression trees in the next lecture we
will start our discussion on logistic regression.
Thank you.
864
Dr. Gaurav Dixit
Lecture - 46
Logistic Regression-Part I
Welcome to the course Business Analytics and Data Mining Modeling using R. So, in
this particular lecture we will start our discussion on a logistic regression. So, let us start
our discussion. So, logistic regression they are equivalent of a linear regression technique
that we have gone through in previous lectures, for categorical outcome variable.
So, linear regression that we have discussed before that was mainly for the numeric
outcome variable, continuous outcome variable. Now, logistic regression is an is a
equivalent is an equivalent technique for categorical outcome variables. So, typically as
you would understand that a linear regression is typically used for the prediction tasks
and a logistic regression is typically used for the classification task.
However, though the categorical outcome variable we use, the predictors can be
categorical or continuous. So, let us understand more about logistic regression.
So, typically applied in a following task classification task is one where we predict the
class of a new observation. So, that is the you have first category then it this particular
865
technique could also be used for profiling. For example, understanding similarities and
differences among groups right; So, classification task and profiling though the way
modeling is done there is not much difference.
So, in terms of modeling these steps for both of both these tasks classification or
profiling not much is going to change; however, the ideas the you know objectives are
different. So, classifications tasks we would like to predict the class of a new observation
in the profiling. We would like to understand similarities and differences among groups.
So, let us move forward. So, what are steps for logistic regression?
So, first in logistic regression because this is typically for classification tasks we estimate
probabilities of class membership. So, if it is a you know m class scenario. So, probably
we would like to estimate probabilities of belonging to class one to class two and up to
class m. If it is a two class scenario we would like to estimate the probabilities of new
observation belonging to a particular observation belonging to class one and class zero
right.
So, therefore, with this, so that becomes the first step. So, first we would like to estimate
these probabilities, the probabilities of class membership then the second step is about
classifying observations using these predicted, using these estimated probabilities value
right. So, as we have discussed for other techniques. So, first one is typically the most
866
probable class method wherein we you know we look at all the estimated probabilities of
class membership and we assign the observation to the class with highest probability
value.
So, that is the typical scenario where we are looking for you know overall
misclassification error, we are looking to minimize overall misclassification error and we
do not have any specific class of interest. So, this is the you know typical method that we
follow right. This particular most probable class method is equivalent you know for a
two class case.
The cutoff value of point five is equivalent to this most probable class method. So, if we
have just two classes and we want to apply most probable class method we are just
looking to minimize overall misclassification error. In that situation the two class
scenario if you know because if the probability of belonging to probability of a record
belonging to class one is a you know let us say p.
Then, probability of that particular record belonging to class zero is going to be 1 minus
p. So, therefore, you know 0.5 probability value of point half can be used as a as an
appropriate cutoff value for a two class scenario and for most probable class method.
Now, as we have talked about when we discussed classification and prediction

performance matrix when we discussed different matrix or for classification. We also
talked about that we, if you have you know estimated probabilities values like we do in
this for this particular technique logistic regression. Then, we can also use a one way
excel you know tables to find out the optimized cutoff value for a particular you know
for a particular problem for a particular classification problem or task. So, that can also
be done right.
So; however, that might lead to over fitting. So, those things we have already discussed
in that particular lecture. Now, the second scenario is class of interest wherein we have
the user specified you know we have to use user specified cutoff value. So, in the class of
interest we might have one particular class which is which is a low probability event.
So, therefore, we would like to identify you know more members of that particular class
and. So, we have talked about different scenarios using cost based method,
misclassification cost and other things in previous lecture already. Right now, for class of
867
interest, if we are dealing with a two class case. So, typically a value that is a greater than
you know average probability value for class of interest, but less than 0.5 probably that
can be used. Still just a kind of rule of thumb based on the, you know experience in
modeling.
So; however, depending on the depending on the probabilities value and depending on
the you know we have that cost matrix we can find out a find out a good enough cutoff
value or optimized cutoff value in class of interest scenario. So, these are two typical
step; first we would like to estimate probabilities and then these probabilities are used to
classify observations. Now, let us move forward.
So, now, I will discuss more about the logistic regression model as such. So, typically
this is this particular model. So, as we talked about this is for classification task, so this
particular model is typically used in cases where a structure model is preferred over data
driven models right. So, we have a discussed a few data driven model right like naive
Bayes and K N N right.
So, we have discussed a few data driven models. However, when we required a structural
model when we have some assumed functional form between the outcome variable. And
you know you know and the set of predictors or we understand that there is some sort of
relationship between the predictors and outcome variable. And if we are looking to
structure that we are looking to model that then probably logistic regression is an is a is
868
an equivalent technique for the same you know a structural technique for classification
task just like linear regression is for prediction tasks.
So, as we understand that categorical outcome variable if we try to model it as a linear

function of predictors. So, that is not possible because of various reasons. So, as
mentioned in the second point here. Categorical outcome variable cannot be directly
modeled as a linear function of predictors right because on when you have a Y a
categorical predictor so you are going to have you know a few values that are nothing,
but the, but the different categories. Let us say if there are m classes so they are going to
be if we express that in numerical code format then it is going to be 0 or 1 or 2 or 3 up to
m minus 1.
So, those could be the classes. So, now, if you try to you know model that as if a linear
function of predictors where predators could be continuous and categorical. So, that
range would be minus infinite to an infinite.
So, that there are going to be certain problems. So, which are expressed some of them,
some of these issues are expressed in this particular slide as well. Inability to apply
various mathematical operators one so because your, categorical variable could be
nominal variable. So, more often not the categorical outcome variable that we might use
it could be nominal variable.
So, therefore, we have already discussed in one of the lecture the kind of you know
operators that can be used. So, equal to n is not equal to. So, these are the you know two
operations that can be performed. And therefore, the you know that linear modeling
cannot be applied.
Now, variable type mismatch that is also obvious that, in the outcome variable we have
just that is categorical and in the you know set of predictors where we might have
continuous and categorical variables there. So, therefore, interpretation and modeling and
everything is mismatch there.
Now, there another thing to understand is range reasonability issues. For example,
categorical outcome variable as I said that the values could be one of you know if we are
expressing the categorical variable numeric code format. Where, 0 is representing you
know one of the class 0 then 1, class 1 and similarly class m is being represented by m
869
minus 1. So, in that fashion you will have one of these values from 0 to m minus 1 in
LHS side that is outcome variable side.
And then RHS side you would have the predictors and therefore, some of those
predictors could be numeric variables as well. So, therefore, the range could be minus
and finite to infinite.
So, therefore, that range reasonability issues would also occur. So, lots of things, very
you know few that is just equal to is not equal to operators can be applied. Variable type
mismatch is there reason range reasonable reasonability issues are there. So, because of
this categorical variable cannot be directly modeled as a linear function of predictors. So,
let us move forward.
So, how do we, so what we do in logistic regression model then? So, instead of using
outcome variable Y which is typically, which is actually the categorical variable as well;
So, instead of using this categorical outcome variable Y in the model if function of Y
called logit is used.
So, let us understand what this function is about. So, before we reach to our standard
logistic regression model and discuss logit and other related concepts. Let us think about
a modeling probability value as a linear function of predictors specifically in a two class
scenario right. So, if we have just two class as I mentioned in the in the in this slide as
870
well we are in a two class scenario. So, we just need to you know estimate one
probability values that is probability of belonging to class 1.
The other would be 1 minus of this particular probability value. So, therefore, the
probability value P can be expressed in this form as a linear function or predictors beta 0
plus beta 1 x 1 plus beta 2 x 2 plus beta p x p, if there are p small p that is p number of
predictors.
So, what is the problem with this particular formulation? So, let us discuss this. So, if we
have a two class scenario and we express the probability value of class one membership
as a set linear function operators then, what are going to be problem with this? So, let us
look at the LHS range now.
At LHS range will improve from a set of two values that is a 0 and 1 to you know this
range from 0 to 1 right. So, now because this is LHS is now probability. So, therefore,
the probability value can range from 0 to 1. So, the LHS range improves from just two
distinct values to is small you know range that is between 0 and 1. However, it still
cannot match RHS range where is minus and finite to infinite.
So, what we can do? Can we bring RHS range? That is right side range to 0, 1 level. So,
it is possible with some transformation. So, this is typically non linear approach we have
to take. Some non linear transformation we would have to perform in a manner that the
871
expression that we have on the right hand side that you know linear that linear function
of predictors that we have on the right side as expressed here; In a form where it starts
taking in a non linear form where, it start taking value values from 0 to 1.
So, typically as you can see in the slide a non linear function of this form which is called
logistic response function can actually be used to perform this transformation. So, we can
express probability value as this probability value P it can be expressed as 1 divided by 1
plus exponential of minus of beta 0 plus beta 1 x 1 plus beta 2 x 2 plus up to beta p x p,
that is a linear function of our predictor. So, you can see from here that the RHS side, the
earlier RHS side that we had now it is you know it is within this exponential you know
this is power to the exponential negative power to the exponential.
And now irrespective of the values, that is being taken by with this linear form of
predictors the value of right hand side would be would be would be in that 0, 1 range.
And now you would see that left hand side and right hand side both are in the same range
0 to 1.
Right; So, left hand side is probability values probability value in two class case or in a
for that matter any class case this is going to be 0 and 1, between 0 and 1 and right hand
side is also now between 0, 1. As you can see when denominator it you know in this
exponential function it can it can range from 0 to infinite. So, therefore, when it becomes
when it is approached 0; So, the value is going to be 1; when it approaches infinite then
the value is going to be close to 0. So, therefore, this right hand expression is going to be
in that range 0 to 1.
Now, once these two ranges once range this ability is you know now we can move
forward.
872
So, the same, the previous equation that we just saw can be rearranged in this form. So,
now, instead of you know having that form we can use this particular rearrangement
wherein we have on LHS side P divided by 1 minus P and on the RHS side we have the
exponential of beta 0 plus beta 1 x 1 plus beta 2 x 2 up to beta p x p. So, in this particular
formulation you would see that now the range here has changed, but it is now it been you
know 0 to infinite for both the side LHS and RHS as well.
Now, if we look at this particular equation in the RHS side, now this is more of a
proportional form. So, if we use this particular model you know the interpolation would
be in percentage terms right. However, we can do a certain more transformation to bring
it to typical standardized form of equivalent form of a linear regression; So, that being
the objective.
So, we would like to reach to the equivalent you know standard formulation that we use
in linear regression for the logistic regression as well. So, now, LHS the expression that
we see P divided by 1 minus P that is actually nothing, but the definition of odds; So,
odds is another measure of class membership where, it can be defined as expressed as
odds equal to P divided by 1 minus P. And it can be defined as odds are belonging to a
class is defined as ratio of probability of class 1 membership to probability of class 0
membership right.
873
So, it is a ratio of you know whether there is going to be success or failure. So, this
particular now we can use instead of using this expression P divided by 1 minus P we can
replace it with this odds which is also another measure of class membership.
So, this particular metric odds is popular in sports, horse racing, gambling and many
other areas. So, now, you can also understand that range for odds metric is from 0 to
infinite. So, the same point is here. So, in previous equation now it can be rewritten as
odds equal to exponential e to the power beta 0 plus beta 1 x 1 plus beta 2 x 2 up to beta
p x p. So, a range is now 0 to infinity as you can see both LHS and RHS. We have
already matched these ranges and that same thing continuous.
Now, if we take a log on both sides of this particular equation then, what we get is log of
odds and on the right side we get a linear function of predictors that we wanted to
achieve. That is beta 0 plus beta 1 x 1 plus beta 2 x 2 up to beta p x p. Now once we take
log now let us look at the you know arrange reasonability on first look at let us look at
the LHS side log odds. So, as we know that log of a log of a value you know that log
based function they will take range from minus infinite to infinite. And RHS is now in
the linear function format. So, it is it will also take the same range minus infinite to
infinite.
So, range is matching and this particular formulation that we now see log of odds being
expressed as a function of these predictors beta 0 linear function of these predictors; that
874
is beta 0 plus beta 1 x 1. So, this is the standard logistic model. So, this is a standard
formulation of a logistic model.
So, this is the formulation that we typically used in a linear regression modeling. Now
this particular term on the this particular term in the left hand side log odds is called
logit. So, as we talk about as we talked about that instead of using the categorical
outcome variable, we would be using logit. So, this is how this is the formulation this is
expression this is the expression for logit. So, logit is nothing, but log odds which is log
P divided by 1 minus P. So, this particular logit is going to be used as the outcome
variable in the model instead of the categorical Y.
Now, odds and logit as we have understood and now both these matrix can actually be
written as a function of probability of class one membership right. To understand more
about the relationship between these two these three matrix let us see few plots in R
studio. Let us look an R studio.
875
So, what we will do we will look at we will create some you know some plot for this. So,
odds versus probability value of class 1 membership and also logit versus probability of
class 1 membership.
So, we will try to understand the range reasonability as well as a will understand how the
values are been taken. So, odds as we talked about, odds is can be expressed as P divided
by 1 minus P. So, this is the particular function that can be used to plot this particular
mathematical relationship where, curve is the function. So, first argument is the
expression. So, in this case probability value; So, P divided by 1 minus p.
So, this is the expression that we want to plot right. So, P divided by 1 minus P, the
values can range from 0 to 1 right because the probabilities value. So, that is the range
for a probability value. So, the same has been specified in argument second and third. So,
third is from 0 to second argument is from 0 and third argument is to 1. Now, type of the
plot is going to be l, that is linear you know linear plotting. And then this is expression
that x name this particular argument is p because it is the value of p that we are changing
from 0 to 1.
So, a label for x axis is going to be probability of success because this particular these
particular values are probability of class 1 membership and the label for y axis is odds.
So, let us execute this particular code and we would see a plot here, let us zoom in.
876
Now, you would see that on x axis we have the values ranging from 0 to 1 and on y axis
we have the values ranging from 0 to infinite. However, we are just showing up to 0 to
100. So, on y axis we have odd. So, we can see as the values increase from you know left
to right on x axis from 0 to 1 the value of odds keeps you know keeps on increasing
right.
So, this particular odds value will keep on increasing. So, as we approach you know as
you can see 0.8. So, this up to this 0.8, there is sharp increase in the odds values and you
know when we reach on x axis when, we reach to 1 the odds value on a y axis it reaches
to infinite. So, this is the plot for you know odds.
So, if we are talking about you know typical default cutoff value that we use is 0.5 then
the corresponding odds value is going to be you know somewhere around this particular
you know 0.5 value.
So, the corresponding odds value is going to be 1 right. So, let us move forward. So,
similarly we can express we can generate a plot for logit versus p that is probability of
class 1. So, logit can be expressed as log odds and which is nothing, but log p divided by
one minus p. So, same thing we can express in the this particular function curve where,
first argument is going to be log p divided by 1 minus p. Second and third arguments are
going to be remain same from 0 to 1 because probability values will range from 0 to 1.
Type of plot is again l name of this name of x is p. This p is the that that variable that we
877
have and this is styling detail then, we have appropriately we have specified the labels
for x axis and y axis.
Let us create this plot as well. So let us also create x axis, you can see in the curve
function we have last argument is x a x t as n; that means, x axis was not plotted. Now
we will create x axis separately. Let us do this, now let us zoom into the plot.
Now, from here you can see the values along x axis range from 0 to 1 for this particular
plot. And you would see that on y axis we have logit and the values range from minus
878
infinite to infinite. However, for this for this particular plot we see typically the values
from minus 4 to 4. Now as we you know move from a left to right along x axis. So, as
the probability values increase from 0 towards 1, you would see up to you know 0 to 0.5.
So, 0.5 this particular you know logit value becomes you know 0 that is because this
particular expression is for log p divided by 1 minus p. So, p divided by 1 minus p, will
be 1 for a value of p, even value of p is 0.5.
So, therefore, log of 1 is going to be 0. So, therefore, at this probability value p equal to
0.5 this particular logit function is 0 right. And you can see from when the probability
values are near to close to 0, then the logit values are closer to minus infinite. They are
high values you know on the in the in the negative sign negative side and as we you
know a increase values from 0 to 1 from 0.5 to 1 the logit values keep on keeps on
increasing and as we approach 1 we approach, we consider values close to 1, then the
logit values they approach infinite.
So, the range for logit is minus infinite to infinite and you can see the probabilities range
values then 0 to 1. From this so this is the relationship between odds and odds logit and
probability value. So, from this we can understand that the standard formulation for
logistic model that we just saw once we estimate those beta values, it is always possible
for us to compute the you know corresponding probability value from a logit value.
So, with this we will do one exercise in also the data set for a logit model, that we are
going to develop is a this particular data set we are going to use promotional offers. So,
this particular data set we have used in previous technique as well classification and
regression trees. So, what we are going to do is we are already familiar with the data set
and variables what we will do? We will first start with the building a simple a logistic
model just like simple linear regression where we just regress outcome variable with one
particular predictor.
So, in this perfect case, the promotional offer will build with between promotional offer
and income variable. So, let us first load this package x plus x let us import the data set.
So, promotional offers data set that we have already seen in previous lecture, previous
technique and that lectures as well. So, we have 5000 observations and this is about we
are we can build a classification model of where in a particular customer or new user
whether they are going to accept or reject a promotional offer. So, now, using logistic
879
regression we will try and build a classification model. So, you can see that data set is
now imported in R environment.
So, let us look at the apply structure function. So, we can see these variables income is
spending promotional offer. So, first we would like to start building you know simple
logistic model logistic regression model between promotional offers and income; Income
being the predictor and promotional offer being the outcome variable.
Other variables also we are familiar with age, pin code, experience, family size,
education, and online. So, first let us take a backup of this particular data frame. Now, in
this particular logistic regression exercise you would not like to include pin code because
as we have understood too many categories we will have to deal with this. So, as of now
will not like to consider for logistic regression this particular variable.
So, let us create make that change in the data frame. Now two variables promotion offer
that is our outcome variable, let us make it factor variable and also the online that is the
whether the particular customer is online active or not. So, let us change these two
variables. Now let us look at this structure, now all the variables are in the appropriate
format variable type income spending numeric and then promotional offer is factor, then
age and experience and family size.
880
They are numeric, then education and online. So, they are factor. So, as we have been
doing in other techniques as well first let us start with partitioning. So, we will go for 60
percent observation in training partition and 40 percent observation in test partition. So,
60, 40 that is the proportion will go for. So, let us create the partitioning.
Now to build the model. So, as we said we will just build a model between these two
variables promotional, offer and income just to one just a single predictor. So, GLM is
the function that is used to build logistic regression model. So, GLM first in GLM
function first we will have to pass the formula for our model, that is promotional offer
tilde day income and then second argument is about the family that we have to pick.
Families actually you know about telling this particular function that we would like to
build logistic regression model. So, you can see family binomial. So, logistic is part of
the binomial family. So, link you can see logit. So, we would like to build logit model
and the data is that training partition. So, let us run this code in the logistic model.
Let us look at the summary. So, in this summary you would see that we have an intercept
term, we have income term estimate for these two.
So, beta 0 and beta 1, beta 1 for income that we can see and we can see both of these you
know estimates are significant right. So, beta one for income that is also significant here
as you can see; Now, if we want to express this in terms of for final model then we
881
would have to extract these beta 0 and beta 1 value. So, this is how we can extract. So,
model object that we have just created, it will have this attribute coefficient and within
the coefficient we will have just two values. We can extract using in this fashion.
Unnamed function would just remove the names for this particular attribute, this
particular coefficient attribute. So, we are not interested in the name. So, we just are not
interested in extracting the values; so, that we can write our formulation of the final
model of logistic regression. So, let us introduce these two commands. So, the fitted
model, we can write in this fashion. Probability of a particular observation you know
accepting a particular individual, accepting particular customer, accepting promotional
offer. Given their income label as x, can be expressed in this form; e 1 divided by 1 plus
e to the power minus b 0 plus b 1 into x.
So, this particular expression one single predictor model fitted model can be expressed in
this form. If you want we can plot this. So, let us look at the range of income variable
and create a plot. So, the plot is between income variable that is the predictor in this case
and the proportional offer and the promotional offer that is the outcome variable. So, let
us create this plot.
So, we would see that now let us add the this curve as well this curve that the model that
we have just fitted. So, you can see using the curve function we can express the same
model here and then we can add it to the plot.
882
So, this is the plot for our logistic response model and all the you know dots that you see
these are our observations right and. So, this is our fitted model here. So, as you can see
income values they range from 0 to 50 for you know that this is the scale and
promotional offer that is nothing, but the probabilities value 0 to 1.
So, at this point, we will stop here and more on logistic regression and this particular plot
as well we will discuss in the next lecture.
Thank you.
883
Dr. Gaurav Dixit
Lecture - 47
Logistic Regression - Part II
Welcome to the course business analytics and data mining modeling using R. So, in a
previous lecture, we started our discussion on logistic regression. So, we talked about
logistic regression model, the standard formulation, we talked about different issues; for
example, categorical outcome variable and inability to you know model it as a function
of linear function of predictors, then we discussed further steps how the standard logistic
regression formulation is actually done. So, we talked about this probability you know
this particular probability model, right.
884
That is a using logistic response function, then we looked at this Odds model this one;
the Odds a Odds Odds model as a function of these a you know exponential and then the
power and in the power term the a linear a function of predictors, right. So, this was the
Odds model, then Logit model that is the last one that is the standard formulation log
Odds also called Logit.
So, Logit equal to a beta 0 plus beta 1 x 1 up to beta p x p. So, we talked about these 3
models and the log Odds are Logit being the standard formulation for logistic model. So,
this Logit model the log Odds model that is used to estimate a parameters coefficients
beta zeroes and then from this we can always derive Odds model and also probability
model. So, we also did a small exercise in R where we looked at the where we generated
few plots and looked at the looked at the relationship between Odds and Logit and
probability and we also did a one simple logistic regression model where we where we
build a model a between to promotional offer being the outcome variable and income
being the single predictor right and we also plotted the we also plot at the a fitted model
in previous lecture.
885
So, let us so, let us go back to our the same the point where we stopped in the previous
lecture.
So, let us load the data set again. So, this is the package, now let us promotional offers is
the data set that we had used in previous lecture. So, let us reload it. Now in the previous
lecture, we stopped at the plot, right. So, we will start our discussion from the same point
will also later on, we will also a develop the model build build the model using all the
886
predictors and then we will discuss some of the interpretation related issues that we
generally face in logistic regression model.
So, data is now loaded; let us remove NA columns; this is the structure of data.
And we would not like to; let us take a backup. Now, we would not like to use pin code
variable. So, let us get rid of this then let us convert promotional offer and online to a
factor variable and then this structure. So, these are the variables that we use in previous
lecture proper partitioning. Now, we talked about the function GLM. So, GLM is the
function that can be used to build a logistic regression model in R.
So, let us build this. So, we had done up to this in previous lectures we can see the
estimate for the constant term that is intercept and for income as well. So, both being
significant right using these two you know a parameters; these two estimates of
parameter estimates, we can specify a as we talked about in previous lecture we can
specify our probability model in this fashion because the main idea is to estimate the
probabilities as we talked about in previous lecture.
The first step is to estimate the probabilities values and then a use that to perform our
classification in the second step. So, we can extract these a parameters b 0 for constant
term b 1 for the income in this fashion, we can use the unnamed function to remove the
887
that row names that that will have an mod dollar coefficient and then extract the
individual values for constant term and income as per this particular code.
So, once this is done you can see in the commented out section, here the fitted model can
be written in this fashion. So, you can see here the fitted model; the probability model
can be can be written in this fashion p promotional offer, yes given the income variable
that is x and in this fashion we can express a one divided by one plus exponential of
minus b 0 minus of b 0 plus b 1 into x.
So, x is a income. So, b 0 and b 1 values; we have already estimated using logistic
regression model. So, now, these values can be plugged in into this particular expression
and for any value of x that is income in this particular case, we can compute the
corresponding probability value using this particular expression right.
888
So, when we in the previous lecture then we moved forward to creating a plot for the
same. So, let us do the same thing again. So, range is 6 to 205. So, appropriately will be
capture this through x limit argument and this particular scatter plot is between income
and as you can see here the promotional offer.
Now, the promotional offer is actually the categorical variable as we had converted into a
categorical variable now we want to plot this particular variable in a in a scatter plot then
we will have to convert it into a numeric variable and when we do. So, we will have just
2 values 0 and 1. So, this is how we are trying to a convert it back to numeric variable.
So, you can see promotional offer. Now, first it is being converted, it is being coerced
into a character vector using as dot character function and then once that is done this
character vector is now being coerced into as dot a numeric this is to get the desired
result right for desired result for example, numeric variable we want to get the numeric
variable for promotional offers.
So, once this is done we can plot this. So, you can see type of plot is p; that means, we
just want to plot the points. So, x and y values a values corresponding corresponding to x
axis and y axis, we just want to plot them. So, we are not plotting a you know a line
rather we are plotting points. So, let us create this plot.
889
So, we create this part, we see that because the probability promotional offer, it takes just
2 value 0 and 1. So, some of the observations they are at 0, you can see here you know
majority of observations are at this level 0 level 0 value and then the remaining
observations they are at 1. So, all the points and they seem to be you know on these two
extreme values 0 and one because promotional offer offer while you categorical variable
we converted into a numeric form 0 and 1. So, just two extreme values for all the points
are depicted here.
Now as we talked about as we talked about that and we just saw also that the
probabilities value there in essence become because we use the logistic response
function. So, the relationship is non-linear because of this that non-linear function is
being used to fit these points. So, essentially if we consider promotional offer as numeric
and income is already numeric variable and just now we have created this scatter plot
now we try to fit this particular plot using our logistic response function right that is the
non-linear.
So, this is how we can do we can add this particular curve and the estimates the b, b 0
beta 0 and beta 1, we have already computed we have already estimated these
parameters. So, these are directly being used using these expression mod dollar
coefficient one and mod dollar coefficient two using double bracket. So, that we you
know a do not have to worry about that names that are there.
890
This is another way to un name unnamed the those variable names and the coefficient
result and then for x is the variable that is going to be varied right. So, in the x name
argument you can see x is mentioned and this is going to be added to the existing plot
right add argument is there add is true. So, to in the existing plot this particular curve is
going to be added. So, let us create this like we did in previous lecture as well.
So, now from this plot, what we are trying to a do here is we first create a scatter plot
between outcome variable of interest that is promotional offer versus the income the
891
single predictor and then we have added our added the fitted curve. So, the fitted curve
here is this is logistic in a small function and expression and estimates, we have already
seen for this.
So, now you can see some part of these 0 values; 0s are you know are or on this plot and
then this plot is as we as the values of income increase this probability value keeps on
increasing. So, these probabilities value are the one that we are going to use to classify
the observation into class 0 or class 1, right. So, this is the log plot that we are using to fit
the data that we have right. So, this one eventually as the value income value increases
further to the right side this particular plot is approaching the one.
So, as you can see as you can see that after you know first few observations belong into
class 0. So, they are you know they are fitting to the, to this particular non-linear curve
and in and there might have been if there have been a few observation on this particular
top right corner. So, they might have also been fit into the curve as it approaches one, but
in this case more or more of them are on this side, right.
So, in the another problem and other data set you know some of the points could have
been here as well. So, some observations as the this particular non-linear curve is
approaching one right fit to the curve some observations as it this particular starts from 0
right you know we all might be fitted by this curve. So, the remaining observation they
will a lie in between you know.
So, now, for new observations when we estimate probabilities value using this logistic
response function; so, we will have to use cutoff as you can see from here to classify
these observations in into either class 0 and class 1. So, I from this particular plot, we can
also understand other thing for example, income is just one predictor that we have just
that we have modeled in our single you know predictor logistic regression model.
So, in case it had been you know in case we had more. So, the x axis in a way is also
representing the Logit values a that a and therefore, this particular and this logistic
response function is actually the probabilities value. So, this is also this particular curve
also giving us the indication about the relationship between when Logit is on y axis and
the probability values on in Logit is on x axis and the probabilities value on y axis, what
is going to be the relationship.
892
So, as the Logit value values increase along the x axis the probabilities values also
increase. So, let us come back to our discussion here in the slides. So, in logistic
regression model we predict the Logit values as we saw in the output.
So, let us scroll the output I once again; so, this was the single predictor model that we
had.
So, we predict the Logit values and therefore, corresponding probability of a categorical
outcome right. So, using those a i output values and using the logistic response function a
893
we can use the same parameters and compute the probabilities value. Now these
predicted probabilities value become the basis for classification right first step was to
estimate the probability now once we have these probabilities value they can be used for
classification. So, in essence, if we look at the logistic regression model essentially we
are fitting a model just like a multiple linear regression model right we have Logit values
which can we can is considered as a numeric numeric outcome variable and the on the
and it is being fitted as a linear function of predictors just like linear regression.
So, in a way we can define logistic regression as a prediction model for classification
task. So, we first fit a prediction model. So, the outcome variable is instead of you know
the category using the categorical outcome variable we are using a form a form of you
know a particular function based on that categorical outcome variable that is Logit and
that Logit variable is actually being you know modeled as a function of linear function of
predictor. So, essentially we build a prediction model and, then that is based on using
that those values we compute the probabilities and then use it for classification tasks.
Now, let us talk about the estimation technique. So, the estimation technique that we use
in multiple linear regression is a typically ordinary least square ols. So, a that method
cannot be used in this particular case because essentially the logistic regression a the
formulation is non-linear non-linear it is a non-linear formulation because we are looking
to estimate probabilities values through a logistic response function which is which is a
non-linear function therefore, least squares a estimation method has not been a not being
considered as an appropriate technique to estimate parameters in logistic regression
model..
So, the another method which is called maximum likelihood method that is used to
estimate parameters. So, this idea about a main idea behind this particular maximum
likelihood method or m l m is actually to estimates to find out the estimates in order to
maximize the likelihood of obtaining the observations using the training the model, right.
894
So, the estimates are we will we look to compute estimates which are going to increase
the like likelihood of obtaining the observations if you using a those particular estimate
itself.
So, of course, this particular method will require a number of iteration to reach to those
estimates. So, this particular method is used a as you can see a from the main idea itself
this particular method looks a less robust than the estimation technique that we use the
ordinary least square least square method that we use in linear regression.
So, m l m maximum likelihood method, it is less robust than that and reliability of
estimates for MLM also depends on a number of things for example, outcome variable a
categories; this would have a adequate proportion. So, you know probably logistic
regression model is not suitable for situations where we have a very few records a in the
training partition belonging to class 1 or class 0. So, we would like to have adequate
number of records in all the classes and then the you know reliability of estimates would
be improved adequate sample size with respect to number of estimates.
So, quite this particular technique is going to be sensitive because of the estimation
method we are using maximum likelihood method MLM. So, because of that we would
require an adequate sample size with respect to number of parameters that we need to
estimate. So, number of depending on the number of coefficient we require adequate
sample size. So, that the estimates that we get that those are reliable now some of the
895
some of the issues that we encounter that we face in logistic regression are similar to a
linear regression for example, co linearity.
So, co linearity related issues are very much similar to linear regression. So, if you know
predictors, if some of the independent variables in our model if they are collinear then
that could be problematic for our logistic regression results as well just like in linear
regression so much of the discussion that we did in a multiple linear regression technique
so that also applies here as well.
So, let us a let us go back to the R studio environment and what we are going to do.
896
Now, is we will model the same data set we are going to use promotional offered data,
data set and we are going to use the same data set now to build a model with all the
predictors right; so, the predictors that we have.
So, just you know we have till now we have built a model with single predictor that is
income; so, the predictors that we have; so, these are the predictors as we already know.
897
So, now, we would like to build a logistic regression model promotional offer versus all
these other predictors income spending age experience family size education and online.
So, the function that we used for this is same glm; so, promotional offer now against all
the predictors now other arguments are same.
So, let us execute this, fine, let us look at the summary results.
Now if we look here that; then first what we get is the call that actual code that we had
used to perform this logistic regression modeling.
898
And you would see the deviance residuals here the basic statistics about the residuals and
then we see coefficients and for all the Predictors.
So, intercept significant than we see income which was significant when it was a used as
a single predictor and in presence of other predictors as well; this particular variable is
still remain significant now the second one is spending you can see is dot here and you
can see a dot here. So, this is also this particular predictor is also significant at this
particular level you can see the 4 dots; it has to be value has to be less than point 1. So, p
899
value has to be less than 0.1, which is the case for spending you can see the p value is
0.07 to 49. So, at 90 percent significant at 90 percent confidence interval this particular
spending variable is also significant.
While the income variable; it is significant at 99.9 percent confidence interval if you look
at other variables age, this is not found to be significant experience this is also not found
to be significant in this particular problem and data set family size we can see that this
particular variable is significant at 99.9 percent confidence interval, this is you know a as
we can understand from three star significance now if we look at the next variable that is
education HSE, we can see that this is also significant, right. So, this is also significant at
99.9 percent confidence interval and after this we have education post grad this is not
found to be significant as you can see the p value is quite high and for this particular
variable, then we have a online.
So, online that the variable we had two levels; so, one whether a particular customer is
active online and 0 if the customer is not active online; so, you can see online one this
particular a dummy variable is being used. So, this also comes out to be insignificant as
you can see from the p value.
So, from a this model as we can see a we can see 3 predictors that is income and the
family size and the dummy variable education HSC. So, these three predictors seem to
900
be significant and we had significant at 99.9 percent interval and spending also seems to
be significant at 95 90 percent confidence interval.
Now if we look at the interpretation interpretation of these results we can see the income
the coefficient is positive the value is 0.063936, right. So, what we can say is that for a
unit increase in income right, we would see a some increase the corresponding increase
in Logit values and because of that there is going to be you know some increase in the
probabilities values and therefore, the probability of accepting the promotional offer
right. So, similarly if we look at the spending; so, the this particular coefficient comes
out to be 0.0921550 also positive. So, therefore, a unit increase in this particular variable
is spending, we will also have the corresponding influence in the Logit values and
therefore, but this is at 95; 90 percent confidence interval right.
So, some analysts might not consider this particular variable. So, corresponding
influence in Logit values is from a spending as well and therefore, a corresponding
change in the probabilities value and therefore, in the acceptance of the promotional
offer, if we look at the other significant variable family size. So, this is significant the
coefficient is also on the higher side, you can see 0.48, 0.487969.
So, this is significant at 99.9 percent confidence interval and the so, therefore, a unit you
know the unit to increase in family size will have a corresponding a increase in Logit
values and therefore, the probability values will also change and that might a you know
that might indicate the increased probability increased level of acceptance of the
promotional offer.
If we look at the another in the first dummy variable which is significant the only
dummy variable that is significant is education HSC. So, a as we can understand in this
particular case the education variable we had three categories right HSC graduate and
postgraduate grad and post grad and because of the alphabetical you know ordering of
these categories the grad has been taken as the reference category and I just see in post
grad we can see in the model.
So, therefore, interpretation of these dummy variables, are going to be with respect to the
reference category that is grad. So, because this particular education is just see
significant three start significance and the coefficient is quite high, but negative sign. So,
direction is negative. So, the coefficient is minus 4.351366. So, from this we what we
901
can say is that for customers who are having education up to HSC level right there is
twelfth customers who are twelfth pass.
So, they have they have a significant lower levels of a acceptance of promotional offer in
comparison to the reference category that is graduated; however, among graduates the
second dummy variable education post grad. So, this was not found to be a significant.
So, we cannot make any distinction between the customers having education grad a
being graduate or being postgraduate; however, for customers will twelfth pass
education, they will have they have lower level of acceptance in comparison to the
reference category that is graduate. So, in this fashion a we can go about we can go about
interpret doing the interpretation for logistic regression.
Now what we can do we can understand few we can learn a few more plots to understand
it better, for example, for example, a Odds probability and odd. So, we had in the
previous lecture, we had generated two plots where which was Odds versus probability
value and Logit versus probability value. So, we will also look at how you know when
the Odds are on x axis and probabilities on y axis. So, probability or probability versus
Odds and probability versus Logit value; how the plot looks like.
So, as you can see in the curve function will have to write this equation. So, probably
now Odds divided by 1 plus Odds that would be equal to the probability value and we
are changing is from 0 to 200. So, because the Odds values, they can range from 0 to
902
infinite. So, therefore, we are taking a good enough value 100 a large enough value and
the a type of plot is going to be a you know a line and then this change x name is odds.
So, with this let us create this plot. So, as you can see in this plot as you can see in this
plot as the Odds increase as we move along to along the x axis from left to right as the
Odds increase the probability of success also increases right.
And the change in Odds value change in probability value is you know much higher a
change as we increase Odds you know after you know the value after more than one it
seems that after once Odds is greater than one the probability of success increases much
more, right. So, because that is the for example, the cutoff the fall cutoff value that we
have been talking about and the typical two class scenario case where the default value
of on most problems.
Last method the corresponding default value could be 0.5 the same cutoff value we used
the Odds instead of probability as the cutoff criteria then the corresponding value is
going to be one right because the a probability is a Odds divided by one plus odd. So, if
we have probability value as point five the corresponding value for Odds is going to be
one.
So, because a once the Odds value is one you can see increase in probabilities values this
is the relationship and similarly we can have a look at the curve between probability
versus Logit. So, you can see now the equation we have the Logit values you know
903
because we are building logistic regression model. So, therefore, from the estimates and
a predictor values we can compute the Logit values and therefore, from though that those
Logit values we can compute the probabilities values. So, this is the expression. So,
exponential of Logit value divided by one plus exponential of Logit values and because
this the Logit value can range from minus infinite to infinite. So, therefore, I have given
a you know good enough range minus 100 to 100 and the type of plot is again line and
Logit x name and other things are quite similar.
So, let us create this plot. So, this is the plot as you can see here.
So, this particular plot as you would see that this particular plot looks quite similar to
what we saw in logistic response function. So, this is. So, quite similar to what we saw in
logistic response function and as you can see as value inc increase along the x axis from
negative values from minus 100 to minus 50 up to 0 the probabilities; there is very slight
increase in the probabilities, whether it remains close to 0 and as we move further from 0
to more higher values 15 and you know 100 a closer to a Logit value of 0, we see sudden
increase in probabilities right probabilities values and after that this a kind of remains
close to 1.
So, a as the Logit values increase; so, for the negative where Logit values the you know
probabilities values are going to be close to 0 for you know a positive Logit values the
probability value is going to be close to one and near about 0 there is going to be the
904
switch from you know close about 0 values to close about 1. So, that that significant
change happens in here that value. So, a with this we will stop here and we will continue
our discussion on logistic regression in the next lecture.
Thank you.
905
Dr. Gaurav Dixit
Lecture - 48
Logistic Regression-Part III
previous lectures we have been discussing logistic regression and in previous lecture we
talked about different aspects of interpretation model interpretation in logistic regression.
So, we will discuss further on the same interpretation of results and to will further
discuss the issues that could be there, because of the non-linear; a non-linear function
and three different models that we typically use the logit model the odds model and the
probability model. So, let us start.
So, in the logit model that we typically use lets write a particular logit model. So, if this
is the model that we have let us say we have a P predictors. So, if we have P predictors
and this is our model; so, how do we interpret results?
So, once we have built the model will have estimate of all these values. So, all the betas
right all the betas. So, these estimates will have. So, these estimates we will have. So,
how do we interpret the results, because a as you know that the logistic regression model
906
is typically used for the classification tasks and we have already understood that the; our
dependent variable outcome variable is categorical and so, we have already understood
all these issues all right.
So, we look to estimate probabilities values right. So, we look to estimate probabilities
values and for that; we need this particular formulation logit. So, once the model the
estimates have been a computed using these particular estimate we can use the direct
values to compute the corresponding to compute the corresponding probability value as
we saw in the a previous lecture as well right.
So, in the previous lecture as well this particular thing we saw there that the
corresponding using the right value. We can compute the corresponding a probabilities
value, now the interpretation remains slightly tricky. For example, here in this case if
there is a unit increases in this X 1, X 2 these predictors. So, we have you know you
know unit increase 1 unit 1 unit change in these values.
So, the corresponding change in logic values value would be based on these estimates.
So, beta 1, beta 2 right because simply you can see that beta 1, X 2; and if this was the a
you know previous value and if there is a one unit change in the value of X 1. So, from
this you would see that the corresponding change in logit value is a going to be just beta
1.
So, this is the; so betas they are the additive factor the additive factor that actually
changes. So, irrespective of you know the values of X the actual specific values of X the
change in terms of change, if we look to interpret this in terms of change all depends on
the beta values. So, that is the same thing is mentioned here. So, beta plays the role of an
additive factor. So, if there is if there is a unit change in any of the predictors values the
respect to beta change would be seen in the a logit values right.
So, increase in X would lead to corresponding increase in logic values; if beta is positive
if beta is a negative. So, the direction would change and therefore, any increase in X that
is predator’s values would lead to a corresponding decrease in logic values and in
previous lecture itself; we have understood the relationship between logit and
probabilities values right.
907
So, we had created this plot wherein we saw the relationship between P and logit values
right. So, from this also we can understand that, if there is some change you know if
there is one unit change in predictors values and beta is positive the logit values will; so
increased. So, therefore, the probability values will logit value will increase. So,
therefore, the probability value will also increase and that could lead to increase in
acceptance of for example, the promotional offer example that we have been using.
So, did this increase in probabilities value will actually increase in the acceptance level
of promotional offer right; however, as you can see from this particular curve you know,
if the corresponding increases you know in this particular zone a you know zone, then
probably the acceptance it might not reach to the acceptance you know a label it might
not be classified as class one right.
This is the zone where the probabilities value significantly starts to change. So, this value
is around 0 as we saw in previous lecture. So, you would see that the values logic value
is close to 0, we see a sudden re start to see sudden spikes in probabilities values and
then finally, it becomes one right.
So, the interpretation; however, for the interpretation purpose if we are using logit model
right. Logic values to interpret the results if beta is positive the one unit change in X 1, X
2 the predictors value will have the corresponding change in the logit values and
therefore, higher probabilities value and therefore, higher level of acceptance rate for
promotional offer in this for example, that we have been using so, for any value of x.
So, for any value of X the interpretive statements of results they are going to remain
same. So, increases either by this additive factor you know beta 1, beta 2 depending on
the predictor. Now, let us move to the a odds model. So, in odds model the interpretation
would change.
908
So, let us look at the odds formulation now; so, if we go back to our odds formulation
right. So, odds formulation if we go back it was e ,and then minus minus of as we can
see in this e and then this is the formulation e to the power beta 0, beta 1, X 1, beta 2, X 2
up to beta P, X P. So, if there are P predictors. So, if there are P predictors this is going to
be the formulation odds model formulation right. And since we have already estimated
these beta 0s; so these beta 0s have been already estimated right.
So, they can be directly so values of these you know coefficients the estimates can be
directly plugged in into this model as well and we will have the odds model right. So, we
can rearrange this we will have e beta 0, this is going to be like our you know constant a
you know multiplicative factor and then well have e you will have e beta 1 right. And we
can write in this fashion X 1 and then we will have e beta 2, then we can write in this
fashion X 2 and then we will have e beta 3 X 3.
And in this fashion; and finally, we will have e beta P for the Pth predictor and XP. So,
you would see that this particular a model right. You would see for this particular model
for a for a unit change, if we do a unit change in X 1. So, if X 1 becomes; so I know if
we do a unit change X 1 plus 1. So, you would see e beta 1 and a X 1 plus 1. So,
effectively what we get is e beta 1. So, that we get a multiplicative. So, all these are
multiplications right.
909
So, this particular model is also the multiplicative models. So, what we actually get is
multiplicative factor; e beta. So, for all the predictors X 1, X 2 to XP you know, if there
is one unit change in those predictors values the corresponding change in the odds of
accepting the offer is going to be e to the power beta and this is going to be the
multiplicative factor.
So, that this particular value tends to the odds value will increase by a factor of this. So,
earlier if the odd value was let us say 1.2. So, now, it will have a increase in e beta times
1.2, if all other variables and everything else is kept constant right; say if we look at the
logit model that we discussed right. So, there it was the additive factors right.
So, in the logit model we had the additive factor and you know if one unit change would
increase the value the corresponding increase in logit values would be by beta values
now it would be. So, this would be plus you know that particular beta now here it would
be you know e beta this would be a multiplication. So, if there is an; if there is a unit
increase in predictor’s value.
So, the a corresponding increase in odds values would be by a multiplicative factor of e

to the power beta. So, increase in X 1; so let us go back to our slide. So, you can see here
that if beta is greater than 0. So, beta is greater than 0. So, increase in X 1 would lead to
increase in odds value right; and if beta is less a than 0 then increase in X 1 will lead to
decrease in odds.
However; as you see that this is an you know exponential formulation. So, you would see
the values will range 0 to infinite. So, the values you know. So, therefore, even though
the a beta values even though even though the beta is less than 0, the values will still
remain greater than 0 for odds right; and a you would also see that a in the logit
formulation that we had a logic formulation was like this in the logit formulation the
values which are which are going to be negative.
So, the beta which are going to be a negative; so they will become a value less than 1
between 0 and 1 and the betas which are going to be positive they will become values
greater than 1. So, for example, if beta 1 was positive and beta do is was you know
negative then this one would be this particular value would be greater than 1 and this
particular value beta 2, 1 negative this would be less than 1; how less than 1; however,
the values you know even this value as well would be greater than 0So, the negative a
910
values there in logit model transform into smaller values in this odds model and positive
values in logit model transform into a value greater than 1. In this particular model and
the interpretive statements; so they can be still made with respective of the predictor
value. So, irrespective of the for any value of X unit or one unit change in that particular
value of X will lead to the corresponding increase of a multi by in multiplicative factor
of e beta in the odds value. So, for both logit models and odds models, this is how we
can go about the making interpreting the results.
However, if we look at the probability model right; however, we look at the probability
models. So, the probability model is going to be like this. So, this was the; a probability
model that we can use. So, here in the in this particular expression as you can see this
particular equation this is this is also nothing, but logit right this is logit. So, we can the
same thing as we have seen; before we can write it in this first a particular form and
therefore, it will become logit divided by 1 plus e logit.
So, in this form also we can write; so this was the model. So, from here you can see that
if there is a unit change in X 1; right X 1 it is X 1 plus 1, then the corresponding a change
in the probability value of P is not going to be constant, because the way the expression
is right. So, the expression the a one unit change in X 1 the corresponding change till P is
going to depend on the actual value of X 1 right.
So, X 1 from this you know change in X 1 and the corresponding change in probability
value. Now will depend on the actual value of X 1 as well; however, in other two
formulation we can see a change in a one unit change in X 1 the corresponding change in
logit was beta 1, one unit change in X 1 here the corresponding change in odds values
was e to the power beta 1; however, in this particular case from this particular expression
we can derive that a we can detect that the one unit change in X 1 and the corresponding
change in probability value will depend on the actual value of X 1. So, we cannot
eliminate X 1.
So, therefore, when we talk about the probability model the interpretation of the results
would be for specific observations right. If the interpretation depends on the actual value
of X 1 the change depends on the actual value of X1. So, therefore, probabilities values a
can be interpreted should be inter interpreted for specific observations, because one unit
change in X 1 and the corresponding change in probability value will also depend on the
911
actual value of X as well. So, when we talk about the probability model we will discuss it
in terms of a for with respect to the specific observations and general interpretation of
predictors and their importance can be done either through logit model or odds model.
So, a there is another important aspect that; we would like to discuss here this is between
odds and odds ratio.
So, in many domains these two terms are quite frequently used odds and odds ratio;
however, they do not; however, these two terms are not same they are different. So, the
particular term odds model, in this particular case logistic regression model that term that
we have been using.
So, we had also given a what we mean by odd odds, in our case in logistic regression
model it was the ratio of two probabilities value; that is a probability of a object a if
probability of an a belonging to class 1 and probability of belonging to class 0. So, odds
for our logistic regression model, it was probability of belonging to class 1 divided by
probability of a class probability of belonging to class 0.
However, another term odds ratio which is also popular in many domains; so which is
slightly different. So, odds ratio is actually ratio of two odds, when we just say odds is it
is the ratio of two probability values and when we say odds ratio it is actually the ratio of
912
two odds. So, odds of for example, if we if we have a categorical outcome variable or
categorical variable having m classes and there is m 1 class and there is m 2 class.
So, we can a compare we can you know we can compute odds ratio a value for these two
classes. So, it can be computed using ratio of odds of class one to odds of class m 2. So,
odds of class m 1 divided by odds of class m 2. So, odds ratio is the ratio of two odds
values and when we use just the term odds. So, it is ratio of two probabilities values.
In terms of interpretation odds ratio, when we say odds ratio greater than 1. So, for
example, in this particular case odds of class m 1 are high we can say that odds of class
m 1 or higher than odds of class m 2; however, in the in the in the case of just you know
if we are saying odds and odds greater than 1, then we can say a that the probability of
belonging to class 1; or the particular class is greater than the probability of belonging to
class 0 right.
So, the interpretation of the definition of odds and odds ratios this difference we should
be clear about. So, that there is no confusion when it comes to logistic regression model.
So, what we will do?
We will come back to R studio and the model that we have developed built using all the
predictors.
913
So, this was the model mod 1. So, we had used the promotional offers data set and the
promotional offer versus all the predictors. So, this model we had built and summary
results as you can see in the output here.
So, this is the model that we had built in the previous lecture; so we had already talked
about these a predictors income and spending, and then family size and education HSC.
So these being the significant predictors; and if we look at just the 99.9 percent
914
confidence interval the family size income and education HSC so, these three are the
significant predictors.
So, we also in the previous lecture we also created these two plots probability versus
odds and probability versus logit and we saw how from the odds model and logit model.
So, using you know a by analyzing these plots we can understand further, how the odds
model and logic a logit model can actually be used to make interpretation about
probabilities values right.
So, a what we will a do now we will a score the training partition the test partition
training and test partition using the model that we have just built. So, if we look at the
test partition. So, in this as we can see the test df test that we had already created.
We have 2000 observations here. So, the model mod 1 will use the; so we are going to
use the predict function to score of this particular partition. So, a model is mod 1 our
logistic regression model with all the variables and, then this is the partition we are for
the clarity we are not including the outcome variable for scoring the model, then we have
this another argument type.
So, a this particular variable this particular argument indicates a gives us the values the
estimated the logit values.
915
So, let us look at the help function, because we need to model gives us the logit values,
then we need to yes compute the probabilities values from those logit values.
So, please look at the we look at the; this particular help section also for predict dot glm.
So, you can see that in predict we have a type and within type we have these; we can
have these three options link response and terms right. So, as you can see first I have
used response. So, let us see what these terms are about. So, we then type here; you can
see the type of prediction required.
916
So, the default is on this scale of linear predictors the alternative responses on the scale
of the response variable right. So, the response variable in our case is logit right, because
we have built logistic regression model. So, probably if we want to get the logit values
will have to give this as an argument.
So, this is on a scale of response wherever thus and then for a default binomial model
default binomial models the default prediction all of log odds and the probabilities on
logit scale. So, you get the probability. So, actually from response we actually get the
probabilities values right. So, if we want to get the probabilities values directly then we
will have to use this response a type right.
And then we have terms option. So, this will return a matrix right this is will return a
matrix is giving the fitted values of each term in the model formula right. So, these are
three options that we have right, and then we have linked as well.
917
So, let us use these options and see what the values are there so mod test. So, we look at
the; first 6 values of this particular the these particular scores. So, you can see for the
training partition observations.
We can see here the values. So, these are these are the probabilities values right. So, from
these probabilities values these estimated probabilities values, we can then compute the
we can then compute the determine the classification we can do our classification. So,
the next one is mod test l. So, here we are using the type as link.
918
So, let us a compute this one as well. So, we look at the first 6 values of this one.
So, from here you would actually see that a this link type actually gives us the logistic
values here the logit values right. So, you can see the values already they are in the
negative side and the earlier output that; we saw a these were probabilities values and the
probabilities values are quite close to 0, and you can also see the corresponding logit
values are negative right. And a as we move forward right, as we move forward you can
see this particular observation the value start you know the; this particular observation
this value is the logit value is positive and a here the probabilities value probability value
is also close to 1.
So, in this fashion we can see the plot a that we had saw earlier. So, that is the same kind
of results we are able to see here, now we just need probabilities values a the mod test
values that; we have just computed to assign the observations to classify observations
into different categories; So, if because this is a two class case.
So, if the default value the default cutoff value could be 0.5 to use the most probability
class method right. So, we can use this for code. So, if else is the function that we can
use an; and if the mod test value is greater than 0.5, then we can a assign that observation
into class 1 and otherwise class 0.
919
So, this in this fashion will have the classification. So, let us compute this let us look at
the first 6 values of this particular result.
So, you would see for a different these different observations as indicated in the indices,
in the test partition the classification has been appropriately done you can see a first three
values the probabilities values were also quite close to 0. So, all have been classified as 0
the a this for observation number 10. So, the probability value was close to a one. In this
particular case as indicated in the logit values as well and this particular value has been
this particular observation division has been classified as one class 1 and the others,
because of the smaller probabilities values and negative logit values the classification has
been done to class 0.
So, once the classification is done we can create our classification matrix. So, df test and
promotional offers. So, this will have our actual values and the predicted values as we
have computer just now mod test c.
920
So, we will have our classification matrix. So, as you can see in the matrix 1800
observations have been correctly classified as class 0, and then 113 observation have
been correctly classified as class 1, and then we have these 68 and 19 which have been in
a in a in correctly classified. So, we can compute the classification accuracy and error
numbers for the same you can see.
So, for the logistic model using the particular data set that we have built we get the 95.65
percent classification accuracy right; and the error is point error is 4.35 percent. Now, I
921
would like to a you know if you are able to recall that; previously we had used the same
data set using classification and regression trees and the performance, that we saw there
was you know for especially for training partitions the kind of performance that we saw
there was around 98 percent.
So, the classification in regression tree as we talked about in those lectures that are
typically over fits the data. So, you can see here the performance is 95 and there it was
98 and when we had pruned the classification tree the performance had dropped down to
97, 96 a percentage.
However, if we look at them; So, if we look at this structure model like logistic
regression the performance was performance close to 95.6 and we look at the data driven
model like classification and regression tree the power the performance was you know
something like 97. So, there is 2 percent increase, that we can clearly see in a data driven
models. So, with this we will stop here and we will continue our discussion on logistic
regression in coming lectures.
Thank you.
922
Dr. Gaurav Dixit
Lecture - 49
Logistic Regression - Part IV
Welcome to the course business analytics and data mining modeling using R. So, in
previous a few lectures, we have been discussing logistic regression and in a previous
lecture, specifically, we talked about how we can actually interpret a Logit model, Odds
model and also probability based model, we also understood the differences in terms of
interpretation . So, let us in this particular lecture let us start with our exercise in R that
we have been doing. So, we have been using this promotional offers data set. So, we
would like to complete this particular exercise.
So, let us load this protocol library xlsx. So, promotional offers data set that we are
already familiar with 5000 observations. So, let us a load it into R environment.
So, a in a previous lecture, we have been able to build the model and we also understood
the results and interpreted the results that we got in our promotional offers model. Now
we will check the performance of this particular model on test partition and also will for
the training partition as well we will look at some of the charts like cumulative lift curve
and also design chart for this particular data set. So, as you can see now observations
923
have been loaded into environment section you can see this 5000 observation; let us
remove NA columns let us look at the structure once again.
So, all the familiar variables, right.
So, let us take a backup of this particular data frame and we are as we talked about in
previous lecture as well, we are not interested in this particular variable pin code and
many categories, right. So, we would like we would not like to consider this particular
variable in this model. So, let us get it off get rid of this particular column. Now we are
left with a promote to categorical variable promotional offers and online activities online
activities whether a customer whether a particular individual is active online or not and
the promotional offer is our outcome variable of interest whether the customer accepts
the offer on or not.
So, let us convert them a to factor variable categorical variable now these are the
variables that you would like to take forward for our modeling exercise income
expanding promotional offer and then age experience family size education and then
online.
924
So, we followed a 60 percent, 40 percent partitioning in previous lecture as well. So, let
us do the partitioning.
So, 60 percent of the observation will go into the training partition as you can see in the
environment section 3000 observations for df train 8 variables and for test partition the
remaining observations that is 2000 observations on 8 variable. So, that is also there.
925
Now, as we talked about in previous lecture the GLM is the function that can be used.
So, this program order we have already build model with single beta we already
discussed this one.
So, let us move to the model with all predictors. So, as you can see in this particular
model, we are we have this formula promotional offer tilde dot so; that means, we are
going to build model against all predictors using all predictors right other parameters
remain same. So, let us run this run this model, let us look at the summary.
926
So, this particular model also the results of this model also we have discussed, however
there is a one slight change in the results ah. So, in the previous run; that we had done in
the last lecture, as we can see is spending this particular variable right. So, this was this
particular variable this was significant at 90 percent confidence interval level in the p in
the v in the run that we did in previous lecture; however, you can see that as the sample
as changed ah.
Now, this is also significant at 99.9 percent significance level right. So, the results that
we have today, in today’s run; we can see that income is spending and the family size
education HSC; these are the significant variables and 3 of them were a significant at
99.9 percent level in previous run as well and in this particular lectures, run a spending
also comes out to be significant. So, this is slight; this can happen when we run a
particular model multiple times. So, a larger sample size can a guarantee us a more stable
results more robust results right because it the model results also depends on the
observations because training partition, we randomly draw observation from the full data
set and then use them for training partition.
So, the observations the every time we run the observation a that are used for model a are
going to change and therefore, a slight differ slight differences in terms of a model can
be seen to repeat a to repeat the model a using the same observation as we have talked
about in some of the initial lectures of this course set dot seed function can be used set
927
dot seed function will actually allow us to use the same partitioning same observations
pertaining partition for the modeling as well. So, we have already discussed the results of
this particular model. Now let us a move forward. Now let us check the performance .
So, test partition ah. So, we will like to score test partition for probabilities values. So,
this is how we can do it predict function. So, this particular aspect also we talked about
we need to this third argument type we need to specify as response to have probabilities
values and this particular argument will has to be specified as link to have the Logit
values and then we will have to manually classify the observations based on the
probabilities values right. So, let us score the probabilities value Logit values.
928
And then we can score in this fashion the observations. So, cutoff value is 0.5. So, we
have just two classes. So, this is a two class case. So, 0.5 cutoff value of 0.5 will be
equivalent to most probable class method and in this particular case. So, let us use it.
So, now let us look at our classification matrix. So, with this code we would be able to
generate the same. So, you can see in the classification matrix out of the 2000
observations that we have in the trend test partition as you can see in the environment
section as well. So, out of 2000 observations that we have 70, 175 observations have
been correctly classified as class 0 members and hundred and twenty five observations
have been correctly classified as class one members the of diagonal elements that is 65
and 35. So, these are the observations which have been incorrectly classified either into
class 0 or class 1. So, we can go ahead and compute our classification accuracy. So, this
comes out to be 95 percent in this particular run.
If you remember in the previous run that we did in a in the last lecture there also we got
the similar number. So, that was also near about 95 point something in last lecture. So,
you can see the model is a on in terms of performance numbers in terms of matrix the
model is quite this stable and robust right in previous run we also got similar perform.
So, the remaining is the error that is 5 percent.
Now we can compute the important variables for this particular modeling exercise where
you have a predicted class actual class predicted class is stored in mod test c and then we
929
can create a data frame of all these important key variables here actual class is stored in
promotional offer we can have probability value mod test. So, using this we can also
have a look at the table this particular data frame and the table and have a look how our
model has performed log Odds also we can have in this fashion mod test l that we have
already computed.
And this is our then we can also have the test partitions those variables here in this
particular data frame. So, let us look at first 6 observations of this particular data frame.
So, you can see predicted class and the actual class. So, because our accuracy is 95
percent for this particular model; however, in first 6 observational itself you would see
that one error for this particular observation the actual class was 0, but it has been
predicted as one, if we look at the a probability value for the same we can see that 0.57 is
the probabilities value. So, is the probability value? So, therefore, it has been classified
as class 1 ah; however, actual class is 0.
If we look at the other numbers for example, the first row here you can see the
probabilities value is quite low. So, therefore, it has been correctly classified as class 0
we look at the row number 2 the probability value is .98 quite close to 1. So, it has been
correctly classified as class 1 and this one is the error right probability value is a more
than 0.5 therefore, it has been classified as one even though the actual class was 0; if we
look at 3 remaining a 3 remaining rows also, we can see that the all for all these 3 rows
930
the probability value is much less than 0.5, it is quite close to 0. So, therefore, all the all
these 3 rows have been correctly classified as class 0.
So, a log Odds value a you can also see. So, you can see the values which are close to 0.
So, as we had seen the plot of you know probability a probability versus log Odds logic
values. So, from there, we also I can understand that the log Odds value is Logit values
on the negative side so; that means, it will have the corresponding probabilities value
quite close to 0. So, the same thing you can we can see in all the rows where the Logit
values are negative similarly a positive Logit a values as we saw in previous lecture, in
the plot that positive logic values.
They will typically mean a higher probability value a probability value close to one the
same thing is reflected in row number 2 positive Logit value and higher probability
corresponding value if we look at this particular value. So, we saw that that around the
you know when the Logit value is around 0 mark, then we see sudden you know change
in a probability value. So, all the variation in the probabilities values come around the
when the Logit value is near about 0 mark.
So, you can see 0.28 when the Logit value is near about 0 mark 0.28; you can see that the
probability value is also near about 0.5, right, you can see 0.57 in this particular case and
these are the cases and these are the cases the cases where Logit value is close to 0; that
means, the probability value will be close to a 0.5 mark you know on either direction. So,
those are the cases which will which will be difficult to classify for a model in this case,
as we can see also row number 3 the model was not able to classify correctly the
observations.
Then a the predictors variables have also been added to this particular table. So, that can
also be analyzed accordingly income spending aged experienced family size education.
So, that can also be analyzed. So, if we look at the most interesting row that is the row
number 3 here you can see the income levels the spending and the age. So, on the higher
side our experience and family size and education. So, we can look at different a values
specific values for a particular observation and we can understand the results further
another thing that is possible here is that a we can have a look at the we can have a look
at the values which were which are you know which have been incorrectly you know
which have been incorrectly classified by the model.
931
So, if you are interested in those value. So, we can previous command was this one.
So, a in this particular data frame itself we can look for the values which have been
incorrectly classified right. So, first we will have to store this particular variable in a data
frame. So, let us say df. So, we; so, we store this particular variable in this a data frame
df and now within this df, if we are interested in finding the rows where the predicted
class was not equal to predicted class was not equal to the actual class, right or rather
more interesting rows would be where the probability value a the probability value that is
a the third third that is the third column right probability value is close to 0.5 right.
So, that would be more interesting the, those would be more interesting observation. So,
let us compute the same. So, third row and we would like it to be let us say less than 0.6
and the same observations we would like it to be greater than let us say 0.4; so, all the
observations which all the rows which follow this.
932
So, you we can see here and the results. So, that there could be too many observations in
this case. So, there can be too many observations. So, let us take a first few observations
let us take a let us twenty observations here again. So, in this fashion we can do it. So, let
us scroll.
So, now, we can see so, these are the observations for which we probability values range
from 0.4 to 0.5 as you can see from our criteria as well, right. So, probability is value
range from 0.4 to 0.6. So, that was the range where Logit values a you know are close to
933
0 and we see a change in you know sudden change, you know a in a probabilities values
near about this range.
So, now let us look at the some of these observations we can see the probabilities values
are co close to 0.5 and Logit values are close to 0, right and all if we look at the weather
these observations have been correctly classified you can see first one first row
incorrectly classified second third fourth incorrectly classified with the fourth row where
we see the correct classification and if we look at a look further then this one is
incorrectly classified.
So, you would see; most of the observation within this range which have probability
value in this range have been incorrectly classified right. So, very few observation this is
another observation which has been correctly classified.
934
So, very few observations seem to be out of the twenty observation within this range 0.4
to 0.6 that we have seen.
So, in a sense from this kind of analysis we can see that our model is you know our
model is able to correctly classify the clear case records and when the situation comes a
bit close where the probabilities values are quite close to 0.5 our logit values are close to
0 in those situations the performance of the model performance of the model goes down
most of the values are being incorrectly classified; however, if we look at the overall
picture the model is giving us 95 percent accuracy. So, that is mainly because of some of
the easy some of the direct maybe more observation which are easier to predict.
So, in some situations in this kind of situation we would require expert knowledge. So,
the observations which have probability value close to 0.5. So, a in these cases can be
identified and you know closure is scrutiny with the help of experts can be done to
classify these observations.
935
Now a let us look at the cumulative lift curve for this particular exercise for this
particular model. So, for this as we have done in a some of the techniques in previous
lectures as well. So, I will have will clear this particular data frame first column would
be the probability of class one in this case, mod test is storing that information a actual
class in this fashion because you can see the code is just slightly you know adjusted. So,
that we get the values in numeric form because later on we would be computing the
cumulative actual class number. So, you can see promotional offer it was converted into
a factor variable; however, the labels where is to 0 and 1.
So, we would we would first required to change it to a character variable now the levels
would be now the ones. So, labels would be gone and the values would be in correct
format 0 and one and then from that we can convert into a numeric format 0 and one
right. So, directly the direct conversion factor to numeric might lead to some errors and
the values might not be in the desired format. So, if we directly convert from factor
variable to numeric variable.
So, the classes would be classes a number of the numeric code for class 0 could can
become one and numeric code for class one can become two; however, you would like to
have numeric code for class 0 at 0 and numeric code for class one as one because we
require certain computation based on that a those values. So, this code will give us the
936
desired value. So, factor labels for 0 and one and when we converted it into a numeric
vector then the values will also be 0 and one using this particular code.
So, let us create this data frame, let us look at the first 6 observations.
So, we can see in the first column we have the probabilities values and we also have the
corresponding actual class. So, please note this that these are the estimated probabilities
and the actual class. So, this particular cumulated class in this exercise we have gone
through before as well. So, now, what the next thing that we would do we would sort this
particular data frame in the decreasing order of probabilities values, right. So, order is
order function can be used and the decreasing argument has to be set as true. So, that we
get the values in the decreasing order ah. So, let us a run this code let us look at the
observations.
937
Now if you see that very first row is having the highest probability value followed by the
observation with second highest probability value and you would see that first if your
observation are all are close to 1.99 something numbers and the actual class is also 1.
Now with this with this transformation of this data frame we can go ahead and compute
the cumulative actual class ah. So, a this come sum is the function that can be used to ah
perform, this computation in our environment ah. So, you can see that in the second
column we are applying function. So, we will get the cumulative number in this and
stored in this worker variable come cumulative come actual class. So, let us compute this
let us add this particular variable in to data frame; let us look at the first 6 observation.
So, now you can see probability and the actual class and the cumulative actual class you
can see the numbers also one two 3 four five six. So, now, let us plot our cumulative lift
curve. So, first let us look at the range for x axis. So, one to 2000 that is the number of
observation in test partition and let us look at the range for a y axis that is range for
cumulative actual class. So, 1 to 19 so, that is the range. So, in that sense we can also
understand that we have hundred and nineteen in our data set of 2000, we have 190
observations belonging to class 1. So, that is also clear from that.
So, now let us plot you can see that limits x limit y limits are appropriately specified. So,
that we focus mainly on the data points the plot region let us generate this plot. So, this is
the plot; let us also create the reference line and a legend for the same.
938
Let us look at the plot. So, this is our cumulative lift curve. So, as you can see that as we
have talked about in when we generated cumulative lift curve for some other techniques,
alright so, when we look to identify a first few observations most probable ones right.
So, from this from this particular plot we can understand the ability of the model in
identifying the most probable ones. So, this lift from the reference line this indicates. So,
this particular line solid line is representing the model and the dotted line is representing
the reference case different scenario baseline scenario.
New model new rule and from this we can say our in terms of identifying the most
probable ones in terms of identifying the customers who are more likely to accept the
promotional offer this model does a good job and provides a provides a good very good
lift in comparison to reference in comparison to the benchmark case. So, we can see that
lift is quite high in the initial part of the curve and as we look to identify more such cases
and the lift keeps you know a lift starts decreasing, right that is because there are just 190
total observations which fall which total observation which actually fall in that category
the individuals who have a customer who have accepted the offer.
So, as we go about reaching that number you can see here it is 2000. So, this particular
mark is one ninety. So, as we go about a reaching this number the performance of the
model start a merging with the performance of new rule; however, in terms of identifying
the most probable ones, right ah. So, what we are looking here is the top left corner; so,
939
this particular corner. So, if we are looking to if we are you know if we are with
identifying these many observations. So, the model gives us a quite good performance
and comparison to the new rule you can see even at this point we will be able to identify
about 150; 100 you know 55 more of more than 150 a individuals you know who who
are more likely to accept the offer which is quite close to 190.
So, in terms of that in terms of identifying the most probable once the model does quite a
good job. .
Now the same information can be further understood using the Decile chart. So, as we
have done in previous techniques also.
940
So, in Decile chart, we will a first have to compute this a global mean. So, you can see
that cumulative actual class variable and we are trying to compute the global mean the
corresponding value for the last observation and then total number of observation. So,
that will give us the, a global mean. So, this is the number point 0 nine five then a Decile
cases we would like to have 10 Decile. So, each Decile which represent traditional 10
percent of cases; so, first Decile would represent 10 percent second Decile 20 percent
cases third Decile 20 percent cases.
So, in this fashion, we can compute you can see this particular sequence is multiplied
with number of observations. So, this will give us the appropriate number of
observations for each Decile. So, once this is done we need a counter this counter is
basically for the Decile. So, this is actual Decile counter for Decile ah. So, let us
initialize this then we have Decile you know a this variable to store the ratio of Decile
mean to global mean and the Decile meaning actually mean for each of the Decile. So,
let us initialize these variables and in the for loop as you can see this is running from all
the values that are in Decile case cases right. So, 10 Deciles and the number of cases and
those respective Deciles and once we run this, we will have the numbers let us look at
the range of Decile.
941
So, this is one to 7.3 something and so, the you can see the limits on y axis have been
appropriately specified add 0 to 10, you can also see that other arguments are also for
example, on x axis labels that is 1 to 10 Decile; 1 to 10 and other things are appropriately
specified. So, let us create the Decile chart.
So, this is the Decile chart which can be created using the function like we did using a
function bar plot. So, we can see. So, a then formation that we saw in the cumulative lift
curve the same information is being defective a being depicted in a different format the
942
bar chart format in the design chart. So, you can see Deciles. So, each Decile as I talked
about first Decile is representing the first 10 percent values.
So, second and the twenty percent thirty percent fashion. So, in a in a way first Decile is
giving a a telling us in terms of Decile means y axis a Decile mean divided by global
mean. So, this positive cell is giving us the idea about; how well the model will perform
in comparison to average case in identifying the most probable ones most likely
customers the customer which are most likely to respond most likely to accept the
promotional offer. So, you can see that for first 10 percentage a 10 percent of cases the
lift is quite high its more than 7, if we look for 20 percent first twenty percent cases and
the model still gives us good lift more than a 4 and if we look at the first 30 percent cases
the model still gives us good lift more than 2 near about 3 and in this fashion as as we
can see just like the cumulative lift curve.
As we look to identify a more number of customer which are likely to accept the offer
our lift value goes down the same is reflected in Decile chart if we look for this Decile 4
5 6 7; that means, we are looking to identify most probable 40 percent, 50 percent, 60
percent you know cases. So, our lift will go down. So, typically the, you know; we can
we can go for the up to or Decile where the lift value is still greater than one. So, we look
at near about you know a eighth Decile; that means, a 80 percent of the cases, this is
about near about seems to be near about one. So, from this also we can understand that
out of 190 observe of 190 observation which have you know 190 customers which have
accepted the promotional offer about 80 percent of them can be easily identified by the
model.
We can further look at some of the measures of goodness of it. So, these are some of the
values.
943
So, we will discuss some of these values and later lecture now, right. Now what we will
do will look for our performance in the training partition. So, the performance that we
have seen till now is watch for the test partition now do the let us do the same exercise
on training partition itself. So, let us have a look on the same. So, let us compute the
probabilities values followed by Logit values and followed by classification just like we
did for test partition. So, let us look at the classification matrix.
944
Here we can see out of 3000 observation good number of observation majority of the
observation have been correctly classified as we can see in the diagonal elements, right.
So, now let us look at the classification accuracy you can see 0.959. So, this is more than
the performance on tests partition which is expected because these are the observation on
which the model has been built. So, the error is this much about 4, we can also create
cumulative lift curve and Decile chart for this particular partition as well.
So, this data frame is created and let us look at these first 6 observation. Now, we let us
sort it out let us order it in the decreasing value this is.
So, most of the values we can see now first 6 observation close to one let us compute the
cumulative values let us look at this. So, a once this is done we can go ahead and a create
our lift curve.
945
So, we can see here. So, this is the curve for the training partition.
So, we can see because the model is doing good on test partition also. So, both training
and test lift curve look a quite similar. So, with this will a stop here and we will do
another exercise to understand further non logistic model in next lecture.
Thank you.
946
Dr. Gaurav Dixit
Lecture - 50
Logistic Regression - Part V
Welcome to welcome to the course Business Analytics and Data Mining Modeling Using
R. So in previous few lectures we have been discussing Logistic Regression. So let us
start our discussion from the same point where we left in previous lecture. So we were
doing an exercise in R environment and we completed that exercise so that was using
promotional offers; a data set. So what we are going to do in this particular lecture is we
will use another data set and other problem and we will go through the complete analysis
that is required in logistic regression.
So the data set that we are going to use for this lectures exercise is on flight details. So
this particular data set we have a used before as well. So let us so this is the data set;
flight details dot xlsx. So, this particular data set have been used before. So let us load
this, import this data set into R environment.
947
So we have 108 observation 13 variables; however, there are some na rows and na
columns. So let us get rid of them. So let us first remove na columns then na rows and
you would see 107 observation 13 variables. Let us look at the first six observations. So
this particular dataset we had used before when we discuss naïve Bayes.
So now again we will understand; so this particular dataset most of the variables that we
have they are factor variable, as you can see in first six observation. We have flight
number, flight carrier, then date, then source and schedule time of departure, then we
948
have actual time of departure, then we have scheduled time of arrival, then actual time of
arrival, then a destination, then we have day whether it was Sunday or Monday.
So we had information on just 2 days; data on just 2 days; then flight status and then two
additional variables. So when we use this particular data set in a naïve Bayes.technique,
we did not have these two variables; distance and flight time. So we have added these
two variables distance between the source and destination and also flight time.
So this particular data set, so let us look at the structure and then we will discuss further.
So let us look at this information. So, flight number if we see that the flight number is
something that we are going not going to use in this particular analysis.
949
However, flight numbers for each of the you know flights that we have with us right; 97
of them out of 107 observations, some of them must be repeating. So then we have flight
carrier so we have 3 carriers; Air India, Indigo and Jet Airways. So we will discuss them.
Then we have date; so a data flight. We have flights on 2 days that is 30th July 2017 and
31st July 2017.
However, as we did in naïve Bayes.`we would like we would not like to use this
information and date is not important for our analysis and we will consider them you
950
know we will not considered date and we look at the time intervals of this specific time
intervals, departure intervals of flights. So the main problem remains same. So this is a
classification problem that we are going to model using logistic regression modeling. So
where we would like to predict the delays of flights right.
So other variables we are familiar with a day and flight status, distance, flight time. So
these are the other variables flight time. As you can see this is right now it is Factor, so
this has to be converted into a numeric variable distances flight status is also ok.
However, we would like to change the labels, you can see this is a delayed and on time.
However, since we are modeling for delayed, so we are trying to predict delays. So
therefore, our reference category has to be on time and modeling has to be done with
respect to delayed. So we need to change these labels for our outcome variable that is
flight status. So we would like to predict the flight status whether particular flight is
going to be on time or delayed; focus is on delayed. So the on time is going to be our
reference category. Just like we used to have other techniques other examples, just like
we used to have 1 and 0. So a focus was on always class 1; members belonging to class
1. So we always look to identify build a model which would be classify an observation
into class 1.
So, here in this case as I talked about we need to, we would we require to change the
levels; destination is ok, day this also we would have to change. So day is factor variable,
951
2 days; Sunday and Monday. So right now this is numeric variable, so we will be
required to change this variable as well. The schedule departure you know departure time
actual time of departure and these four information only will a try to derive another
variable using these four information.
So flight status has been derived already been derived using some of these variables and
we will derive another variable that departure time interval using some of these
variables. Other after those derivation after those variable derivation we would not be
using these variables.
Source is appropriately mentioned factor, 3 levels date we would not be using, flight
carrier we would we would be using appropriately mention, flight number again we
would not be using. So let us start some of these transformation. So before we go ahead,
let us take a backup of this particular data set. So we will take a backup and then we
would because as we said, we are not interested in actual dates of those flights.
So, I will like to change the same because we need to do, we need to use these particular
columns for certain variable derivations. So we would like to change these dates to, so
that it is it comes out to be the same date for all the flights and therefore, various
derivation that we want to perform there are no issues in that.
952
So, now you can see now in all these 4 dates, they have the dates has been changed.
Earlier, it was earlier it was 1899; so this particular aspect we have already discussed
during new base why 18 under 99 this particular date was coming there. Now this is what
we require before we go for further variable transformation. So with this, so first six
observation not much change only these four variables have been changed, dates have
been change as you can see.
953
Now, let us first variable transformation that we are going to perform is on departure
time. So, we would like to break departure time in to appropriate intervals. So for
example, let us look at the range now because now actual time of departure now. It is
now you know we have excluded that date information now for all the observation we
are using the same date. So therefore, range can be appropriately captured for our
analysis.
So, we have a flights ranging from flights ranging from flights ranging from 12:40 pm to
20 hours right. So, 40:00 hours 40 minutes to 20 hours right; so this is the range. Now we
would like to break this time into appropriate intervals. So as we have done during naïve
Bayes.we can see we would like to break this particular departure time into 4 intervals,
you can see the labels also 0 to 6, 6 to 12, 12 to 18 and 18 to 24.
So, 24 hours time format; you would like to create four categories of that from 0 to 6 and
6 to 12 and 12 to 18, 18 to 12. So for this we would be requiring this particular breaks
variable which would be used in the cut function as you can see in next line of code; this
one that breaks is a breaks this particular variable is going to be used here and it will cut
the different observation in actual time of departure using these breaks.
So let us compute this. So you would see in environment section, we will have breaks a
variable and 5 values are there and let us also create labels.
954
So these breaks can be used to create these 4 categories. So, let us execute this. So you
would see that depth this variable has been created with it is a factor with 4 levels right.
So, we get it this particular variable into appropriate formats. Let us append this
particular variable into the data frame. .
Now let us focus on other variables. So we had noticed that day was also you know day
was we wanted to be a factor variable, categorical variable having a Sundays and
Mondays flights on 2 day. However, this was stored as numeric. So let us change it to
factor. Let us look at the level 1 and 2.
So we also would like to change these level names instead of 1 and 2 we would like to
have Sunday and Monday; specifically specified. So we will do that. Once this is done
we will focus on another variable that is flight time. So flight time that information we
had in that time notation as hours colon minutes colon seconds. So, we would like to
change it into a format that could be useful for our analysis. So, we would like to change
them all those that this file flight duration flight time variable into minutes. So all those
values we would like to convert into minutes.
So this is the function that can be used as dot difftime. So because these are you know
time intervals, so their differences between two times. So this particular function as dot
difftime can be used and you would see that we would be able to change it into minutes.
955
So first you would see that this particular variable has been changed to this. Let us look
at this structure; so if you go to fl time you see that class difftime has been created.
So this is now atomic factor and now all these values they are minutes right. So you can
see character mins here this information, so all these values that were there time
intervals. Now they have been converted into minutes.
So with this almost we are almost there. So you can see that flight time; now you can see
84 minutes, 94 minutes, 79 minutes. So all the flight time duration have been
956
appropriately converted. We have created the departure variable, each of the flight has
been correctly labeled as per it is newly created category in departure and so let us focus
on some other transformations; so before that a let us take a backup. .
Now as we have talked about that some of these variables we would not be considering.
So for example, variable 1 that is flight number, then date that is 3 column number 3; we
would not consider, then from column number 5 to 8. So these are the actual dates
scheduled time of departure, actual time of departure, scheduled time of arrival and
actual time of arrival. So these also, these 4 columns also we would not be taking into
model.
So let us get rid of these variables, these columns. Now so these are the variables that we
are left with and also they are in appropriate format. So we would like to use these
variables in our logistic regression model. So we now we have flight carrier, source,
destination, day, flight status, our outcome variable; however, we need to change the
levels as we talked about; distance and we have flight time, minutes and then we have
this departure interval.
So once this is done and these are first six observations. If we want to take you know any
random sample of 20; so this is sample function can be used in this fashion. So this
particular command will give us 20 randomly selected rows. So we can have a look at
different values for different observations randomly drawn observations.
957
Now, let us focus on the outcome variable. So outcome variable, the levels are delayed
and on time in that particular order. Now we would like to change this into numeric code
because later on the different later on our code would be a much easier to write, if we
have the numeric codes right because we would be creating lift curve and cumulative lift
curve we would be we would require numeric code, so that we can create the cumulative
actual class. So all those things for all those things we would prefer this level for
outcome variable to be 1 and 0.
958
So let us change this. So 1, here as you can see; 1 is corresponding to delayed and 0 is
corresponding to on time. Because our task is to predict delayed flights right. So
therefore, one has to be assigned to delayed right. So let us execute this. Let us look at
the first six observation of this outcome variable, flight status; you can see levels have
changed now 1 to 0. However, if you see that ordering is 1 and 0. So 1 would become
reference category and that means the rate would become reference category. So we did
not want that. So we would like to change this. So re level function can be used to
perform this kind of change you can see. In the re level first you need you need to pass
on the factor variable and then the second argument is for the reference. So, we can
select the reference category here, the other things would be appropriately changed.
So, I be execute this code and look at the structure of this particular variable. Now you
can see 0 and 1; in the correct order and 1 representing now delayed and 0 representing
on time flights. So now, this particular variable; let us look at first six observations also.
Now this particular variable is in the desired you know state as we want it to be. Now at
this moment if we want certain analysis; descriptive analysis, we can do so as we did in
naïve Bayes.as well. We can once this data is ready for logistic regression model, all the
key variables are there and the data frame; final data frame. Then we can write this
particular file into our disk and then we can apply you know pivot table excel, base pivot
tables to write some summary to for further analysis.
So let us, so I have already created one pivot table. So let us open this. So this was the
data; pivot table this particular exercise we have already done.
959
So we have to just select the data and you know get it into appropriate presentable format
and then select all the all the columns, all the rows and then create pivot table using the
insert tab and within that this pivot table option and the pivot table would be ready for us
in this fashion.
So, I have already selected few of the important variables here, in appropriate filters. So
you can see in row levels I have, in row level I have SRC that is source; then in column
level I have flight carrier those three flight carrier that we have; Air India, Indigo and Jet
960
Airways and three source we have a that is BOM, DEL and MAA that is Mumbai, Delhi,
Mumbai airport, Delhi airport and Madras airport. And then we have in the values areas
we have account of flight status that means, how many flights are actually delayed.
So how many flights are there; count of flight. So overall, now however, in the report
filter I have flight status and day as you can see here. So these are report filters point. So
flight status you can see I have pre selected one. So this can be changed if we click on
this filter you can see all the flights count of all the flights or delayed flight 1
representing delayed. So therefore delayed flights and 0 representing on time so that can
be done. .
In day also we have a filter. So this is since this is report filter, so we are going to have a
filter for all the variables. So Sunday and Monday you can see here that we can have all
you know all days Sundays and Monday’s together total and then either Sunday or
Monday. So those numbers would be reflected. So we these this particular descriptive
table will change depending on the change in levels of report filter variable.
So let us look at some of these numbers. So as we can see that when we are interested in
finding the count of delayed flights, you can see the flight status is selected as 1 right
here; in the filter flight status is selected as 1. Therefore, the we are interested in
understanding the delayed flights from these descriptive statistics, summery statistics. So
we can see that then off further we have selected Monday. So delayed flights and then
Monday; so these are the numbers. So we can see when the source is Mumbai airport that
is BOM, then we can see total number of delayed flights are 16 and you would see Jet
Airways more number of flights of Jet ways Jet Airways are delayed.
961
So in this table once you click you will get the specific observations as well right. So
now if we look at, if we look at the second row we see that Delhi. So overall total
number of delayed flights for Mumbai and airport both these airports. So these airports
are too busy airports in our country and you can see number of delayed air fights are
similar number is there. And Madras airport we have just 7 delayed flights; however,
across 3 carriers we see that Jet Airways we have more number of flights; in terms of
number we have known more number of delayed flights.
So, in terms of just the number right; in terms of just the number and this information is
for Monday. So now what we can do is a because with respect to delayed flights, let us
look at the what happens during Sundays. Let us select Sunday and we do; so now you
would see during Sundays we did not have any flight in our data set originating from
Madras airport; so that is gone.
962
So that is also one problem in modeling exercise as I have I have talked about in
previous lectures as well; that if some of the combinations are not covered then that
could be a problem when we go about predicting new observation because if that
observation falls into that zone and in our training partition that was not covered; some
of those combinations were not covered in our training partitions then prediction would
not be possible.
So the same thing is reflected here. For example, Sunday we do not have resource
destination MMA Madras. We do not have any flights from that source and if we look at
the number of flights are quite few in comparison to Monday. However, if we look at
again more number of delayed flights or from Mumbai and again more number of
delayed flights are from Jet Airways. So, in terms of numbers total number of flights are
also less and during Sundays and total number of therefore total number of delayed
flights are also less.
So, if we look proportion wise then the that does not seem to be much difference
between Sunday and Monday. So the same thing we should expect when we build our
logistic regression model, the same thing we should be expecting in our results as well;
that there in terms of you know we look proportion wise.
Then there is seem to be any difference in delayed flights, you know whether it is
Sunday or Monday. The similar kind of exercise can also be performed for on time
963
flights. So we can go to flight status report filter, we can select 0 and that would
represent the on time flights. Now we can see the Sunday numbers here and Bombay 3
flights on time, Delhi 2 flights on time and the Madras 1 flight on time. So we did have a
flight on Sunday from Madras, but that was on time.
So we can change the day filter as well to a Monday to see the difference. So we expect
more number of flights during Monday. So that is the case. So we see that a more
number of flights; 15, 22, 19. However, if we really look at these numbers during
Mondays we see that there are more number of flights, there are more number of flights
from Indigo which seem to be on time. However proportional wise our number of flights
that they are running on Mondays are also more; are also higher, so that could be the
region.
So this kind of analysis we can do using pivot tables and that would give us some
insights when we go into former technique like logistic regression, what we should be
expecting. So this kind of analysis can also help us in grouping as we have been talking
about in previous lecture; grouping some of the categories on this we would be able to
understand that which source destination can be grouped or which days can be grouped
here, we just have 2 days. So therefore, the question would be whether we should
incorporate day at all? If it is, the if does not seem to be a significant predictor for
delayed flights as per the descriptive stats that we saw.
So these kind of decisions can be these kind of insights can be derived from descriptive
stats. So what we will do? We will go back to R environment and we will move to our
next step that is a partitioning. So because this particular data set is quite small; just 107
observations and for training partition you would like to have more number of
observations; so that the model is slightly a more stable, more number of combinations
because we are using a factor in many.
There are you know majority of the predictors that have they are factor they are
categorical. So there you know there are going to be more combinations of values that
you would like to cover in our model. So more observation we would like to have in the
training partition because of this small sample size.
So let us do this partition partitioning; 90 percent for training partition and the remaining
10 percent for testing. You can see our partitions are created; 96 observation in the
964
training partition and 11 observation, remaining 11 observations in test partition. Now as
we did in previous lecture, the same function glm can be used. Flight status is our
outcome variable, other variables are going to be going to be used as predictors other
things remain same. So let us run this code. You get the model. Let us look at the
summary.
So if we look at the results of this logistic regression model that we have just created.
We can see the results most of the variables because of this smaller data size, we do not
965
see much significance. However, we do see three variables to be significant at 90 percent
confidence interval. You can see small dot here that is for 90 percent confidence interval.
So, these three variables seem to be significant; first one is Flight Carrier Indigo.
So, the so the estimate is negative. So this seems to be significant at 90 percent

confidence interval and so therefore what we can understand is with reference to Indigo,
with reference to Air India which is the reference category for flight carrier; Indigo of
flights from Indigo flights from indigo carrier. They seem to be less delayed because
remember this particular modeling exercise with respect to delayed flights.
So therefore, if the coefficient is negative, so therefore we should expect that logit values
are going to be that logit value there is going to be less cost, there is going to be decrease
and therefore there is decrease in probabilities value and therefore there is more chance
for it to be on time rather than delayed.
So, in this fashion we can interpret. However, the append be interpret these results. We
should also look at first whether the variable is significant. So we have to check the
significance as per our accepted level of confidence interval. We can look at the
significance and once the variables are a significant then we can look at their coefficient
values to understand the level of impact that they have. .
So flight from Indigo they seem to be less delayed or more on time in comparison to Air
India right. The same cannot be said about Jet Airways because this is insignificant
relationship. Now, other very other significant coefficient is source that is from Madras
airport. So it seems that flights which originate from a Madras airport they also seem to
be you know a less delayed right so more on time.
So this is also again at 90 percent confidence interval. Similarly we have another

significant variable that is destination Hyderabad. So this is dummy code for Hyderabad.
So airport, so we can see here also that the flights which arrive at Hyderabad; so where
the destination is Hyderabad. They seem to be less delayed or more on time with respect
to our reference category. Same is also that Madras airport was also with respect to the
reference category.
966
Now other variables we can see they do not seem to be significant. For example, day
Monday; so as we saw in our pivot table that a proportion wise there did not seem to be
much difference on in on flights during Mondays or Sundays. So this also, in the results
also logistic regression model results also this particular variable dummy variable does
not come out to be significant.
So we can see that what we expected from our pivot table exercise, is the same thing is
reflected here in the model. A distance this also is highly insignificant; so distance is not
a the key predictor here. So this can also we so from this we can also understand which
variable can be dropped and another modeling could be done. For example, distance is
highly insignificant. So probably this variable can be dropped.
However, as I have pointed out in earlier lectures also, in data mining modeling our goal
is prediction. So we are not even if a particular predictor is insignificant, but it is of
practical importance in terms of predicting tasks, you know prediction or classification
tasks or other data mining tasks; we would still like to keep it in the model.
However, in this case distance does not seem to be of much practical importance and
highly insignificant. So probably this can be dropped. Flight time we can see that it was
you know quite close to being significant at 90 percent confidence interval. So probably
a flight time is also quite practical in terms of predicting delayed flights. So this we
should anyway keep there and the and the then this next one is departure time interval.
967
So, with respect to the reference category that is 0 to 6 and these 3 departure interval
they do not seem to be significant.
So probably at the departure time intervals also do not matter with respect to the data set
that we have. So what we can do? We can look at the another modeling approach. So
from this model we can further understand which variables are important; which variable
are insignificant and if they are significant what is the impact that they have and whether
another model with only the important variables can be done. Even if a particular
variable is insignificant as we talked about it can still be kept in the model, if it is it
provides some practical importance. So with this we will stop here and we will continue
our discussion on this particular modeling exercise on flight delays.
Thank you.
968
Dr. Gaurav Dixit
Lecture - 51
Logistic Regression-Part VI
previous few lectures we have been discussing logistic regression. So, in previous lecture
we were doing an a modeling exercise or using a flight details data dataset.
So, let us go back to that exercise in R environment. So, we have been able to a build the
model right and we understood we also discussed the results and different alternative
models that could be created by dropping few variables and merging few categories.
So, we would be doing that, but before that we would like to check the performance of
this model, look at some of the plots and discuss then. So, first we would like to check
the performance of training partition itself. So, the we have fitted value with us. So, fitted
values are nothing, but the probability estimated probabilities value, values for all the
observation that we that we used in previous lecture.
So, same model result we are using again here in this particular lecture. So, first we will
use these fitted values to classify observations. So, we can see if fitted value that is
969
estimated probabilities value, if they get greater than 0.5 then the class is 1 another by 0;
one means here delayed flight.
So, let us run this. So, we will have the classification and once that is there we can create
our classification matrix, actual value is stored here flight status and the predicted values
in, just now we have computed. So, let us look at the classification matrix. So, you can
see in the diagonal element 46 plus 20. So, these are the observations correctly classified
46 and 20, and we can also see that off diagonal elements 18 and 12.
So, these are the incorrectly classified observations. So, let us look at the classification
accuracy. So, this comes out to be 0.6875, so 68.75 percent accuracy; So, this is still
good enough, for reason being we had very small sample size and even smaller sample
size for training partition and then there are so many factor variables. So, therefore, there
can be so many combinations of you know values which might not be modeled or which
might not have enough number of observations to, you know have a good model. So, it
still we got 68.75 percent actually accuracy. This is the error.
Now, we can look at performance of our model, on the 10 percent observation that we
had left out in the test partition. So, let us look at the performance. So, first let us score
the partition for probabilities value. So, we will have this. So, this is scored. Then let us
also score the logit values and then let us classify these observations. So, once this is
done, let us generate the classification matrix.
970
So, this is the classification matrix, we had just about 11 observations. So, 3 plus 2 5
observations have been correctly classified and the remaining observation 6, have been
incorrectly classified. So, our model does not seem to be performing well on new data.
So, just 45 percent; So, that is expected also from training partition to testing partition
performance, might dropped down especially a specifically in this case, because we had
very small sample size and you know too many factor variables right.
So, as we have talked about and small the initial lectures when we talk about factor
variables, it is the number of categories that also play a role in terms of data, determining
the minimum number of sample observation, sample points that we should have. So, you
have too many factor variables and with, you know and quite a few number of categories
and that puts more higher restriction on sample size. So, this is our error. So, error is
much higher interest partition.
So, let us look at the lift curve in terms. So, we would like to understand, even though
the model is not performing as well, as we would expect, because of the issues that we
discussed sample size and others. Still in terms of identifying the most probable ones,
most likely flights which are going to be delayed lift curve kind of give us some
important information, how this particular model is going to help us with respect to an
average case scenario.
971
So, as we have been doing for plotting cumulative lift curve; First we cleared the state of
him, where we have first column we have probabilities values and then the second
column we have typically the actual class. So, let us create this, let us look at first 6
observations.
So, we can see that these are the values. So, this is for test partition, we have total 11
observations only and once this is done, we would like to sort this particular data frame
with using actual class variable and decreasing order. So, the code is written there. Let us
look at the sorted values. So, these are the sorted values. So, the data frame has been
sorted with respect to probabilities values correction, with respect to probabilities will be
descending, decreasing probabilities values descending order and.
Now, we will compute the cumulative class, and let us look at the cumulative class
variable this data frame, you can see. So, this kind of exercise we have done before as
well. Now let us create the cumulative lift curve, let us look at the range. So, this is the
range. So, as I said we have just 11 observations and this is the range for cumulative
actual class. So, out of 11 probably we have just 7 observations we. belonging to delete
class category. So, let us create the plot.
972
So, we need to correct our limits first, let us make it 8 and other things seem to be ok. So,
now, let us run again. So, this is our plot, let us create the reference line. So, reference
line can further be corrected here, as we can see first record cumulative actual class is 1.
So, this reference line can actually be 1 and 1.
So, this would be much, you know corrected version, improved version of reference line.
For this we will have to create the plot once again then reference line and then we can
add the legend. So, this is the plot that we have.
973
So, you can see that the lift curve goes below reference line somewhere here, but again it
picks up so, but in this starting region you can see, you know for most part of this
particular lift curve this solid line which is for our model remains above this reference
line; that means, our model does a better job of identifying the most probable delayed
flights with respect to the reference case; that is average case.
So, now what we have understood from this exercise and the results that we had
discussed, we can go ahead and do a remodeling of this, this particular dataset, this
particular problem, delayed fly, delayed flights prediction problem. So, let us look at the
logistic regression results once again.
So, as we can see that from in these results that three variables were found to be
significant flight carrier. This is significant specifically that is indigo dummy variable
and then source is also significant, one of the dummy variable is significant. Destination
also one of the dummy variable is significant right.
Day is; however, you know this is not significant, but this could be of practical
importance, reason being even though in our small data set that we have, we just have
flights on 2 days Sunday and Monday; however, on working days the schedule might be
more, schedule might be, traffic might be more, more flights might be running during
working days.
974
So, therefore, and in comparison to weekends, so therefore, this particular variable a day
is important information. So, because of its practical importance we would like to keep
it, despite it being insignificant in our small data set and model distance does not seem to
be significant it is highly in significant. So, probably it is not an important point. First
anyway the flights you know the, the jets that are used they, they fly at a quite good
speed, so distance is not a matter.
It is the operational fact, factors which matter more. Flight time of course, can be
important, because flight time with, if the flight is now, time is more then there are more
chances for some factor playing a role, more like factor playing a role in terms of
delaying the flight. So, flight time you would like to keep, and also you can see that p
value is on the lower side. So, you know y is small margin, this has been ah; otherwise it
has, it would have been significant at 90 percent confidence interval.
So, it has been left out by small margin, departure intervals also, because of the practical
importance we would have, we would like to keep it, because flights, at what time the
flights are originating or arriving. So, that can play an important role in terms of, whether
the flight is going to be delayed or not; however, we would like to look at the categories,
we would like to find out the categories, use the categories which would give us more,
some improvement in our classification model.
So, from this, probably it seems that the departure, this one 18 to 24 category with
respect to reference, you know if we look at the p value is smaller. So, probably we look
to combine these two categories; the reference category and this departure 18 to 24 and
the other two categories will group into 1 departure 6 to 12 and 12 to 18, we would like
to group into 1. So, that we can call day and the other category 18 to 24 and then 24 to 6
that we would like to keep at the reference category 0 to 6, we would like to keep as
night, part of night group.
So, two categories will be part of day group, two categories will be part of night group
and we will like to see how it is performing in the model.
975
So, let us. So, let us open start R modeling exercise. So, before let us clear some of these,
some of these variables and data frames from the environment section, because that
might create some problem in this particular code. So, what we will do.
So, this library is already loaded. So, we would like to again import the data set. Let us
remove na row and na columns and a rows for 6 observations a structure.
So, this we all have already gone through in previous lecture. Let us take a backup, let us
change the time, so that it is an appropriate format for us to derive some variables; like
976
we did in previous lecture. So, these are the observation. Now, here we do a certain
variable transformation differently based on our learning from previous modeling.
So, as you can see that range is of course, going to remain same, but now the grouping
that we are going to do, the categories the groups that we are going to create for
departure time interval are different. Now, we would like to as, as we discussed we
would like to create these two groups day and night. So, all the all the flights, which fly
from, which fly from 6 to 18 hours, so that would be categorized as day, that would be
grouped as a day and remaining flights would be grouped as night.
So, we can have few other changes, so we do not have to should not be restricting
ourselves to this particular time 6 and 18 and this can be changed depending on how
useful it in terms of operating delayed flights. So, we have to keep working on this these
in categories and how they are created. So, for this model we will go with these hours; 6
to 18 and then remaining as another group.
So, we are using if else to perform this categorization. So, once it is done, we will you
can see in the environment section and character vector has been created, for all
observation, specifying whether it is a night flight or day flight. So, let us add this to our
data frame. Now you would like to convert it into a factor variable, so let us do that.
They, we would like to as we discussed, we would like to include this in our model
though it was found to be insignificant, because of the practical importance.
977
So, let us convert it into factor variable as well. The labels as we did in previous lecture,
labels for day would also we would like to change from 1 and 2 to Sunday and Monday,
and flight time as we discussed this is also of practical importance. So, therefore, we
would like to have it in our model and as we did in previous lecture we use as dot diff
diff time to convert the particular values for this variable into a suitable format. So, these
are the values. So, some of the convergence transformation we have proper performed.
Let us look at the first 6 observations, you can see flight time appropriately mention we
still have some of the variables that we would, we will be taking subset.
So, before that let us take a backup of this data frame. Now, you would see that we are
getting rid of column number 1; that is flight number and column number 3 that is state.
So, we do not want it, then 5 to 8 that is scheduled time of departure, actual time of
departure, scheduled time of arrival, actual time of arrival like the previous models. So,
we get rid-off these variables as well.
Then the next one is tenth variable; that is, the next one is tenth variable, so that is day,
that is day. Also as we as we saw that even though this is of practical importance, we are
not including this on our model, because this was insignificant. So, we will see how
would what the results would be, once we excluded. So, this is also gone and then we
have distance, so distance also you are trying to get rid-off. So, these variables will get
rid-off.
978
So, now, these are the remaining variables, we have flight carrier ah; three levels, we
have source, destination three levels. Flight status we will have to correct the labels as
we did in previous modeling, previous lecture. Flight time is and departure. Now we
have just two levels day and night. So, let us look at the first 6 observations. So, these are
the observations.
Now, let us work on the outcome variable. So, we will go through the same piece of code
that we did in previous lecture, we would like to change the labels delayed and on time.
So, as we said, because we generally create cumulative lift curve and for that we require
this particular variable into 0 and 1 format numeric code format, so that we are able to
later on do certain computations.
So, we would like to change this you can see the delayed is 1 and on time is 0 and now
we would like to re level it, so that the reference category is 0, so let us execute this. So,
now, the reference category is 0 as you can see, first 6 observations also you can see.
979
Now, partitioning is same 90 percent for any partition, because of this smaller sample
size. So, again in this modeling exercise also we would like to follow the same, so a 90
percent and 10 percent for the testing partition. So, once this is done, let us create the
model.
So, model all the arguments remain same, so there is hardly any change, with respect to
model let us look at the results. Yes. So, now, if we look at the results, now the level of
significance has gone up right. So, you can see the flight time, this is now significant at
two star level; that means, it is significant at 99 percent confidence interval; that is a
flight time because.
So, this could be, because we have this smaller sample size and this time the
observations that have been selected for that to be part of training partition that will also
have their influence, because of this for R sample size and you can see flight time is
significant at 99 percent confidence level all right. So, this is one change.
So, in the last modeling, this was you know this was just left out by a small margin, to
achieve the significance at 90 percent confidence interval line; so, another important
source. So, in the previous modeling we saw that madras dummy variable for source, so
that was significant at 90 percent confidence interval. Now, this time it is very different
at 95 percent confidence interval; 1 star level significance.
980
Now, we can see that destination, the same dummy variable which was significant last
time also it is significant at 90 percent confidence interval, so there is no change in
confidence interval for this one. Now, there is one change we can see that in this time our
flight carrier, the indigo dummy variable which was significant at 90 percent contrast and
our in previous model.
Now, this has missed by a bit more margin for 90 percent level significance; however,
because of the practical importance and even if some of these variables are insignificant,
we would like to use them in our you know modeling and in our predicting and in our
predictions right, yes scoring new observations.
So, if we look at the last variable that is departure interval night, in this case this does not
seem to be the significant right. So, it seems that either we will have to work further on
departure time intervals to understand if there is any difference between two groups of
intervals; however, we have tried in the us modeling we had created 4 groups and in this
particular modeling, we have created two groups. So, this also underscores the
importance of descriptive analysis.
So, probably we should focused more on this particular grouping in our pivot table
descriptive analysis, to find out which groups, if there are any groups which can be
created with respect to prediction of delayed flights, you know classification or delayed
flights.
981
So, another variable this, this destination airport we have already discussed other
variables. So, now, let us look at the performance of some of this model. So, majors of
goodness of it can also be computed. So, discussion on this we will do later so; however,
if you are interested in seeing the values. So, these are the some of the values that could
be interesting for a different discussion.
So, these are some of the observation multiple R squared is quite low for this model 17
percent. So, that is another reason for low performance of the models in previous
exercise and maybe in this exercise as well. So, let us look at the performance. So, from
the low R squared value we can say one thing that, probably we need to think about more
variables, which can increase your certain you know R square, which can gauge R
squared to by certain margin.
So, let us look at the performance of training partition itself. So, mod fitted values we
have. So, these are estimated probabilities value, let us classify the training partition
observations. Let us look at the classification matrix. So, you can see 44 plus 24 these are
the observation correctly classified, 17 and 11, so these are the observation incorrectly
classified. So, despite low R squared value and many variables mean insignificant;
however, because they are practical, because most of the variables of, are of practical
importance.
982
We can see that the performance for any partition is 70 percent 70.8. If you remember the
performance of previous model that we had that was around 68 percent this is around 70
percent more than 70 percent. So, there is 2 percent increase after certain transformation
of a certain modeling that we did this time.
So, you would see that is this particular modeling exercise, these two models that we
have created and looked at the performance underscore some important points. For
example, when we talk about data mining modeling, when we talk about prediction tasks
classification tasks, even if some variables are not insignificant it you know, but if they
are of practical importance that can help us in predicting a new observations that is quite
clear from the performance numbers as well.
However, however we can always work on certain, i know a certain variable is

specifically factor variables and the transformation regrouping that can be done and
improve our performance further.
So, error is despite a small sample size, so we got this much performance. Now, let us
look at the performance on tests partitions, we had, we have the still 11 variables i think,
like the last time yes 11 variables in the test partition. So, let us see the performance. The
outcome variable we would like to leave out, so let us look at the 1 2 3 4; yes, so
correctly specified, so it was the fourth column. So, we would like to exclude it for the
clarity.
983
So, probabilities values were estimated, logistic values are also extracted and let us
classify the observations. Let us look at the classification matrix for test partition. So, we
can see that 6 plus 2 or the observation correctly classified and 2 plus 1 that are the, these
are the observation incorrectly classified.
So, the model seems do, it seems to be doing a good job this time; however, this as i
would like to point out; however, this is because of the way partitioning might happen,
the observation that going to training and the remaining observation that are left as part
of test partition. So, because of this, there are going to be certain swing performance.
So, therefore, what we require is, larger sample size and within that we require larger
number of observations to be part of training partition and therefore, you know check the
performance on test partition. You can see in this particular case the performance of
model is better in this partition which is not always expected. So, in this case it is
performing well.
Now, what we will do? We will also look at the lift curve this time. So, let us see how our
model is doing in terms of identifying the most probable delayed flights. So, as we did
you know in previous exercises as well, let us create a data frame of probabilities values
and actual class. So, this is for training partition. So, let us create this.
984
So, these are the observations. Probabilities values and actual corresponding, actual class
for those observation, let us order them sort them with respect to probabilities values
decreasing order you can see. First observation is highest having highest probability
value and the corresponding actual class. Now, let us compute the cumulative actual
class numbers that column, let us add it, let us look at the observations. So, once we have
this information, we let us look at the range and create.
Let us look at the range let us create now create the cumulative lift curve. So, we can see
x limit and y limit, they seem to be 1 to 96 and that is covered and y limit also 1 to 45 1
this is where we have 0 to 45, so this is covered. So, we can go ahead and create our
plots. For the reference category in this case this would be 1 and 1. So, we will get better
reference line this time. So, this is our reference line and let us legend as well. So, let us
look at the cumulative lift curve.
985
So, as we can see that in majority part of the reference line, the model curve remains
above the reference line and. So, that indicates that usefulness of the model. So, as we
have talked about that sometimes our model, you know might be giving, you know might
not be giving much improved performance in comparison to reference case, typically
because of the sample size problems; however, even with those smaller sample size, in
terms of identifying the most probable ones, the model does a better job.
So, you can see that in terms of identifying most probable ones the model always
remains above this and does a better job. So, as we discussed in the previous lecture, we
are always looking to, we are always interested in top left corner of this particular curve
right. So, you would like to identify as many observations correctly as possible; so,
probably somewhere here.
Let us create cumulative left curve for our test partition. So, first let us create this data
frame, probabilities values an actual class. First 6 observation, let us sort them out
decreasing order or probabilities values. So, we can see here the same.
986
Let us compute the cumulative actual class values. As you can see now in this case first
reference line we can see, it should be one record and 0 value. So, that has to be reflected
there. Let us look at the range again for the test partition.
So, a range for x limit is appropriate for y, we have to change that is also appropriate. So,
let us create this curve. So, this is the curve. So, because the very few observation that
we have in test partition.
987
So, that could be curve, might sometimes be touching the reference line or even going
below reference line first coordinate, it should be 1 and 0. So, that is fine. So, we can
create our reference line; that is created allegiant.
So, as we can see, because few observations, you know, therefore, you know some part
of this particular plot we can see that model is going below the reference line; however,
initial part of the plot we can see, the model is clearly above the average scale. This is for
very small, just 11 observations in test partition. So, again here we can say usefulness of
the model is that, we can easily identify, it is does a much better job of identifying the
most probable ones.
So, with this till now in logistic regression modeling we have done two exercises; one
was using promotion offers data set and then this one was using this flight detailed data
set. Some of the important points like goodness of fit majors, because logistic regression
is a classical statistical technique, and it is quite popular as we talked about in, you know
statistical modeling.
988
So, in the next lecture, we will also discuss some of those aspects, some of the aspects
with respect to the statistical modeling will, will develop will, will build a few more
models, which could be more useful for a classical setting. We will also try to understand
the difference between logistic regression and linear regression, where we look at why
linear regression is not suitable when the variable, whenever outcome variable is
categorical in nature. So, we will discuss those points as well.
We will also one particular task that we, that we discussed in our previous logistic
relations lectures, that logistic modeling can also be used for profiling in terms of
understanding the similarities and differences between two groups. So, how it can help,
how logistic relation model can help us in profiling tasks that also we would like to
understand. So, some of these things we would like to do and, we will be doing in next
lecture. So, at this point we will stop.
Thank you.
989
Dr. Gaurav Dixit
Lecture - 52
Logistic Regression-Part VII
previous few lectures we have been discussing different aspects of logistic regression so
we will continue that discussion in this particular lecture as well.
So, in previous lecture we will be used apply details data set and then promotional offers
data set as well. Some of the details regarding modelling exercise we could not cover.
So, we will do that and discuss. So, let us move to let us import this data set. So, we will
let us import the library xlsx. So, we are again going to use this particular data set flight
details.
990
So, let us import this. Now, let us remove NA columns NA rows, let us look at these are
the observations.
Let us the structure for the data frame. So, we will follow some of the steps that we have
gone through in previous lectures as well, so we will just go through them.
991
Now, there is this particular exercise in the previous lecture we had used a separate
grouping for departure time. So, again I have done certain changes into this grouping
also, but this is not very important; however, let us run the model with a new grouping
for departure time interval. So, this is the range we already familiar with.
So, break is now 0 and 12, so these are the 2 breaks, 0 hours and 12 hours. Now this is
how we are creating depart time variable. So, if less than a breaks 2, breaks 2, and breaks
992
1, then 0 to 12 so; that means, within if the timing is within these two you know hours 0
hours, and 12 hours.
Then first category that is 0 to 12 otherwise it is going to be the second category that is
12 to 14. So, let us create this variable let us append this to the data frame let us change it
to a factor variable day variable as well let us cut the labels, change the labels flight time.
Let us also change it to appropriate format, now this is the structure that we have, now
after taking backup we would not like to you know take forward some of the variables so
let us get rid of them. Let us look at the structure again now these are the variables, now
these are first few values for 6 values you can see everything is ok.
Now, let us work on the outcome variable like we have been doing in previous lectures.
So, let us change it to numeric code, let us change the reference category, now this is
what it becomes now this is ok.
993
Now, we can move ahead let us do our partitioning 2 partitions, training partition, testing
partition and 90 percent for training partition and 10 of observation for test partition.
Now, the same function glm that can be used to again model this. So, these are the
results.
994
Let us look at once again now you would see every time like we have been running this
particular model on the same data set and every time. The significance levels have been
changing that as I have been explaining smaller data set and therefore, subject to change
in terms of as we as the observation that are part of training data set training partition this
results will also slightly change and mainly with respect to significance level.
Now, you can see again flight carrier indigo has become significant to star level right and
we can also see that destination has also become significant right. And then we can also
see flight time has also you know modes you know higher level of significance.
However, more important thing is look at the p varies we can see that this one source for
madras as well this is also smaller p values this is anyway significant flight carrier and
destination anyway this is significant at ninety percent this is also significant. However,
the new grouping that we have created out of department departure time intervals that
also not comes out to be significant; however, p value is now smaller.
995
So, with this we will discuss the important aspect of logistic regression so that is the
majors of goodness of fit.
So, just like the linear regression multiple linear regression logistic regression also is a
statistical technique primarily. And therefore, in a statistical modelling the main objective
is to fit to data as we have talked about this aspect many times before. So, therefore, in
multiple linear regression we have a particular metric called a multiple R square and
adjusted R square which are used to assess how good the model is fitting to the data.
996
So, similarly since logistic regression is also a statistical technique so there also we are
required to have matrix for good to measure the goodness of it how well model is fitting
to the data. So, what are those matrix so because there is certain key differences in
logistic regression and linear regression. So, we will talk about some of those matrix
now.
So, you can see in the code that I have created a vector here gf first one is mod 3 the
model that we have just computed and you can see degree of residual degree of freedom.
So, this df dot residual is one of the value that is returned by and the glm function and
gives us the residual degree of freedom then we have deviance.
So, this is again the returns the deviance value returned by the glm function right then
few other things which are mainly for the descriptive purposes. For example, this table
result which is for the outcome variable here and then divided by the full number
observation that will give us percentage success in training data and then we have
iterations.
So, as we talked about the particular estimation technique that is used in that is used in
logistic regression is different from multiple linear regression we can look at for more
details we can look at here. So, we talked about that Emily is used for typically for used
for logistic regression; however, you can see that for example, we have been using glm
function, and within that if we go we look up for the some of the arguments.
997
Specifically for this purpose the estimation technique purpose will get more detail. So,
we will see that glm for dot fit.
We will see that glm dot fit method. So, this we can see iteratively re weighted least
square is used the particular function that we are using glm iteratively re weighted least
square function is used which is quite similar in approach with respect to Emily
estimation technique that we talked about.
998
So, MLM techniques, sorry MLM maximum likelihood method that we talked about in
our discussion as we can look in these slides as well maximum likelihood method MLM
method that we talked about. So, this is quite similar to what this the discussion that we
have.
So, in particular R’s implementation glm function that we are using it is the iteratively re
weighted least square that estimation technique that is used to estimate the coefficients
that we have been doing quite similar to a MLM and as we talked about that number of
iterations have to be performed to reach to estimate these parameters so that we can get
the best model which is the model best model which is fitting the data.
So, number of iterations actually indicate that and then we have one matrix which is
quite similar to what we have multiple R square and linear regression right. So, this is
actually computed where using the deviance value. So, null deviance and the standard
deviation estimate or deviance that we have can look at the returned values here.
So, you can see deviance is the one of the return value so this is one and then we also
null deviance is also a return. So, this is null deviance is come with respect to the naïve
rule. So, 1 minus this deviance divided by null deviance gives us a value which is quite
similar to what we have in multiple linear regression multiple R square.
So, this particular value will be will give us a metric which can be used to understand the
goodness of fit follows logistic regression model and as on its own deviance also can be
used it is quite similar to what we have there in as I see sum of squares error.
So, this is quite similar deviance is quite similar to that and then we can have one metric
as I talked about similar to multiple R square. So, these metrics can be used to assess; the
assess the fitness or model goodness of fitness goodness of fit of logistic regression
model.
So, let us compute some of these things so, residual degree of freedom deviance which is
similar to SSC in linear regression, then we have this proportions percentage in
successive training data. Then we have number of iteration and then we have a multiple
R square kind of metric let us compute this.
999
Let us create a data frame and row names we have given some and these are the values.
So, you can see residual df df is 87 here, so as you can see let us again have a look df 1
training partition we have 96 observations.
And if we go back to our summary results right if we go back to our summary results we
can see that how many variables we have here 1, 2, 3, 4, 5, 6, 7, 8; 8 8 variables. And we
can see that 87 is the residual.
1000
So, 87 plus 7 that makes it 94 that is n minus 1, so that is the computation that is how the
degrees of freedom have been computed, so this is a correct value here. And then we
have deviance value which is also called as standard deviation estimated by some
software’s some statistical commercial statistical software’s.
And then so this is all similar to what we have SSE sum of squares error in multiple
linear regression then we have number of iteration that have been used to arrive at the
particular model and that we have that is not 3 in this case. Then we have a value similar
to multiple are squares we can see 20 percent, of the variability in the outcome variable
has been explained by this model. So, all we talk about that this is being computed by 1
minus mod 3 deviance, divided by null deviance.
Now, whether so in terms of on further in terms of deviance the null deviance represents
the naive rule value. So, we have to see how much our model has been able to how much
reduction in deviance has been done by our model and whether that that is significant or
not so that can also be that can also be performed using this chi square test that we can
do.
So, we have one function p chi square. So, there we can actually use these two values or
we can take a difference of null deviance and deviance so that would be the reduction in
deviance from a naive rule and you know how much deduction that our model has done.
1001
And we can look at the number of predictors as degrees of freedom I could be used the
number predictors that we use have used could be used degrees of freedom because these
are freedom degree of freedom that had been used to reduce the deviance as we talked
about from 95 minus 1 available degrees of freedom two we have reduced up to 87 that
is residual degree of freedom so 7 predictors have been used.
So, that information can be use to perform chi square test and to find out the significance
of whether the reduction has been significant or not. So, the third argument is lowered
tail that is specified as false and we can compute this chi square value so we can see that
this is a small value. So, therefore, it seems that this redis reduction is reduction deviance
is significant which are also clear by the difference between the deviance and null
deviance values.
So, we can also compute that we can see this is null deviance; this is null deviance. So,
let us look at the value and we can have the deviance. Let us look at the value so you can
see.
So, the difference is so there is good enough Reduction and deviance and that is why it
also came as significant you know difference. So, these are some of the matrix that can
actually be used to understand to a measure the goodness of fit to asses goodness of fit
for a model as we talked about I talked about that in a statistical set setting. So, these are
some of the matrix which would be more useful.
1002
So, in a statistical modelling we stopped at when we build the model using the training
partition. So, typically all the observations are used that are present and then we look
asses the model with respect to some of these some of these matrix. Now, let us move
forward to our next discussion point in logistic regression so that is let us move forward
so that is this particular point whether linear regression can be used for a categorical
outcome variable right.
So, there are there are some situations where linear regression can be used as a
categorical outcome variable which we will discuss later; however, right now we are
discussing some of the more important points with respect to overall general
applicability of linear regression for a categorical outcome variable.
So, can be done technically it can be applied so we can treat the outcome variable as
continuous. So, the categorical outcome variable can be treated as continuous variable.
So, we can essentially do the numeric coding and keep it as a numeric variable and
technically we can apply we will get the results; however, the results would be
meaningful or not that we need to understand.
So, technically it can be applied we can read the categorical outcome variable as
continuous variable we can code it numerically so that can be done. However, there are
going to be anomalies that would lead to spurious modelling so what could be some of
these things.
So, number one predict predictions can take any value not just any values so for
example, that binary logistic regression model that we have been performing for on some
of the data sets so they are the our outcome variable it is it typically has two classes class
1, and class 0.
And so, the values the remaining variable we will take is to 0 for class 1 and 0 and 1 for
class 1; however, when we apply a linear regression model to a categorical outcome
variable the prediction can take any values any real value can be taken and not just the
dummy values 0 and 1 so that is one challenge.
How do we map some of these predators values which can which can be any real value
to the actual values of the outcome variable 0 and 1. Now, outcome variable or residuals
do not follow normal distribution. So, as we have discussed during a linear regression
1003
that this is one of the important assumption that dependent variable that is outcome
variable all residuals should follow normal distribution, but that is not the case as we can
understand that categorical outcome variable will have just 2 values 0 and 1.
So, it is actually it actually follows a binomial distribution so this particular assumption

would also be violated; however, we talked about that because we are in data mining
modelling context. So, where as we talked about even if for you know prediction
purposes even if this particular you know assumption is violated in terms of prediction it
might not much matter much because generally check performance on validation
partition and test partitions; however, this case is different.
The deviation from normal distribution is much higher it is actually different distribution
binomial distribution so that is one problem. Now, the another assumptions that we
talked about in multiple linear regression is homoscedasticity; however, if we apply
linear regression to an a to a categorical outcome variable this particular assumption
would also be violated. The variance of outcome variable that we that we expect to be
constant across all the records that is the homoscedasticity property so we for to apply
multiple interrogation we want our outcome variable to follow this to have this property.
So, variance should be constant across all time; however, if we look at the variance for
our categorical outcome variable it is going to be this particular value n times p into 1
minus p and as you can see because this is dependent on the value of p. So, therefore, the
variance will change as the value of p changes.
So, when the value of a probability value is actually close to 0, then the variance would
be on the as lower side and when the value of p is approaches 1 then the variance would
be on the higher side. So, therefore, the variance will be will not be constant and it will
be it will increase as the probability value you know in case is from 0 to 1.
So, so some of these are some of the problems that we can see in that we can directly
understand and why a linear regression and cannot be applied to category outcome
variable in a general sense and the problems that we can see here. So, what we will do
we will do an exercise in R to understand the same thing to understand this particular
aspect.
1004
So, what we will do we will apply a linear regression model on a logistic partition and
see the see its applicability and see some of the anomalies or violations that are that
could be there. So, for this purpose we are going to use as you can see multiple here the
comment is multiple linear regression model for a categorical response. So, promotion
offers is the data set that we are going to use for this particular exercise. So, let us import
the data set let us edit load.
So, once the observations have been loaded into environment we as we will see in the
environment section we will do some of these steps I think it is has been loaded yes. So,
df 2 we can see 5000 observation, 9 variables so let us remove NA columns, or NA rows
if there are any let us look at the structure so this is the data set. So, we are already
familiar with this.
1005
So, let us go through some of the some of the comparisons some of the transformations
that we have been doing you know in previous lectures as well take a backup let us get
rid of let us select just these 4 variables that is income promotional offer of our outcome
variable then we have I think family size and online activity. So, let us select them. So,
these are the variable selected for this particular exercise income promotion offer and
family size and online.
1006
So, the variable that we are going to use for our outcome variable is going to be the
promotional offer as you can see that we have commented out the lines of code, which
we used earlier to convert these numeric variables going to convert numeric variables
into the categorical variable. So, promotional offer an online so they are actually
categorical variable factor variable, but we are not converting them into factor variable
because we are going to apply a linear regression modelling. So, we will give them as
numeric and we will apply.
So, a partitioning is the same 60 percent, 40 percent in this case so we can see that let us
do partitioning. So, df 2 train you can see 3000 observations, 4 variables. Now the same
align function is going to be used now the promotional offer is going to be request
against all the predictors that are present in this particular dataset df 2 train. So, let us run
this.
Now, what we will do we will look at the summary table. Let us look at the results. So,
as we can see that income is intercepted significant income is significant and family size
is significant we can look at the different estimates for example, income quite a small
value from this family size is also 0.03, online is not significant it has not only found to
significant as we can see.
1007
And we can have a look at the other values we can see adjusted R square R square and
multiple R square. So, we can see if we you know 27 percent multiple R square value we
look at the p well it is quite as small. So, the model is significant ah; however, as we
talked about certain problems as we have discussed could be there certain anomalies
could be there.
So, what we will do? We will look at look at computing some of those things to check
whether those anomalies are there in this particular case. So, what we will do we are
1008
running and over to extract some of the parameters here. So, we can see so sum of square
errors mean of square errors and F value of a statistic and the probability values for these
predictors income family size is given there in the ANOVA table.
So, what we are going to do is we are going to compute these particular values in a in a
format that can be used for tabular presentation later on. So, mod 4 so on; we have a F
statistics value. So, that is written as part of the summary function of models.
So, when we apply summery on the model object so you can see this is nothing, but F
statistic value for the model and then we have degree of freedom and here. So, we can
see the degree of freedom residuals degree of freedom and the residuals degree of
freedom and the regression degree of freedom here so that is going to be stored in this so
in this data frame.
First we have the regression degree of freedom, then we have the residual degree of
freedom and then we have the total so this data frame is about degree of freedom as is
sustained by its named DF. Then we will compute some other square error so first we can
see that in the ANOVA table, from the ANOVA table we are trying to extract out this
these values.
1009
So, first some other square for the you know regressors regression and then for the
residuals, and then total. So, this would be recorded in this particular variable SS. And
then we look at the mean or mean of square errors, so that is also being extracted from
the ANOVA table results, you can see this particular column mean square and these
values are being instructed.
So, first 3 values so this head function as you can see now this role of this head function
is quite different and both these computations we have used head function.
So, you can see the second argument is minus 1 so what it actually does is it gives us all
the values except the last value in the vector. So, for example, mean square or some other
square. So, there are 4 values and these 2 columns, and these 2 vectors.
So, except the last values that is corresponding to residuals the first 3 values are going to
be written; that means, last value will be left out and the remaining and remaining n
minus 1 values are going to be written. So, that is what we want. So, that will give us the
corresponding sum of squares or mean sum of squares are for the deviation.
So, first that then residuals so let us compute this. Then let us also extract the F statistics,
from the this particular vector that we have already seen. So, let us do that then what we
are trying to do we are trying to compute the corresponding probability value.
1010
So, probability value corresponding to the F for test so that that is how it is being
committed pf is the function that can be used for more detail you can go into the help
section and find out more information about pf. So, you can see this is F distribution.
So, within F distribution we have F function which can be used to compute the
corresponding p value. So, what we need is F statistics and the first argument then we
need degree of freedom as second argument that is responding to a regression here and
the second is then last one is residuals degree of freedom third argument and then we
have a specified lower tail as false.
So, we will get the p value corresponding p value. So, let us record it in this format once
this is done we can create this table data frame. And let us assign a names for this and let
us have a look at this particular table. So, we can see that now once computed.
1011
So, first column we have degrees of freedom. So, for regression there are 3 and then
because as remember that we have used just 4 variable, so one being outcome variables,
so 3 predictors. So, therefore, degree of freedom degree of freedom is here is 3 for
regression then residuals it is the remaining that is n minus 1 is 2999, total number of
observation and is 3000.
So, residual degree of freedom 2996 sum of a square for regression and for residual is
also there. So, you can see that residual sum of square is much higher so that also is we
can we saw that the variance was also on the lower side right earlier we saw that the
variance that we had computed was on the lower side so that is also indicated here, mean
square of errors is also there and then we have F statis F statistic and then we have p
value is small value.
1012
So, this gives us some information about the model and now what we will do is we will
use this model to the score of a particular observation. So, let us do a prediction for a
new observation let us say this is our new observation is annual income of rupees 5 lakhs
with 2 family members who is not active online.
So, this is information about the particular customers customer whether is going to
accept the promotional offer or not. So, annual income is 5 lakh family size is 2 and not
acting online. So, you can use the predict function first argument is as usual the model
object mod 4 then in a data frame we are trying to pass on the values operate predictors
for example, income 5.
So, it should be in the same unit as was used for the modelling exercise in the training
partition. So, you can see 5 and family size is 2 and then online is because this particular
customer is not active so 0. So, once we do this we will be able to predict this particular
observation. So, you can see this comes out to be a negative value. Now that was one of
the first point that we discussed in the slide.
1013
Let us go back you can see here that the predictions can take any value not just dummy
values. So, we can see that a negative value has been taken here and so that is in one
anomaly that we can clearly see. And set of values for them was no offer that we already
know 0 and 1, and the value is that comes out to be predicted value comes out to be
minus 0.14 now.
So, one difficulty is how do we do our classification in this case? Now, let us look at the
residuals so second anomaly that I discussed that outcome variable of residuals do not
follow normal distribution. So, let us look at that let us plot a histogram and find out
whether this is being followed.
1014
You can see residuals when we plot residuals when we get histogram you can see this is
clearly not being followed normal distribution is not being followed one grouping here
and other grouping where this is of this lower values, lower frequency and this is higher
frequency. So, this is clearly looks binomial or different groups kind of thing. So, this
definitely not following so normal distribution and distortions due to real binomial
distribution can be seen here.
So, the typically what exercises and the discussion that we have been doing was
pertaining to classification tasks and as we talked about that this is a statistical technique
logistic regression being a statistical technique is also applied in so used in statistical
modelling. So, the kind of task that we generally do in a statistical modelling are quite
similar to what we can call profiling tasks.
So, as we talked about in the you know starting lecture of logistic regression and that it is
about understanding similarities and differences between two groups. So, logistic
regression can also be used in can also use to understand what are the variables which
you know which can bring out some of the similarities or differences between group, so
let us discuss that aspect as well.
So, in profiling tasks when we talk about profiling tasks the situation is slightly different.
So, in the classification task we typically build our model and look at the performance of
that classifier that particular model using the classification matrix using the overall
1015
accuracy or overall error matrix and you know some deviations when we have a class of
interest.
So, those are the things that we typically do we also typically look at left chart especially
when we have a class of interest to in left chart and Decile chart to see whether it is still
despite you know higher error whether it still the model is useful in class of interest with
respect to naive you know naive rule or a or a average case.
So, some of those things that we do in classification tasks; however, in profiling tasks
what we do in classification we follow that. So, apart from model performance on
validation partition we also asses models fit to data on training partition right because as
I taught what in a statistical technique typically the whole sample is used; however, since
we are using training partition for model building.
So, the models fit to data is assessed on training partition; however, model performance
is assessed on validation partition for profiling tasks and models fit to data assessed on
training partition. So, some of these things we talked about when we talk about goodness
of fit measures we talked about the deviance, we talked about 1 minus deviance divided
by null deviance that is equivalent of multiple R square some of those things you can see
here as well.
So, models fit to data is assessed on training partition ah; however, we still focus on
avoiding over fitting because as we talked about when we have matrix which look for
and when we do modelling to achieve the you know goodness of fit then it can it can
lead to over fitting. So, it still we would like to avoid over fitting and still be able to you
know check the performance still be able to have good classification performance as
well.
So, usefulness of usefulness of predictors is also examined in this particular case

profiling so because it is about understanding similarities and differences between 2
groups. So, therefore, which predictor is more helpful in terms of bringing out those
differences those similarities so when we build our model. We also look at the
significance levels of some of these predictors we look at which predictors are significant
which are not significant whether the not significant predictors can be dropped from the
model.
1016
And this has also should be looked at from the perspective of model performance
because as we have taught about data mining model we would like to keep the
insignificant variables also in the order if they provide some practical importance in
terms of scoring new observation.
How and logistic statistical modelling we just drop the insignificant variables because we
are just interested in understanding the phenomena, understanding the underlying
relationship. So, profiling is quite similar to you know that that approach; however,
because we are doing data mining modelling. So, we have to balance between these two,
we have to balance between performance and also the main profiling tasks.
So, we have to really see whether the predictors can be dropped just like in statistical
training or whether they have to be kept in the model because they also provide some
practical significance for scoring new observation so that balance has to be achieved.
So, we have to avoid over fitting, we have to look for using usefulness or predictors in
both the context data mining context prediction point of view and also statistical context
in terms of understanding you know finding understanding variables which differentiate
the groups, which bring out similarity or differences between groups.
So, this kind of exercise is done in profiling tasks and goodness of fit matrix that through
an exercise in R that we have already understood overall fit. So, we look to understand
first we look to understand the overall fit of the model, so if the overall fit of the model is
good only then we go ahead and look at the individual variables.
So, first step typically is in profiling on a statistical modelling we look at the overall fit
of the model. So, in this particular case logistic regression as we talked about the
deviance is the metric we taught one previous lecture that could be used and we also said
that this is equal to SSE in linear regression and 1 minus deviance is divided by null
devian that is equivalent to multiple R square in linear regression.
So, these are two matrix that could be used to assess the overall fit of the model and then
once this is done then we look at the single predictors. We look at whether they are
significant or not as I talked about and whether we can strike a balance in terms of
prediction performance, classification performance, versus the profiling that is and also
statistical modelling context.
1017
So, with this we move to our next discussion that is about so till now the exercises that
we have been doing they were mainly focused on binary classification, and binary
logistic regression model we just had 2 classes class 0, and class 1. Can logistic
regression we extended to a scenario where we are dealing with more than 2 classes
where we are dealing with you know m classes. So, yes it is possible so we will discuss
some of those things.
So, first one is multinomial logistic regression so multinomial logistic regression. So, the
categories the classes that we have they are you know so the categorical variable is
nominal, so, in that case we can apply multinomial logistic regression. So, what happens
in multinomial logistic regression first out of those m classes that we have we have to
select one we have to pick one as the reference category, and for the remaining m minus
1 classes we create separate binary logistic regression model.
So, for each of the m minus 1 classes apart from the reference category class. So, we will
have n minus 1 classes so for each of the n minus 1 classes will create separate binary
logistic regression model; that means, for a class 1 we will have the scenario where the
observation probability of belonging to class 1 and probability of long not belonging to
class 1 so that kind of binary scenario we will have and that for each of the n minus 1
classes.
1018
So, we will be dealing with m minus 1 binary logistic regression equations and so using
that we can compute all those probabilities values with the help of the predictors and
then the remaining reference class of that probability for that can always be computed by
the m minus 1 probabilities values for the m minus 1 classes.
So, we can just subtract that from one for you know and then we can get for the
probability value for the different class. And once we have the probabilities values for all
m classes then we can apply our most probable class method routine where the class
having highest probability value would be assigned to the new observation.
So, this is how we can go about applying logistic regression to an outcome variable with
m classes. So, this is called this is a multinomial logistic regression and this is applicable
mainly to nominal categorical nominal variable right, sos the categorical variable having
nominal classes.
The second scenario second scenario is about when we have a categorical variable with
ordinal classes we have an ordinal variable. So, in those cases we can apply ordinal
logistic regression. So, so again within this ordinal logistic regression we can have 2
scenarios. So, as we have understood in some of the initial lectures and supplementary
lectures that ordinal variables they have order among different labels is also important
that is also meaningful.
So, less than or equal to or greater than or equal to operations they are also applicable in
this case, so the first scenario is large number of ordinal classes. So, if our outcome
variable which is categorical variable with ordinal classes if that variable have is having
large number of ordinal classes, then one solution is treat that ordinal variable as
continuous variable and apply multiple linear regression right.
1019
So, when we have a categorical variable with ordinal classes so we have a categorical
variable. Categorical outcome variable when m ordinal classes, and m is large then I as I
talked about we can treat this particular variable as a continuous variable and apply
multiple linear regression.
So, multiple linear regression can be applied and reason is one reason one justification
for this is as we talked about earlier that in binary situation we have just two values for a
categorical variable and the predicted values up using MLM can range anywhere any real
value so, that was the one main problem here.
But when we have m is large; that means, set of values could be you know many more.
So, it could be you know if there are 50 groups let us so in this fashion we can go on up
to this. So, the number of values that can be taken by this particular this particular
ordinal variable are many more.
So, therefore, the predicted values this can be easily mapped to some of these values
could be close to some of these values, and probably multiple linear regression can be
still applied so this is one way. When we have a ordinal variable with many number of
classes with large number of classes, when m is large then is still you know instead of
logistic regression we can apply multiple linear regression so that is first scenario.
1020
The second one is whatever we have a small number of ordinal classes. So, if m is in this
case m ordinal classes that we have if this is a small m is small, then probably we will we
will run into the same problem like for binary classification. So, similar problem would
be there we will have only few values 0, 1, 2, 3 let us say these many.
So, again small number of classes small number of ordinal classes, so we will have I will
run into same problem. So, so we what we do is we use a different version of logistic
regression called proportional odds or cumulative logic method as indicated in the slide,
so small number of ordinal classes so we would like to use proportional odds, or
cumulative cumulative, or logit method.
So, what we do here in this particular method is we create separate binary logistic
regression model for m minus 1 cumulative probabilities, so, we talked about that when
we so, when we discussed multinomial logistic regression; So, for all m minus 1 classes
will have separate binary logistic regression model.
However, if you see that here will have separate binary logistic regression model not for
m minus 1 classes not for presence of that class or absence of that class, but m minus 1
cumulative probabilities.
So, let us understand what we mean by that so let us take an example for a 3 class case as
written in this slide for a 3 class case C 1, C 2, and C 3. Let us say these are the 3 classes
1021
ordinal classes C 1, C 2, C 3 and a single predictor x 1 that is being used. So, our logit
equations could be something like this logit for C 1 it could be alpha 0 plus beta 1 x 1
and logit for C 1, or C 2 so; that means, from C 1 first logit equation is just for you know
observation belonging to C 1. The second is observation belonging to C 1, or C 2 so that
gives us the cumulative sense. So, as we talked about that ordinal the order is important
so that that means, you know different classes can be compared.
So, therefore, C 1, or C 2 is a you know is a meaningful here in the sense that if we look
at the rights part of the equation beta 0 plus beta 1, x 1 you can see that beta 1 is same
and both these equation x 1 so you can see because the this is the comparison can be
done.
So, the coefficient so the intercept R only difference because the comparison so one is
when we talk about ordinal, so one particular class this ordering this ordering is you
know this is meaningful. So, when the ordering of classes is meaningful; that means, one
can be you know said you know less than one particular class the or higher than one
particular class just like the categories that we might have high, you know low medium,
upper medium, medium and then low.
So, this kind of this kind of classes we might have all we might have you know a
strongly agree to strongly disagree. So, those kind of ordinal classes where the order is
meaningful, where the order is meaningful then in those regression we can have
something like this cumulative probabilities values, and beta coefficient is can be used
the same beta coefficient can be used for both these logistic regression.
And from this we can compute the cumulative probabilities values and once these
cumulative probabilities value values for two of the classes for these classes have been
computed the actual probabilities value for C 1, C 2, C 3 can be derived from using the
probability value that formulation that we have so, it is from logit, we can derive the
probabilities values.
So, once these cumulative so from there once we have the probabilities values for all
these classes then we can again apply the most probable class method and also assign the
class based on the based on the probability value and again this so in this fashion in this
fashion we can apply ordinary logistic regression to a scenario with fewer number of
ordinal classes.
1022
So, what we will do we will go through R Studio and do an exercise for this when we
have classes more than m is greater than 2, so a logistic regression modelling for classes
greater than 2. So, what we will do we will create a hypothetical data set here, so number
of observations are 100 in this case, so as you can see and it is 1000.
So, let us create this now what we are going to do is we are going to create a data frame
having that data frame where we have 2 variables x 1 and x 2, x 1 as you can see we are
using run if for x 1 and x 2 both so the observations. So, the val values would lie between
0 and 100 and n number of values would be created that is 1000 values would be created
lying between 0 and 100.
So, let us create this data frame, let us look at some of the observation. For 6
observations you can see all the values they are lying for both the variables x 1 and x 2;
they are between 0 and 100.
Now, what we will do we will create a categorical outcome variable with 3 classes. So,
this particular data set that we are trying to create we are going to use it for both the
scenarios multinomial scenario, and ordinal scenario right, so this is just for illustration
purpose so we are not specifying whether the variable is ordinal or not s, we are just
going to use it for both the scenarios.
1023
So, what we are going to do is we are using transform function to create our outcome
variable, categorical outcome variables. So, you can see that we are trying to compute y
as 1 plus if else and if the value is less than 0, this particular value 100 minus x 1 minus
x 2 plus again a certain value is being taken from normal distribution no standard
deviation 10000 values.
So, if it is less than 0, then you know we are using this information and that is
information based on x 1, x 2, and certain additional computation we are trying to assign
it a class 0 or then the second if that is not true then second computation so it will get a
class 1 or class 2. So, let us compute this once this is done let us look at the observations
you can see another variable y that is categorical variable has been created having you
can see 1 and 3 2 values are being taken. Let us look at the structure of this particular
data frame.
So, these are 3 variables x 1, x 2 values between 0 and 100, and y is taken value 3 values
1, 2, and 3. So, let us plot this particular data set. So, this is our default palette so;
however, I would like to use this palette agree 3 sets. So, in this fashion we can have sets
of no number of sets that we require. So, let us look at the range of x 1, y, and x 2 which
is already.
1024
Because we have just now created these variables studies actually clearly understood as
well so limits 0 to 100, and 0 to 100 and then colouring is with using this particular
factor y. So, you can see as dot factor we are using this particular variable and let us
create this plot.
So, you can see so this is our plotting. So, a one group is here, the second group is having
some medium level gray set here and the third group is here, this is lighter gray code
1025
colour. So, these are the values that we are going to use x 1, x 2 this is plot between x 1,
and x 2 and the outcome variable has been color coded, so, 3 categories.
Now, what we will do is we will use the multinomial logistic regression. So, for this we
need this package the library, and net package this package actually for a neural network,
but it provides us offers us this function which can be used for multinomial logistic
regression.
So, a multi norm is the function so what we are now going to do is the outcome variable
y we are going to regress it with the remaining variables that is predictors in this
particular data. So, all the observations we are going to use here, so let us run, this let us
look at the summary.
1026
So, we can see coefficients x 1 and x 2 right. We can also see the error values residual
deviance and the AIC values, are also indicated here. Now, fitted values also we can look
at first 6 observation here.
So, we will get these fitted values which are nothing, but estimated probabilities values
for these 3 classes 1, 2, and 3 so what this is what we have. Now, if we had another
partition validation or test partition you can use predict function to the score those
1027
partitions and have probabilities values, estimated probability values, for those new
observation.
However for the demonstration purpose, we are applying predict function on the training
partition that is the full data set itself. So, you can see another argument that we can see
is type which is props which is proper probabilities values. So, let us run this and you
would see we will have the same probabilities values which were fitted. So, we can see
here that same values you can see first row it is same, same values. So, the fitted values
and all we have got the same one using predict function.
Now, let us classify these observations. So, this is how we can classify if the probability
value for 1 is greater than the probability value for class 2 and also greater than the
probability value over class 3, then of course, this observation has to be classified to
class 1.
Otherwise we will again compare the probability value for class 2, with probability value
for class 3, if again it is greater and I know class 3 value divided then the class 2 is
assigned, otherwise class 3.
So, in this fashion we can assign all the observation into appropriate classes so this is
implementation of most probable class method. So, once this is done you can see mod
train see for all 1000 variables have been created and observations have been assigned to
go to appropriate classes.
1028
Now, let us look at the classification matrix. So, we can see actual value, predicted
values, now till now the classification matrices that we have been observing that we have
been creating they had 2 values 0 and 1. And we had 2 by 2 classification matrix now
this time we are seeing 3 by 3 classification matrix.
So, 3 possibilities for actual values 1, 2, and 3 and the predicted values and we have the
corresponding numbers here. So, you can see that again the diagonal element; that
means, these 3 elements they represent the values records which have been correctly
classified, and the remaining observation off diagonal elements you know they have the
records, they have the counts of record, which have been incorrectly classified right.
So, from this we can compute the classification accuracy that is 91 percent in this case,
and error that is the remaining 9 percent. So, in this fashion we can apply multinomial
logistic regression to a particular data set. Now, let us move to next part that is ordinal
logistic regression.
So, in this product case ordinal logistic regression as I talked about 2 scenarios, so 1
where m is large; So, in the in that scenario we can apply multiple regression one
exercise that we had done in previous lecture, we had applied multiple linear regression,
but that was for the class which had just two you know for a variable category which just
had 2 classes 0 and 1.
1029
In the same fashion for the first scenario the multiple linear regression, can be applied.
So, there is not much much different in terms of applicability. So, well what we will do
we will do an exercise for this scenario where we have to apply this ordinary ordinal
logistic regression the cumulative probabilities method, cumulative logit method so for
this we need this package mass.
So, let us load this particular library and we have polr is the function for this. So, this is
actually for probit however, it can also be used for it can also be used for the cumulative
logit method. So, we will just see in the in the in the help section. So, you can see polr
this is ordered logistic, or probit regression right.
So, what we are interested in ordered logistic regression, so so here ordered logistic
regression. So, you can see in the method argument the first one is draw logistic this is
actually ordered logistic method. This is also called proportional odds logistic regression
which we have discussed right.
So, let us go and build this model so polr is the function. So, first as you can see y
variable I have converted into a factor variable and then the request against all other
variables that which are predictors data is full, then we need this argument as well as
which is mainly if we want to apply for summary function later on the model object, so,
this is also true.
1030
So, now let us apply summary, so we will get the results we can you can see the
coefficient values here x 1, and x 2, the value and the error and the T values are also
there. And we have the residual deviance, and AIC value as well for this particular model
and we are interested in finding the fitted values so that is also returned by the model. So,
let us look at some of these values.
So, you can see for each of these classes class 1, 2, and 3, so these values are actually
probability estimated probabilities values. Now using these values we can again apply
the most probable class method so first we need to compute the first we need to compute
and do the assignment as per the most probable class method like we did in previous
exercise.
So, as I talked about predict function can again be used to score new data. So, in this case
we are scoring the training partition itself again. So, we expect to get the same values as
you can see last row, you can see here, here also the same values are there. So, it is
scoring off for the training partition itself. So, we will get the same observation.
Now, what we will do classify observation. So, as we did in previous exercise mod 1
train for class 1 and probability value for class 1 greater than value class 2, and greater
than class 3. Then assign it to class 1 otherwise again we look on more comparison the
probability value for class 2 is greater than probability value class 3, then assign it to
1031
class 2, otherwise class 3. So, in this way we will have the appropriate classification
scores.
Now, let us generate the classification matrix here. so you can see 3 by 3 matrix we have
3 classes actual values 3 possibilities, predicted values, 3 predicted classes, so again
diagonal elements they represent the correct classification values and off diagonal
elements they represent the incorrect classifications.
So, what we can do is so let us compute a classification accuracy, and you can see eighty
two percent and the remaining is error.
1032
So, with this we have completed our discussion on logistic regression and so today we
have been also able to cover the scenarios where more than 2 classes are present in our
categorical variable. What happens when the classes are nominal and how we can apply
logistic regression when classes are ordinal, so we have seen that we have also done an
exercise in R.
So, we stop here and we will continue our discussions in next structure for a new
technique.
Thank you.
1033
Dr. Gaurav Dixit
Lecture - 53
Artificial Neural Network-Part I
Welcome to the course Business Analytics and Data Mining Modeling using R. So, in
this particular lecture we are going to start our discussion on Artificial Neural Networks.
So, let us get this started.
So, let us understand a particular background about background of artificial neural

networks how they were conceived, and other things other details. So, first artificial
neural networks they are based on human learning and memory properties.
So, they mimic the process of memorizing and learning of humans, so that is mimicked
by these models artificial neural networks model. So, capacity is to generalize from
particulars so that is also the in the one capacity of humans, third one biological activity
of a brain where interconnected neurons learned from experience.
So, the structure the biological activity and the particular structure that is part of the
brain the similar kind of a structure is used in artificial neural network, and the learning a
procedure the learning process is also quite similar so directly inspired from human brain
1034
activity. So, these particular model neural networks they mimic our human brains
learning and learning process and other things.
So, learning and memory properties ability to generalized from a specific things and also
the overall network of neurons that is there. So, all these things are quite similar in
artificial neural networks. So, typically neural networks model they are used where the
relationship between outcome variable, and the set of predictors is quite complex.
So, because of that the applications in finance and some engineer application few
examples are give. For example, credit card fraud, fraud as a finance application to detect
whether a particular whether there is some fraud in a particular gate card account and
then engineering discipline. One example is autonomous vehicle movement, so that
whether movement of whether the steering has to be moved left or right, so those kind of
things can also be predicted or classified using artificial neural network.
So, typically here we are dealing with complex relationships something for which it is
difficult to difficult to use a structured or functional form or actual relationship is
unknown, probably artificial neural network can be used. And applications also
applications of artificial neural networks also are seen in areas where the main where the
main behaviour, or main phenomena or relationship is being determined is being based
on how the human things are and therefore, behave.
So, for example, driving so one example that we have given here driving. So, driving is
essentially how a person how a driver thinks about moving their car in a congested in
traffic environment. How you know people plan people plan and execute credit car,
related frauds so that is also how human brain thinks and plan things.
So, therefore, those things are difficult to put it in a functional form; however, they can
be modelled using neural network, because neural network also mimic the properties of
human brain. So, as discussed they can model complex relationship, so they are flexible
data driven model.
So, we are not required to specify the form of relationship so like in linear regression we
assume that the relationship is going to be linear the relationship between outcome
variable, and set of predictors that is going to be linear so that is the exemption. And
1035
logistic regression we assume that you know we use that logistic function as the form of
relationship.
And in this particular case neural network we are not required to specify any such form
linear or logistic or other forms; So, a useful technique especially useful technique when
functional form of relationship is complicated or unknown.
So, if we do not know the actual relationship or it is quite complicated, so in such

situation artificial neural networks are quite useful as I have talked about any you know
specific areas where you know human activity the way human things becomes the core
part of that big activity for example, driving or you know some of the these frauds the
financial frauds that happen. So, there the applications of neural networks have been
quite useful.
Another important aspect of neural network is that a linear and logistic regression these
can be conceptualized as a special cases of neural network which we will see that later in
the lecture. Now, let us move to next point that is neural network architectures.
So, there are various neural network architectures that are being used; however, the most
popular one that is typically used in data mining is multi layer feed forward networks.
So, this is the architecture that we would be using for our discussions, and exercises
1036
modelling exercises. So, multi layer feed forward network this is the neural network
architecture that is used in data mining.
So, let us understand the multi layer the architecture of multi layer feed forward
networks, so they are fully connected networks, so comprising of multiple layers of
nodes. So, there are multiple layers of nodes and there is a one way flow and no cycles
we will see this through a diagram.
Then we have typically we have 3 types of layer in multi layer feed forward networks,
first one is first layer is input layer, then we have a series of hidden layers and finally, we
have output layers.
1037
So, a typical neuron network diagram is something like this, so, you can see first we have
input layer. So, all these circles they are actually nodes this whole diagram is called a
network a specifically neural network, and these circles they represent nodes and we
have 3 types of layer.
As discuss input layers so we can see some you know nodes here, and these nodes are
connected to you know next layer of node which are called hidden layer, so we might
have one or more hidden layers. So, input layer nodes are going to be connected to you
1038
know to to these hidden layer nodes in one directional sense and then these hidden layer
nodes would finally, the last hidden layer nodes would finally, be connected to nodes of
output layer right.
So, this you can see one node is there in this particular output layer, so this is a typical
neural network diagram, and specifically for this architecture multi layer feed forward
network. So, let us go get back to our discussion. So, as I said in the diagram that we saw
typical neural network diagrams we saw that all the nodes were fully connected.
So, comprising of multiple layers of nodes so there were multiple layers input layer then
series of hidden layers and output layer and we saw that there was one way flow, and
there were no cycles, so there were no feedback loops, there were no cycles.
So input layer that was the first layer of the network. So, typically the predictors the
variables that are part of the modelling exercise they corresponding, so corresponding to
each of those predators typically we have a node in input layer. So, we have p predictors,
typically we p nodes in input layer and then we have hidden layers. So, there can be one
there can be there can be any number of hidden layer, hidden layer nodes; however,
typically single hidden layer node is sufficient to model even the highly complex
situations, highly complex relationships.
So, these hidden layers are between input and output layer and finally, we have the
output layer which is the last layer of the network and we have nodes. So, corresponding
to the output variable that is outcome variable of interest. So, depending on the outcome
variable we might have you know nodes there.
So, one outcome very typically we have typically we have one node there, if the outcome
variable is categorical in nature. So, typically you know for binary you know variable
binary variable we might still have just one node, but if the categorical variable is having
m classes, then we might have m nodes. So, output layer typically of a prediction there
might be a single node for binary categorical they might be just single node and for a
categorical variable m classes there might be m nodes so that is going to be the last layer.
And so this is about the structure or architecture of multi layer feed forward networks.
Let us understand a few more details about these networks, so nodes as we saw in the
1039
diagram nodes receive feed from previous layer and forward it to the next layer after
applying a particular function.
So, let us go back to this diagram so we can see that for example, first layer is input
layer. So, we do not have any previous layer in this case. So, we just receive the input
values and then they are fed to a next layer. Now when we talk about the first hidden
layer you would see that they receive the feeds from the previous layer nodes.
So, you can see nodes receive feed from previous layer and forward it to next layer after
applying a particular function so these nodes first hidden layers, so these nodes. So, they
receive feeds these connected arrows. So, you know from every node in input layer every
node in input layer is going to be connected to every node in the next layer that is the
first hidden layer.
So, you can see 3 arrows emanating from this particular node and they connecting to all
3 nodes of first hidden layer then the second input node and again you can see 3 arrows
connecting to 3 hidden layer nodes, and similarly for the next node and similarly for the
last input node.
So, similarly if we look at other layers so first hidden layer, and second hidden layer as
you can see from first hidden layer nodes you can see 3 arrows connecting to 3 nodes in
the next hidden layer that a second hidden and second and last hidden layer. And we can
see similarly from all the nodes in the last hidden layer we can see arrows connecting to
a output node in the just single output node in the output layer.
So, for example, first hidden layer we can see that it is receiving feed from the previous
layer nodes that is input layer nodes, and then it will apply certain function you know
that certain function on that on that feed. And then the output is going to be forwarded to
next layer node and for next layer node it will become the input.
And then again next layer node will apply a certain function to the feed that they are
receiving from previous layer, and then again the output would be computed and this
output would be forwarded to the next layer that is in this case output layer.
So, in this fashion the feed is computed the input values so they are used and the output
is produced. So, you can see next point is function used to map input values, that is
1040
received feed to output values that is the forwarded feed at a node is typically different
for each type of layers, to the function that are going to be used that is going to be used
in input layer, you know you know that is going to be applied on input values that is
predictors values.
So, this particular function is going to be different from hidden layer and output layer a
hidden layer the function that is going to be applied on the received feed is going to be
different from the other two types of layers. Each type of layer will have its own
function. So, this function typically is called transfer function as we will see later on.
So, a few more details about the these networks. So, each row from node i to node j has a
value w ij that is the weight indicating weight of the connection so in this particular
neural network. So, if this node is connected to this particular node if this is the node i
and this is node j.
Then the arrow this is connecting these two nodes this will represent that this will have a
weight this will have a weight called w ij, and it will represent the weight of this that
particular connection is strength of that particular connection. Each node in the hidden
and output layer layers also has a bias value that is theta j equivalent to intercept term.
1041
So, we will see that these nodes hidden layer nodes first hidden layer, second hidden
layer and then the output layer. So, these particular nodes they will also have a bias value
right, so all these nodes will also have a bias value and which is going to be equivalent.
So, you can consider the scenario in this fashion like these connected arrows they are
they are bringing the feed to this particular node and the output level of the output that is
going to be produced that is being controlled by these biased values which is equivalent
to what we have intercept term in linear regression. So, in this fashion these values are
controlled, and the feed is forwarded to the next layer.
So, next important aspect is computing output values at nodes of each layer type. So,
how do these values are computed at you know each layer and nodes at each layer each
type of layer. So, let us first discuss the input layer nodes so typically as I talked about
number of nodes in input layer are typically equal to the number of predictors.
So, if they are p predictors that we are using in our modelling exercise then we will have
to in a typically we will have p number of nodes in input layer corresponding to each of
the predictors. Because all the predictors values would be fed to the input layer nodes
and then the output would be produced that would feed into the first hidden layer nodes.
Now, next point is each node will receive input values from its corresponding predictor
as I talked about now output is same as input that is predictors value. So, typically the
1042
function that we talked about the transfer function that you know it is called transfer
function, so that we talk about typically it is the identity function or linear function in
that is used an input layer; that means, whatever input is received the same input is
transferred same input is fed to the next layer.
So, the function is linear so there is you know no effect is no factor is applied on the
input values. So, now let us discuss the next type of a layer that is hidden layer hidden
layer nodes, and how values are computed. So, hidden layer nodes what we do is we take
some of bias value and weighted some of input values received from previous layer so
that is computed.
So, you can see here theta j so that is the equivalent of intercept term. So, this is the bias
value and then we add weighted sum of input values that we received from previous
layer. So, let us go back to the diagram and let us try and understand this thing again, so
if we look at this particular node this node in the first hidden layer. So, this particular
node is receiving is receiving values from these nodes input layer node.
So, if there are p nodes so it will receive p values now each of these connections each of
these arrows will have their own weights. Now, these weights would be multiplied with
the values that is being received from these input nodes.
So, each of these input nodes the values of that is being received from each of these
nodes would be multiplied by the weights. And then that would be summed up
summation would be performed and then the intercept term that is biased value would
also be added to this sum, so that is what you see here in this particular expression.
So, you see that x i is corresponding to you can see that this summation is starts from i is
equal to 1 to p so; that means, for each predictor we will have a value and so that value
would be fed to the node that node that we just saw when a first hidden layer and node
and for that matter any other nodes in the hidden layers or output layer.
So, so this value then is the multiplied by the weight of the connecting arrow, and then
we take summation and then bias value is also added to this. So, this bias value will as
you can see in a way is controlling the level of the overall value of hidden layer nodes
just like it.
1043
So, this some similarity as we can see to the linear equations linear line that we you
know linear modelling that we see beta 0, so theta theta j here is equivalent of beta 0 and
then these weights multiplied to greatest values are equivalent of, but we have you know
beta 1 x 1 and beta 2 x 2 that in linear regression. So, this formulation is equivalent to
that.
So, function g as we have been talking about that each layer will have its own function
that is going to be the applied on the input values that are received at a particular node.
And using that function will get the output values that is going to be forwarded to next
layer.
So, this function g is called referred as transfer function and is applied on this sum, so for
hidden layer nodes we have to compute this term. So, all the predictors values that we
are receiving from p nodes they will be multiplied by their respective weights and then
bias value would be added then this will have the value from you know using this
expression and this value would be then passed to this transfer function and we will get
the output, so this output is going to be the output of that hidden layer node.
So, what could be the different alternatives different options for our transfer function, so
transfer function could be a monotone function and these are few examples for example,
linear function g x is equal to v x. So, this is the function that is used in input layer where
the input is transferred as is as output re right.
So, this is the function that is used in input layer and another function that could be used
is exponential function. So, e to the power bx and then another function that is that
would be used is logistic or 6 model function that is 1 divided by 1 plus e to the power
minus v x, so this could also be used.
So, a logistic function is the typically the most you know used transfer function in multi
layer feed forward networks, in neural networks in general. So, logistic regression is
typically used for hidden layer nodes, and as we will discuss even for output layer nodes
also this logistic function is typically used most use function.
Now, these hidden layer nodes few more points these biased values and weights theta j
and w ij's they are typically initialized to small random values in this range 0 to 0.05 0 to
plus minus 0.05.
1044
So, typically for the first iteration as we will discuss the; you know training process
neural network training process. So, typically these values are initialized in this range,
and then the network will learn through different iterations. So, as you can see the next
point network updates these values.
So, first they would be randomly initialized with these small values, and in next
iterations the network will update these values after learning from data during each
iteration or round off training. So, these values theta j's theta theta j's which are biased
values which are going to be you know each for each hidden layer or and output layer
nodes and then weights which are going to be there for each of the connection from input
layer nodes to output hidden layer nodes to output layer node.
So, each of those connection will have these weights they would be in initialised to some
values for first iteration and after that these particular values theta j's and w ij's they
would be continuously updated after each iteration, after each pass through the network.
So, for each observation in the training partition all the or the sample that is these
weights, these values are going to be updated. So, let us move forward.
So, now let us talk about the next type of layer that is output layer, so output layer nodes
as you can see in the first point is steps are quite similar same as we just discussed for
hidden layer nodes except that the fact that input values are received from last hidden
layer right.
1045
So, when we talked about hidden layers to the previous layer whether it is the hidden
layer or input layer the values were received. And then those fields then they were using
that expression those that value was computed weighted value weighted value was
computed and that was then used in the function transfer function. A similar process is
used in output for output layer nodes as well.
However, the values that we receive they come from the last hidden layer and using the
same expression that we used for hidden layer nodes, the weighted value is going to be
computed and then that is going to be used in transfer function and the output value is
going to be produced.
So, typically the transfer function that is used in the hidden layer nodes the same transfer
function is used in output layer nodes as well. So, typically the same function same
transfer function is used. So, as I talked about that most used function most used transfer
function is logistic function, but that is mainly for out hidden layer node hidden layer
nodes and output layer nodes. So, typically logistic function is used for hidden layer
nodes, and output layer nodes. And linear or identity function used for input layer nodes.
Now, output values that are produced by output layer nodes they are used as predictions,
because that is the final layer last layer that we have in a network. So, the output that is
produced out of these nodes output layer nodes, so they are going to be used as a
predictions and in a in a in a prediction task. And also they are going to be used as this
course for example probabilities values and that are going to be used to classify record in
a classification task. So, this is how the computations happen computations are
performed in a typical neural network architecture.
Now, if we talk about the neural network training process so a steps that we have
discussed for computation for whether it was for input layer, or hidden layer, or for the
output layer. So, these steps are going to be repeated for all the records in the training
partition. So, for each so each whenever I know for each observation when these
particular steps are repeated that is called one iteration.
So, each observation will go through this neural network and all those computations in
input layer, and then hidden layer and then the output layer are going to be performed
and for each record this process will continue. Now, for each observation therefore, will
have a predicted value and therefore, we will have you know predicted error.
1046
Now, these errors are later on going later on are going to be used to learn for example,
each error as we will discuss later that in coming lecture that these errors for each
observation these errors are then used to update the neural network to update the weights
w ij's and theta j's. And from each observation that is you know each observation and the
computations that are performed in the neural network that duration that happens are
going to be used as a part of learning process.
So, more about this process we will discuss through an example in a next lecture.
Thank you.
1047
Dr. Gaurav Dixit
Lecture - 54
Artificial Neural Network-Part II
previous lecture we started our discussion on artificial neural network. So, we will
continue that; so till now we have been we have been able to discuss the neural network,
architecture, the background, multi layer feed forward, network and specific details
about computations that are involved in input layer, hidden layers and output layer.
Now, what we will do; the this particular process, that is the computations that are
involved in different layers. We will go through some of those things using an example
using exercise in R. So, let us open R studio and the particular example this is a
hypothetical one.
So, the example that we are going to is about is about a particular cheese combination.
So, in a particular cheese we have these scores fat score and salt score. So, this is
combination of is going to be tested by is has been tested by expert and the they indicate,
whether that particular fat score and salt score combination that particular fat and salt is
combination is going to be accepted or not.
1048
So, we have fat scores for every such experiment and the salt score and whether that was
accepted or not. So, you can see we have just 6 observations. So, 6 observation in each of
these vectors fat score, salt score and acceptance and this can be used to this is a small
example, that we are going to use to understand certain computations that we discussed
in the previous lectures.
1049
So, let us create first variable fat score, so you can see fat score and the environment
section we can see 6 observations; numeric vector. And then let us compute this salt
score. So, again we can see 6 observations numeric vector and then the acceptance.
So, you can see salt score the acceptance 6 values 1 0 0 0 1 1. So, corresponding to each
combination of fat and salt we have a value for acceptance, whether that was accepted or
rejected.
Now, the typical example that the a specific example that we are going to follow is going
to be based on this neural network structure. So, input layer will have two nodes, nodes 1
and 2, reason being we have two predictors here fat score and salt score. And hidden
layers will have just one hidden layer with three nodes. So, that is one more than the
number of predictors. So, p plus 1 n nodes in hidden layer.
So, they are going to be denoted by 3, 4 and 5. Then we have output layer of one node.
So, that would be representing, because we have a you know binary variable here
acceptance that is either 1 or 0, so one node for that; so that is denoted by node number
6.
So, if we want to draw the neural network architecture, that we have decided for this
small exercises this one.
1050
So, node 1 and 2 , so node 1 and 2; so that is the values will come from this particular
this particular predictor as we saw in our R environment, and then we have salt score and
then these, so this is input layer. So, this is input layer now as we have discussed in
previous lecture that input layer nodes all the all the input layer nodes they are going to
provide input values to the next layer that is the first hidden layer. So, we had these
predictors p value of p is 2.
So, typically around the number of predictors that we have; the same number of nodes
we have input layer as we discussed and around the same number we have the number of
nodes in the hidden layer. So, here we have just one hidden layer and the number of
nodes that we are taking are p plus 1, otherwise here in this case three. So, this is no
number 3, 4 and 5.
Now, as we talked about that from each node in the input layer we will have an arrow.
So, this will provide a feed to this node as well as this node as well as this node.
Similarly, from second you know from second node that is corresponding to the second
predictor salt score will be providing feed to all three hidden layer nodes. So, you can see
two arrows, because we had two predictors, two arrows coming to this particular this
particular you know all the nodes of two arrows arriving at connecting to all the nodes in
the hidden layer.
Then as we discussed that will have one node in the output layer. So, this is our output
layer and this is the only hidden layer that we have. So, all the nodes in the hidden layer
1051
are going to be connected to the output layer node the single output layer node that we
have; so, they would be providing feed to the output layer single output layer node that
we have, this is why we have just one node here. So, this is actually corresponding to the
output variable that we have outcome variable, which is; which is binary variable in this
case, which is binary variable in this case; which is acceptance. So, the feeds from all the
hidden layer nodes would be provided, would be forwarded to this single output layer
node. So, this is our typical neural network.
We will also have a bias values on each of the hidden layer nodes and output layer nodes.
So, these are biased value. So, we will also have them. So, each of these nodes, we will
have corresponding weights right. So, we will have these are bias values thetas and we
will have weights for all the connected arrows. So, this is the typical neural network
architecture that we are going to use for this example.
So, let us proceed. So, you can see first step is initialization. So, as you can see here in
the comment that the theta and wij initialization that we have to perform first. So, this
particular character this was actually theta; however, this probably this does not
supported some problem here. So, this is being depicted by some other character, but this
is; what we are talking about is the initialization of bias values that is thetas and weights.
So, that is the typically that is first step.
So, bias values if we look at this particular function that we are using here is the matrix.
So, we will compute a matrix of bias values, the first argument is of course, the data. So,
we are using runif function to generate these random numbers. So, we are generating 4
random numbers, if we look at that we have three nodes in the hidden layer and one node
in the output layer. So, we have 4 values we will have 4 bias values and therefore, I am
generating 4 4 random numbers here.
And you can see the range of these numbers is minus 0.5 to 0.5. So, we will have that
number of column is 1, so that is going to be representing all have all bias values.
Dimension names are are 4, 3, 4, 5, 6, that is corresponding to node numbers 3, 4, 5, 6
for each of these node will have the row number and then we will have the 4 values
corresponding values biased value. So, let us compute this.
1052
So, you can see here; corresponding to node number 3, 4, 5, 6 which is which are being
represented by row number here 3, 4, 5, 6 we have randomly initialized bias values
which are from which range between minus 0.5 to 0.5.
Now, let us move forward to next step that is weights. So, first weights of arrows, which
are connecting input layer nodes to hidden layer node; the single hidden layer node that
we have. So, again here you would see that we have from we have two input layer nodes
and three hidden layer nodes so; that means, we will have 2 into 3, 6 connecting arrows.
So, therefore, we will have to randomly initialize 6 values 6 of the weights first. So,
again same range 6 values same range and we will have 3 columns 3 columns here
corresponding to three nodes in the hidden layer nodes and we will have two rows
corresponding to two nodes in the; input layer.
So, as you can see in the dimension names you can see first is for row names first
element is for row names; first vector the second vector in this list is for column names.
So, 1 and 2 2 nodes for input layer two and then three nodes in the hidden layer. So, in
this fashion we will generate the randomly initialize the weights.
1053
So, let us compute these values, you can see here in the output. So, row number
correspond to input layer nodes that are 1 and 2, and the column names correspond to
hidden layer nodes that are 3 and node number 3, 4 and 5 and you can see the values
randomly initialized values in the range minus 0.5 to 0.5.
Then, let us move to next step that is initializing weight values that are that are there for
the connections between hidden layer nodes to output layer nodes. So, we have three
nodes in the hidden layer and just single hidden layer that we have; I am just single node
in the output layer. So, we have 3 connecting rows 3 into 1, 3 connecting arrows and
therefore, 3 weights we will have to initialize same you can see in this particular matrix
function; 3 values will be in the same range one column, that is because we have just one
node in the output layer.
So, you can see the dimension names, arrow names are corresponding to the hidden layer
nodes that is 3, 4, 5 and the column name is corresponding to the output layer node that
is 6 here.
1054
So, let us compute this particular matrix and you can see in the output, that row numbers
3, 4, 5 responding to hidden layer nodes and the column name is 6, that is corresponding
to out single output layer node that we have and these are the randomly initialized
values.
So, once the random initialization has happened for all these w's and thetas, then for a
particular observation we can start certain computations. So, let us look at the first record
that we have. So, first record that we have; you can have a look in the environment
section as well, you can see fat score first value is 0.2 and salt score first value is 0.9. So,
first record we have fat fat score as 0.2 and salt score as 0.9.
So, now, we will do certain computation. So, this is one variable output, where we are
going to come store the output value values. So, any output values that are going to be
processed after applying transfer function, they are going to be stored in this particular
variable. So, let us initialize this marginal.
Now, because this is for first observation, k is 1; that is for first observation that is
nothing, but to access the vectors of fat score and salt score, because the first value that
we are taking here. So, let us initial initialize this now the loop; so this in loop runs from
the you know all the all number of bias values and number of bias values and one less
than the number of bias values. So, that is we can see that that, this is specifically for the
hidden layer nodes we have three hidden layer nodes as you can see. So, this how we can
1055
compute this. And now you can see the expression bias that expression that we saw in the
slide. So, let us again have a look at that expression.
This is we saw in a previous lecture.
1056
So, you can see this is what we want to compute? We want to compute a weighted value.
So, you can see theta value the bias value plus summation over all the predators values
the weighted average of all the predictors values.
So, we can see that bias value is being accessed, the bias value that we had initialized
here right; starting from first value and then we can see weights then that we have
initialized you can see the bias this one matrix first column first value right weights,
again this was matrix; weights between input layer two hidden layer and you can see the
you know a first row and then the we are going to through the i value is going to be 1. In
the first iteration, so that is first column, then second column.
So, we would see in the output itself, if you go above here and bias values and then the
weights IH; you can see the first column is corresponding to node number 3, second
column is corresponding to node node number 4, and third column is corresponding to
node number 5. So, you can see these column numbers are changing; however, we are
dealing with the you know you know same row.
So, you can see row number 1 and here and for next one row number 2. So, first arrow,
this is going to be the weight and then fat score and then, the second arrow this is the
weight and for the same column and this is going to be computed. So, this expression is
essentially being computed using this particular code and then the output values you can
see the a logistic function here 1 divided by 1 plus exponential of minus x. So, we have
1057
computed the logit value the logistic function and for all the nodes all the hidden layer
nodes, this would be computed. So, let us run this loop.
Now, once this is computed, then we have one more output value to be computed that is
for corresponding to the output layer node. So, let us increase increment the i value
counter, and j also let us initialize this. Now you can see this particular code is for the
output layer node biased value you can see here the bias values this was i we have
already implemented.
So, therefore, the last pass value is going to be used here that is corresponding to the
output layer node; then weight this is between hidden and output layer. So, the
corresponding weight is being accessed from that matrix that we have and then that same
expression the same expression that we have here is being evaluated here. So, let us
compute this.
Now, again we are using a logistic function to compute the value for the single output
layer node.
1058
So, now all the values output values have already been computed and; as you can also
see when we computed the output value for the out single output layer node, the input
values were the output of hidden layer you can see output 1 and output 2, output 3. In the
loop you can see that input values were coming from input layer node, fat score here and
salt score here.
These were the two nodes in the input layer and these values were being used. When we
look at the you know output layer node, you can see the input values are coming are
being used as the what output values of hidden layer are being becoming the input values
for the output layer, and output 1 and output 2 and output 3 can be clearly seen here. And
that is how we compute output for the output layer node. And once we have that value;
so let us look at this value output 4; so the value that has been computed comes out to be
0.4918236.
1059
And, now as we know that the acceptance variable that outcome variable that we have
that is; essentially a binary variable 0 and 1 whether that particular combination was
accepted or not. So, this value; this score can be used you know compared with a cut off
score that is we can take 0.5 as the cut off score and it can be compared to classify this
observation, whenever this is the code that we are using.
So, if else if this output value is greater than 0.5, then classified to that at row 1; that
means; accepted otherwise 0; so we can compute this value; so this comes out to be 0,
because the value is 0.49 quite close to 0.5, but less than that. So, this has been classified
as 0; however, quite close if you look at the actual value acceptance it was 1.
1060
So, the value that was computed here is 0.49. So, there is still some it was still not
classified as you know class 1. So, therefore, what we can understand is; more iterations
are to be performed, this neural network that we have this is what this was the first
iteration that we had done.
So, therefore, more iteration have to be performed for us to be able to have the final
model, which will have higher accuracy, higher classification accuracy. So, this was just
first iteration. So, more iteration the network will learn something more from the data
and then probably the performance will improve the actual you know results would start
matching the actual values.
So, let us go back to our discussion. So, as you can see whatever discussion that we had
in previous lecture the computations that we talked about in previous lecture, this
particular expression as well. So, that we now saw how it are going to be performed you
know in R environment R studio environment.
Now, let us now talk about the next important point here, we talked about in previous
lecture that linear and logistic regression can be treated as a special cases of neural
network, how is that how can that happen; so let us discuss. So, let us consider a neural
network with single output node and no hidden layers and that would actually
approximate the linear and logistic regression models right.
1061
So, as we talked about that expression, that we use the expression that we use for for
weighing the inputs here right; the expression that we used here this theta plus
summation of w and x i predictor values, this is quite similar to what we have in linear
regression the beta plus you know beta 0 plus here we have summation beta and x this is
quite similar to what we have in linear regression.
So, therefore, in that sense we can try and approximate we can try an approximate linear
regression as a special case of neural network what we will do we will have 0 hidden
layers. So, let us zero down layer single output node that we already have. So, if the
same diagram we want to convert into a linear regression this is what we can probably
do.
So, if we remove the hidden layer, if we remove the hidden layer the input layer nodes
are going to be directly connected with the output layer node right; directly going to be
connected with the output layer node, and if we are not using; if we are using the transfer
function, so these arrows will also change this will also go. So, these arrows will also
change.
So, this will become there will be a feed to this from this node and there is going to be a
feed to this. Let us remove some of these things. So, this is what we will have and this is
theta this is going to be. So, they are going to be weights. So, we have two predictors and
one output variable and if you look at; now there is similarity, transfer function if g is
1062
something like this, then the input values that receive the same are going to be
transferred here and therefore, this is what we will have.
So, this will approximate what we talked about this will approximate the linear
regression. So, as we can see in the slide as well if a linear transfer function gx and equal
to bx is used so; that means, the input values that we receive from predictors the same
input values are going to be fed to the output layer node there are no hidden layer nodes,
then the formulation is going to be equivalent to that of a linear linear regressions you
can see these formulations.
So, this formulation will become equivalent to the formulation that we have in linear
regression. However, the estimation method how do we estimate how you know we
estimate the beta's in linear regression, that is typically done using least square that is
different in case of neural network. Neural network we apply back propagation algorithm
to estimate theta thetas and w.
So, theta and w’s values are estimated using back propagation algorithm neural network;
however, in linear regression these betas are estimated using least square. So, estimation
method is different; otherwise, we can approximate the linear regression using neural
network. So, therefore, we can say that in a way linear regression or a special cases of
neural network.
1063
The similar thing similar conceptualization we can do for logistic regression as well.
Suppose this transfer function is logistic function, if this transfer function is logistic
function, then you would see that you would see that the neural network with zero
hidden layer nodes zero hidden layers would again approximate the logistic regression
equation. So, you can see as in the slide probability of a particular record belonging to
class 1 and this is you know they are also in logistic regression also we use this logistic
response function and we will have expression like this here.
So, that will; so this will actually approximate this will actually approximate the logistic
regression. So, if the formulation is equivalent to what we have in logistic regression
equation, how well just like linear regression; the estimation method is different there in
logistic regression that is the maximum likelihood method that is typically used in
logistic, and in neural network as I talked about typically back propagation is used. So,
linear and logistic regression they both can be conceptualized as a special cases of neural
network.
Let us move forward; so some more important points with respect to artificial neural
network, so one is normalization.
1064
Typically, the amount of iterations that are performed in a neural network depending on
the number of observations and the depending on the learning rates and other things that
we will discuss later on. So, depending on that quite you know number of computations
number of computations or computational intensity of a artificial neural network could
be quite high. So, therefore, to boost the performance to get the converging convergence
in neural network and to also get better performance; it is generally recommended that
all the variables should be in the scale of 0, 1.
So, all the predictors; so we would like to have all the predictors in this scale 0 to 1. So,
normalization, so we would be required to perform normalization. So, that all the
variables are in that scale. So, for numeric variables as you can see in the slide the
normalized variable V norm could be computed in this fashion V minus minimum of V
divided by max V minus min V. So, this will give us a normalized variable, which will
have values in this particular range 0 to 1. So, this is for numeric variables.
If we talk about the binary variables the categorical variable with two classes; So,
typically there is not much that we need to do; if we create dummy variables, so they will
anyway have take values set of values is going to be 0 and 1. So, either 0 or 1 values is
going to be taken for all the observations. So, binary variable will work just fine
irrespective of whether they are ordinal or nominal. So, binary variables are going to be
with this range this normalization.
1065
We talked about the nominal variables with greater than two classes, then we can create
m minus 1 dummy variables and because these are dummy variables again they will have
this set of values 0 and 1. So, values could be going to be either 0 or 1. So, that is also
ok.
Now, when we talk about the ordinal variables ordinal variables with m classes, where m
is greater than 2, then we have to think about what can be done. So, typically the values
can be mapped to this particular set of values. So, 0 comma 1 divided by m minus 1, so if
there are m classes; so we are dividing 1 by m minus 1, then 2 divided by m minus 1,
then up to m minus 2 divided by m minus 1 and 1. So, this is to map the values.
So, if there are let us say; if there are 4 classes if we have an ordinal variable with 4
classes. So, we would like to as discussed in slide. So, you would see that, we would like
to have in this fashion the values as mentioned in the slide. Now for four classes that
scenario would be 0, 1 divided by 3; 2 divided by 3 and then 1.
So, these could be the four values for the ordinal classes, now the values that could be
there in the variable, they will have to mapped to these four values right. And we have to
change the variable type as ordinal and have these values. So, the scale will again be 0 to
1, and since this is going to be anyway ordinal variable. So, so the values are also going
to be in this range and it can be used here.
So, these transformations can be performed or are actually recommended for in your
work to achieve either the conversion of the network, conversion of the model or to even
improve the performance. There are few more considerations more discussions point
about artificial neural network, that we will discuss in the next lecture. So, we will stop
here.
Thank you.
1066
Dr. Gaurav Dixit
Lecture - 55
Artificial Neural Network-Part III
previous lecture previous few lectures; we have been discussing artificial neural
networks. So, we have been able to cover the background, we also did a small exercise,
we understood the architecture different layers, we also we also have gone through a few
more details related to input layer computations that, we are required to perform in input
layers or hidden layers and output layers.
We also understood some of the expression the computations, the transfer function all
those things we have gone through the bias values weights connection weights. So, all
that discussion we have completed.
We also discussed how linear regression and logistic regression Can be considered as
special cases of neural network.
1067
1068
Now, let us move forward. So, we also talked about in previous lecture the
normalization. So, we talked about that scale of 0, 1 is typically recommended for a
neural network models for performance purposes, neural network models converge quite
quickly, when the scale is this is scale is used 0 to 1 scale is used.
And the performance is also improved. How, so how this is kind of normalization can be
achieved. So, we discussed this particular expression formula V norm. So, any
normalized variable in this to bring it to 0 to 1 scale; So, we can sub in the numerator we
can subtract it is value with the minimum value of that particular variable and then
divided by difference of maximum and minimum value for that particular variable.
1069
So, that will give us the values in 0 to 1 scale. We also discussed that binary variable you
know they are not much to we are not required to do much, because they would already
be 0 to 1, because we would typically be used dummy variables. So, they are anyway
they are 0; they take two values 0 and 1, nominal variables also no such problem, we will
have m minus 1 dummy variables.
So, they will also have value 0 or 1. So, that would be also in the same range, ordinal
variables we talked about this particular mapping right 0 or 1 divided by m minus 1, 2
divided by m minus 1, and up to m minus 2, m minus 1 and minus 2 divided by m minus
1 and then 1. So, this is just mapping and we would be required to you know; so this is
mapping can then be used for ordinal variables.
Now, there are a few other transformations that can help. For example, if we are if we
have a right skewed variable, then probably log transform that can be done and that can
help us our neural network modelling. So, as you can see in the first point that
transformation, which could spread the values more symmetrically can be done for
performance purposes.
So, just like you know the normalization 0 to 1, and then this is log transformed we are if
we have a right skewed variable. So, that the values are are more evenly spread. So,
those kind of transformation can be done to improve the performance of our neural
network model.
1070
Let us discuss the estimation method. So, as we are familiar about the linear regression
and logistic regression models these techniques and we know that least squares and
maximum likelihood hood methods are used as estimation technique in these methods.
So, least square is used in linear regression and the maximum likelihood method is used
in the logistic regression. However, as you would; as we have discussed while you know,
when we were discussing these technique linear and logistic that a global metric of
errors; for example, SSE sum of sum of square errors that is typically used to estimate
the parameter.
So, these techniques least square and maximum likelihood; so their estimation procedure
their estimation steps use this particular global matrix SSE or some other global matrix
to estimate the parameters that are betas that, because this is we you know model the
predictors as linear function and therefore, those beta values are estimated using these
methods and typically a global metric of this kind SSE is used there.
However, we in neural network, we do not use any such you know global metric to
compute the parameters. So, we will understand, so what is the technique estimation
method, that is use in neural network.
1071
So, neural networks use error values of each observation to update the parameters in an
iterative fashion. So, it is not that we compute a you know we use a global matrix and the
error computation based on those global matrix, they are either optimized there.
So, that error that error is minimized. So, that typically happens in linear and logistic as
we discussed. So, in case of neural network it is for each observation, whatever error
value using the typical formula actual value minus updates predicted value, so that error
value that is used to update the parameters in an iterative fashion. So, after each
observation we would be updating. So, this is particular process this particular step is
referred as learning.
So, when we say that you know especially for data driven techniques, data driven model
for example, classification and regression tree, neural network here, in this particular
case. So, these techniques these models they learn from the data. So, in particular case, in
this particular case artificial neural network the learning is based on the error values
right. So, after each observation the learning comes from the error value for that
particular observation and this happens in an iterative fashion and all the observations are
used in this learning process.
So, as you can see in few more points in the slide error for the output node, that is you
know prediction error. So, the is typically distributed across all the hidden layer node.
So, that that is what happens in neural network; then all hidden layer nodes share
responsibility for part of the error. So, you would from this it would be clear that all the
nodes that we talk about for example, hidden layer nodes or hidden layer nodes and
output layer nodes.
So, all those nodes they share they are part of the learning process, they share response
and learning happens through error values after each observation. So, they also; so these
nodes also share responsibility for part of the error. So, this is this particular is referred as
node specific error. So when the part of the error that is shared by hidden layer nodes that
is called node specific error.
Now, these nodes specific errors are used to update the connection weights and bias
values right. So, we are going to have connections between input layer nodes and output
layer nodes and similarly for between hidden layer nodes to output layer nodes. So, all
those connections, weights and the bias values for hidden layer nodes and the output
1072
layer nodes. So, all those things all those parameters are going to be updated using these
node is specific errors. So, the actual algorithm that is actually used to perform this
process that is back propagation so as the anyway we can say that estimation method that
is using neural network is actually back propagation algorithm.
So, this particular algorithm is used to update weights and biased values of a neural
network.
So, error values they are computed from output layer back to hidden layers that is why
you would see that it is called back propagation. If we have to understand this we can
understand the same thing through a diagram; for example, in previous lecture; we had
used this particular neural network.
1073
So, we had two nodes here; so we had two nodes in input layer, then we had three nodes
hidden layer, and then we had one node in output layer. So, the error values would be
computed here, and then they would be propagated back to hidden layer nodes right. So,
that is why this algorithm is called back propagation; so error values are computed from
output layer back to hidden layer nodes.
The next point as you can see in the slide all hidden layer and output layer nodes and all
connection weights become part of learning process right. So, since this particular value,
that is computed here the error value that is computed here. Now, this is propagated back
to the other layers the previous layers of the network.
So, all the connections that are going to be here all the connections and the
corresponding weight the bias values they are going to be part of the learning process.
Similarly here; so since all the connections all the weights and vast values are going to
be updated using this particular error you know back in a back propagation manner. So,
all of all these parameters are going to be all these nodes and the connections are going
to be part of the learning process.
So, typically how do we compute these errors? So, let us talk about the node is specific
error, how it is computed. So, you can see that first we will discuss for the output node.
So, for example, in this case we have one output node. So, if we consider this particular
network diagram, then we can see that for this output node error can be computed in this
1074
fashion. So, we have a correction factor, that is multiplied to the predicted error, that is
actual value minus predicted value. So, this is prediction error. So, this prediction error
that is going to be computed here; so the predicted scores, so the output value that comes
from this particular node output node, that is going to be the predicted value as we talked
about in the previous lecture.
And now we will have the actual value, now the difference actual value minus predicted
value. So, this is going to be our predicted value, so this is typical definition of prediction
error. So, this is a typical definition for any error computation. Now, this particular value
is going to be multiplied by a correction factor, this particular value is going to be
multiplied by a correction factor and that; so now, we will get a new value, now this
value this value would be assigned to the output layer node. So, now, this particular
value is going to be used to update the parameters related to this particular node.
So, as we can see that the bias value we can see that in the next equation, we can see that
bias value theta new is theta old plus learning rate into error. So, the error value that has
been computed in the previous step; So, we will multiplied with learning rate and then
add it to the old value of old bias value. So, that will give us the new bias value to be
used for the networks. So, any new observation that is run through the network it will use
this particular new bias value. So, what is learning rate?
So, learning rate controls the rate of change from previous iteration. So, you can see
what part of; what amount of the computed error value that we just saw; what part is
actually used for the learning process that is controlled by learning rate. So, typically the
learning rate value is typically it is a constant value and in the range 0 to 1. So, that will
determine what; how much; what amount of error part of the error is going to be used for
the learning process the updating process.
So these are the steps, so node specific error from the prediction error we can compute
the prediction error that we get for a particular observation. So, as I talked about for
every observation we will learn through this particular network. And in previous lecture,
we talked about that all these through an example also in R as well we understood this
that all the connection weights and bias values they are initialized to random numbers.
Now, once the first observation is passed through this particular network run through this
1075
network these randomly initialized you know weight values and bias values are going to
be used to produce the predicted value.
And once that predicted value for that particular objects observation number 1, is
actually the core number one is actually computed, then it can be subtracted from actual
value and we will get the prediction error and from there we can compute the error and
we will know the values as given in the you know slide that we can use the learning rate
and the control the amount of change and we will get the new values and new values for
thetas and w weights and those can be updated.
So, the bias values here for output node and the weights can be updated. So, this was for
the output node and; however, the process the steps are quite similar for the hidden layer
nodes as well; however, at this point I would like to point out that this was for the first
observation. Now, second observation also once these weights and you know connection
weights and bias values have been updated for the whole network, then the second
observation will pass through these updated values right.
And again we will reach to this value will learn through the network, we will have the
predicted value again the same process will continue and again the updation and learning
will happen. And in this fashion observation 2, 3, 4 and in this fashion will keep running
these observation through the network and the network will keep on learning in this
fashion right. In the prediction the correction factor will get the error and that would be
updated as we talked about right theta new theta old plus learning rate into this error
value.
Similarly, weight new weight old plus learning rate into error value; So, in this fashion
now the new values will be updated. So, we talked about how the values are computed
for a output node, these values are being computed for the output node. Similarly, the
same type of process similar steps are performed for to compute node specific error for
hidden nodes. So, the error value that we have computed for output node this one.
Now, this is going to be used by hidden layer nodes, this is going to be used by hidden
layer nodes to perform these steps similar to what we have here. So, this particular error,
now this is going to be used here; and these steps are going to be quite similar again we
will you know; so in place of prediction error now we will have this error value, now we
will have this error value for hidden nodes and other steps are going to be similar. So, we
1076
will use a correction factor and that correction factor is, then going to be used for all
these nodes.
Now, the correction factor is specific to the node. So, the value that is the that has been
the output value, that is there for a particular node. So, whether it is for the output node
or the hidden layer nodes that; so correction factor could be based on these output values
and therefore, it is it could it is going to be different it is going to be different for all
these nodes.
So, that correction factor then would be for output node it would be multiplied with the
prediction error, for hidden layer node it would be multiplied by the error value that is
computed for the output node. So, in this fashion all the weight and bias values are going
to be updated.
Now, let us talk about the few more things about updating weight and bias values. So,
there are two main approaches for this updation. So, for example, what we have
discussed till now is; actually you know case updating. So, what is case updating? So,
updating is done after each case or record is run through the network referred as a trial.
So, this is what we have discussed. After each observation this case updating you know
this is called case updating, after each observations the bias and weight values are
updated and this happens for all the observations in the data set.
1077
And when all the records are run through the network it is referred as one epoch or
sweep through the data. So, that is referred as a one sweep through the data, if we are
using training partition data set, then it is going to be one sweep through the training
partition data. So, in the training process learning process of the network we might have
to run many such epochs. So, as you can see here one point here last point here in case
updating many epochs could be used to train the network.
Now, the another approach for updating weight and bias value is called a batch updating.
So, this is different from what we have discussed in you know case updating here. So,
what happens in batch updating? The updating is done after all the records are run
through the network right. Till now what we have been discussing is observation is run
through; so the weights and biased values are randomly initialized for the network.
So, once first we decide the network architecture, the network structure once that is
decided, then we will in randomly initialize the connection weights and bias values, then
we will run the first observation through the network, then the second observation
through the network, third observation through the network, and when all the
observations have been run that is one epoch. So, we might have to you know execute
many such epochs and this is, what we are discussing under case updating this is called
case updating.
And, when we talk about batch updating first all the observations; so you would see that
in case updating, after one observation has been run immediately these computations are
done and the bias values and weight values they are updated; however, in batch updating
all the observations are run through the network, and then this particular these kind of
computations are performed. So, what is the difference in these computations?
So, in place of prediction error, because the prediction error was a specific to the
observation, so for every observation you know we had a prediction error in the case
updating. Now, in place of that prediction error we will use some of prediction errors for
all records, you know that is for all records.
So, in place of prediction error some of these errors is going to be used right. So, this is
going to be used and accordingly the biased values and weight values are going to be
updated. So, again in batch updating also even you know many epochs, we would have
to run to train the network.
1078
Now, in terms of performance in terms of performance and modeling, how case updating
and batch updating what are the advantages or disadvantages; so the one important
difference that you can understand through the process itself; case updating is more
rigorous right. So, in case updating every time when an observation is done through the
network, the updation of these values the back propagation of these values that take place
and that takes place for each record. However, in batch updating this happens once for all
the observations in the data set.
So, you would see that case updating is more rigorous and therefore, it would require
more runtime as well. So, the accuracy of model would be much better in case updating;
however, it will take it will come at the cost of longer runtime.
Now, let us talk about the stopping criteria for updating.
1079
So, when should be stopped; we talk about that to train the network to learn from the
data; to learn from the data, we will have to execute many epochs will have to perform
many sweeps through the data. So, when do we stop? So, what should be the; what are
the some of them you know key stopping criteria’s for this learning process.
So, few points are discussed here, as you can see small incremental change in bias and
weight values from previous iteration. So, if the bias and weight values the change that is
happening; so that is that you can see learning rate into error. So, that component the
second component; that is being added to the previous value this is very incremental
there is not significant change.
So, when we run through; when we run our first observation, then this is this particular
component would be quite significant as we go through other observations, this value
might decrease and you know as we go through all the observation and we you know one
epoch is completed, we go through second epoch still there would be some you know
significant you know component here.
However, as we move as we do more computations of this kind this is you know that
significant that significance significant value that is being added here you know added or
subtracted here might that that magnitude of that might decrease and it might just remain
an incremental value very small change might be happening.
So, probably that is the indication that the model or network has saturated and therefore,
we should stop the learning process. So, this is what is mentioned here. The small
1080
incremental change in bias and weight values might indicate that probably we should
stop the learning process.
Other stopping criteria could be a rate of change of error function values as there is a
required threshold. So, overall; over overall performance of the model, so some error
function could be used for example, SSE could be used to check the overall performance
of the model. So, therefore, we can see for you know different different sweep. So, that
we that we execute different sweeps, that we execute we can have the; you know that
you can see; we can check the change of error values there that is for the model error.
And we see when that is you know that is reaching the required threshold.
So, if the rate of change is you know has these there is required so; that means, the rate
of change in terms of model performance is not much. So, that is when we talk about this
error function that is with respect to the model performance. So, that error, so it could be
SSE or some other metric, and if the rate of change is not much then probably all has
reached to a established specified value threshold value, then probably we should stop
the learning process.
Now, so this is second point is quite similar to what we have been discussing in previous
techniques as well. Now, for example, in the cart algorithm also we talked about that;
you know the classification, misclassification rate on training partition and the same on
you know validation partition and somewhere when the value is minimized probably you
know that is the point where we should stop the tree growth. So, this is second point is
quite similar to this.
Now, there could be another criteria, that is limit on number of runs is released. So, we
can also specify the number of runs the maximum limits. So, this is going to be the
probably the last result to stop the network is not able to converse, then probably we
would like to set a limit at the point this particular training process learning process
would be stopped.
Typically this is the last result, when we are not able to you know stop the stop the
learning process the updating process either from first criteria, second criteria, whatever
is being used and then probably we should stop at some point and too many runs have
been done and still no conversion has taken place. And probably we should we can do
1081
that by specifying the limit on number of runs. After this discussion let us understand
some of these concepts through a modeling exercise in R.
So, in the previous lecture, we had used this particular example that is we had this fat
score and salt score, and this was for different experimentation with respect to cheese
samples and whether those cheese samples and that combination of fat and salt score,
whether that is being accepted or rejected by the experts. So, we had this hypothetical
later a few a values also and that was used to understand the computations that we do in
a neural network. So, we are going to use this same example to build our to build our to
train our neural network model.
So, let us execute this code. So, we will have fat score, as you can see just 6 values are
there.
1082
So, these sample size is going to be quite a small just 6 values; however, we are going to
go going through this exercise for the illustration purpose. So, let us look at the salt
score.
So, this is second variable than acceptance, then some of these computation that you see
and we have gone through in previous lecture. So, let us skip through; I will stop here.
So, this is the package that we require neural net.
1083
So, this is the packet that we are going to use for our neural network modeling exercise.
So, let us load this particular library a neural network; so this is probably not installed.
So, let us install this package.
So, once; so this is so there are many many packages which are available in R that can be
used for neural network modeling exercise.
So, neural network being one of the most used package. So, that is why we are using this
one. So, otherwise for this is applicable for other techniques as well, so there could be
1084
more than one packages that could be available for implementation and for modeling of a
particular technique.
So, typically what we have been using are the most popular most used packages most
used functions for the modeling.
So, let us load this library once this is loaded. So, we will create a data frame of these
three variables fat score, salt score and acceptance. So, let us create this data frame.
1085
So, let us look at the structure of this data frame. So, you can see fat score is the 6 values
and all our numerical as of now; so as you can see probably this particular process is
actually for a classification task. We would like to classify whether a particular key
sample is acceptable or not based on the fat score and salt score.
However, you would see that pack the package that we are going to use neural network
that restrict that does not require us to change the change the variable type to for
example, acceptance is a categorical outcome variable, but the package does not allow us
to do that; so all the computations are done internally within the function that are
available in this package.
So, now let us talk about the model. So, structure neural network is structured as we
talked about that; that is the first thing that we need to decide and of course. So, we can
do certain experimentation with the neural network structure. However, for our
illustration we will use this particular neural network structure for this example; two
nodes in the input layer and then three nodes in the hidden layers the one node in the
output layer so in this fashion, because our output variable is binary variable.
So, we can use now a neural network, as you can see this function is going to be used to
build the model and we would see that linear output is one argument there, and this
argument has to be specified as false for classification, if we are building model for
classification task that it has to be specified as true, if we are building model for the
prediction task.
So, I we will stop here and we will continue our discussion on this particular model size
in R in the next lecture.
Thank you.
1086
Dr. Gaurav Dixit
Lecture - 56
Artificial Neural Network-Part IV
previous we lecture lectures we have been discussing neural artificial neural networks
and in particular in previous lecture we started our modeling exercise using this small
data set related to cheese samples and there is you know the acceptance or rejection
based on two predictors fat score and salt score. So, we were at this point, we were trying
to build this model. So, let us start again.
So, data frame we had already created in the previous lecture. So, the same date frame
you can see we are going to use the data frame 6 observations 3 variables.
The particular neural network architecture that we are going to use is the same that we
had used in previous lectures this is the so the architecture is this one.
1087
So, this was the architecture that we had used in previous lectures for this particular
example, right. So, these are the connections, and then further then bias values. So, we
can see these connections and bias values, these this is the architecture that we are going
to use. So, as I talk about we can certainly do certain experimentation with this particular
architecture, we can certainly add more hidden more hidden layers`.
So, we can have one more hidden layer here with two nodes. So, this kind of
experimentation we can always perform. However, what has been seen that typically one
hidden layer is sufficient to model even the most complicated complex relationships. So,
we typically start with one hidden layer and typically the performance is higher for
higher for the one hidden layer networks.
Again in terms of how many nodes that we should be using in a particular hidden layer
that is also of course, we can experiment with that part also. So, we can of course, have
you know one more node and that experimentation and the performance of that particular
network we can always compare it with. So, we can always build different candidate
network model and we can always check which one is performing better, and those kind
of experimentation with the network architecture can also be performed. However, for
our this illustration for this exercise we are sticking to this particular network 2 nodes, 3
nodes and 1 node in the input layer hidden, hidden layer and output layer respectively.
1088
So, let us discuss further about this so the package as we talked about is the neural net
and the function is neural net that we are going to use to build our neural network model.
So, the first argument is the formula for the model equation as you can see, acceptance is
the output variable outcome variable then we have two predictors fat score and salt score.
And then data is coming from the data frame df and you can see the next argument is
hidden, so that is a number of hidden layers and you know that is the number of nodes in
different hidden layers. So, we have just 1 hidden layer and we have 3 nodes, we have
just mentioned 3 here. However, if you have we have more number of hidden layers and
different number of nodes there, so that can we specified using this particular vector.
Then we have a start weights. So, this is for the initialization. So, like we did in our
previous exercise be the same data set so we can see we need 13 values. So, this we can
understand we have number of bias values 4. So, we have 4 bias values and here for
every node we have it is connected with the nodes of next layer. So, 2 you know 3 into 2,
6 plus 3, so we have 9 connections and we have 4 bias values. So, in total we need 13
initialization; 13 values we need to initialize.
So, the same thing is mentioned here in the start weights. You can see start weights run if
13 values, and as we talked about that typically we initialize these values from minus 0.5
to 0.5. So, the same thing is mentioned here. So, this will be used this these particular
values are going to be used for the initialization step. And then when we see another
argument rep that is for number of repetition a number of repetition that we want to
number of times that we want to train our model, right.
So, we just want to run it you know once right. So, this we if you understand finding
more detail about this particular function you can go here in the help section neural net
and you will get more details about this particular function.
1089
You can see all these arguments and for example, specifically we were discussing a rep.
So, you can see number of repetitions for the neural networks training.
1090
So, in this case we just want one. If you specify more than one then of course, you will
have in a way you will have two candidate models. So, that other actually based on two
runs. So, an every run the results might be might change slightly as we have been talking
about that machine learning algorithm or data driven models they are sensitive to data.
So, therefore, you know every run results might change. So, from those runs we can
always let the best model.
However, in this particular case for the illustration we are just running it building our
model just once.
1091
Now the algorithm that we can select, so there are multiple options here, so that we can
you can see in the help section. So, this particular argument algorithm is here.
You can see many options here. So, we have back prop that is traditional back
propagation algorithm. Then we have R prop plus, R prop minus and other you know
variance. So, all these could be used to build our neural network model; however, for our
exercise we are using back prop. Then certain arguments in this particular function are
depend on the algorithm that is being used.
1092
So, for example, in this case back prop we might specify one more argument that is
learning rate, right. So, we discussed about the learning rate that the value the constant
value, that could be used for the you know to control the amount of learning that happens
and you know that happens from previous iteration or in each iteration. So, this is 0.1, so
right.
So, the whatever formula that we saw the updation formulas that we saw in the previous
lecture for thetas and weights ah, so there it was the values were or the addition was the
learning rate into the error value. So, you can see 10 percent of that particular value is
being used to learn.
Now, we have error dot fct; so this function again can be used. So, this is sse, so this can
be used to check the overall model error then we have act dot fct; so this is activation
function which is nothing, but what we have been calling as transfer function. So, this
activation function is actually the transfer function. So, as we talk about different
alternatives the logistic is the most popular. So, that is being used here.
Then as we talked about in previous lecture linear output is the argument that is to be
specified as false if we are building a classification model and if you are building a
application model then this has to be specified as true. So, let us run this code. Let us run
this and we will have our model.
1093
Now in the model we many values are written. So, one of them is result dot matrix that
will actually have the values of you know bias values and connection weights and few
more a few more few more parameters about the model.
So, you can see here we are just looking at first 3 values, so first 3 values of this
particular matrix. So, first column so, you can see. So, first 3 values are actually about
error. So, this error is actually based on. So, this is actually sse value because we had
used sse here; so we have other options also for model error. So, as you can see here sse
and ce in the help section and you can see ce is cross entropy error and as a some of
square error. So, these are the option that could be used.
Then we have reached this threshold. So, that is 0.0098; so that is about 0.01. So,
threshold we had not specified, however the default value for the threshold is there. So,
that we can see you can see threshold is 0.01. So, the threshold has these two this level
therefore the training process was it stopped, right. So, the threshold is again based on
this the error value that we talked about and the steps that have been taken 6 steps were
required. However, if as we talked about if the model is not able to converge then
probably the training process would be stopped by the limit that is specify the number of
steps.
However, by default this limit is quite high; so therefore, there is a you know good
chance that model will converge you can see the default value is step max in the help
section, you can see steps max it is 1 e plus 0 5. So, that is 10000, I think that is 10000
steps more than so that is more than 10000 steps 1 lakh, 1 lakh steps. So, that could be
used. So, those number of steps put the limit. So, in this particular case only 6 steps have
been used.
Now, let us move forward. So, next thing that we would like to understand is the
interlayer connection weights, so which is nothing but the information about these
values. So, these are these are some of the weights so we would like to see from the
model; the final model that we have what are these values thetas and weights.
So, first we will look at these values and these weights and theta combination and then
here these value. So, typically because we have one hidden layer if we had a more
hidden layer and again and we would we can; so for every inter layer combination, so
1094
input layer 2 first hidden layer in this case we have just one hidden layer. So, from input
layer to this hidden layer we will have few weights and bias values.
Now next is from hidden layer to output layer; so again we have some weights and bias
value. So if we had more hidden layer, so we would have more such combinations. So,
inter layer connections weights and bias values we want to have a look. So, first we look
at the input layer to hidden layer connection. So, this is the matrix that we can this is
actually list that is returned when the model is built. So mod dollar weights. So, within
this the you can see in the list the first element of this list is actually nothing, but a matrix
for storing these weights right.
So, by default these values are stored using the default row number and column number
1 2 3. However, I have changed the dimension names that is row names and column
names for this particular matrix. So, that we are able to understand which particular
value is for which particular connection or by aspect or you know for which bias value
for which particular node.
So, we this is the code; so you can see I am using dim names function and we use this
function allows us to change the row names and column names of a particular matrix or
data frame. So, in this case we can see the list is to be supplied. So list first early first
element is always the row names and then the second element is the for the column
names.
1095
So, let us execute this code and you can see this is the result; you can see bias values for
node 3 is this one, node 4 is this one; so you can see for node 5, similarly a node 1,
which is also corresponding to the predictor fat. So, the connection weights are specified
so from node 1 to node 3 and node 4 and node 5 these are the weights. Then node 2 that
is corresponding to the predicted salt and the connections to node 3, node 4 and node 5
and the connection weights we can see here.
So, a specific connection weights and their values, the bias values and connection
weights values we can access in this fashion. Similarly from hidden layer to output layer
also we have 3 weights and 1 bias value, so 4 values. So, that also we can access and in
the similar fashion. So, again I am changing the dimension in here again for this
particular second element of weights a list. So, again as you can see bias the list is being
supplied with first element being the row names and the second element being the you
know column names. So, let us execute this code.
Now you would see that the row names bias, now you see the row names have changed.
Now, the hidden layer hidden layer you know bias and hidden layer bias value and nodes
they have become the row names and the output layer nodes node has become the you
know column name.
1096
So, you can see bias that bias value for this particular output node and then we have this
connection weight from node 3 to node 6 and node 4 to node 6 and node 5 to node 6. So,
these are the connection weights and bias values for the model that we have just built.
Now, the final output values the values that we get from the output node, so this value.
So, this value is also returned by the model by the function in our model object. So, this
can be accessed using net dot not net dot result list somewhat dot what dollar net dot
result will give us these core values. So, in this particular case we had just one node. So,
we have just one value here. So, because this was for classification this was classification
task.
So, these values can be compared with our cut off value. So, in this particular case we are
taking 0.5 as the cut off value which is equivalent to the most proper class method where
you know there are only two classes then 0.5 cut off value will actually implement that
most proper class method. So, we will get these specific values using these scores. So, if
you want to have a look at this scores also so we can look at this scores as well.
So, if we run this you can see this is a list and we have just one node and 6 observations
there. So, these are the scores. So, if we compare this with 0.5, we will get the values.
Let us unname these column names, these names, dimension names and we will get this
now we are creating a table with the scored you know this the predicted class and then
1097
actual class and then the predicted value that is the you know we can say the probability
scores and then predictors.
So, this is our data frame with all the information the relevant information predicted
class. You can see all the observations have been predicted as class 1 belonging to class 1
and if you look at the predictor value all of them are more than 0.5.
So, because this was a small sample just 6 observations you can see that you know all the
values have been classified as belonging to one particular class and you can see the
predictor the predictors also their values also. So, now we can look at the performance;
however, this is quite obvious 3 cases correctly classified.
1098
So, in this case as you can see the predicted classes all the observation have been
classified as class 1. So, therefore, we need to make certain changes in this particular
code that we have been using for the other techniques because there will not be any
values for level 0.
So, therefore, we need to specify that explicitly, so that that actually comes in the
classification matrix that we want to generate here. So, you can see that modtrainc I am
converting it into a factor variable and then giving these two levels. So, that even if there
are no observations being predicted as belonging to class 0 is still that particular you
know column in the classification metric would be displayed; so let us run this.
1099
So, you can see this column I was talking about. So, no observation, but it is being
displayed because of this change in the code that we have done.
We had used just the modtrainc directly this column would have gone. So, we can see
that 3 values in the diagonal element 3 observation have been correctly classified and 3
observations are incorrectly classified. So, all the observation belong to you know you
know in class 0 I have been incorrectly classified as class 1. So, the classification
accuracy an error are also obvious 0.5 in this case.
We want to have a look at the network diagram now using R. So, this is the function plot
and we have to pass the this neural network object mod here and we will get the diagram.
1100
So, you can see here, so you can see here that this is the neural network diagram that has
been prepared by this function plot.
So, you can see fatscore, saltscore these are the two predictors corresponding nodes so
you can see here and from each node you can see the values the weights from both these
nodes and you can see the bias values also here. So, 3 bias values corresponding to 3
hidden layer nodes here and then we have one bias node for the output node and 3
connection weights for the output node and then we have the, so this particular output
node is corresponding to the acceptance that is our outcome variable.
So, other details as you can see this was the error that is the sse value and the number of
steps that were required to reach the conversion right to stop the network learning
process. So, this is our model.
1101
Now, so this example was you know we used small sample just 6 observations.
So, probably this was not an appropriate you know example an appropriate sample for us
to understand you know build a good enough neural network model. So, what we will do
we will use our used cars data set that we have been using in other techniques as well.
So, this particular data set we are going to use to build a model with you know a slightly
you know larger sample size.
So, because we would be importing data from excel file. So, let us load this package;
there very now let us import this file with cars. So, you can see 79 observations of 11
1102
variables; however, even for this dataset this data set is also small, but better than the
previous example that we have used. So, let us remove na columns, na rows.
Let us look at the first 6 observations. So, these are the variables we are already familiar
with this dataset brand so this is about the used car. So, we would like to build a
prediction model to predict the value, offered price value of a used car. So, the variables
that we have brand model manufacturing here Fuel type it is petrol, diesel or CNG then
we have SR price that is show room price, that is the price when the that particular used
the car was first purchased. Then we have kilometres the accumulated kilometres, then
price the offer price of the used car in the current condition.
That is you know that we can see through the different variables. Transmission 0 or 1
that is manual or automatic, then the Owners previous number of 5 persons who have
actually owned this car before this sale offer. So, that is there then the number of airbags
and then we have another variable C underscore price; however, however we will not be
using this particular variables C underscore price.
Now from the manufacturing year as we have been doing in other techniques also when
we use the particular data set that we compute the age variables. So, that is more
appropriate for our prediction model. So, let us compute this. Let us add it to data frame
and let us also take a backup of this data frame.
1103
Now, if we look at the variables, so now, we would like to you know get rid of certain
columns certain variables.
For example, one is I think brand name yes; so brand name, model name and
manufacturing you are no longer required and the last one that is as you can see this is
this is actually C price we do not want. So, this is actually we have 12 variables this is
the variable number 11.
So, we do not want this one as well, so because we are going to build a prediction model.
So, let us cut it off these columns and now let us look at the variables that we have. So,
we have now 79 observation 8 variables. So, these are the variables of interest to us.
Now, what will do as we discussed in previous lectures that the neural network models
perform much better when we have the scale of you know 0 to 1. So, if all the variables
they are in this scale 0 to 1 then probably neural network model they perform they
converge quickly and also the performance improves.
So, in this, particular data frame as we can see we have two categorical variable Fuel
type and Transmission, so however, as we will see that neural network you know as we
did in the previous exercise also the acceptance was categorical outcome variable, but we
did not change it to change it into a factor variable; because the neural network function
it does not allow us to do that conversion. So, all the computations are done internally in
that particular function. So, we would be required to change some of these variables. So,
1104
what we will do? We will so this is a one particular example in R environment where we
have to explicitly create dummy variables.
So, dummy coding we have to do probably for the first time. So, we have been gone
through so many techniques. So, typically the we convert the variables into factor
variable and the functions take care of the dummy conversion process and the model
building later on however, in this particular case will have to explicitly do this. So, a fuel
type we have 3 levels, let us check the level CNG, diesel, petrol. So, all these levels will
look to convert them into dummy variables this is one way to achieve that you can see.
So, CNG if df1 dollar fuel type if it is CNG then because this logical operator we are
using. So, therefore, if it is CNG then the it will be true otherwise false. So, it could be a
logical variable, you can see in the environment section CNG has been created logical
you can see false false true. So, in the neural network function the values should be
either the variables that use there should be the logical or numerical.
So, you are converting the factor variables the categorical variable into logical variables
the logical dummy variables. So, now let us convert the diesel and then petrol.
Then we have another variable transmission that is also factor variables; let us convert it
into a logical or dummy variables. So, manual T and auto T however, as we understand
that we will not be using all 3 all categories and one when we taken as the reference.
1105
So, we would be no for fuel type we would be taking only two of the dummy variables in
the modeling exercise and for transmission also just one of the dummy code in the
modeling exercise. So, we will take diesel and petrol you know these two categories of
Fuel type and the automatic transmission as the one category from transmission variable.
So, next process as we talked about as we discussed in previous lecture is we need to

convert our numeric variable into you know bring them into a 0 1 scale. So, this is how
we can do it. So, what first computation is we are trying to compute the max values for
all the numeric variables; so you can see df1 data frame we have excluded a column
number 1 and 5 which are the corresponding to fuel type and transmission. So, we left
with only numeric variables here and then we are trying to compute max value for all
these variables. And then in the next line we are trying to compute the min value for all
these numeric variables. So, let us execute these two lines, and you would see that in
max d of one here in the environment section we have 6 values because we have you
know 6 numeric variable and again min df1, 6 values because we have 6 numeric
variables.
So, the max and min value have been computed. Now, we can use these values in the
scale function. So, this scale function we are going to use to normalize to a 0 1 scale. So,
you can see centre is now min df1 and the scale is max df1 minus min df1; so the
particular formulation that we discussed in the slides so that is how it can be done using
R. So, scale function can be used and then we will store these values in the same
columns.
1106
So, let us execute this code, now so we have scaled all these we have normalized all
these numeric variables. So, let us add the dummy variables in and create a new data
frame that we are going to use in our modeling exercise, so diesel, petrol and automatic
transmission.
So, this was this these are the variables that we are taking into our modeling exercise, let
us create the data frame, let us look at the structure of this final frame.
1107
So, this is the frame df2 that data frame that we are going to use for our network model.
So, you can see we have variable SR price, KM price, owners, airbag and age all have
been scaled to 0 and 1 and you can see the values there in the structure output. And then
we have 3 dummy variables which have been which have been taken as the logical
variables here diesel, petrol and automatic transmission. So, once this is done we can go
ahead and do our partitioning.
1108
So, df2 is our data set now. So, we will take 90 percent of the values in the training
partition and the remaining 10 percent of the values will be left for validation testing
partition. So, let us create this. So, let us create training partition, then test partition.
Now the same function the same packaged a neural net that we are going to use here; so
neural network a structure that we have to decide right now if we look at the number of
variables that we have. So, we have 9 variables, so one of them is the outcome variable
that is price. So, that leaves us with 8 variables that is we have 8 predictors. So,
therefore, in the input layer as we have been doing for previous examples. So,
corresponding to 8 predictors we would like to have 8 nodes. So, that would cover nodes
number 1 to 8; then hidden layer as we talked about that typically one hidden layer is
sufficient to even model the complex relationships.
So, it will take just 1 hidden layer and you know we will take you know we will have 9
nodes, so 1 more than the number of predictors here. So, we had 8 predictors will let us
take 9 nodes. So, of course, we can do experimentation with the number of nodes and
even with the number of hidden layers. So, that will cover us the nodes number 9 to 17
then we will have a just 1 node in the output layer that is node number 18.
1109
So, with this network structure we can go head and we can build our model and see the
performance. Of course, after experimentation we can try out different candidate models
as well and then finally, select one.
So, we will stop here and we will continue model access model building exercise for this
particular dataset in the next lecture.
Thank you.
1110
Dr. Gaurav Dixit
Lecture - 57
Artificial Neural Network-Part V
previous few lectures we have been discussing Artificial Neural Networks. Specifically
in the previous lecture we were doing an exercise in R using our used cars data set and
there we discussed different steps that we executed related to a variable transformation
and normalization that were required and then the formula and so will again restart that
exercise and we will do our modelling and discuss some of the important issues that we
faced in neural network.
So, before going into R studio let us discuss few important issues in modeling exercise.
So, one of the in one of the key issues that we typically a face in neural network
modeling is over fitting; it is more likely that model will over fit the data in a neural
network modeling scenario. So, how do we overcome this situation? So, typically error
on a validation and test partition would be large in comparison to training partition. So,
first we need to detect whether a neural network is over fitting.
So, typically when we build a when we train a neural network model then we can check
the performance on a training partition itself and then performance on validation and test
partition and we would see that error is quite you know quite small in comparison a quite
small for training partition in comparison to that off in a validation and testing partition.
So, that is how will know that probably the neural network is over fitting to the data.
Therefore, we need to limit the training or learning process of neural network. So, that
this over fitting could be avoided, because as we have been talking about the objective in
data mining modeling in prediction classification and other types of tasks, other types of
data mining task. The objective is that our model should be performing well in new data.
So, a few things that can be tried out is one limit the number of epochs. So, number of
times and number of sieves through the data; that we have to do in neural network
learning process, training process that can be limited. So, that network you know, so that
1111
that was just fits the key information that is there in predictors and not to start fitting to
the noise.
Now, another approach could be a plot of validation error versus number of epochs that
could be used to find out the best number of epochs, so for training. So, we can always
look for point of a minimum validation error; something similar that we have been doing
in previous few techniques as well where in we used to it create this kind of plot. So, in
this case it is going to be number of epochs on x axis and error it could be you know for
training and validation. So typically this kind of plot we have been you know generating
in other techniques also.
So, typically for any data mining technique typically the training error will plot curve per
training error will go like this, and the as we keep on as we keep on training our model
the error will keep on decreasing and it will reach to 0. However, for new data that is
validation partitions the error will keep on decreasing, but the at one time at one point it
would be minimum and then again it will start increasing. So, this is the point that we are
looking for.
So, this is the minimum validation error point; point of minimum validation error that we
are looking for. So, we would like to stop the learning process of our neural network at
this point. So, we would like to stop here and this particular, so the network which I have
been trained which have been trained up to this point these many number of epochs right.
1112
So, if this is n, these many number of epochs that would probably perform better on new
data.
So, however, of course, this you know this criteria whether we have to take number of
epochs here or some other parameter related to training process or learning process of
neural network would actually depend on the implementation of neural network in the
software. So, we will see what kind of plot we require to find out this point of minimum
validation error in our case a using R. So, let us go back to R studio environment; so the
data set that we were using in that in the previous lecture.
Used cars the last exercise that we were doing so let us, so let us load this library xlsx.
So, let us import this data set used cars.
1113
So, small data set about the used cars. So the task is to as you can see environment
section 79 observations, 11 variables. So, the task is to a build a model for predicting the
used cars price. Let us move na columns, na rows.
So, we are already familiar with this data set these are the variables as used in the
previous lecture as well.
1114
Let us create a is added to the data frame; let us take a backup, will exclude the variables
that we do not want to you know take forward for our modelling. So, these are the
variables now.
Now, as we did in previous lecture, let us go through a the transformation process.
1115
So, this part we have discussed before in previous lecture. So, we will just quickly go
through this, a scaling of the numeric variables.
And then we will create a data frame that would finally, be used for our modeling
exercise.
1116
So, this is the data frame df 2 and you would see these are the variables price is our
outcome variable and others are predictors and you would see all the values for all the
variables they are in 0 to 1 scale. Now we will do our partitioning, so 90 percent and 10
percent, 90 percent for training partition because this is small data set. So, we would like
to use you know higher percentage of observation for our training partition.
So, let us do our partitioning. So, only 10 percent of the observation that is about 8
observation would be there in test partition and the remaining 71 are going to be used in
the training partition.
Now, as we have discussed neural net is the package that we require that we have been
using for our neural network model. So, it has been loaded.
1117
We have already discussed the neural network structure that we would be following for
this particular exercise.
So, we talked about that there are 9 variables, so one being the outcome variable. So, we
will have 8 predictors effectively. So, therefore, 8 nodes in the input layer and then the
you know will take typically we take one hidden layer and one more you know we can
always do experimentation with the number of, you know hidden layers and also number
of nodes that are there. So, we will for our illustration purpose we will take just 9 nodes a
one more than the number of predictors here and the output layer just one node that is for
the our outcome variable.
1118
Now, this particular argument linear dot output has to be true for a prediction model. So,
let us create the formula. So, you can you can use as dot formula function and here the
price being the outcome variable. So, all other variable will be collapsed using plus sign
and will get the destring of all the predictors. So, this will be our formula let us create
this.
Now a number of epochs so as we have discussed this particular thing. So, 1 epoch
means you know all the observations iterations of all the observations. So, let us compute
this epoch df 2 train certain number of observation training partitions. So, that would be
a 1 epoch. So, 1 epoch will have 71 runs through the network, 71 observation that are
there in the training partition.
Now, in the model modeling you would see that now there typically we required we are
required to do you know you know higher level of experimentation in a neural network
modelling, because there are so many things that can be changed; for example number of
hidden layers, nodes that are there in hidden layers and a few other things that we will
discuss for example, threshold value. So, what is threshold value here? So, quite similar
concept, so we have been using in previous technique for example, we look at the neural
net network you know function in the help page will get down here and we will see that
what is threshold a numeric value is specifying, the threshold for the partial derivatives
of the error function as a stopping criteria.
1119
So, this is the in this the implementation of neural network that we are using here neural
net function this the main stopping criteria that is used is threshold value; that is you
know rate of change of you know that error given the error function. So, error function
typically that is by default that is used is SSE. So, the rate of change in that particular
function SSE is going to be used here as a stopping criteria, right.
So for example in this plot when we talk about the number of epochs and this plot so
probably instead of the epochs we would like to create a plot error versus you know
threshold, because threshold is the stopping criteria that is that is as per the
implementation of this neural network function. However, you can see another argument
here these stepmax. So, that is also this is also something that we have discussed in
previous lecture that the maximum number of steps that would be taken to a train the
neural network. So, you can see 30 epochs we are taking, so in typically the
implementation neural net implementation the function that we are using.
So, the stepmax is given a bigger number a large number by which the neural network
good converge and so typically this is given a large number and this becomes the last
resort for stop stopping the learning of learning of neural network. However, it should
you know if if the neural network is not able to converge even within this number then
will get will get some error in this particular appear on this particular function.
1120
So, therefore stepmax is typically used as just in a large enough number and it is the
threshold value which decides the you know convergent. So, of course, if we have a
smaller if we have a smaller threshold value we would be requiring a higher much higher
value of stepmax because if we you know our neural network model might not converge
might not reach the optimum and if we have a higher threshold value we will require less
number of less number of steps in stepmax argument. So, this experimentation we can
always perform. So, you can see you know two models we have written two lines of
code for your calling neural net function twice and two models we have written.
So, in a first one as you can see first is mf that is the formula for our neural network
modeling, then algorithm we are we are we are using rprop plus; resilient propagation
plus algorithm here and you would see threshold value is quite as small here 0.0009 and
stepmax is 30 epochs. So, this threshold value is small threshold value you know is
because that we are using a large stepmax. So, of course, we expect that our model
would converge and we will get the you know will we can use the model then.
The second code you can see neural net function first two arguments are same, but the
threshold value you can see that it is quite high 0.04. So, the model would converge quite
quickly and you would see therefore, the stepmax value is also smaller just a 1 epoch.
So, you know you know, so when we specify particular threshold value we have to take
care that stepmax value is good enough, adequate enough to you know. So, that the
model is convert model the conversion takes place.
So, let us run this model, so you can see other arguments are similar data training
partition df 2 train number of nodes in the hidden layer just 1 hidden layer 9 nodes.
Learning output you can see it you know true that is for prediction and there is another
argument left that is 1. So, will see how rep is used in a later in the lecture. So, let us run
this code so you will get mod 1 and so, this has this has quickly converged. Now let us
look at mod 2.
1121
So, this has also converged. Let us look at the errors of these two models you can see that
mod 1 the error is smaller than a mod 2 this is expected because the threshold value was
higher for mod 2 therefore, less training and therefore, more error. And you can see these
threshold is also you can see clearly the difference you can see the 0.00079 that is
0.0008, almost 0.0008 that is the reached threshold value which is you know smaller than
the threshold value that we have given here 0.0009 and then we can see mod 2 the value
threshold value the 0.04 and we it is has it stopped that 0.024.
So, if we look at the number of steps that were required to reach the threshold it can see
just 63 steps which are less than the number of steps that we have given, but the way we
have been giving the the way we have initializing the stepmax value in our call to this
function has been you know quite close based on some experimentation so that the
number of steps that are required are quite close to these stepmax that we are specifying.
To be on safer side we can always try much larger stepmax value however, because of
the experimentation you would see which 1 epoch would have in this case you can see in
the in the number of observation train partition is 71.
1122
So, less than 71 that is 63 steps were required, but if we look at the mod first model,
model 1 539 steps were required, but if we look at the 30 epoch value it is much larger
2130. So, within 30 epochs we are allowed 2130 steps however, only 539 were required.
If we run the model again probably it might take more number of steps or less number of
steps so that is you know how we can always you know do the experimentation with
threshold and stepmax.
So, now let us look at the some of the details; for example, interlayer connection weights
just like we did for our previous you know you know model that you know in a previous
lecture that for fat and salt cheese sampling model fat and salt score were the predictors.
So, similarly we can create you know we can you can see that.
1123
We are renaming the row names, row names and column names. So, we are changing the
dimension name.
So, in this case you can see this is the, so this is the value you can see here, the
interconnection interlayer connection weights.
So, this is between input layer, 2 hidden layer. So, these are input layer to hidden layer
connection. So, you can see bias values first row, second value is first node that is where
that is corresponding to this predictor SR price, and the weight values, then for KM, then
1124
for owners, and its connection to the different nodes in the hidden layer node 9 10 11 22
up to 17 and the a corresponding weights; so that is here.
Now, a hidden layer to output layer connection also we can see. So, you can see here
again in this particular code I rename the dimensions row and column. So, you can see
here since we had just 1 output node price we can see node 18 price in the column and all
other you know in there are row side we have bias and others node 9 to node 17 or the
hidden nodes, hidden layer nodes and you can see the connection rates and bias value.
So, this is this was these values are called the model 1.
1125
Now, we are interested in looking at the results the predicted value the actual value other
things. We can run this particular code. So, I have created data frame of predicted value.
So, the result would be captured here, in that result element of this mod 1 an actual
value, and the remaining of the predictors in the training partition.
So, you can see these are first 6 observations. So, you can see predicted value and actual
value. So, you can see in most of the cases the predicted value is quite close to the actual
value.
1126
So now what we can do is let us look at the performance. So, we will use this package
rminer and then we will compute some of these matrix a SSE, RMSE and ME. So, let us
compute these values.
So, the first one is for the training partition. So, as you can see here.
So, these are the value for the training partition and then we will compute the values for
the for the second model. So this was for the first model then let us compute for the
second model. So, these are the values for second model. So, you can see that RMSE
1127
value is smaller in first model in comparison with the second model because over
training has happened in case of first model. Now, what we will do will look at the
performance of these two models on test partition.
So, for this we have till now in other techniques we have been using predict function to
score the test partition, but in this case this particular package neural net and we have
compute function to is called the test partition observation. So, we will use compute and
then other arguments are quite similar first model object and then the test partitions will
score this will compute these metrics SSE, RMSE, ME and then. So, this is with respect
to the first model you would see that RMSE value was 0.01 for training partition, but it is
0.16 in the test partition.
So, from here we can say that the model has over fitted to the data; so over fitted to the
noise. So we can see that the error on test partition is 10 times more than that of in the
partition. So this is for the model 1 and let us look at the performance using model 2. So,
value is 0.21 in the for the test partition if we look at the value for the model 2 .024.
1128
So, even in this case you would see that the performance of the model is quite poor even
though less training has happened just 1 epoch just a you know I think about 60, about
60 observations we can look at the previous results epoch value our epoch this converge
a number of steps that we had seen 63, so just 63 steps and so even after that the second
model also seems to have over fitted to the data or it might be under trained. So, there are
there could be two scenario either because the only 63 steps were used.
So, more likely scenario is this that this particular models are under trained and because
of that its performance on test partition is poor however, in case of first model it seems to
be over trained and because of that its performance is poor on test data. So one model,
model 1 is over trained and the model 2 is under trained and that is very that can be seen
from the number of steps that have been used and the threshold value that were used in
these two cases. So, now let us look at the diagram plot model 1 so look at this.
1129
So, for model one this is the this is the network diagram that we have. So, you can see
here all 8 predictor SR price, KM up to automatic transmission AutoT and you can see
different connecting arrows from input layer nodes to hidden layer nodes and you can
also see the bias node, bias values you know bias node and they connect connected to the
hidden layer nodes on the bias values and then from hidden layer nodes connections are
there to output layer node and then there is another bias value bias node, right.
So, this is what we have, so you can see that. So now so as I talked about that two
models; so one model is over fitting and the second model is under fitting. So, what else
can be done? So, there are as I talked about a number of experimentation can be
performed in neural network modeling. So, next is whether we can change the number of
hidden layer nodes. So let us see what happens if we increase the number of hidden layer
nodes.
So, since we have already you know build one model, model one 1 had 9 hidden 9
hidden nodes in the in the layer so I still it was over fitting. So therefore we expect if we
create a model with 18 hidden nodes. So, this one is also going to this one also this one is
also going to over fit to the data or fit to the noise. So, we look at the this particular
model 3 that we are going to create look at the arguments. So, threshold value you can
see now this transition much is smaller. So earlier one was 0.0009 that we had a specified
0.0009 yes.
1130
Now, you can see here 0.0007 and these steps are same this is because since we will have
a more number of hidden layer node. So, of course, you know the model would converge
even at a lower threshold value and still keeping the you know same number stepmax
you know value.
So, we are having in, we are passing a quite tighter value for stepmax and within this you
know value of stepmax; we are trying to you know get the highest possible threshold that
can be used. So, of course, it will lead to over fitting. Let us run this so it did not
converge. So, let us again run, not converged even this time. So let us one more time, not
converging.
1131
So, what we will do? We will increase threshold value to from 7 to 8 and we will run it
again and you see that immediately it converged.
So, you can see even after you know increasing the number of hidden node no hidden
layer nodes from 9 to 18 if we look at in terms of threshold value the earlier threshold
value was 0.0009 and it what it what it converged and in this case it is just 0.0008. So,
just you know one you know fourth decimal point one unit decreased there and we have
increased the number of nodes and we have doubled the number of hidden layer nodes.
Now, we look at the error value now this is much lower. So, you can see 0.0035 we look
at the earlier a value, error value in first model you can go back and you can see the other
was 0.0048 and here what we have 0.0035. So, error is much less since threshold value is
also the threshold value at which the model has converge is is lower and you can see the
number of steps now more number of steps 1674 were required for this convergent to
take place.
Now, let us look at the performance of this model on training partition and test partition.
So, let us compute the same matrix. So, you can see now RMSE value has decreased
further 0.0099 in this case. Now, let us look at the performance on test partition. So,
again you can see that test partition the performance has become much worse. So, you
can see that now the RMSE value is 0.46 and it is you know you know about it is much
higher much times higher in comparison to previous case. So over fitting has increased if
1132
we compare this model 3 with model 1. So, we can see that performance on new data
validation data is worsening and as reflected in the RMSE value.
So, what we need to do to find out the best model? So, what we will do now? We will
build the model using certain such experimentation. So, what we are going to use the rep
argument that we had kept as 1 in previous modeling we will make it 20. So, well run the
same model 20 times and then pick best one out of those 20 runs.
Now, other things also will change. So, you can see here that what I am doing here in this
particular for loop is step mac max is kept as 30 epochs that is the highest value and you
would see that threshold value is now mentioned value i and I am running a for loop
using i as a counter and we are going to build a number of models and we will test them
on validation partition just like this graph. So, we will create this kind of plot. So, the
threshold values, so we are going to do a perform we are going to do some
experimentation with threshold value.
So, let us, these are the threshold value that we are going to use, so 19 of them, so
starting from 0.01 and then 0.015 then 0.020 and up to 0.1.
So, we will create these 19 models and as you can see for each of these 19 threshold
values we will create 20 models a repetition is 20 and the best one within that you know
would be picked. So, essentially for each of these threshold value we would be picking
1133
the best model based on 20 runs. So, let us and then Mtest is the variable which where
we would store the error value a error validation error value.
So, you can see in next few lines of code. So, will you can see here the best this one is
being used to find out the the out of 20 runs which one is the best run, which one is the
best model and once that is selected then it is being used to score the test partition as you
can see here and once that is done we are computing the a matrix values RMSE mainly
and then is storing it in this particular vector.
So, for all the 19 models we would be storing the this value. So, the plot that we are
going to use is a going to be error on y axis validation error on y axis and threshold value
on x axis. So, let us initialize Mtest, let us run this loop and you can also notice that
number of hidden layer nodes are 9.
So, let us run this loop. So, it will take with time because we would be running 20
multiplied by 19 models and yes. So it has all of them have been computed.
1134
So, there were no problems of convergence as we saw in earlier cases where we had to
reduce threshold value because stepmax is large enough to allow all the models for
different threshold values to converge. So, once it has been computed we creating a data
frame of threshold value and error values that have been computed.
So, you can see here, threshold value 0.01, 0.015 up to 0.1 and you can see the
corresponding validation error that has been computed 0.33, then it drops to 0.17 and it
1135
keeps on dropping. So, there are swings if we look at this particular output. So, let us
find out where the value is minimum? So, you can see 16 that is 0.085.
So, when the threshold value is 0.085 the error is minimum. However point of caution
here that the data set that we are using is quite small however, a you know if you do start
an experimentation with this loop you know you run it again then up is still the is still
even after that if threshold value comes around that 0.08, 0.085, 0.09.
So, even after the smaller size, probably this one is the you know best threshold value to
get the best model and once this is identified we can create our plot you can identify the
same thing using this plot. Let us get this plot a error versus threshold value.
1136
So, you can see here.
So, you can see that validation error because there are many you know you know you
know swings here however, we had a larger test partition. So, these points would have
been is smoothed out and we would have been seeing clear you know minimum point of
minimum validation error just like this plot how because we have just 8 observation the
rest partition. So, the plot is not that smooth. However, still we can identify 0.085 as the
a minimum at the point of minimum validation error.
1137
So, once this is known to us we can again build for our best model. So, as you can see
that typically the best model as per these results is around 0.0885. And so we will stop
here at this point. And in the next lecture we will build our model using this particular
threshold value 0.085.
Thank you.
1138
Dr. Gaurav Dixit
Lecture – 58
Artificial Neural Network-Part VI
previous few lecture we have been discussing Artificial Neural Network and is
specifically we were discussing over fitting issues that we have to encounter neural
network.
These talked about some of the points how the over fitting can be deducted, we talked
about that error on validation and test partition that difference with respect to the error on
validation the training partition, we will actually help us identify if the model is over
fitting to the data.
Now, different ways to sort out this problem limit the number of epochs. So, to stop the
training is to stop the training of neural network; the learning process or neural network
at the particular point where you know it just where it just fitting the information of
predictors rather than fitting to the noise.
So, one way that we discussed was a plot of validation error versus number of epochs;
that could be used to find the best number of epochs for training. So, point of minimum
validation error.
1139
So, similar exercise that we have been doing for that we have done previous techniques;
so, we discussed this part and we also want did some modeling to understand it better.
So, we will continue from there.
So, let us open R studio. So, let us load this library neural net. So, we will import the
data set that we are using.
1140
So, this was the data set. Let us say load this library xlsx. Then let us import the data set;
Used cars. Now, let us remove na columns, na rows, first six observation, few
computations that we have been doing.
So, let us go through them quickly. Then other variable transformation, normalization we
will quickly go through. So, this part we have already gone through in previous few
lectures.
1141
Let us do partitioning. Now, the modeling that we have been doing; So, this is this was
our formula the epoch the model one that we that we built in the previous lecture, we
looked at the threshold value this was we talked about that combination of threshold
value and stepmax and how we were trying to build a model with tight limit to stepmax
and you know adequate threshold value.
So, these models we had built we also looked at the performance of these models and we
saw how you know we saw how when we you know in these two different models. How
1142
when we changed the you know training that is in the second one you can see just 1
epoch, the first one 30 epochs. The amount of and the number of steps they are reduced,
how that affects the output, the model results.
We also saw that the model 1 was over fitting to the over fitting to the data and we saw
that second model was probably under fitting to the data. So, we looked at all those
things.
1143
Now, let us move forward. We also we also did a model for neural network with the 18
hidden nodes. And we saw that that was over fitting even further.
And so let us move. So, how do we find out the best model? So, as we talked about in
previous lecture as well as well that validation partition could be used.
So, we talked about that we can run this particular loop, for so we can create a plot. Do
certain computation and related to threshold value. And so, for different threshold value
we can build a different models.
1144
So, we will give a good enough number of stepmax, so that the models converge. And
then for different threshold values we would be building model and testing the
performance of those models on validation data. So, this particular excise we were doing
and so from there probably we can identify the minimum you know that point of
minimum validation error.
So, let us again repeat this. So, let us run this. And in the in this in this particular loop as
you can see we are building model and you can see the threshold value i it will change.
So, this part we have gone through.
And then the best model would be selected out of the 20 repetition. And then that would
be then you know tested on this validation partition that we have. And then the
performance would be recorded.
1145
So let us run this loop. Let us load this library; rminer. So now, let us rerun this particular
these steps.
You see that for each threshold value we are running 20 models and for each threshold
value we are selecting best one and then testing this performance on validation partition,
now it is done.
1146
So, let us create this data frame and this is the minimum value. Let us plot it. This is the
plot as you can see.
Now, if we look at if we look at this plot the minimum error is at 0.0 0.016 that is
threshold value 0.03. Right however, if we run this process again then we would see that
the more stable numbers number might come at much later look at this plot.
1147
Right so; you would see that at this at this particular run we would see that 0.03 seems
0.03 seems to be 1 and the point with minimum validation error. But because of this
smaller sample size and the sample error we adjust for that we might pick for a slightly
different point. And these this particular plot is also subject to change we had we have
run this for just few number of threshold values and smaller data set. However, we can
use this value.
So in previous few runs what result few experiment that I had done. So, what was
observed that the values were minimum around 0.085 marks which is also reflected in
this plot as well. You can see 0.085 near point zero 0.09 in near, that mark the value was
quite close.
So, this is slightly different output. However, we will run the model at this one. Because
in more runs this was the value and threshold value and this is also three higher threshold
value; So, probably since we are facing the problem of over fitting, so the higher
threshold value would help us in avoiding that.
So, best model would be around this 0.085 value and let us run this. So, if we select us
because these were 20 runs. Let us select the best one.
1148
So, these are the values. The error is threshold and steps. And now, let us look at the
performance. So, the training performance as you can see 0.027. If we look at the this
performance on test partition validation partition.
Let us look at the performance. This is 0.033. So if we look at this particular model, we
can see that error on training partition, RMSE value on training partition is about 0.028
and the same for validation partition about 0.033. This is quite close. If you remember in
the previous lecture also the error the two models model 1 and model 2 that we had run,
1149
the performance on validation on test partition was much higher in comparison to the
training partition.
So, which is now much closer; So it seems that this model is good enough and not over
fitting to the data. Now, as I talked about that in many as various run that I have done for
the for loop that is there typically the value came around 0.08 point zero nine 0.085,
0.09.
So, here we were seeing you know a bit constant, you know you know validation error.
In this plot it was different. However, we can always we can always do the this exercise
again.
So, let us do one more run and see this because you can see that the best model that we
got out of 0.085, that seems to be the good one.
1150
So, we do a few more runs and of course the results are also dependent on the
observations which have been which are part of training partition. So, that is another
thing.
So, here again we would see that know this time you can see that 0.08 and this point is I
guess 0.085. So here it is quite either here it is minimum. However, some other points are
also competing with this. This is also seem to be minimum. But however this is quite
close to this lower threshold value and therefore we know that this would be over fitting.
1151
So, probably we are looking for a point on in this part which is neither over fitting nor
under fitting. So, if we do a few you know another partitioning and the observation that
are part of the training partition and the results would be typically in favour of this
number 0.085 and there is also quite value reflected if we run the model using this value.
So, then the another important aspect of neural network is that since we had scaled all the
variables into a 0 to 1 scale. Now how do we how do we go back to the original units? So
this is how we can do it.
So, maxdf1 for this particular variable outcome variable so this value we have here. So
this is the value and this is the value again minimum value for this so this outcome
variable that is price.
Now, this value can be used to then get the get these scores on original units. So, you can
see test partition we can see that updated values original. So, a plus b minus a into
multiplied with this particular value. So, which is having the is predicted values, net
result and we will get this.
1152
So, let us execute this code. Let us look at the data frame you can see here. So, you can
see the actual value and predicted value. So, this is originally scale.
So now you can see that predicted value are quite close to you know in many cases they
are quite close to the actual value. First one quite close, you know second one slight
difference, third one is also slight difference, fourth one quite close, then fifth one some
difference some difference then this one quite close. So, you can see another quite close
value. So the neural network model this one the best model that we had selected at
threshold value 0.085.
So we would always like to pick a higher threshold value, so that the over fitting which
is the typical problem that could be avoided and we can see from the results also the
selected model seems to do seems to be doing a good job in this aspect.
Now, the models that we have been building till now they are they were for the
prediction task. Now, how neural network could be used? How useful neural network
could be for classification task? So, let us do this through an exercise in R. So, let us get
rid of some of the things. So, we are clearing out the environment variables, objects and
now we would be doing we would be building a classification model. So, the same data
set we would be using. But this time it would be for the classification task.
1153
So, let us import the data set. Remove na columns, na rows, observations so we are
already familiar with this data set this is same variables. Let us compute Age. Add it to
the data frame.
Let us take a backup, eliminate some of the unwanted variables you would see that we
are eliminating the seventh column; which is actually corresponding to price because in
this time this one this time we are not considering we are not building a prediction
model, rather we are building a classification model.
So, we would like to keep C underscore price. This is the categorical version of this
variable price and we will eliminate the, this continuous variable. So, let us create this
new data frame.
1154
So this is the structure of this new data frame you can see; Fuel type, SR price, KM,
Transmission, Owners, Airbag and most importantly CR price which is going to be used
now for the classification model and Age.
Now, variable transformation normalization some steps are similar to what we did in
prediction task. So, these things are similar.
So, let us go through these steps. So, dummy variables creation of dummy variables for
Fuel type and so the same for the Transmission and then scaling is again a similar as you
1155
can see here. So, you would do these scaling. And then we will as we did in the
prediction task will include Diesel, Petrol and AutoT these three dummy variables in our
data frame that is to be used for the model.
So, you can see now see C underscore price we are also converting this particular
variable into a logical variable. The neural network function of the package and the
function it allows only the numeric or logical values.
However, since we have been creating dummy variables or logical ones. So, we would
also like to keep that consistency and keep it also as a logical variable here and for the
neural network function neural net function.
1156
Let us create this. So now you can see that we have C price, C underscore price, Diesel,
Petrol, AutoT these are logical variables logical dummy variables and other numeric
variables that we have. Now, let us go ahead and do our partitioning. So, 90 percent of
the observation in this first part; then test partition.
So, neural net this let us load this library. So now, if we look at the neural network
structure, this is quite similar to what we used in the prediction task. You can see input
layer 8 nodes for 8 predictors node 1 2 nodes 1 to 8 then hidden layers 1 with 9 nodes,
1157
nodes 9 to 17 and then the output layer 1, 1 node, node 18 linear output. Now, this time
this is false because we are going to build classification model.
So, let us create the formula here. So, this time outcome variable as you can see in the
formula itself C underscore price and other variables in the data frame are going to be
used as predictors.
So, let us create this formula. Epoch is also in the same fashion we are computing the
number of observation in the training partition. So, that would be 1 epoch 1 sweep
through the data I will have those many observations. Now, you would see that in the
code we are directly going ahead and we are trying and build the best model directly. So,
you can see that rep 20 we are going to execute 20 repetitions for each model and you
can see threshold value point 0.01 to 0.1 and the increment is 0.005.
So, for all these threshold value we would be building classification model. And then we
would be testing its performance on validation partition; that is second partition in this
case and then that would be used to pick the best model.
So, the code is quite similar. So, code is quite similar Mtest1 as we did in the prediction
task this is going to be used to record the performance; on validation partition for
different models. So, let us initialize this. Now, as we can see in the loop the steps are
similar. So, we are building this model you can see hidden layer, 1 hidden layer with 9
1158
node, linear output is false this time; stepmax are same as the previous 130 epochs and
threshold value is i so therefore, will change with respect to the as the loop counter
changes. And then we pick the best model out of the 20 repetition that are being that are
going to be executed.
And once that best out of 20 is selected for each threshold value then we will test the
performance of the that model, on validation partition and then in another step is
required, to actually classify these observation into the appropriate class and then that is
going to be used to compute the this classification misclassification error as you can see
here. And that is then finally being recorded.
So, let us execute this loop. So, let us look at the plot so this plot is as we did in the
prediction task validation error versus threshold value.
Now, this plot is also reflective of what I was discussing for prediction tasks that
somewhere around 0.08 we see a constant error values you can see in this time you can
see that 0.08, 0.085.
1159
So, around this point and threshold value we see constant errors probably this is the point
which we because this is a small data set. So, important it is quite complicated to pick the
best point however, many runs many runs of the model would give us the good threshold
value.
So, as per this different models for different threshold values that available we can pick a
value nearby here. So, sample adjustment for sample error is also crucial so but however,
we have many wants to pick. So, we will again pick 0.085 just like we did for the
prediction task. So, let us pick that one.
So, in this particular case I have written down certain lines of code which can again be
used, to pick the one of the best value. So, let us follow these instructions. So, first is that
this data frame, threshold value and error value and then let us order them in the
decreasing sequence threshold value is in the decreasing event. So, let us order this data
frame as you can see here in the output now.
1160
So, higher threshold value comes first and then we lower threshold then lower threshold
value. So, we would like to identify a model which is close to the higher threshold
values. So, let us compute the this minimum value first. So, here we can see that that is
about 0. So, this is the minimum value so forth.
So, even if for higher threshold values are being used you know for some of those
models as you can see in this output as on is calling here that the error values for the you
know number of threshold values this comes out to be 0.
So, as you can see here. Let us know this particular code will give us the those number of
threshold values which are having the equal error on validation partitions we can see that
the same error same misclassification error for these many these many threshold values.
1161
Now, out of these we would like to identify the one which is probably the first one
having highest threshold value.
So, picking the first one would give us this. So, you can see 0.01 is the threshold value
highest threshold value however, best one to adjust for the sample error, one process
could be take you know a second value. So, that could be one approach.
So, instead of 0.1 we can take 0.095. So, that could be the best th. Now, this 0.095 I am
using here in best th. So, you can see in the prediction task so we looked at the table and
1162
looked at the plot and identified a best threshold value you know respect to her you know
to get a good model. Now, in this case these are steps are again helping us.
So, let us build this model. So, let us pick the best out of 20 repetition. So, these are
some specific points of related to step 91 steps were taken threshold 0.094 and 2.83 is the
error.
Now, let us check the performance of this model on training partition itself. So, you can
see 0.1267. Let us check the performance on test partition.
1163
So, we can see the model is performing better on test partition in comparison to training
partition. So this seems to be a good enough model let us plot this. So, you can see. So,
this is the classification model and it is doing a much better job.
Here we can see different values, denotes the predictors and different nodes and the
hidden layers.
So, let us move forward. Now, there is another pack is that is available to build a neural
network models in R; that is nnet. So, this model this package could can also be used to
build neural network models however, some of the steps because the package difference
on the steps are going to be different.
This is also to highlight the fact that there are many packages available there might be
more than 1 package available for different techniques. And it sometimes different
packages depending on the context problem context data set different packages could be
more suitable.
However, to just to familiarize you with this package we will go through one model,
using this one.
1164
So let us load this package nnet. Let us take the same data that backup you are taking
here. These were the variables there now again 1, 2, 3 and 7 we are eliminating those
variables we are eliminating again. And so again we are building classification models of
prices also we would like to get rid-off, so this is the structure now these are the
variables.
Now, in this particular package this package allows us to have the categorical variable as
factors and then it internally does the dummy coding.
1165
So, let us look at this. Let us convert into a factor variable, these were transmission, C
price the fuel type is already factor. Now, for the remaining variable numeric will change
their scales. On scale change, so we will have these variables. So, 3 factors and others
numeric scale has been changed.
Now, let us go through the partitioning. Epoch 2 so that is the number of observations
there. So, we have taken 90 percent of generation for training partition this is the
function that is used to build neural network model. So you can see nnet and the C price
and then being recast with other variables.
So, in neural net we did not have the option on dot until a dot. So, we had to so that is
why we had written the formula, to get all the variables names because if that is a long
list, so we needed that formula. So, other things are quite similar sizes here
corresponding to number of nodes in hidden layer. And rang is this is of initialization
point plus 0.5 minus 0.5 to plus 0.5.
Decay is that learning parameter we had corresponding you know argument in neural
network, learning rate for back propagation and for other variables also there were few
arguments maxit.
1166
This is similar to stepmax they are neural net number of you know steps and then
absolute tolerances this is similar to the stopping criteria at threshold value that we have
in neural network.
So, let us build the model. Let us classify the observations. Let us look at the table and
let us look at the classification accuracy 98.
So, as you can see and that classification this model is also seems to be doing a good job
like in other model also we got good classification performance. So, so this is the on
1167
training let us check the performance on test partition. But however, we see that
performance on test partition notice as good as we saw in neural net package.
So, they are the performance on the in the it was much better however, because we are I
mean you created the partition differently this time. So, that could be one reason. So,
now for this package we do not have a function to create plot function; we do not have
support and plot function to create a neural network diagram.
So, this is one particular code that is available in different sources that are different
developers who keep on writing these functions of this particular source code can be
used to plot a neural network model; a neural network model for using nnet package.
1168
So, we have to load this library and then so this is not installed. So, let us go through this.
So, this is a this is not part of the package this is separately this source code is available
and in different repositories and blocks that are available on internet on R.
So, you can always find this particular source code and this is the source and this
particular source code can be used to build the neural network model.
1169
So, you can see that this is quite different. However, there are some specific details
which are not available but other things you can see here the model is going to be and
you can see that Fuel type Diesel and Fuel type Petrol. So, these have been internally
created.
So, you can always get model using this. So, let us go back to our discussion. Now, let
us discuss some of the important points related to neural network models; For example,
1170
experimenting with the neural network model; so few points that we will discuss. For
example, how do we decide a neural network architecture.
So, first one is number of hidden layers. How do we decide the number of hidden layers?
So, for most of the scenarios as we have been talking about one layer is adequate to
capture when the complex relationship and number of nodes in each hidden layer.
So, as we talked about that we can always start with p nodes where p is the number of
predictors and then increase or decrease the number of nodes balancing for over fitting or
1171
under versus under fitting. And we can compare model performance on validation
partition and we can look at what happens when we increase or decrease nodes.
So, even with this even with these steps several trial error runs would have to be run
executed to two on different candidate architecture and domain knowledge and neural
network expertise is always going to be useful, to pick the best network attack network r
architecture for your data set and the task.
Number of output layer nodes typically a numerical outcome variable; if you have just
one then you would be requiring single node. Categorical outcome variable if you have
you know if the variable is binary than typically single node is sufficient, and cut off
value is can be used to classify the observations. If you have more than 2 classes then m
nodes would can be used. Number of input layer nodes so typically that is p nodes that is
equivalent to number of predictors that we have.
So, for each predictor, will have a corresponding node in the input layer; Few other
points related to experimenting with neural network for example, variable selection
could be an important part of neural network whenever domain knowledge can be used
to select the important variables in a modeling exercise, automated reduction techniques
can we used, data mining techniques for example, cart and regression based models can
be used to identify the important variables and then take them in our neural network
modeling exercise.
So, it has been seen that if variable selection is applied and we get much higher quality
input variables then the performance of neural network models improves. So, typically it
is recommended that variable selection approaches are applied and important variables
are identified and then neural network modeling is performed.
1172
Other parameters are first one learning rate. So, we have talked about this value is
typically in the range 0 to 1. The value is a constant value can be taken so some variation
they allow you they allow some variation that can be performed. We can have a large
value and then decrease it for successive iterations, so that can also be done. So, we can
that can also be done.
So, we can start with l so when we are starting when we have started to train the model,
train the network, started the learning process of the network then probably the learning
rate should be higher and as we have as the network has learned a bit then probably the
learning rate can be reduced, the same or ideas reflected here.
So, we can start with l and then l by 2, l by 3. So, in this fashion we can decrease. So,
this is another approach. Other variables that are could be their momentum. So, this
variable is about to insult convergence of weights to an optimum value. So, this
particular variable is typically used to allow weight changes in one direction to ensure
that all optimum points are traversed.
So we would not like to get the stuck in a local optima, we would like to achieve the
global optima and for that it is important that the our execution does not get stuck in the
local optima. So, the momentum can be used. So, something similar that we have used in
our R modelling; so R prop plus function that is with weight backtracking and that
essentially implements these similar ideas.
1173
Other things that we can discuss is; for example, sensitivity analysis using validation
partition. So, as we as we understand that neural networks follow a black epochs
approach. So, the interpretation of variables, the relationship between outcome, outcome
variable and predictors that is difficult to explain with neural network model and
network. However, few things can be done for example, sensitivity analysis using
So, we can this can be used to sense which predictors affect predictions more and in what
way. So, that will give us some understanding about the relationship. So, we can set all
the predictors to their mean values and compute average level prediction of the network.
So, that will give us the average level prediction and will serve as a benchmark to
perform our sensitivity analysis. For each predictor now we can change its value to
maximum or minimum and compute the prediction using the network.
So that will give us how the prediction is going up or down and why, how much, in what
fashion a particular predictor is affecting the neural network and therefore, the outcome
variables that relationship we would be able to understand.
Now, compared with the average level prediction as I talked about to learn about the
relationship between predictor and outcome variables. Further, comments on neural
network so, few advantages that we have high predictive performance that is plus with
neural network over fitting issues that is one problem that we have to deal with as we did
through our modeling exercise in R.
Now, inability to explain the structure of the relationship that also we discussed I also
talked about the role of sensitivity analysis that can be done; to understand the something
about the relationship between outcome variable and predictors. Prediction, another
problem that we might face in neural network is that prediction outside the training range
because we typically scale the variables into 0, 1 is scale.
1174
So therefore, prediction outside the training range that might not be valid however, this is
quite a generic point and it is applicable to other techniques as well.
Now, as we talked about the variable selection approaches for example, cart they could
be used in combination with neural network and that will actually improve the network
performance.
Another issue related with neural network is that adequate samples and sample size
would be required for training the network. So, that is important; so that the model is not
over fitting to the observation. So, adequate sample size is there and then over fitting has
to be controlled as we have discussed.
Risk of convergence of weights to local optimum instead of global optima; so this part
also we have discussed and there are different algorithms that are available as we talked
about the particular parameter momentum and that can be used. So, many algorithms
have implemented some of those kind of features to avoid getting stuck into the local
optima and achieve the global one.
Another disadvantage with the neural network is a longer runtime. So, typically because
of the learning process and the sort of deep learning depending on the number of hidden
layer and the nodes that could be there it takes a much longer runtime.
1175
So, with this we will stop our; so this concludes our discussion on artificial neural
networks. In the next lecture, we will discuss another technique.
Thank you.
1176
Dr. Gaurav Dixit
Lecture – 59
Discriminant Analysis-Part 1
Welcome to the course Business Analytics and Data Mining Modeling using R. So, we
have come to our last technique that we want to discuss that is the discriminant analysis.
So, let us start. So, discriminant analysis is also a statistical technique, typically used for
classification or profiling tasks. So, application and the different types of tasks that it
could be used are quite similar to what we discussed for a logistic regression.
So, this is also a model based approach. So, typically we make assumptions about the
structure of relationship between outcome variable and set of predictors. So model based
approach; if we look at the main idea behind discriminant analysis. So, we can see two
points; so in two approaches have been used to conceptualize the idea of discriminant
analysis.
First one is to find a separating line or hyperplane equidistance from centroids of

different classes So, by this you can also understand main application mainly
discriminant analysis is used for classification tasks. So, let us read it again to find a
separating line or hyperplane equidistance from centroids or different classes.
1177
So, from this we can understand that we are talking about when we discuss discriminant
analysis we are essentially looking for separating line.
So, this particular technique is about finding a separating line which is equidistance from
centroids of different classes. So, let us say centroid for this particular group is this one
and centroid for this particular group is this one. So, probably we are looking for a line
which is equidistant from the centroid of different classes.
So, sometimes this approach is used to implement discriminant analysis. The another
approach is to find a separating line or hyperplane that is best at discriminating the
records in two different classes. So, that is another approach. So, in terms of theoretical
understanding, there is not much of a difference; however, when you go about
implementing it using software, implementing the actual steps of algorithm then there is
slight differences.
So, the second approach is to find a separating line or hyperplane that is best at
discriminating the records in two different classes. So, in that sense we are looking for a
particular line or hyperplane that might be best at discriminating these cloud of
observations right into their respective groups right. So, these are the two approaches;
first one finding equidistance or that finding an equidistant line or hyperplane.
1178
So, that would of course, because that is equidistance from centroids of these groups. So,
probably that would do a good job of classification, the second one is finding a line that
would be best at discriminating the records in two different classes. So, these two
approaches are popular in discriminant analysis and have been implemented.
So, the classification procedure that is used here is based on distance based matrix. So,
few distance based matrix we have covered when we discussed KNN and so this
particular technique is also based on distance based matrix. So, the main idea is again the
based on the distance of a record from each class. So, as you can understand from the
two approaches that we discussed the underlying computation calculation that would be
required and each of those approaches would be calculation of distance of a record from
each class.
So, this calculation then becomes the basis for classifying observation. So, let us move
forward, so classification few more points are there.
For example, best separation between items is found by measuring their distance from
each class so for different items that are there. So, how we can separate them into their
respective groups? So, that is found as we talked about measuring their distance from
each class and particular item is classified to the closest class.
1179
So, we measure the distance of a particular item from each class and then depending on
its closeness to a particular class, so the particular item is accordingly classified. So, next
important point is what are these distance metric that could be used; So, one alternative
one option to use a Euclidean distance metric for discriminant analysis.
So, we are already familiar with this. This is the formula for a Euclidean distance metric.
Distance of a record, let us say x 1, to x p. So, we have p predictors. So, we will have
these p values and there for this p values, in this the distance of a record x 1 to x p from
centroid x 1 bar to x p bar of a class is computed in this fashion; So, this distance e u;
that is for Euclidean and distance between x; that is for an item for a record and distance
from the centroid; that is x bar. So, that is nothing, but vector of means of the predictors
as you can see where centroid x bar, centroid x bar is a vector of means of p predictors.
So, formula is quite familiar square root of x 1 minus x 1 bar is square then up to x p
minus x p bar whole square, so this is the formula and this can be used as a distance
metric for those calculation. For calculation for each item and it is distance from each of
those classes and then finally, classification based on those computations.
So, Euclidean distance metric is one that could be used, but as we have discussed before
as well in KNN as well that, there are a few issues with the Euclidean distance metric.
So, let us understand them, so first one is that distance values depend on the unit of a
measurement. So, as it is very clear from the formula itself. So, if we have two variable
1180
one is being measured on square feet and the another one is being measured on hundreds
of a square feet. So, of course, the actual values are going to be different, but those
values when they are used in Euclidean distance metric one particular variable would
dominate the distance.
So, therefore, distance values that are going to come, they are going to be dependent on
unit of measurement. So, that is quite problematic in sometimes in some data sets. So,
that is one problem, second one is based on mean and does not account for variance. So,
we look at the Euclidean distance metric that we just saw. We are trying to compute the
distance of a particular item, particular record from centroid of each of the classes. So,
therefore, those now centroid is vector of means for all predictors then therefore, the
distance computation is actually just accounting for mean values and we are not
accounting for variance.
Now, a variability might play sometimes an important role in determining the closeness
of a record to a particular class right because if the computation is based on just this, but
the variance on the variability is not accounted one class would be one class might have
a larger split and higher variability. So, because of that larger split because of that larger
split also the new observation is likely to be closer to this class despite the distance
between mean and that new observation might be on the higher side.
But because of the variability this split more likely that a new observation could belong
to this class; however, even though the distance would be smaller from this lower
variability group as per the Euclidean distance metric, that new observation might be
assigned to these groups; however, higher split. So, probably there is a good chance that,
that observation your objection might go to this group.
So, variability plays an important role in determining the closeness of a record to a

particular class. Let us say a this is the observation new observation and we can see this
is the distance form this centroid and this is the distance from this particular centroid. So,
if we look at it this distance is smaller so therefore, as per the Euclidean distance metric
probably this record is going to be allocated to this group; however, be look at the split of
this group this is split is much wider. So, this point is quite close to this sphere; however,
it is slightly more distant to this, this is sphere.
1181
So, because of that we can say that there is again, so there is another argument could be
there that this observation, if we will just look at the variability or spread then probably
this observation can also belong to this particular group. So, as per the variability
observation, this new observation can go to this group, but if we look at just the distance
from the centroid that is accounting for mean this observation will go to this group.
So, therefore, a mean and a variance both should be accounted, so which is not the case
with the Euclidean distance matrix. So, apart from the a scale dominance and the
variance is also an issue on Euclidean distance matrix. So, to eliminate these two
problems we can think of, a distance can be computed using standard deviation z score,
that is z score instead of unit of measurement.
So, instead of using those actual, even if they are in different unit of measurement we can
actually go for measuring them in terms of standard deviation So, therefore, that means,
z scores. So, that would eliminate the first problem the scale dependence and the second
problem also would be covered because this standard deviation which is also and
indicator for the split would also be part of this process.
So, that is one way we can overcome these two problems; however, there is one more
issue with Euclidean distance metric is that correlation between variables is ignored. So,
these variables that are going to be the part of the process of this distance calculation. So,
the correlation is also important. So, that is also ignored in Euclidean distance matrix.
So, how do we solve this one? Because correlation relationship between different
because this one is more like 2 D space when we talking about p predictors. We are into
p dimensional space. So, therefore, the correlation between two variables might also play
an important role in terms of determining the closeness of a record to a particular group.
So, this particular issue should also be resolved.
1182
So, the next distance metric that is called statistical distance or Mahalanobis distance,
that can be used to overcome the issues that we have discussed that issues that with
Euclidean distance metric. So, the statistical distance or Mahalanobis distance is
typically defined as below as you can see in the slide; D m l that is m l form Mahalanobis
distance, then the distance of record x from centroid of a class x bar can be computed in
this fashion, x minus x bar transpose and then we have S inverse then x minus x bar.
So, you can see that x minus x bar transpose, this is transpose matrix of x minus x bar.
So, essentially the column vectors are turned into row vectors and then we have S
inverse. So, this is inverse matrix of x.
So, this can be a thought of, this is where s is a covariance matrix between p predictors.
So, see in the definition itself covariance metric, matrix is part of this calculation, the
statistical distance calculation. So, therefore, correlation between predictors that is
accounted and you can see that the S inverse, this can also be considered as p
dimensional extension of division operation. So, in this fashion scaling is also part of the
process. So, if we take this particular statistical distance formula to one dimensional
space you would also realize that this formula would convert into z score for one
dimensional space.
So, we would see that some of the two important issues that scale dependence and the
variance, there that we could overcome using z scores are part of this formula because
this is for n dimensional or specifically p dimensional space. So, this is also is
1183
overcoming the taking the same advantage. It can say extension into a p dimensional
space and because s is there, so which is covariance matrix.
So, correlation, another issue that was there with the Euclidean distance matrix that is
also overcome using this statistical distance. So, correlation is accounted using the
covariance matrix and since we are taking inverse of it. So, scaling is also accounted and
since we are scaling also accounted, since we are taking the subtraction from mean value
that is centroid so and then scaling is also accounted. So, scaled dependence is also
considered variability and other things are also accounted for.
So, this statistical distance metric Mahalanobis distance metric that seems to be a much
better metric for distance calculation in discriminant analysis.
So, next important point is about, how these some of these things are implemented?
However, we have talked about the discussion that we had about, the discriminant
analysis. It was mainly based on distance computing distance of new observation from
each of the class and then assigning it to the closest class.
However the implementation is done using linear classification functions and that also
brings about the similarity of discriminant analysis with the multiple linear regression.
So, let us discuss this aspect. So, linear classification functions are used to implement
some of these things that we talked about the distances and other things. So, used as
1184
basis for separation of records into classes. So, these functions are used as basis for
separations of records into classes and so this function compute classification score
measuring closeness of a record to each class.
So, something that we discussed that distance metric are used and this is classification
function, actually implement the same thing in a functional forms, so distance idea in a
functional form is implemented. So, you can see first point here, compute classification
score measuring closeness of a record to each class. The second point is highest
classification score is equivalent of smallest statistical distance. So, you would see the
main idea that was based on using the distance metric, it is actually captured here in
linear classification functions.
So, but the implementation is now different. So, instead of measuring the distance,
calculating the distance we actually use these classification functions to compute
classification score; however, in terms of understanding and interpretation the main idea,
the main underlying basis is same. So, instead of saying is smallest statistical distance,
we would be saying is highest classification score.
So, again to understand what these functions do next point is so main idea behind these
function is to find linear functions or predictors that maximize the ratio of between class
variability to within class variability. So, you can see, when I said that linear
classification functions being discriminant analysis closure to what we understood and
multiple linear regression, you can see that here also we are looking for a linear functions
of predictors.
So, these classification functions, they are actually linear combinations of linear
functions or predictors so in that sense, the similarity with the multiple linear regression.
And later on will also discuss some of the application and performance of discriminant
analysis are also quite similar to what we discussed in multiple linear regression. So, the
idea to find this linear functions or predictors that maximize ratio of between class
variability to within class variability. So, therefore, if the between class variability is
between these two classes for example, in this two class case, so ratio of this. So, this
variability and divided by within class variability.
So that means, we are trying to separate these groups. So, this ratio would indicate, so if
this ratio is maximized, so therefore, we are trying to achieve maximum separation
1185
between these groups. And we would be able to discriminate between observations and
will be able to classify them to their respective groups.
So, through maximization of this ratio these functions are determined. So, these linear
functions or predictors they are determined using by maximizing this ratio and the
underlying understanding is quite similar to what we discussed, the distance based
calculation. Instead of distance based calculation will have the classification scores, but
the idea is coming from that. So, to understand few things what we have discussed till
now, let us go back to R studio and through an exercise will try understand few of the
points that we have discussed.
So, the data set that we are going to use right now is sedan car data set that we are
already familiar with. So, let us load this package xlsx.
1186
So let us import the data set, let us to remove NA columns NA rows.
So, these are the three variables that we already are familiar with annual income
household area ownership and this is the structure of the data frame; annual income and
household area are numeric and ownership is the factor variable, categorical variable.
So, that is also our outcome variable in this case. So, we have two groups as you can see
non owner and owner.
1187
Now, first important aspect that we need to discuss is the class separation into classes.
So, it is important for a particular classification model to do well in classifying
observation that class separation is should be bit more clear. If class is separation is clear
then a particular classification model would probably do a good job of classifying the
observations.
However, if the class separation is not clear then in that case the model and modelling
would be much more complicated and the performance would not be as expected. So, let
us understand these two things using some plots here So, this plot which is quite similar
to the example that we have shown here in the board, so let us plot this.
1188
So, these are the observations this is the scatter plot. So, as you can see here this
particular group is top right group, this is owners group and then bottom left group this is
the non owners groups.
So, there is clear separation between these two groups and therefore, it is easier for any
classification model to separate observation into their respective groups So, if the class
separation is clear then, probably the classification would be much easier. So, we can
find you know so discriminant analysis that main idea that we talked about that finding a
line or hyperplane that is either equidistant. So, that was one approach or finding a best
line or hyperplane that does the good job of discriminating the observations.
So, let us understand that, so try and find out a line. So, if we look at this scatter plot, if
we draw a line somewhere from here to here you would see we will get a good enough
separation where only one observation would be misclassified as we can see here. So,
this is the point that we want to locate you can see, this is between 20 and 22 on y axis
and between I guess 5 and 6 along x axis. So, this point we want to locate. So, our line
would come above this particular point.
So, let us find out. So, this is one way to find out because we are doing some manual
process here following manual process to find that line So, we can see 20 to 22 and 5 to
6. So, you can see annual income 5 to 6 and household area 20 to 22. So, we would be
able to find probably this point.
1189
So, you can see this is the point 5.3 and 21. So, this seems to be the point 5.3 and 21. So,
this is the point. So, let us find out another point in this particular zone.
So, this seems to the point. So, our line should be our discriminant lines should be below
this particular line. So, as I suggested the line could go like this and therefore, it should
be below this point. So, let us find out this point, this point seems to be between 8 and 9
along x axis and between 16 and 18 along y axis. So, you can see here 8 and 9 here
annual income along x axis and 16 to 18 along y axis. So, let us find out this particular
point. So, you can see 8.5 and 17. So, this seems to the point 8.5 about 8.5 and 17.
Now, we can now manually assign some coordinates to draw our line. You can see here
we are plotting a line from 5.3 to 22. So, the point was 5.3, 21. So, we are moving up in
the y direction. So, 22 value therefore, and the second point is 8.5 and the 17. So, again
we are moving down along the y direction.
So, the x value is same at 0.5 and the value is say from 17 here to 15 now. So, will get a
line a separating a discriminant line. So, let us plot this.
1190
We can see, so probably this could be. So, through our manual process visual inspection
we can see that this could be one line that would be able to discriminate the observations
into their respective groups. So, this also seems to be equidistance line from the centroid
of these two groups.
So, both these approaches typically the line that we get typically is going to be same. So,
we can extend this line using few lines of codes here.
1191
So, you can see using these two coordinates, these point coordinates that we have
identified you can compute slope and we can use this slope to extend this line further as
you would see the line is being extended in the plot. So, let us look at, you can see this
line has been extended.
So, this is the line that we were talking about. So, the main idea about discriminant
analysis is to find this line and that would be able to separate these observations. So,
from this it is also clear that we would like to have the class separation. So, the class
separation should be quite clear for discriminant analysis to probably work well. So, let
us look at another example where this class separation is not quite clear.
So, this is promo offers data set. So, we are also familiar with this one as well. So, let us
import this one and let us see, what is the scenario in this particular data set? So, in this
particular data set as we would see that, the observations belonging to different groups
there is no clear separation between those groups and that would actually complicate the
performance of the model. So, let us you remove NA columns NA rows. Let us change
the palette and we are taking these 200 observations.
1192
And using these two observation now we are again using log scale here, so because the
there are too many observations and they are within one particular limited coordinate
area. So, we would like to space them out.
So, that is why we are using log scale here. So these are the observations.
1193
So, let us look at, so in this particular case we can see here that, even though we have
used only 200 points we can see that a both the groups they are lying all around in the
similar area and we can see that separation is quite difficult here.
Similarly, if we plot we use all the observations instead of this sample. So, let us look at
all the observations. So, if we look at all the observations this situation is even more
difficult, you can see here.
1194
So, here you would see that this particular half top right half we see that observation
belonging to both acceptor and non acceptor class are clubbed in there.
So, it is going to be quite those, the class separation is not quite clear in this particular
region; however, this particular half the mid part and this left part, most of the
observations belong to the non acceptor groups. So, these are low hanging fruits and
model would be able to correctly classify these observations because typically belong to
just one class one group; however, this top right group this is quite complicated the
separation is not clear.
So, the separation is not clear it is going to be difficult for a model to give good
performance. Now you would also see for any model that we apply on this particular
data typically the performance of that what model would come out to be you know quite
good that is because majority the observations are very easily going to be classified. You
can see here so many observations are here and which are going to be easily classified as
non acceptor. So, overall performance of the model would be quite good, but if we
restrict our self to this part probably the performance of the model would be tested on
this part because the class separation is not that clear.
So, with this we will stop here and in our next lecture on discriminant analysis, we will
do modelling, we will build our discriminant analysis model using the sedan car data set.
1195
Thank you.
1196
Dr. Gaurav Dixit
Lecture - 60
Discriminant Analysis-Part II
Welcome to the course Business Analytics and Data Mining Modeling Using R. So, in a
previous lecture we were discussing discriminant analysis and we were a doing a few
exercises in R. So, let us get back to our R studio. So, we talked about the class
separation and two data sets we used to show that how class separation is important and
different data sets, how it is going to impact the modeling process and results.
Now, what we are going to do in this particular lecture we will do our modeling. So, a we
are using sedan car data set here. So, let us look at the structure of the data frame.
So, this is already loaded into the environment. So, you can see. So, these are the
variables annual income household area and ownership.
1197
So, we can go ahead with our modeling process. So, the package that we are going to use
for this particular discriminant analysis modeling is mass. So, will let us load this
package library mass.
And the function that we required to build our model is called LDA. So, this is for linear
discriminant analysis.
1198
So, for more information on LDA you can go to the help section
And find out few more details for example, this function is for linear discriminant
analysis as you can see, and you can understand details about different arguments that
are part of this function.
1199
So, some of them we are going to use the important arguments.
1200
We are going to use in our modeling exercise. So, our outcome variable here of interest
here is the ownership. So, you can see ownership tilde remaining variables that are
annual income and household area, and then data frame that is df and let us run this
model.
So, these are the results. So, this what call you can see the prior probabilities of groups
0.5 and 0.5 this is nothing, but the actual proportion of actual proportion of observations
for each group, and then we have group means.
1201
So, these are actually the centroids for you know. So, first row is for a centroid for non
owner group and the second row is the centroid for owner group. So, the values for these
two centroids, centriod you can see here. Then what we have is coefficients of linear
discriminant. So, this is quite similar to what we discussed multiple linear regression. So,
just like the betas that we compute in linear regression, but the idea is slightly different
there that is with respect to outcome variable.
But here the coefficient of linear discriminants are with respect to the class separation,
we want to achieve the class separation. So, these are the coefficient you can see 0.61
annual income and 0.3 household area. So, our LD1 that linear or function of predictors
is this one this LD1 and these are the coefficients. So, this is the line that we talked about
in the previous lecture as well.
So, if we have. So, in this particular case the data set that were that we are using is also
quite similar to the example that I am using here on board and this is the line that we are
looking for. And the coefficients for this line we can see in this output model output, 0.61
annual income and household area 0.3.
So, these coefficient will determine this line and this could be used to discriminate the
observations into their respective groups. To understand the results of this model a bit
further, we can use a stacked histogram of LDA values for the observation that we have.
So, let us set the parameters first.
1202
So, graphical parameters and then let us plot this and we will see this graph here.
So, in this graph as we can see that the LDA values for different groups.
1203
So, these are actually different observations we can see that in the data set we had just 20
observations. So, you can see here 4 5 and then 9 here and then we have 4 and about 8 9.
So, almost about all the observations are covered here and we have LDA’s values and we
can see for the group non owner most of the values are below 0 except 1 and then we
have a LDA values for the group owner most of the values are greater than 0 you can see
here. So, these are so, this in this fashion our model is able to discriminate.
So, the idea that we talked about that we are looking for a line, we want to find the line
that would be able to discriminate the observations into their respective groups. So, you
can see that LDA’s values less than 0 typically it is non-owner group and greater than 0
typically it is owner group of course, there are going to be few misclassifications.
So, this is the one understanding of the model that we have just filled. So, look few more
details for example, the class that is going to be predicted using the so, this is for the
training partition, the observations that have been used to train the model. So, we can use
predict function and this is going to return us a one particular value that is class for each
observation which is predicted class.
1204
So, let us look at this. So, for each of the observations we have about 20 there and the
predictor class we can see here right. And then its course also the classification is course
that we talked about. So, this linear discriminant you know a function the classification
linear classification function will give us these scores also.
So, predict function again can be used and this is the element one of the retained value
and that can be used to displace scores.
1205
So, you can see here 20 a classification scores. So, as we saw in the plot itself that for the
non-owner group the values were less than 0, the same thing you can see here, you know
less than 0 values, first observation second normal observation up to you know up to a 9
observation ninth observations all values are negative. And then we can see tenth
observation, it has positive value here and then we can see here then other observation
they have greater than so, their values are greater than 0.
So, we can see 11th is of course, seems to be misclassification, negative then we can see
positive values there 0.28, 0.77, 0.34 and in this fashion we keep on going. So, these are
the values. So, we can see here, one see there seems to be one misclassification in each
case. So, we can see observation 11, that is misclassified here and similarly there also
this one the observation 10; that is misclassified there in the non-owner group and this
11th observation misclassified in the owner group. So, these are the two miss
classifications and classification of this code we can see here.
Now, if we are interested in plotting these scores so, we can plot. So, it is discriminant
functions. So, in this case just one. So, we can actually plot this gate and a scatter plot for
this. So, let us open a new graphics device. So, this is a function again that can be used
dev dot new to open a new plot in device. So, this is how it can be done. So, this is the
new device and now the plot that we are going to create this scatter plot of discriminant
function, it is going to be on plotted on this device you can see here.
1206
So, these are the classification scores for all the observations. So, you can see index. So,
this index is for a row number for each of the observation, we have about 20
observations; 20 observation. So, for all 20 observations and we have the classification
scores you can see, few scores are below 0 and about half of the this course are
observation are scored below 0 and about half of the observation are scored as above 0 as
we have seen in other plots. So, let us add few more details levels and legend.
So, let us look at this observation. So, here we can see that the predicted classes is
displayed using the gray colour and the actual class is displayed using the black colour.
So, for each of the observation we can see here what was the actual class and what was
the predicted class. Again, predicted class in gray colour and the actual class in black
colour. So, we can see all these observation are correctly classified. So, we can come
here and these are the observations you know, this is the observation which is actual
classes owner and has been classified as non-owner here and then another one there is
one more misclassification we need to find that point.
So, that is also this. So, this is the observations, we can see this first non owner and has
been classified as owner. So, these are the two observations which are misclassified and
so, these are quite close to our might is going to be quite close to our a discriminant line
also and other observations are being correctly classified.
1207
So, we can also construct a data frame of these values that we have just computed. So, let
us look at some of these values in a data frame format.
So, predicted class and the actual class and the classification scores. So, we can see a
first state observations, 9 observation correctly classified. Then as I talked about, this is
the first misclassification then followed by second misclassification. So, in each group
we had we have one misclassification and then the remaining observation are correctly
classified into its group that is owner.
1208
Now, let us look at the performance of this model using the accuracy or misclassification
error matrix. So, let us compute this matrix first.
Classification matrix you can see that here 9 observations, 9 records belonging to non-
owner class, non-owner class, have been correctly classified as non-owner class
members and similarly for 9 owner class members have been correctly classified as
owner class members and two of diagonal elements we have 1 and 1. So, these are the
errors. So, let us compute the accuracy and error numbers here, before that there is
1209
another important function that can be again used to visually expect the classification
matrix, this called mosaic plot. So this can also be used to see the classification matrix.
So, I think it has gone to the yes in this new device we can see. So, this is the
classification matrix, then formation we can see here. So, non-owner we can see you
know because the size of the rectangle, this is quite big. So, most of the observation are
there correctly classified and here this size of rectangle is also quite large. So, this most
of the observation have been correctly classified as owner. So, these are the observations,
this be rectangles are representing a smaller rectangles are representing the incorrect
classification.
Now let us compute the accuracy and error. So, we can see 0.9 is accuracy and remaining
0.1 is the error in this case. So, model seems to be a doing a good job.
1210
However, as we saw that in the scatter plots itself the class separation was quite clear. So,
therefore, we should have expected a good performance by the model.
Another data frame with some key information here. So, we can see here.
1211
A predicted class, actual class and probability of ownership. So, that has been computed.
So, you can see here probability of ownership we have this predict function and posterior
element is being used to you know to actually display the probability of belonging to the
owner group and the 1 minus this probability is going to be the probability belonging to
the non-owner group and. So, other predictors other column names are also appended to
this particular data frame. So, we have this probability values, the predicted information
and predicted and actual class.
1212
So, let us move forward. Now what will do? We will create a scatter plot where will
again will create this scatter plot. So, this is again going into new device and will a use
labelling.
1213
For so, you can see this is the scatter plot and we are using labelling here for different
observations, and how they have been classified. See it text there so, that is telling us the
predicted class.
So, here again we can see that the, this particular observation. So, it was belong to the
owner class, but it has been predicted as non-owner. So, this is one misclassification.
And we have one observation here and this was a so, this is correctly classified and this
is another observation. So, this was supposed to be non-owner, but it has been predicted
as owner. So, again in this scatter plot itself, we can see which observations because this
was a small data set.
1214
We can easily spot which observation have been correctly classified and which have
been incorrectly classified. Now, what we are going to do is, we again re-plot our ad hoc
line that we did before in the you know previous lecture, and we will plot the
discriminant line that, we have just computed using our model LDA function. So, let us
re-plot. So, these computations we are already we already understood.
So, this is the ad hoc line that we had plotted in previous lecture.
1215
So, now let us look at the line that we have got from the model. So, there are two ways,
first one we can use this contour function. So, we would be required to do some of these
computations. So, because.
So we will have to expand our observation. So, you can see here I am using I am
expanding some of these observation using expand dot grid function; more details on
expand dot grid you can find out using the help section.
So, now using the predict function and the model object, I am going to score these this
expanded grid this expanded data frame and now this is going to be used to create a
matrix, and then create the line our discriminant line.
1216
(Refer Slide Time: 16:54).
So, we can see here this is our discriminant line shown in gray colour. So, this is the
product of the. So, this is from the model. So, this is the line that has been computed,
which is actually giving us the classification.
Now, the same discriminant line can also be plotted using some matrix algebra
computations. So, for example, coefficient values and other things that we are going to
compute slope and so, finally, we want to compute slope and intercept for the line. So,
some of these computations are related to this. So, will just go through these
computations. So, this matrix of algebra is bit you know complicated here.
So, we would not go into detail of this, but this is another suppose DA line that is by
computing intercept and slope and you can see that matrix algebra is involved, you can
see that coefficient values and then we take multiplied with this transpose matrix take
prior values and centroid values and then we compute our slope and intercept values.
And then the intercept in the plot that we had created, let us clip is a function that can be
restricted, that can restrict the plotting of a particular you know a line or any other
graphic to a to a limited region in the in the graphic device.
So, let us look at. So, now, you can see that gray line because this AB line you can see
colour is blue. So, this gray line you can see here this has been you know you know
coloured with blue; the same gray line has converted to blue because the DA line is
going to be the same it is just different plotting mechanism that we had used.
1217
So, you can see here from the legend itself also. So, earlier the gray line that was there
that was the DA line, now it is blue, this is the DA line, this is our black line, this is our
ad hoc line, that we had drawn on our own by looking at the you know observations and
other details are there.
So, let us move to another exercise, now this time what we are going to do is we are
going to use a this particular data set used cars data sets, let us import this data set. So,
let us move na columns na rows. So, these are the observations now let us compute each
variable, let us take backup and again you can see that this we are going to build a
classification model here.
So, we are eliminating the price variable and will just keep the C underscore price, as
you can see here this is our factor variable categorical variable and let us convert to other
variables into factor transmission and C price.
1218
So, this is our final structure. So, we can see here, now we can actually partition. So, let
us partition so, 90 percent of observation to training and testing. So, you can see in the
earlier exercise all the observations were part of the you know part of the model and now
this time we have slightly bigger samples. So, we are creating different partition and then
we would be judging the performance on the validation or test partition.
1219
So, this is another important function that that is near 0 where so, this is to identify near
0 variance predictors. So, ideally we would not like to include these near 0 variance
predictors in our LDA modeling.
So, library this caret library is there and which has this function near 0 variance to which
spot these variables predictors.
So, we can see in our case we do not have any you know near 0 variance predictors. So,
we are safe we can go ahead and do our modeling. So, this was the library that we has
used for LDA exercise and LDA is the function on this case you can say C price being
request against all other predictors and this is our training partition.
1220
So, let us build the model. Now you can see we since we have just two groups here
again. So, we just need one you know linear discriminant function here and we can see
the combinations, this is combination linear function of all the predictors.
So, you can see diesel, petrol, SR price, KM transmission, owners, airbag, age. So, all
these variables we can see here and we can see the our linear discriminant function, now
we can go ahead and it compute some of these scores.
1221
So, classification then the scores classification scores and then the estimated probabilities
values and let us look at the classification matrix.
Now this is the classification matrix, this is on training partition. So, we can see the
model is doing a good job here.
Let us look at the accuracy and error numbers, 87 percent and 12 percent. Now let us
look at this particular data frame with some information on some key variables.
1222
Let us look at this. So, we have predicted class, actual class, we have a score that is from
coming from the our discriminant function. So, this is our classification scores, then we
have probability of belonging to class 0 and probability of belonging to class 1 and after
we have all the predictor. So, this particular data frame has all the important information,
the output of the model and the input as well. So, this in this fashion so, as you can see
that, we can we can if the class separation and other things are quite good then you
would see that the l linear discriminant function does a quite good job. So, we can
compare our scenario, this linear discriminant scenarios to the classification and
regression trees.
So, there we used to identify the rectangular regions, which would be you know able to
separate some of these observations in this fashion. So, in this case you would see
instead of looking for a set of horizontal and vertical lines, creating rectangular region as
we typically doing classification and regression trees. In this case we are looking for a
diagonal line or a line, which is able to discriminate between the observation and you
know classify them into their respective groups. So, each technique has its own strengths
right you can see, instead of looking creating so many rectangular regions, as we do in
CART. Just one discriminant line was sufficient for this particular data set to get a good
model.
1223
So, let us go back to our discussion on using slides. So, few important things that we
would like to cover here. So, there are some assumptions and other issues that are
concerning discriminant analysis.
For example, predictors, first assumption is that predictors follow are supposed to follow
multivariate normal distribution for all classes. So, the predictors that we have been that
we have to used for our to construct our discriminant linear line that discriminant
function.
So, for each of these those classes they should follow a normal distribution. However, if
there are good enough sample points, then the results are quite robust to violation of this
particular assumption, about the adequate sample size for you know each of those classes
have to be there. So, this is first assumption and then the second one is that correlation
structure between predictors for each class should be same. So, that structure is important
for us to be able to build a valid discriminant analysis model. So, this is the second one.
So, we can always look at the correlation matrix and find out whether the correlation
structure is similar for each of those classes.
Now, this technique is also sensitive to outliers, as you can understand from the
discussion as well, that if there are a few outliers our linear discriminant line can change
significantly and that can impact the results. So, therefore, it is important to identify
those extreme points and eliminate them if possible. Few more comments on
1224
discriminant analysis so, as we talked about that application and performance aspects are
similar to multiple linear regression.
So, just like in linear regression with respect to outcome variable, we compute our you
know a coefficients. So, as you can see in the second point, in discriminant analysis
coefficients of linear discriminant or optimized with respect to class separation. So, we
want to achieve class separation and the coefficient of linear discriminant that line that
are optimized in that fashion. When we talk about linear regression the coefficient are
optimized with respect to outcome variable because we want to predict that outcome
variable, so coefficient or output price with respect to that. So, that is one difference
otherwise there are so, many similarities, only the approach is different the optimization
process. So, in both cases the coefficients or weights are optimized and the estimation
technique is same. So, we use least square in both the cases and discriminant analysis as
well as multiple linear regression, these squares is used, the same estimation technique is
used here in discriminant analysis.
So, with this we have completed our discussion on discriminant analysis and with this we
have also discussed we have covered most of the supervised popular supervised learning
techniques under this course. So, we started with so, first to be had introductory lectures,
then we started you know understood the data mining process, we looked at some of the
exploratory techniques visual visualization techniques, we also went through the matrix
that are used to assess the performance of classification model and prediction model,
then we started our discussion on formal techniques that are used for modelling.
We started with multiple linear regression and then KNN, naïve Bayes, we covered
neural network, we also covered classification and regression trees. So many other
techniques that we have covered logistic regression and discriminant analysis. So, all
these techniques that I just talked about, they come under the umbrella of supervised
learning algorithms. So, there is always an outcome variable and with respect to that
outcome variable, we go about building our models either for prediction tasks or
classification tasks.
So, in this course in this course and this part of the this particular course, we have been
able to understand some of the basics of data mining modeling and analytics in general,
and we covered statistical techniques, mathematical techniques, data mining machine
1225
learning algorithms and we have understood how these techniques can be used in
modeling process. And most of as I said most of these techniques are typically used for
classification and prediction that is supervised learning techniques.
So, I hope that you have learned a lot you know going through the lectures of this course,
and you would be able to use some of the information some of the learning’s that you
have used here in your future in your career.
Thank you.
1226
THIS BOOK IS
NOT FOR SALE
NOR COMMERCIAL USE
(044) 2257 5905/08 nptel.ac.in swayam.gov.in

Data Mining Notes

Uploaded by

Copyright:

Available Formats

Data Mining Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Notes

Uploaded by

Copyright:

Available Formats

INDEX

S. No Topic Page No.

(Refer Slide Time: 01:07)

(Refer Slide Time: 02:01)

(Refer Slide Time: 03:56)

(Refer Slide Time: 05:42)

Now, data mining generally it employs a pattern recognition technology as well as a

(Refer Slide Time: 08:59)

(Refer Slide Time: 12:38)

Nowadays millions of transactions are recorded on a daily basis we have organized

(Refer Slide Time: 14:58)

Let us look at an example to develop a better understanding. Now, we have this

So, I recommend you to go through the supplementary lecture of introduction to R and

(Refer Slide Time: 20:44)

(Refer Slide Time: 22:05)

(Refer Slide Time: 22:16)

(Refer Slide Time: 22:57)

(Refer Slide Time: 25:20)

(Refer Slide Time: 27:53)

(Refer Slide Time: 29:06)

(Refer Slide Time: 31:24)

(Refer Slide Time: 32:31)

(Refer Slide Time: 34:06)

(Refer Slide Time: 36:24)

(Refer Slide Time: 37:37)

(Refer Slide Time: 38:52)

(Refer Slide Time: 40:29)

(Refer Slide Time: 40:47)

These are some of the key references.

(Refer Slide Time: 00:30)

Partitioning, partitioning is another essential part of a data mining modeling. So, be

Next phase of data mining process is model planning.

(Refer Slide Time: 06:27)

(Refer Slide Time: 10:17)

(Refer Slide Time: 12:27)

Sampling, next is sampling with replacement. So, when we do sampling with

(Refer Slide Time: 14:24)

(Refer Slide Time: 22:35)

(Refer Slide Time: 24:21)

Now, another important concept is principle of parsimony. So, principle of parsimony is

(Refer Slide Time: 26:32)

Now, another relevant concept is, another element concept is outlier.

(Refer Slide Time: 28:19)

Another important concept is normalization. Many times in many techniques would

So, these are the key references.

Welcome to the supplementary lecture number 1 of the course business analytics

(Refer Slide Time: 00:34)

Now, because R studio is actually based on java, so therefore, if your pc or

(Refer Slide Time: 02:27)

(Refer Slide Time: 03:23)

(Refer Slide Time: 06:36)

So, whenever you face some difficulty in understanding how a particular

(Refer Slide Time: 07:31)

(Refer Slide Time: 09:06)

(Refer Slide Time: 10:52)

(Refer Slide Time: 22:55)

(Refer Slide Time: 24:39)

(Refer Slide Time: 27:04)

(Refer Slide Time: 29:25)

(Refer Slide Time: 32:10)