Eslsca business school logo
Big Data & Business Analytics
Module (04) – Data Science & Linear Regression
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 Learning Objectives
Module Objectives:
Data science concepts & process
What is linear regression
What is Mean Square Error MSE & Root Mean Square Error RMSE
What is Coefficient of Determination R2
What to Study for Exam:
Module 4 Lecture Notes (emphasis on above topics)
© 2020 Eslsca. All Rights Reserved 2
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 1st Datamining/Data science
One of the first articles to use the phrase “data mining” was published by
Michael C. Lovell (economist) in 1983 where he pointed that statistics could
lead to incorrect conclusions when not informed by knowledge.
By 1990s, the idea of extracting value from data and identifying patterns had
become popular. Database and data warehouse vendors began using the
buzzword business intelligence.
In 1996, a group of companies that included Teradata and NCR led a project to
standardize and formalize data mining process.
With the proliferation of artificial intelligence AI and neural networks, data
mining is now a subset of machine learning and AI.
© 2020 Eslsca. All Rights Reserved 3
Big Data & Business Analytics Module 4: Name
Module Data Science & Linear Regression
Course Name
Module 02
Module 4 1st Datamining/Data science
Data science is a field of study that aims to use a
scientific approach to extract meaning and insights
from data.
Data Science tackles data cleansing, preparation,
analysis, visualization and evaluation.
Machine learning, on the other hand, refers to a
group of techniques used by data scientists that
allow computers to learn from data.
Deep learning is part of a broader family of machine
learning methods based on artificial neural
networks ANN.
https://www.javatpoint.com/data-science-vs-machine-learning
© 2020 Eslsca. All Rights Reserved 4
Big Data & Business Analytics
Course Name Module
Module 4:Name
Data Science & Linear Regression
Module 02
Module 4 1st Datamining/Data science
© 2020 Eslsca. All Rights Reserved 5
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 2nd Datamining/Data science Process
1. Selection: This is a process of selecting relevant data from database which are important
for data mining problem
2. Preprocessing and cleaning: Most times the raw data we used to collect are not always
clean and may contain errors, missing values, noisy or inconsistent data. Thereby getting
rid of such anomalies are very important.
3. Features selection and extraction: Feature selection and extraction lets you refine data
with a smaller number of attributes than the original set.
4. Data Mining/Data science: This is the application of data mining techniques on the data to
discover the interesting patterns. Using various techniques such as regression, clustering,
classification and other techniques of analytics.
5. Interpretation and Evaluation: This is where we generate visualization, forecasting and
prediction
https://steemit.com/steemstem/@noble-noah/data-mining-and-application-big-data-rules-the-world
© 2020 Eslsca. All Rights Reserved 6
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 3rd Regression
Cases that can be modelled as a mathematical equation are referred as regression
Examples:
o Predicting the failure of mechanical parts in automobile engines
o Predicting social media share scores
o Predicting performance scores, e.g. restaurant rating, revenues
o Estimating life expectancy
o Estimating population growth
o Temperature forecast
© 2020 Eslsca. All Rights Reserved 7
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4
• 3rd Linear Regression
We want to find the best line (linear function
y=f(X)) to explain the data.
The predicted value of y is given by:
𝑝
𝑦 = 𝛽0 + 𝑋𝑗 𝛽 𝑗
𝑗=1
To determine the model parameters 𝛽 from some
data, we need to minimize the Residual Sum of
Squares:
𝑁
RSS 𝛽 = 𝑦𝑖 − 𝛽𝑥𝑖 2 X
𝑖=1
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 3rd Linear Regression
Regression Mean Squared Error MSE
RMSE is the square root of MSE
© 2020 Eslsca. All Rights Reserved 9
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 3rd Linear Regression
Coefficient of determination- R Squared
High value of R squared is an
indicator of a close fit
© 2020 Eslsca. All Rights Reserved 10
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 4th Linear Regression Examples
Car Price Prediction:
A Chinese automobile company Geely Auto aspires to enter the US market by setting up
their manufacturing unit there and producing cars locally to give competition to their
US and European counterparts.
The company wants to know:
- Which variables are significant in predicting the price of a car
- How well those variables describe the price of a car
Based on various market surveys, a large dataset of different types of cars across the
American market was obtained.
Various attributes were found to include:
fueltype, apiration, doornumber, carbody, drivewheel, enginelocation, wheelbase,
carlength, carwidth, carheight, curbweight, enginetype, cylindernumber, enginesize,
fuelsystem, boreratio, stroke, compressionratio, and others
https://www.kaggle.com/goyalshalini93/car-price-prediction-linear-regression-rfe/notebook
© 2020 Eslsca. All Rights Reserved 11
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 4th Linear Regression Examples
Car Price Prediction:
Car Price Sample Data:
https://www.kaggle.com/goyalshalini93/car-price-prediction-linear-regression-rfe/notebook
© 2020 Eslsca. All Rights Reserved 12
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 4th Linear Regression Examples
Car Price Prediction:
The price was plotted against the various attributes to test the relationship and the
significance:
https://www.kaggle.com/goyalshalini93/car-price-prediction-linear-regression-rfe/notebook
© 2020 Eslsca. All Rights Reserved 13
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 4th Linear Regression Examples
Car Price Prediction:
The regression model for this example yield the following:
Mean squared error MSE: 8871405.16
Root Mean squared error RMSE: 2978.49042
Coefficient of determination/ R squared: 0.87
This high R squared indicates a strong linear relationship
Code and dataset are found on the following link:
https://www.kaggle.com/goyalshalini93/car-price-prediction-linear-regression-rfe/notebook
© 2020 Eslsca. All Rights Reserved 14
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 5th Big Data Linear Regression Case Study
SAP Analytics:
Revitalizing the Shopping Center Experience with SAP Analytics. To stay
competitive in the era of e-commerce, AG Real Estate, the largest real
estate player in Belgium, is reinventing the mall experience. This requires
insight into how mall visitors shop and helps shopping center managers
create experiences that maximize revenue by keeping shoppers coming
back.
https://www.youtube.com/watch?v=lv6ZVr5114k
© 2020 Eslsca. All Rights Reserved 15
Big Data & Business Analytics
Course Name Module
Module 4: Name
Data Science & Linear Regression
Module 02
Module 4 Questions
© 2018 MegaSoft. All Rights Reserved 16
Module Completed
Module 04