0% found this document useful (0 votes)

10 views

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Data Science

Lecture # 3
Step 1 – Acquiring Data

• By the end of this discussion, you will be able to:

• List techniques and technologies to access and retrieve the data you
need
• Describe an example scenario that accesses data from a variety of
sources using different technologies

Note: All Images are taken from edx.org

Step 1 – Acquiring Data

3
Where’s the Data?
• Identify suitable data related to problem
• Acquire all available data
• Leaving a small amount of data can lead to incorrect conclusions
• Data comes
• From many places i.e. Local & Remote
• In many varieties i.e. Structured & Unstructured
• In many different velocities i.e. Streaming speed of data

4
Where’s the Data?

5
Where’s the Data?

• A lot of data exists in

relational databases like
structured data coming
from organizations
• SQL is used to access
data

6
Where’s the Data?

• Data can also exist in

files such as text files
and Excel spreadsheets
• Scripting languages are
used to get data from
files like Python, VBA,
JavaScript, Perl, PHP, R,
Octa, MATLAB etc

7
Where’s the Data?

• An increasingly popular
way to get data is from
websites
• Common formats are
XML, JSON etc.
• Many websites host web
services to access their
data e.g. REST &
WebSocket

8
Where’s the Data?

• REST stands for Representational State Transfer and it is an

approach for implementing webs services with performance,
scalability and maintainability in mind

• WebSocket services allow realtime notifications from the

websites

9
Where’s the Data?

• NoSQL storage systems

are increasingly used to
manage variety of data
• Examples are Cassandra,
mongoDB and HBASE
• They provide APIs to
allow users to access the
data

10
A Real Example

11
Summary

12
Step 2A – Exploring Data

• By the end of this discussion, you will be able to:

• Explain the importance of exploring data

• Identify methods to perform preliminary analysis of your data

13
Step 2A – Exploring Data

14
Step 2A – Exploring Data

• After getting data you might be tempted to immediately build

models to analyze the data
• We must resist this temptation
• Perform preliminary investigation to gain better understanding
of specific characteristics of data
• We’ll be looking for correlations, general trends, outliers
15
Temptation: The desire to do something, especially something wrong or unwise
Why Explore?
• Correlation
graphs explore
dependencies
between variables
• General trends
show how data is
progressing over
time
• Outliers show
data points that
are distant from
other data points
16
Why Explore?
• Summary statistics
provide numerical
values to describe data
• Mean & median are
measures of location of
specific values
• Mode is the value that
occurs most frequently
• Range and standard
deviation are measures
of spread in data
17
Visualize Data
• Heat map show hot spots
• Histogram show data
distribution (unusual
dispersion)
• Boxplots also show data
distribution
• Lines graphs show
change of value over time
• Scatter plots show
correlation between two
variables 18
Step 2B – Pre-Processing Data

• By the end of this discussion, you will be able to:

• Identify some problems with real world data

• Describe what is needed to transform raw data to data that can be used
for analysis

19
Step 2B – Pre-Processing Data

Clean: To address data quality issues

Transform: To make it suitable for analysis 20
Real World Data is Messy!

• Inconsistent values: Customer with two different addresses

• Duplicate records: Customer recorded at two different locations
• Missing values: Missing customer age
• Invalid data: Invalid step code e.g. 6 digit zip code
• Outliers: Due to sensor failure values are much higher or lower
than expected for a period of time

Outliers: Things situated away or detached from main system 21

Addressing Data Quality Issues
• Remove data with missing values
• Merge duplicate records
• Generate best estimate for invalid values
• Remove outliers

• To address these issues Domain Knowledge is required

• Keep record of changes you made
22
Getting Data in Shape
• The second part is to manipulate the clean data into a format
needed for analysis called data manipulation, data pre-
processing, data wrangling or data munging
• Some operations in data munging
• Scaling
• Transformation
• Feature selection
• Dimensionality reduction
• Data manipulation
23
Scaling
• Scaling involves
changing range of
values such as from 0
to 1
• E.g. magnitude of
weight value is much
greater than
magnitude of height
value
• Scaling both values
between 0 and 1 will
equalize contributions
24
Transformation

• It reduces noise and

variability
• Aggregation is one type
of transformation
which results data in
less variability used in
long term analysis
• E.g. daily sales figures
transform into weekly
or monthly sales figures 25
Feature Selection

• It removes redundant
features, combining
features and creating
new features
• If two features are very
correlated, one can be
removed

26
Dimensionality Reduction
• It is useful when dataset
has large number of
dimensions
• It involves finding
smaller subset of
dimensions that
capture most of the
variation in the data
• E.g. principal
component analysis
27
Data Manipulation

• Raw data often has to

be manipulated to be in
the correct format for
the analysis
• It involves creating
groups and capturing
mean, range and
standard deviation for
each group

28
Summary
• Data preparation is very
important part of data
science process
• Here we spend most of
our time
• It can be tedious but is a
crucial step
• Don’t get good results if
we don’t put time and
effort, no matters how
sophisticated techniques
we use for analysis 29
Step 3 – Analyze Data

• By the end of this discussion, you will be able to:

• Describe what is involved in applying an analysis technique to your data

• List three basic analysis techniques

30
Step 3 – Analyze Data

31
Step 3 – Analyze Data

32
Categories of Analysis Techniques

• There are different types of problems so there are different types

of analysis techniques. The main techniques are
• Classification
• Regression
• Clustering
• Association analysis
• Graph analysis
33
Classification

• Goal is to predict the

category of input data
• An example is
predicting the weather
as sunny, rainy, windy
and cloudy
• Another example is to
identify handwritten
digits as being one of 10
categories i.e. 0 to 9 34
Regression

• When our model has to

predict a numeric value
then it becomes
regression problem
• An example is to
predict price of stock
over time
• Another example is to
estimate weekly sales
of a new product 35
Clustering

• The goal is to organize

similar items into
groups
• An example is to group
company’s customers
as seniors, teenagers
and adults
• Another example is to
determining different
weather groups like
rainy, cold or snowy 36
Association Analysis

• The goal is to find rules

to capture associations
between items or events
• Common example is
market basket analysis
to understand customer
purchasing behavior
• E.g. banking customer
with CD also interested
in other investments
• Diaper-bear example 37
Graph Analytics
• When data have lot of
entities and connections
like social networks, we
use graph analytics
• E.g. exploring the
spread of disease by
analyzing doctor’s
record
• Identification of security
threats by monitoring
social media, email etc 38
Modeling
• Modeling starts with
selecting one of these
techniques
• Construct the model
using prepared data
• To validate model, apply
it to new data samples
• Divide prepared data
into set of data for
constructing model and
reserve some for 39
evaluating the model
How to Evaluate Each Model?

• For classification and

regression we’ll have the
correct output for each
sample in our data
• Comparing the correct
output and predicted
output by the model
provides a way to
evaluate the model
40
How to Evaluate Each Model?
• The groups from
clustering should be
examined to see if they
make sense for our
application
• E.g. do the customer
segments reflect your
customer base?
• Are they helpful for use
in our targeted
marketing campaigns?
41
How to Evaluate Each Model?

• In this case some

investigations will be
needed to see if the
results are correct
• E.g. network traffic
delays needs to be
investigated to see if
what our model predicts
is actually happening?
42
Determine Next Steps

43
Summary

44
Step 4 – Reporting Insights

• By the end of this discussion, you will be able to:

• Determine what to present in reporting your findings

• Identify techniques to communicate your results

45
Step 4 – Reporting Insights

46
What to Present?

• Look at results and

decide what to present
• Means determining
what part of analysis is
more important to our
company?
• Our findings determine
what the next step
should be
47
What to Present?

• All findings must be presented so that informed decisions can

be made
• If your conclusions later found to be wrong your credibility
could be seriously damaged
• Better to tell a complete and true story, even if it isn’t very
clean, then to try finesse things and make them sound more
clear than they really are

48
How to Present?

• Visualization is an
important tool in
presenting results
• Scatter plots, line
graphs etc are
effective ways to
represent your
results visually
• We have tables
with details for
deeper analysis 49
Visualization Tools

50
Step 5 – Turning Insights into Action

• By the end of this discussion, you will be able to:

• Explain what turning insights into action means

• Connect your results with your business question

51
Step 5 – Turning Insights into Action

52
Step 5 – Turning Insights into Action

• We bring together large

datasets to find
actionable insights to
help answer scientific or
commercial question

53
Questions
• Business questions
• Is there something wrong in our process?
• Is there data that should be added to our application to make is more
accurate?
• Science questions
• Where the benefits from a drug trial statistically significant?
• What is the rate of deforestation? Can you predict how much forest will
remain in 15 years?

54
Implementation

• Now we’ve to figure out

how to implement the
actions
• How should it be
automated, if it can be?
• Stakeholders need to be
identified and get
involved in this change
55
Implementation

• We need to monitor and

measure the impact of
the action on the
process
• Be sure to think about
what data you should
collect during and after
the change to properly
evaluate its impact
56
Determine Next Steps

• Big data and data

science are only useful if
the insights can be
turned into actions and
actions should be
carefully defined and
evaluated

Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
2.1_Data_Analytics[1]
No ratings yet
2.1_Data_Analytics[1]
16 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
intro
No ratings yet
intro
144 pages
ETCh2
No ratings yet
ETCh2
36 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Bsd1313 Chapter 3
No ratings yet
Bsd1313 Chapter 3
74 pages
Unit 1
No ratings yet
Unit 1
36 pages
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
No ratings yet
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
15 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Google Certificate Notes
No ratings yet
Google Certificate Notes
36 pages
CSD101 Fundamentals of Data Science Session 1 and 2
No ratings yet
CSD101 Fundamentals of Data Science Session 1 and 2
53 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Big Data
No ratings yet
Big Data
4 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Unit4 - DataAnalytics and IoT PDF
No ratings yet
Unit4 - DataAnalytics and IoT PDF
40 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
Introduction To Ds - 2024
No ratings yet
Introduction To Ds - 2024
25 pages
UNIT-1 PPT DMA
No ratings yet
UNIT-1 PPT DMA
83 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Data_Mining_Warehousing Unit II
No ratings yet
Data_Mining_Warehousing Unit II
39 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Unit 1
No ratings yet
Unit 1
61 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Week 1
No ratings yet
Week 1
54 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
BA Full Note 1
No ratings yet
BA Full Note 1
183 pages
DOC-20231118-WA0008new Unit 3
No ratings yet
DOC-20231118-WA0008new Unit 3
15 pages
IBM Data Analyts Professional Certificate Note
No ratings yet
IBM Data Analyts Professional Certificate Note
16 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Module 3_(Prepare Data for Exploration)
No ratings yet
Module 3_(Prepare Data for Exploration)
29 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Set A-1
No ratings yet
Set A-1
6 pages
Chapter 3 MLR
No ratings yet
Chapter 3 MLR
40 pages
Section A: Course Details: Unikl Micet
No ratings yet
Section A: Course Details: Unikl Micet
5 pages
Aczel Business Statistics Solutions Ch8-12
100% (4)
Aczel Business Statistics Solutions Ch8-12
112 pages
Sampling CH-5
No ratings yet
Sampling CH-5
6 pages
HW12 Sol
No ratings yet
HW12 Sol
9 pages
Review and Analysis of Artificial Intelligence Methods For Demand
No ratings yet
Review and Analysis of Artificial Intelligence Methods For Demand
6 pages
P16MBA3
No ratings yet
P16MBA3
4 pages
Managerial Economics in A Global Economy, 5th Edition by Dominick Salvatore
No ratings yet
Managerial Economics in A Global Economy, 5th Edition by Dominick Salvatore
26 pages
Excel 2019 for Educational and Psychological Statistics A Guide to Solving Practical Problems Thomas J. Quirk 2024 Scribd Download
100% (6)
Excel 2019 for Educational and Psychological Statistics A Guide to Solving Practical Problems Thomas J. Quirk 2024 Scribd Download
62 pages
8 +e254658
No ratings yet
8 +e254658
23 pages
MATH 1281 - Unit 4 Discussion Assignment
No ratings yet
MATH 1281 - Unit 4 Discussion Assignment
5 pages
Introduction of Non-Parametric Test
No ratings yet
Introduction of Non-Parametric Test
9 pages
ADM-SHS-StatProb-Q3-M18-Defining Sampling Distribution of the Sample Mean for Normal Population
No ratings yet
ADM-SHS-StatProb-Q3-M18-Defining Sampling Distribution of the Sample Mean for Normal Population
35 pages
Correlation Analysis and Its Types
No ratings yet
Correlation Analysis and Its Types
50 pages
Biostats L2
No ratings yet
Biostats L2
36 pages
Business Statistics
No ratings yet
Business Statistics
20 pages
2 Marks MLT Ai&ds
No ratings yet
2 Marks MLT Ai&ds
2 pages
Sensory Evaluation Methods - Difference Tests
0% (1)
Sensory Evaluation Methods - Difference Tests
72 pages
Assignment No.6
No ratings yet
Assignment No.6
8 pages
OREAS C27h Certificate
No ratings yet
OREAS C27h Certificate
14 pages
Quantitative Mathematics
No ratings yet
Quantitative Mathematics
4 pages
Math For Machine Learning Book Preview
0% (1)
Math For Machine Learning Book Preview
43 pages
Assignment 4 - Network Analysis
No ratings yet
Assignment 4 - Network Analysis
2 pages
Bookmaker. High Odds. 24-Hour Customer Service 2
No ratings yet
Bookmaker. High Odds. 24-Hour Customer Service 2
1 page
Statistics for Management and Economics 11th Edition Keller Solutions Manual pdf download
100% (2)
Statistics for Management and Economics 11th Edition Keller Solutions Manual pdf download
37 pages
STA 111 Assignment Three
No ratings yet
STA 111 Assignment Three
3 pages
Business Analytics For Decision Making
No ratings yet
Business Analytics For Decision Making
3 pages
Public Version
No ratings yet
Public Version
422 pages