Lecture 3 (DS) - Steps in Data Science Process
Lecture 3 (DS) - Steps in Data Science Process
Lecture # 3
Step 1 – Acquiring Data
• List techniques and technologies to access and retrieve the data you
need
• Describe an example scenario that accesses data from a variety of
sources using different technologies
3
Where’s the Data?
• Identify suitable data related to problem
• Acquire all available data
• Leaving a small amount of data can lead to incorrect conclusions
• Data comes
• From many places i.e. Local & Remote
• In many varieties i.e. Structured & Unstructured
• In many different velocities i.e. Streaming speed of data
4
Where’s the Data?
5
Where’s the Data?
6
Where’s the Data?
7
Where’s the Data?
• An increasingly popular
way to get data is from
websites
• Common formats are
XML, JSON etc.
• Many websites host web
services to access their
data e.g. REST &
WebSocket
8
Where’s the Data?
9
Where’s the Data?
10
A Real Example
11
Summary
12
Step 2A – Exploring Data
13
Step 2A – Exploring Data
14
Step 2A – Exploring Data
19
Step 2B – Pre-Processing Data
• It removes redundant
features, combining
features and creating
new features
• If two features are very
correlated, one can be
removed
26
Dimensionality Reduction
• It is useful when dataset
has large number of
dimensions
• It involves finding
smaller subset of
dimensions that
capture most of the
variation in the data
• E.g. principal
component analysis
27
Data Manipulation
28
Summary
• Data preparation is very
important part of data
science process
• Here we spend most of
our time
• It can be tedious but is a
crucial step
• Don’t get good results if
we don’t put time and
effort, no matters how
sophisticated techniques
we use for analysis 29
Step 3 – Analyze Data
30
Step 3 – Analyze Data
31
Step 3 – Analyze Data
32
Categories of Analysis Techniques
43
Summary
44
Step 4 – Reporting Insights
45
Step 4 – Reporting Insights
46
What to Present?
48
How to Present?
• Visualization is an
important tool in
presenting results
• Scatter plots, line
graphs etc are
effective ways to
represent your
results visually
• We have tables
with details for
deeper analysis 49
Visualization Tools
50
Step 5 – Turning Insights into Action
51
Step 5 – Turning Insights into Action
52
Step 5 – Turning Insights into Action
53
Questions
• Business questions
• Is there something wrong in our process?
• Is there data that should be added to our application to make is more
accurate?
• Science questions
• Where the benefits from a drug trial statistically significant?
• What is the rate of deforestation? Can you predict how much forest will
remain in 15 years?
54
Implementation
57