0% found this document useful (0 votes)
10 views

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 3 (DS) - Steps in Data Science Process

steps in data science process

Uploaded by

anayabutt658
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Science

Lecture # 3
Step 1 – Acquiring Data

• By the end of this discussion, you will be able to:

• List techniques and technologies to access and retrieve the data you
need
• Describe an example scenario that accesses data from a variety of
sources using different technologies

Note: All Images are taken from edx.org


Step 1 – Acquiring Data

3
Where’s the Data?
• Identify suitable data related to problem
• Acquire all available data
• Leaving a small amount of data can lead to incorrect conclusions
• Data comes
• From many places i.e. Local & Remote
• In many varieties i.e. Structured & Unstructured
• In many different velocities i.e. Streaming speed of data

4
Where’s the Data?

5
Where’s the Data?

• A lot of data exists in


relational databases like
structured data coming
from organizations
• SQL is used to access
data

6
Where’s the Data?

• Data can also exist in


files such as text files
and Excel spreadsheets
• Scripting languages are
used to get data from
files like Python, VBA,
JavaScript, Perl, PHP, R,
Octa, MATLAB etc

7
Where’s the Data?

• An increasingly popular
way to get data is from
websites
• Common formats are
XML, JSON etc.
• Many websites host web
services to access their
data e.g. REST &
WebSocket

8
Where’s the Data?

• REST stands for Representational State Transfer and it is an


approach for implementing webs services with performance,
scalability and maintainability in mind

• WebSocket services allow realtime notifications from the


websites

9
Where’s the Data?

• NoSQL storage systems


are increasingly used to
manage variety of data
• Examples are Cassandra,
mongoDB and HBASE
• They provide APIs to
allow users to access the
data

10
A Real Example

11
Summary

12
Step 2A – Exploring Data

• By the end of this discussion, you will be able to:

• Explain the importance of exploring data


• Identify methods to perform preliminary analysis of your data

13
Step 2A – Exploring Data

14
Step 2A – Exploring Data

• After getting data you might be tempted to immediately build


models to analyze the data
• We must resist this temptation
• Perform preliminary investigation to gain better understanding
of specific characteristics of data
• We’ll be looking for correlations, general trends, outliers
15
Temptation: The desire to do something, especially something wrong or unwise
Why Explore?
• Correlation
graphs explore
dependencies
between variables
• General trends
show how data is
progressing over
time
• Outliers show
data points that
are distant from
other data points
16
Why Explore?
• Summary statistics
provide numerical
values to describe data
• Mean & median are
measures of location of
specific values
• Mode is the value that
occurs most frequently
• Range and standard
deviation are measures
of spread in data
17
Visualize Data
• Heat map show hot spots
• Histogram show data
distribution (unusual
dispersion)
• Boxplots also show data
distribution
• Lines graphs show
change of value over time
• Scatter plots show
correlation between two
variables 18
Step 2B – Pre-Processing Data

• By the end of this discussion, you will be able to:

• Identify some problems with real world data


• Describe what is needed to transform raw data to data that can be used
for analysis

19
Step 2B – Pre-Processing Data

Clean: To address data quality issues


Transform: To make it suitable for analysis 20
Real World Data is Messy!

• Inconsistent values: Customer with two different addresses


• Duplicate records: Customer recorded at two different locations
• Missing values: Missing customer age
• Invalid data: Invalid step code e.g. 6 digit zip code
• Outliers: Due to sensor failure values are much higher or lower
than expected for a period of time

Outliers: Things situated away or detached from main system 21


Addressing Data Quality Issues
• Remove data with missing values
• Merge duplicate records
• Generate best estimate for invalid values
• Remove outliers

• To address these issues Domain Knowledge is required


• Keep record of changes you made
22
Getting Data in Shape
• The second part is to manipulate the clean data into a format
needed for analysis called data manipulation, data pre-
processing, data wrangling or data munging
• Some operations in data munging
• Scaling
• Transformation
• Feature selection
• Dimensionality reduction
• Data manipulation
23
Scaling
• Scaling involves
changing range of
values such as from 0
to 1
• E.g. magnitude of
weight value is much
greater than
magnitude of height
value
• Scaling both values
between 0 and 1 will
equalize contributions
24
Transformation

• It reduces noise and


variability
• Aggregation is one type
of transformation
which results data in
less variability used in
long term analysis
• E.g. daily sales figures
transform into weekly
or monthly sales figures 25
Feature Selection

• It removes redundant
features, combining
features and creating
new features
• If two features are very
correlated, one can be
removed

26
Dimensionality Reduction
• It is useful when dataset
has large number of
dimensions
• It involves finding
smaller subset of
dimensions that
capture most of the
variation in the data
• E.g. principal
component analysis
27
Data Manipulation

• Raw data often has to


be manipulated to be in
the correct format for
the analysis
• It involves creating
groups and capturing
mean, range and
standard deviation for
each group

28
Summary
• Data preparation is very
important part of data
science process
• Here we spend most of
our time
• It can be tedious but is a
crucial step
• Don’t get good results if
we don’t put time and
effort, no matters how
sophisticated techniques
we use for analysis 29
Step 3 – Analyze Data

• By the end of this discussion, you will be able to:

• Describe what is involved in applying an analysis technique to your data


• List three basic analysis techniques

30
Step 3 – Analyze Data

31
Step 3 – Analyze Data

32
Categories of Analysis Techniques

• There are different types of problems so there are different types


of analysis techniques. The main techniques are
• Classification
• Regression
• Clustering
• Association analysis
• Graph analysis
33
Classification

• Goal is to predict the


category of input data
• An example is
predicting the weather
as sunny, rainy, windy
and cloudy
• Another example is to
identify handwritten
digits as being one of 10
categories i.e. 0 to 9 34
Regression

• When our model has to


predict a numeric value
then it becomes
regression problem
• An example is to
predict price of stock
over time
• Another example is to
estimate weekly sales
of a new product 35
Clustering

• The goal is to organize


similar items into
groups
• An example is to group
company’s customers
as seniors, teenagers
and adults
• Another example is to
determining different
weather groups like
rainy, cold or snowy 36
Association Analysis

• The goal is to find rules


to capture associations
between items or events
• Common example is
market basket analysis
to understand customer
purchasing behavior
• E.g. banking customer
with CD also interested
in other investments
• Diaper-bear example 37
Graph Analytics
• When data have lot of
entities and connections
like social networks, we
use graph analytics
• E.g. exploring the
spread of disease by
analyzing doctor’s
record
• Identification of security
threats by monitoring
social media, email etc 38
Modeling
• Modeling starts with
selecting one of these
techniques
• Construct the model
using prepared data
• To validate model, apply
it to new data samples
• Divide prepared data
into set of data for
constructing model and
reserve some for 39
evaluating the model
How to Evaluate Each Model?

• For classification and


regression we’ll have the
correct output for each
sample in our data
• Comparing the correct
output and predicted
output by the model
provides a way to
evaluate the model
40
How to Evaluate Each Model?
• The groups from
clustering should be
examined to see if they
make sense for our
application
• E.g. do the customer
segments reflect your
customer base?
• Are they helpful for use
in our targeted
marketing campaigns?
41
How to Evaluate Each Model?

• In this case some


investigations will be
needed to see if the
results are correct
• E.g. network traffic
delays needs to be
investigated to see if
what our model predicts
is actually happening?
42
Determine Next Steps

43
Summary

44
Step 4 – Reporting Insights

• By the end of this discussion, you will be able to:

• Determine what to present in reporting your findings


• Identify techniques to communicate your results

45
Step 4 – Reporting Insights

46
What to Present?

• Look at results and


decide what to present
• Means determining
what part of analysis is
more important to our
company?
• Our findings determine
what the next step
should be
47
What to Present?

• All findings must be presented so that informed decisions can


be made
• If your conclusions later found to be wrong your credibility
could be seriously damaged
• Better to tell a complete and true story, even if it isn’t very
clean, then to try finesse things and make them sound more
clear than they really are

48
How to Present?

• Visualization is an
important tool in
presenting results
• Scatter plots, line
graphs etc are
effective ways to
represent your
results visually
• We have tables
with details for
deeper analysis 49
Visualization Tools

50
Step 5 – Turning Insights into Action

• By the end of this discussion, you will be able to:

• Explain what turning insights into action means


• Connect your results with your business question

51
Step 5 – Turning Insights into Action

52
Step 5 – Turning Insights into Action

• We bring together large


datasets to find
actionable insights to
help answer scientific or
commercial question

53
Questions
• Business questions
• Is there something wrong in our process?
• Is there data that should be added to our application to make is more
accurate?
• Science questions
• Where the benefits from a drug trial statistically significant?
• What is the rate of deforestation? Can you predict how much forest will
remain in 15 years?

54
Implementation

• Now we’ve to figure out


how to implement the
actions
• How should it be
automated, if it can be?
• Stakeholders need to be
identified and get
involved in this change
55
Implementation

• We need to monitor and


measure the impact of
the action on the
process
• Be sure to think about
what data you should
collect during and after
the change to properly
evaluate its impact
56
Determine Next Steps

• Big data and data


science are only useful if
the insights can be
turned into actions and
actions should be
carefully defined and
evaluated

57

You might also like