0% found this document useful (0 votes)

9 views

Module 2

The document discusses Exploratory Data Analysis (EDA) as a crucial first step in the data science process, emphasizing its tools and philosophy for understanding data. It outlines the data science process, including data cleaning, model design, and communication of results, with a case study on RealDirect, an online real estate firm. Additionally, it introduces three basic machine learning algorithms: Linear Regression, k-Nearest Neighbors (k-NN), and k-means, detailing their applications and methodologies.

Uploaded by

16gireeshak122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Module 2

Uploaded by

16gireeshak122003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 38

Module 2

Exploratory Data Analysis and the Data Science

Process
Basic tools (plots, graphs and summary statistics)
of EDA, Philosophy of EDA, The Data Science
Process, Case Study: Real Direct (online real
estate firm). Three Basic Machine Learning
Algorithms:
Linear Regression, k-Nearest Neighbours (k- NN),
k-means.
Exploratory Data Analysis
➔

●
“Exploratory data analysis” is an attitude, a
state of flexibility, a willingness to look for those
things that we believe are not there, as well as
those we believe to be there.
●
Exploratory data analysis (EDA) as the first step
toward building a model
●
The basic tools of EDA are plots, graphs and
summary statistics.
●
Generally speaking, it’s a method of
systematically going through the data, plotting
distributions of all variables (using box plots),
plotting time series of data, transforming
variables, looking at all pairwise relationships
between variables using scatterplot matrices,
and generating summary statistics for all of
them.
●
At the very least that would mean computing
their mean, minimum, maximum, the upper and
lower quartiles, and identifying outliers.
Philosophy of Exploratory Data
Analysis
●
Long before worrying about how to convince others, you
first have to understand what’s happening yourself. —
Andrew Gelman
●
There are important reasons anyone working with data
should do EDA.
●
Namely, to gain intuition about the data; to make
comparisons between distributions; for sanity checking
(making sure the data is on the scale you expect, in the
format you thought it should be); to find out where data is
missing or if there are outliers; and to summarize the
data.
●
In the context of data generated from logs, EDA
also helps with debugging the logging process.
●
For example, “patterns” you find in the data
could actually be something wrong in the
logging process that needs to be fixed.
●
If you never go to the trouble of debugging,
you’ll continue to think your patterns are real.
●
The engineers we’ve worked with are always
grateful for help in this area.
●
In the end, EDA helps you make sure the
product is performing as intended..
The Data Science Process
●
First we have the Real World.
●
Inside the Real World are lots of people busy at
various activities.
●
Some people are using Google+, others are
competing in the Olympics; there are spammers
sending spam, and there are people getting
their blood drawn. Say we have data on one of
these things.
●
Specifically, we’ll start with raw data—logs,
Olympics records, Enron employee emails, or
recorded genetic material (note there are lots of
aspects to these activities already lost even
when we have that raw data).
●
We want to process this to make it clean for
analysis.
●
So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you
want to call it.
●
To do this we use tools such as Python, shell
scripts, R, or SQL, or all of the above
●
Eventually we get the data down to a nice
format, like something with
●
Columns: name | event | year | gender | event
time
●
Once we have this clean dataset, we should be
doing some kind of EDA.
●
In the course of doing EDA, we may realize that
it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that
wasn’t actually logged or incorrectly logged.
●
If that’s the case, we may have to go back to
collect more data, or spend more time cleaning
the dataset.
●
Next, we design our model to use some
algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else.
●
The model we choose depends on the type of
problem we’re trying to solve, of course, which
could be a classification problem, a prediction
problem, or a basic description problem.
●
We then can interpret, visualize, report, or
communicate our results.
●
This could take the form of reporting the results up to
our boss or coworkers, or publishing a paper in a
journal and going out and giving academic talks about
it.
●
Alternatively, our goal may be to build or prototype a
“data product”;
●
e.g., a spam classifier, or a search ranking algorithm,
or a recommen‐ dation system.
Case Study: RealDirect
●
Doug Perlson, the CEO of RealDirect, has a background in
real estate law, startups, and online advertising.
●
His goal with RealDirect is to use all the data he can access
about real estate to improve the way people sell and buy
houses.
●
Normally, people sell their homes about once every seven
years, and they do so with the help of professional brokers
and current data.
●
But there’s a problem both with the broker system and the
data quality.
●
RealDirect addresses both of them.
●
First, the brokers. They are typically “free agents”
operating on their own—think of them as home sales
consultants.
●
This means that they guard their data aggressively, and
the really good ones have lots of experience.
●
But in the grand scheme of things, that really means
they have only slightly more data than the
inexperienced brokers.
●
RealDirect is addressing this problem by hiring a team
of licensed realestate agents who work together and
pool their knowledge.
●
To accomplish this, it built an interface for sellers,
giving them useful data driven tips on how to sell their
house.
●
It also uses interaction data to give real-time
recommendations on what to do next.
●
One problem with publicly available data is that it’s old
news—there’s a three-month lag between a sale and
when the data about that sale is available.
●
RealDirect is working on real-time feeds on things like
when people start searching for a home, what the
initial offer is, the time between offer and close, and
how people search for a home online.
●
it offers a subscription to sellers—about $395 a month
—to access the selling tools.
●
Second, it allows sellers to use RealDirect’s agents at
reduced commission, typically 2% of the sale instead
of the usual 2.5% or 3%.
●
This is where the magic of data pooling comes in: it
allows RealDirect to take a smaller commission
because it’s more optimized, and therefore gets more
volume.
RealDirect Data Strategy
●
Explore its existing website, thinking about how
buyers and sellers would navigate through it, and
how the website is structured/ organized. Try to
understand the existing business model, and
think about how analysis of RealDirect user-
behavior data could be used to inform decision-
making and product development.
●
Because there is no data yet for you to analyze
(typical in a start- up when its still building its
product), you should get some auxiliary data to
help gain intuition about this market.
●
Summarize your findings in a brief report aimed
at the CEO
●
Being the “data scientist” often involves speaking to
people who aren’t also data scientists, so it would
be ideal to have a set of communication strategies
for getting to the information you need about the
data
●
Most of you are not “domain experts” in real estate
or online businesses.
●
Doug mentioned the company didn’t necessarily
have a data strategy. There is no industry standard
for creating one.
Three Basic Machine Learning
Algorithms:
●
Linear Regression,
●
k-Nearest Neighbours (k- NN),
●
k-means
Linear Regression
●
When you use it, you are making the assumption that there is
a linear relationship between an outcome variable (sometimes
also called the response variable, de‐ pendent variable, or
label) and a predictor (sometimes also called an independent
variable, explanatory variable, or feature); or between one
variable and several other variables, in which case you’re
modeling the relationship as having a linear structure.
●
it makes sense that changes in one variable correlate linearly
with changes in another variable.
●
For example, it makes sense that the more umbrellas you sell,
the more money you make.
➔
Example 1:Overly simplistic example to start
●
Suppose you run a social networking site that
charges a monthly subscription fee of $25, and that
this is your only source of revenue.
●
Each month you collect data and count your number
of users and total revenue.
●
Here are the first four: S = {(x, y) = (1,25) , (10,250) ,
(100,2500) , (200,5000)}
●
y = 25x
●
There’s a linear pattern.
●
The coefficient relating x and y is 25.
●
It seems deterministic.
➔
Example 2. Looking at data at the user level
●
Say you have a dataset keyed by user (meaning
each row contains data for a single user), and the
columns represent user behavior on a social
networking site over a period of a week.
●
Let’s say you feel comfortable that the data is
clean at this stage and that you have on the order
of hundreds of thousands of users.
●
The names of the columns are total_num_friends,
total_new_friends_this_week, num_visits,
time_spent, num‐ ber_apps_downloaded,
number_ads_shown, gender, age, and so on.
●
But for now, you are simply trying to build intuition and
understand your dataset. You eyeball the first few rows and see:
●
7 276
●
3 43
●
4 82
●
6 136
●
10 417
●
9 269
●
Now, your brain can’t figure out what’s going on by just looking
at them.
It looks like there’s kind of a linear relationship here, and it makes
sense;
the more new friends you have, the more time you might spend on the
site
●
So how do you pick which one?
●
Because you’re assuming a linear relationship, start your
model by
●
assuming the functional form to be: y = β 0 + β 1 x
●
Now your job is to find the best choices for β 0 and β 1 using
the ob‐
●
served data to estimate them: x 1 , y 1 , x 2 , y 2 , . . . x n , y n .
●
Writing this with matrix notation results in this: y = x · β
●
There you go: you’ve written down your model. Now the rest is
fitting the model.
➔
Fitting the model
●
So, how do you calculate β? The intuition
behind linear regression is that you want to find
the line that minimizes the distance between all
the points and the line.
●
Many lines look approximately correct, but your
goal is to find the optimal one.
●
You want to minimize your prediction errors.
This method is called least squares estimation.
●
To find this line, you’ll define the “residual sum
of squares” (RSS),
●
denoted RSS β , to be:RSS β = ∑ i (yi − βxi)2
k-Nearest Neighbors (k-NN)
●
Lazy Learning
●
K-NN is an algorithm that can be used when you have a
bunch of objects that have been classified or labeled in
some way, and other similar objects that haven’t gotten
classified or labeled yet, and you want a way to
automatically label them.
●
The intution behind k-NN is to consider the most similar
other items defined in terms of their attributes, look at
their labels, and give the unassigned item the majority
vote. If there’s a tie, you randomly select among the
labels that have tied for first.
➔
Example with credit scores
●
Say you have the age, income, and a credit
category of high or low for a bunch of people
and you want to use the age and income to
predict the credit label of “high” or “low” for a
new person.
●
1. Decide on your similarity or distance metric.
2. Split the original labeled dataset into training
and test data.
3. Pick an evaluation metric.
4. Run k-NN a few times, changing k and
checking the evaluation measure.
5. Optimize k by picking the one with the best
evaluation measure.
6. Once you’ve chosen k, use the same training
set and now create a new test set with the
people’s ages and incomes that you have no
labels for, and want to predict.
➔
Similarity or distance metrics
●
Euclidean distance
●
Cosine Similarity
●
Jaccard Distance or Similarity
●
Mahalanobis Distance
●
Hamming Distance
●
Manhattan
k-means
●
unsupervised learning technique

●
k-means is the first unsupervised learning
technique we’ll look into, where the goal of the
algorithm is to determine the definition of the
right answer by finding clusters of data for you.

●
K---number of clusters
1. Initially, you randomly pick k centroids (or points
that will be the center of your clusters) in d-space. Try
to make them near the data but different from one
another.
2. Then assign each data point to the closest centroid.
3. Move the centroids to the average location of the
data points (which correspond to users in this
example) assigned to it.
4. Repeat the preceding two steps until the
assignments don’t change, or change very little.
➔
k-means has some known issues:
• Choosing k is more an art than a science,
although there are
bounds: 1 ≤ k ≤ n, where n is number of data
points.
• There are convergence issues—the solution
can fail to exist, if the algorithm falls into a loop,
for example, and keeps going back and forth
between two possible solutions, or in other
words, there isn’t a single unique solution.
• Interpretability can be a problem—sometimes
the answer isn’t at all useful. Indeed that’s often
the biggest problem.

Linux Magazine USA Issue 280, March 2024
100% (2)
Linux Magazine USA Issue 280, March 2024
100 pages
21CS644 Module
No ratings yet
21CS644 Module
30 pages
Data Science and Visualization (21CS644) : Text Books
No ratings yet
Data Science and Visualization (21CS644) : Text Books
27 pages
Unit 2
No ratings yet
Unit 2
48 pages
DSV QB and Solutions
No ratings yet
DSV QB and Solutions
8 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Ds unit 1 notes
No ratings yet
Ds unit 1 notes
23 pages
MSE-merged
No ratings yet
MSE-merged
78 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
DsNaIT v2.0
No ratings yet
DsNaIT v2.0
43 pages
Unit I
No ratings yet
Unit I
52 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Da&ml PPT-1
No ratings yet
Da&ml PPT-1
35 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
6220010
No ratings yet
6220010
37 pages
File
No ratings yet
File
27 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
data-science-report
No ratings yet
data-science-report
32 pages
Data Mining
No ratings yet
Data Mining
34 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
What Is Data Science
No ratings yet
What Is Data Science
13 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
ds sem
No ratings yet
ds sem
71 pages
datascience
No ratings yet
datascience
12 pages
1666777204580-1666708806962-Introduction to Data Science REV
No ratings yet
1666777204580-1666708806962-Introduction to Data Science REV
41 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
DS Skills
No ratings yet
DS Skills
4 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Data Science: Demystifying
No ratings yet
Data Science: Demystifying
73 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Data Analytics_Module-1.1
No ratings yet
Data Analytics_Module-1.1
42 pages
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
No ratings yet
Contact For The Course: - Instructor: Dr. Kauser Ahmed P
54 pages
Lecture 2 The data science process and tools for each step
No ratings yet
Lecture 2 The data science process and tools for each step
8 pages
m2 final
No ratings yet
m2 final
151 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
3 Data Science Intro
No ratings yet
3 Data Science Intro
76 pages
Self-Learning Data Science
No ratings yet
Self-Learning Data Science
16 pages
TRAINING Report
No ratings yet
TRAINING Report
32 pages
Harsh Synopsis
No ratings yet
Harsh Synopsis
21 pages
Unit 1-Part3-Compressed
No ratings yet
Unit 1-Part3-Compressed
28 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
Data Science
No ratings yet
Data Science
18 pages
data scince report
No ratings yet
data scince report
11 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
505DR - Product Spec
No ratings yet
505DR - Product Spec
5 pages
2400 Codebreakers v1.2 B&W
No ratings yet
2400 Codebreakers v1.2 B&W
2 pages
Dualshock 4: Digital Camouflage
No ratings yet
Dualshock 4: Digital Camouflage
9 pages
Cybersecurity Fundamentals Specialist by ISA Actual Free Exam Q&As - ITExams.com_5
No ratings yet
Cybersecurity Fundamentals Specialist by ISA Actual Free Exam Q&As - ITExams.com_5
2 pages
CH 31 - SQL Server 2019 Beginner's Guide, 7E - 1260458873
No ratings yet
CH 31 - SQL Server 2019 Beginner's Guide, 7E - 1260458873
23 pages
The Story of Civilization - Vol II - Old NCERT - Arjun Dev - STD X
100% (1)
The Story of Civilization - Vol II - Old NCERT - Arjun Dev - STD X
223 pages
Guías Ingles II - Informatica 2022
No ratings yet
Guías Ingles II - Informatica 2022
54 pages
Kishu Inu: Whitepaper
No ratings yet
Kishu Inu: Whitepaper
11 pages
Test Itt300 Okt 2023 - Q
No ratings yet
Test Itt300 Okt 2023 - Q
6 pages
The Joy of Computing Using Python PART 1
No ratings yet
The Joy of Computing Using Python PART 1
5 pages
Issa_Katuga_Resume
No ratings yet
Issa_Katuga_Resume
6 pages
Syllabus: Cs6202 - Programming and Data Structures I
No ratings yet
Syllabus: Cs6202 - Programming and Data Structures I
2 pages
Fastest Growing SaaS Companies in 2024 - FounderPath
No ratings yet
Fastest Growing SaaS Companies in 2024 - FounderPath
68 pages
Unit 3 - Theory of Computation - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Theory of Computation - WWW - Rgpvnotes.in
14 pages
02 Upgrading Kong Gateway
No ratings yet
02 Upgrading Kong Gateway
32 pages
MCA Thesis
No ratings yet
MCA Thesis
42 pages
Final Assessment Questions and Answers Part 2
No ratings yet
Final Assessment Questions and Answers Part 2
4 pages
Lance Win Alexandrei B. de Leon: 21st Century Media: A New Learning Avenue For Youth or A Detrimental Information Medium?
No ratings yet
Lance Win Alexandrei B. de Leon: 21st Century Media: A New Learning Avenue For Youth or A Detrimental Information Medium?
2 pages
Bufoneéis
No ratings yet
Bufoneéis
3 pages
PT 1 Viii Ict
100% (1)
PT 1 Viii Ict
4 pages
Data Sheet: Mcu For DSL
No ratings yet
Data Sheet: Mcu For DSL
49 pages
Practical No 24 2308
No ratings yet
Practical No 24 2308
4 pages
Intel® High Definition Audio Specification: Document Change Notification
No ratings yet
Intel® High Definition Audio Specification: Document Change Notification
11 pages
Brief User Manual For EL Access Control and Attendance Re
No ratings yet
Brief User Manual For EL Access Control and Attendance Re
4 pages
Process and Scheduling - OS
No ratings yet
Process and Scheduling - OS
52 pages
Adrress in Millington - Google Search
No ratings yet
Adrress in Millington - Google Search
1 page
MIC Microproject
No ratings yet
MIC Microproject
15 pages
Data Structure and Algorithm Chapter Three: Linear Array
No ratings yet
Data Structure and Algorithm Chapter Three: Linear Array
25 pages
A Tutorial On Multiple Access Technologies For Beyond 3G Mobile Networks
No ratings yet
A Tutorial On Multiple Access Technologies For Beyond 3G Mobile Networks
44 pages

Module 2

Uploaded by

Module 2

Uploaded by

Module 2

Exploratory Data Analysis and the Data Science

You might also like