0% found this document useful (0 votes)
9 views

Module 2

The document discusses Exploratory Data Analysis (EDA) as a crucial first step in the data science process, emphasizing its tools and philosophy for understanding data. It outlines the data science process, including data cleaning, model design, and communication of results, with a case study on RealDirect, an online real estate firm. Additionally, it introduces three basic machine learning algorithms: Linear Regression, k-Nearest Neighbors (k-NN), and k-means, detailing their applications and methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Module 2

The document discusses Exploratory Data Analysis (EDA) as a crucial first step in the data science process, emphasizing its tools and philosophy for understanding data. It outlines the data science process, including data cleaning, model design, and communication of results, with a case study on RealDirect, an online real estate firm. Additionally, it introduces three basic machine learning algorithms: Linear Regression, k-Nearest Neighbors (k-NN), and k-means, detailing their applications and methodologies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 38

Module 2

Exploratory Data Analysis and the Data Science


Process
Basic tools (plots, graphs and summary statistics)
of EDA, Philosophy of EDA, The Data Science
Process, Case Study: Real Direct (online real
estate firm). Three Basic Machine Learning
Algorithms:
Linear Regression, k-Nearest Neighbours (k- NN),
k-means.
Exploratory Data Analysis


“Exploratory data analysis” is an attitude, a
state of flexibility, a willingness to look for those
things that we believe are not there, as well as
those we believe to be there.

Exploratory data analysis (EDA) as the first step
toward building a model

The basic tools of EDA are plots, graphs and
summary statistics.

Generally speaking, it’s a method of
systematically going through the data, plotting
distributions of all variables (using box plots),
plotting time series of data, transforming
variables, looking at all pairwise relationships
between variables using scatterplot matrices,
and generating summary statistics for all of
them.

At the very least that would mean computing
their mean, minimum, maximum, the upper and
lower quartiles, and identifying outliers.
Philosophy of Exploratory Data
Analysis

Long before worrying about how to convince others, you
first have to understand what’s happening yourself. —
Andrew Gelman

There are important reasons anyone working with data
should do EDA.

Namely, to gain intuition about the data; to make
comparisons between distributions; for sanity checking
(making sure the data is on the scale you expect, in the
format you thought it should be); to find out where data is
missing or if there are outliers; and to summarize the
data.

In the context of data generated from logs, EDA
also helps with debugging the logging process.

For example, “patterns” you find in the data
could actually be something wrong in the
logging process that needs to be fixed.

If you never go to the trouble of debugging,
you’ll continue to think your patterns are real.

The engineers we’ve worked with are always
grateful for help in this area.

In the end, EDA helps you make sure the
product is performing as intended..
The Data Science Process

First we have the Real World.

Inside the Real World are lots of people busy at
various activities.

Some people are using Google+, others are
competing in the Olympics; there are spammers
sending spam, and there are people getting
their blood drawn. Say we have data on one of
these things.

Specifically, we’ll start with raw data—logs,
Olympics records, Enron employee emails, or
recorded genetic material (note there are lots of
aspects to these activities already lost even
when we have that raw data).

We want to process this to make it clean for
analysis.

So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you
want to call it.

To do this we use tools such as Python, shell
scripts, R, or SQL, or all of the above

Eventually we get the data down to a nice
format, like something with

Columns: name | event | year | gender | event
time

Once we have this clean dataset, we should be
doing some kind of EDA.

In the course of doing EDA, we may realize that
it isn’t actually clean because of duplicates,
missing values, absurd outliers, and data that
wasn’t actually logged or incorrectly logged.

If that’s the case, we may have to go back to
collect more data, or spend more time cleaning
the dataset.

Next, we design our model to use some
algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else.

The model we choose depends on the type of
problem we’re trying to solve, of course, which
could be a classification problem, a prediction
problem, or a basic description problem.

We then can interpret, visualize, report, or
communicate our results.

This could take the form of reporting the results up to
our boss or coworkers, or publishing a paper in a
journal and going out and giving academic talks about
it.

Alternatively, our goal may be to build or prototype a
“data product”;

e.g., a spam classifier, or a search ranking algorithm,
or a recommen‐ dation system.
Case Study: RealDirect

Doug Perlson, the CEO of RealDirect, has a background in
real estate law, startups, and online advertising.

His goal with RealDirect is to use all the data he can access
about real estate to improve the way people sell and buy
houses.

Normally, people sell their homes about once every seven
years, and they do so with the help of professional brokers
and current data.

But there’s a problem both with the broker system and the
data quality.

RealDirect addresses both of them.

First, the brokers. They are typically “free agents”
operating on their own—think of them as home sales
consultants.

This means that they guard their data aggressively, and
the really good ones have lots of experience.

But in the grand scheme of things, that really means
they have only slightly more data than the
inexperienced brokers.

RealDirect is addressing this problem by hiring a team
of licensed realestate agents who work together and
pool their knowledge.

To accomplish this, it built an interface for sellers,
giving them useful data driven tips on how to sell their
house.

It also uses interaction data to give real-time
recommendations on what to do next.

One problem with publicly available data is that it’s old
news—there’s a three-month lag between a sale and
when the data about that sale is available.

RealDirect is working on real-time feeds on things like
when people start searching for a home, what the
initial offer is, the time between offer and close, and
how people search for a home online.

it offers a subscription to sellers—about $395 a month
—to access the selling tools.

Second, it allows sellers to use RealDirect’s agents at
reduced commission, typically 2% of the sale instead
of the usual 2.5% or 3%.

This is where the magic of data pooling comes in: it
allows RealDirect to take a smaller commission
because it’s more optimized, and therefore gets more
volume.
RealDirect Data Strategy

Explore its existing website, thinking about how
buyers and sellers would navigate through it, and
how the website is structured/ organized. Try to
understand the existing business model, and
think about how analysis of RealDirect user-
behavior data could be used to inform decision-
making and product development.

Because there is no data yet for you to analyze
(typical in a start- up when its still building its
product), you should get some auxiliary data to
help gain intuition about this market.

Summarize your findings in a brief report aimed
at the CEO

Being the “data scientist” often involves speaking to
people who aren’t also data scientists, so it would
be ideal to have a set of communication strategies
for getting to the information you need about the
data

Most of you are not “domain experts” in real estate
or online businesses.

Doug mentioned the company didn’t necessarily
have a data strategy. There is no industry standard
for creating one.
Three Basic Machine Learning
Algorithms:

Linear Regression,

k-Nearest Neighbours (k- NN),

k-means
Linear Regression

When you use it, you are making the assumption that there is
a linear relationship between an outcome variable (sometimes
also called the response variable, de‐ pendent variable, or
label) and a predictor (sometimes also called an independent
variable, explanatory variable, or feature); or between one
variable and several other variables, in which case you’re
modeling the relationship as having a linear structure.

it makes sense that changes in one variable correlate linearly
with changes in another variable.

For example, it makes sense that the more umbrellas you sell,
the more money you make.

Example 1:Overly simplistic example to start

Suppose you run a social networking site that
charges a monthly subscription fee of $25, and that
this is your only source of revenue.

Each month you collect data and count your number
of users and total revenue.

Here are the first four: S = {(x, y) = (1,25) , (10,250) ,
(100,2500) , (200,5000)}

y = 25x

There’s a linear pattern.

The coefficient relating x and y is 25.

It seems deterministic.

Example 2. Looking at data at the user level

Say you have a dataset keyed by user (meaning
each row contains data for a single user), and the
columns represent user behavior on a social
networking site over a period of a week.

Let’s say you feel comfortable that the data is
clean at this stage and that you have on the order
of hundreds of thousands of users.

The names of the columns are total_num_friends,
total_new_friends_this_week, num_visits,
time_spent, num‐ ber_apps_downloaded,
number_ads_shown, gender, age, and so on.

But for now, you are simply trying to build intuition and
understand your dataset. You eyeball the first few rows and see:

7 276

3 43

4 82

6 136

10 417

9 269

Now, your brain can’t figure out what’s going on by just looking
at them.
It looks like there’s kind of a linear relationship here, and it makes
sense;
the more new friends you have, the more time you might spend on the
site

So how do you pick which one?

Because you’re assuming a linear relationship, start your
model by

assuming the functional form to be: y = β 0 + β 1 x

Now your job is to find the best choices for β 0 and β 1 using
the ob‐

served data to estimate them: x 1 , y 1 , x 2 , y 2 , . . . x n , y n .

Writing this with matrix notation results in this: y = x · β

There you go: you’ve written down your model. Now the rest is
fitting the model.

Fitting the model

So, how do you calculate β? The intuition
behind linear regression is that you want to find
the line that minimizes the distance between all
the points and the line.

Many lines look approximately correct, but your
goal is to find the optimal one.

You want to minimize your prediction errors.
This method is called least squares estimation.

To find this line, you’ll define the “residual sum
of squares” (RSS),

denoted RSS β , to be:RSS β = ∑ i (yi − βxi)2
k-Nearest Neighbors (k-NN)

Lazy Learning

K-NN is an algorithm that can be used when you have a
bunch of objects that have been classified or labeled in
some way, and other similar objects that haven’t gotten
classified or labeled yet, and you want a way to
automatically label them.

The intution behind k-NN is to consider the most similar
other items defined in terms of their attributes, look at
their labels, and give the unassigned item the majority
vote. If there’s a tie, you randomly select among the
labels that have tied for first.

Example with credit scores

Say you have the age, income, and a credit
category of high or low for a bunch of people
and you want to use the age and income to
predict the credit label of “high” or “low” for a
new person.

1. Decide on your similarity or distance metric.
2. Split the original labeled dataset into training
and test data.
3. Pick an evaluation metric.
4. Run k-NN a few times, changing k and
checking the evaluation measure.
5. Optimize k by picking the one with the best
evaluation measure.
6. Once you’ve chosen k, use the same training
set and now create a new test set with the
people’s ages and incomes that you have no
labels for, and want to predict.

Similarity or distance metrics

Euclidean distance

Cosine Similarity

Jaccard Distance or Similarity

Mahalanobis Distance

Hamming Distance

Manhattan
k-means

unsupervised learning technique


k-means is the first unsupervised learning
technique we’ll look into, where the goal of the
algorithm is to determine the definition of the
right answer by finding clusters of data for you.


K---number of clusters
1. Initially, you randomly pick k centroids (or points
that will be the center of your clusters) in d-space. Try
to make them near the data but different from one
another.
2. Then assign each data point to the closest centroid.
3. Move the centroids to the average location of the
data points (which correspond to users in this
example) assigned to it.
4. Repeat the preceding two steps until the
assignments don’t change, or change very little.

k-means has some known issues:
• Choosing k is more an art than a science,
although there are
bounds: 1 ≤ k ≤ n, where n is number of data
points.
• There are convergence issues—the solution
can fail to exist, if the algorithm falls into a loop,
for example, and keeps going back and forth
between two possible solutions, or in other
words, there isn’t a single unique solution.
• Interpretability can be a problem—sometimes
the answer isn’t at all useful. Indeed that’s often
the biggest problem.

You might also like