0% found this document useful (0 votes)

24 views48 pages

Python for Data Science _ Learn in 3 Days

This document is a tutorial on learning Python for Data Science, highlighting its popularity and advantages over other programming languages like R. It covers essential topics such as Python installation, data structures, libraries, and basic programming concepts, along with comparisons to R packages. The tutorial emphasizes the importance of Python in data science and provides practical examples and coding environments.

Uploaded by

7gnsxddzdp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views48 pages

Python for Data Science _ Learn in 3 Days

Uploaded by

7gnsxddzdp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

ABOUT INDEX WRITE FOR US

HOME SAS R PYTHON DATA SCIENCE SQL EXCEL VBA SPSS RESOURCES

INFOGRAPHICS MORE

SEARCH... GO

Home » Data Science » Python » Python for Data Science : Learn in 3 Days Follow us on Facebook


PYTHON FOR DATA SCIENCE : LEARN IN 3 DAYS Join us with 5000+ Subscrib
Deepanshu Bhalla 20 Comments Data Science, Python
Subscribe to Free Updates
This tutorial helps you to learn Data Science with Python with Enter your email... Sub

examples. Python is an open source language and it is

widely used as a high-level programming language for
general-purpose programming. It has gained high popularity
in data science world. As data science domain is rising these
days, IBM recently predicted demand for data science
professionals would rise by more than 25% by 2020. In the
PyPL Popularity of Programming language index, Python
scored second rank with a 14 percent share. In advanced
analytics and predictive analytics market, it is ranked among
top 3 programming languages for advanced analytics.
Data Science with Python Tutorial

Table of Contents
1. Getting Started with Python
Python 2.7 vs. 3.6
Python for Data Science : Introduction
How to install Python?
Spyder Shortcut keys
Basic programs in Python
Comparison, Logical and Assignment
Operators

2. Data Structures and Conditional Statements

Python Data Structures
Python Conditional Statements

3. Python Libraries
List of popular packages (comparison with
R)
Popular python commands
How to import a package

4. Data Manipulation using Pandas

Pandas Data Structures - Series and
DataFrame
Important Pandas Functions (vs. R
functions)
Examples - Data analysis with Pandas

5. Data Science with Python

Logistic Regression
Decision Tree
Random Forest
Grid Search - Hyper Parameter Tuning
Cross Validation
Preprocessing Steps

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some

bloggers opposed and some in favor of 2.7. If you filter your
search criteria and look for only recent articles (late 2016
onwards), you would see majority of bloggers are in favor of
Python 3.6. See the following reasons to support Python
3.6.

1. The official end date for the Python 2.7 is year 2020.
Afterward there would be no support from community. It does
not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and

almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed

major issues with versions of Python 2 series. Python 3 was
first released in year 2008. It has been 9 years releasing
robust versions of Python 3 series.

Key Takeaway

You should go for Python 3.6. In terms of

learning Python, there are no major differences
in Python 2.7 and 3.6. It is not too difficult to
move from Python 3 to Python 2 with a few
adjustments. Your focus should go on learning
Python as a language.

Python for Data Science :

Introduction
Python is widely used and very popular for a variety of
software engineering tasks such as website development,
cloud-architecture, back-end etc. It is equally popular in data
science world. In advanced analytics world, there has been
several debates on R vs. Python. There are some areas such
as number of libraries for statistical analysis, where R wins
over Python but Python is catching up very fast. With
popularity of big data and data science, Python has become
first programming language of data scientists.

There are several reasons to learn Python. Some of them are

as follows -

1. Python runs well in automating various steps of a

predictive model.
2. Python has awesome robust libraries for machine
learning, natural language processing, deep learning,
big data and artificial Intelligence.
3. Python wins over R when it comes to deploying
machine learning models in production.
4. It can be easily integrated with big data frameworks
such as Spark and Hadoop.
5. Python has a great online community support.

Do you know these sites are developed in Python?

1. YouTube
2. Instagram
3. Reddit
4. Dropbox
5. Disqus

How to Install Python

There are two ways to download and install Python

1. Download Anaconda. It comes with Python software

along with preinstalled popular libraries.
2. Download Python from its official website. You have to
manually install libraries.

Recommended : Go for first option and download anaconda.

It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :

1. Jupyter (Ipython) Notebook

2. Spyder

Spyder. It is like RStudio for Python. It gives an environment

wherein writing python code is user-friendly. If you are a SAS
User, you can think of it as SAS Enterprise Guide / SAS
Studio. It comes with a syntax editor where you can write
programs. It has a console to check each and every line of
code. Under the 'Variable explorer', you can access your
created data files and function. I highly recommend
Spyder!
Spyder - Python Coding Environment

Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you

need to present your work to others or when you need to
create step by step project report as it can combine code,
output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys

which makes you more productive.

1. Press F5 to run the entire script

2. Press F9 to run selection or line
3. Press Ctrl + 1 to comment / uncomment
4. Go to front of function and then press Ctrl + I to see
documentation of the function
5. Run %reset -f to clean workspace
6. Ctrl + Left click on object to see source code
7. Ctrl+Enter executes the current cell.
8. Shift+Enter executes the current cell and advances
the cursor to the next cell
List of arithmetic operators with examples

Arithmetic Operation Example

Operators

+ Addition 10 + 2 = 12

– Subtraction 10 – 2 = 8

* Multiplication 10 * 2 = 20

/ Division 10 / 2 = 5.0

% Modulus 10 % 3 = 1
(Remainder)

** Power 10 ** 2 = 100

// Floor 17 // 3 = 5

(x + (d-1)) // d Ceiling (17 +(3-1)) // 3 =

Basic Programs

Example 1

#Basics
x = 10
y=3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)

Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2

x = 100
x > 80 and x <=95
x > 35 or x < 60

x > 80 and x <=95

Out[45]: False

x > 35 or x < 60
Out[46]: True

Comparison & Description Example

Logical Operators

> Greater than 5 > 3 returns

True

< Less than 5 < 3 returns

False

>= Greater than or 5 >= 3

equal to returns True

<= Less than or equal 5 <= 3 return

to False

== Equal to 5 == 3
returns
False

!= Not equal to 5 != 3
returns True

and Check both the x > 18 and x

conditions <=35

or If atleast one x > 35 or x <

condition hold True 60

not Opposite of not(x>7)

Condition

Assignment Operators

It is used to assign a value to the declared variable. For e.g.

x += 25 means x = x +25.

x = 100
y = 10
x += y
print(x)

print(x)
110

In this case, x+=y implies x=x+y which is x = 100 + 10.

Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand

the data structures. Following are some data structures used
in Python.

1. List

It is a sequence of multiple values. It allows us to store

different types of data such as integer, float, string etc. See
the examples of list below. First one is an integer list
containing only integer. Second one is string list containing
only string values. Third one is mixed list containing integer,
string and float values.

1. x = [1, 2, 3, 4, 5]
2. y = [‘A’, ‘O’, ‘G’, ‘M’]
3. z = [‘A’, 4, 5.1, ‘M’]

Get List Item

We can extract list item using Indexes. Index starts from 0

and end with (number of elements-1).

x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]

x[0]
Out[68]: 1

x[1]
Out[69]: 2

x[4]
Out[70]: 5

x[-1]
Out[71]: 5

x[-2]
Out[72]: 4

x[0] picks first element from list. Negative sign tells Python
to search list item from right to left. x[-1] selects the last
element from list.
You can select multiple elements from a list using the
following method

x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of

elements. The difference between list and tuple are as
follows -

1. A tuple cannot be changed once constructed whereas

list can be modified.
2. A tuple is created by placing comma-separated values
inside parentheses ( ). Whereas, list is created inside
square brackets [ ]

Examples

K = (1,2,3)
State = ('Delhi','Maharashtra','Karnataka')

Perform for loop on Tuple

for i in State:
print(i)

Delhi
Maharashtra
Karnataka

Detailed Tutorial : Python Data

Structures

Functions

Like print(), you can create your own custom function. It is

also called user-defined functions. It helps you in automating
the repetitive task and calling reusable code in easier way.

Rules to define a function

1. Function starts with def keyword followed by function
name and ( )
2. Function body starts with a colon (:) and is indented
3. The keyword return ends a function and give value of
previous expression.

def sum_fun(a, b):

result = a + b
return result

z = sum_fun(10, 15)

Result : z = 25

Suppose you want python to assume 0 as default value if no

value is specified for parameter b.

def sum_fun(a, b=0):

result = a + b
return result
z = sum_fun(10)
In the above function, b is set to be 0 if no value is provided
for parameter b. It does not mean no other value than 0 can
be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF

ELSE statements. It can be read like : " if a condition holds
true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

Example

k = 27
if k%5 == 0:
print('Multiple of 5')
else:
print('Not a Multiple of 5')

Result : Not a Multiple of 5

Popular python packages for Data Analysis &

Visualization

Some of the leading packages in Python along with

equivalent libraries in R are as follows-

1. pandas. For data manipulation and data wrangling. A

collections of functions to understand and explore
data. It is counterpart of dplyr and reshape2
packages in R.
2. NumPy. For numerical computing. It's a package for
efficient array computations. It allows us to do some
operations on an entire column or table in one line. It
is roughly approximate to Rcpp package in R which
eliminates the limitation of slow speed in R. Numpy
Tutorial
3. Scipy. For mathematical and scientific functions such
as integration, interpolation, signal processing, linear
algebra, statistics, etc. It is built on Numpy.
4. Scikit-learn. A collection of machine learning
algorithms. It is built on Numpy and Scipy. It can
perform all the techniques that can be done in R
using glm, knn, randomForest, rpart, e1071
packages.
5. Matplotlib. For data visualization. It's a leading
package for graphics in Python. It is equivalent to
ggplot2 package in R.
6. Statsmodels. For statistical and predictive modeling.
It includes various functions to explore data and
generate descriptive and predictive analytics. It allows
users to run descriptive statistics, methods to impute
missing values, statistical tests and take table output
to HTML format.
7. pandasql. It allows SQL users to write SQL queries in
Python. It is very helpful for people who loves writing
SQL queries to manipulate data. It is equivalent to
sqldf package in R.

Maximum of the above packages are already preinstalled in

Spyder.

Comparison of Python and R Packages by Data

Mining Task

Task Python R Package

Package

IDE Rodeo / Spyder Rstudio

Data pandas dplyr and reshape2

Manipulation

Machine Scikit-learn glm, knn, randomForest,

Learning rpart, e1071

Data ggplot + seaborn ggplot2

Visualization + bokeh

Character Built-In stringr

Functions Functions

Reproducibilit Jupyter Knitr

SQL Queries pandasql sqldf

Working with datetime lubridate

Dates

Web beautifulsoup rvest

Scraping

Popular Python Commands

The commands below would help you to install and update

new and existing packages. Let's say, you want to install /
uninstall pandas package.

Install Package
!pip install pandas

Uninstall Package
!pip uninstall pandas

Show Information about Installed Package

!pip show pandas

List of Installed Packages

!pip list
Upgrade a package
!pip install --upgrade pandas

How to import a package

There are multiple ways to import a package in Python. It is

important to understand the difference between these styles.

1. import pandas as pd
It imports the package pandas under the alias pd. A function
DataFrame in package pandas is then submitted with
pd.DataFrame.

2. import pandas
It imports the package without using alias but here the
function DataFrame is submitted with full package
name pandas.DataFrame

3. from pandas import *

It imports the whole package and the function DataFrame is
executed simply by typing DataFrame. It sometimes creates
confusion when same function name exists in more than one
package.

Pandas Data Structures : Series and

DataFrame

In pandas package, there are two data structures - series and

dataframe. These structures are explained below in detail -

1. Series is a one-dimensional array. You can access

individual elements of a series using position. It's
similar to vector in R.
In the example below, we are generating 5 random values.

import pandas as pd
s1 = pd.Series(np.random.randn(5))
s1

0 -2.412015
1 -0.451752
2 1.174207
3 0.766348
4 -0.361815
dtype: float64

Extract first and second value

You can get a particular element of a series using index

value. See the examples below -

s1[0]

-2.412015

s1[1]

-0.451752

s1[:3]

0 -2.412015
1 -0.451752
2 1.174207

2. DataFrame

It is equivalent to data.frame in R. It is a 2-dimensional data

structure that can store data of different data types such as
characters, integers, floating point values, factors. Those who
are well-conversant with MS Excel, they can think of data
frame as Excel Spreadsheet.

Comparison of Data Type in Python and Pandas

The following table shows how Python and pandas package

stores data.

Data Type Pandas Standard

Python

For character variable object string

For categorical variable category -

For Numeric variable without int64 int

decimals

Numeric characters with float64 float

decimals

For date time variables datetime -

Important Pandas Functions

The table below shows comparison of pandas functions with

R functions for various data wrangling and manipulation
tasks. It would help you to memorize pandas functions. It's a
very handy information for programmers who are new to
Python. It includes solutions for most of the frequently used
data exploration tasks.

Functions R Python (pandas

package)

Installing a package install.packag !pip install name

es('name')
Loading a package library(name) import name as import pandas as pd
other_name

Checking working getwd() import os

directory os.getcwd()

Setting working setwd() os.chdir() os.chdir('directory_name')

List files in a dir() os.listdir()

Remove an object rm('name') del object

Select Variables select(df, x1, df[['x1', 'x2']]

x2)

Drop Variables select(df, - df.drop(['x1', 'x2'],

(x1:x2)) axis = 1)

Filter Data filter(df, x1 >= df.query('x1 >= 100')

100)

Structure of a str(df) df.info()

DataFrame

Summarize summary(df) df.describe()

dataframe

Get row names of rownames(df) df.index

dataframe "df"

Get column names colnames(df) df.columns

View Top N rows head(df,N) df.head(N)

View Bottom N rows tail(df,N) df.tail(N)

Get dimension of dim(df) df.shape

data frame

Get number of rows nrow(df) df.shape[0]

Get number of ncol(df) df.shape[1]

columns

Length of data frame length(df) len(df)

Get random 3 rows sample_n(df, df.sample(n=3)

from dataframe 3)

Get random 10% sample_frac(d df.sample(frac=0.1)

rows f, 0.1)

Check Missing is.na(df$x) pd.isnull(df.x)

Values

Sorting arrange(df, df.sort_values(['x1',

x1, x2) 'x2'])

Rename Variables rename(df, df.rename(columns=

newvar = x1) {'x1': 'newvar'})

Data Manipulation with pandas -

Examples

1. Import Required Packages

You can import required packages using import statement.

In the syntax below, we are asking Python to import numpy
and pandas package. The 'as' is used to alias package
name.

import numpy as np
import pandas as pd

2. Build DataFrame

We can build dataframe using DataFrame() function of

pandas package.
mydata = {'productcode': ['AA', 'AA', 'AA', 'BB',
'BB', 'BB'],
'sales': [1010, 1025.2, 1404.2, 1251.7,
1160, 1604.8],
'cost' : [1020, 1625.2, 1204, 1003.7, 1020,
1124]}
df = pd.DataFrame(mydata)

In this dataframe, we have three variables - productcode,

sales, cost.

Sample DataFrame

To import data from CSV file

You can use read_csv() function from pandas package to get

data into python from CSV file.

mydata=
pd.read_csv("C:\\Users\\Deepanshu\\Documen
ts\\file1.csv")

Make sure you use double backslash when specifying path

of CSV file. Alternatively, you can use forward slash to
mention file path inside read_csv() function.
Detailed Tutorial : Import Data in
Python

3. To see number of rows and columns

You can run the command below to find out number of rows
and columns.

df.shape

Result : (6, 3). It means 6 rows and 3 columns.

4. To view first 3 rows

The df.head(N) function can be used to check out first some

N rows.

df.head(3)

cost productcode sales

0 1020.0 AA 1010.0
1 1625.2 AA 1025.2
2 1204.0 AA 1404.2

5. Select or Drop Variables

To keep a single variable, you can write in any of the

following three methods -

df.productcode
df["productcode"]
df.loc[: , "productcode"]

column(Row) by its name

To select variable by column position, you can use df.iloc
function. In the example below, we are selecting second
column. Column Index starts from 0. Hence, 1 refers to
second column.

df.iloc[: , 1]

We can keep multiple variables by specifying desired

variables inside [ ]. Also, we can make use of df.loc()
function.

df[["productcode", "cost"]]
df.loc[ : , ["productcode", "cost"]]

Drop Variable

We can remove variables by using df.drop() function. See the

example below -

df2 = df.drop(['sales'], axis = 1)

6. To summarize data frame

To summarize or explore data, you can submit the command

below.

df.describe()

cost sales
count 6.000000 6.00000
mean 1166.150000 1242.65000
std 237.926793 230.46669
min 1003.700000 1010.00000
25% 1020.000000 1058.90000
50% 1072.000000 1205.85000
75% 1184.000000 1366.07500
max 1625.200000 1604.80000

To summarise all the character variables, you can use the

following script.

df.describe(include=['O'])

Similarly, you can use df.describe(include=['float64']) to

view summary of all the numeric variables with decimals.

To select only a particular variable, you can write the

following code -

df.productcode.describe()
OR
df["productcode"].describe()

count 6
unique 2
top BB
freq 3
Name: productcode, dtype: object

7. To calculate summary statistics

We can manually find out summary statistics such as count,

mean, median by using commands below

df.sales.mean()
df.sales.median()
df.sales.count()
df.sales.min()
df.sales.max()

8. Filter Data

Suppose you are asked to apply condition - productcode is

equal to "AA" and sales greater than or equal to 1250.

df1 = df[(df.productcode == "AA") & (df.sales

>= 1250)]

It can also be written like :

df1 = df.query('(productcode == "AA") & (sales

>= 1250)')

In the second query, we do not need to specify DataFrame

along with variable name.

9. Sort Data

In the code below, we are arrange data in ascending order by

sales.

df.sort_values(['sales'])

10. Group By : Summary by Grouping Variable

Like SQL GROUP BY, you want to summarize continuous

variable by classification variable. In this case, we are
calculating average sale and cost by product code.

df.groupby(df.productcode).mean()

cost sales
productcode
AA 1283.066667 1146.466667
BB 1049.233333 1338.833333

Instead of summarising for multiple variable, you can run it

for a single variable i.e. sales. Submit the following script.

df["sales"].groupby(df.productcode).mean()

11. Define Categorical Variable

Let's create a classification variable - id which contains only 3

unique values - 1/2/3.

df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})

Let's define as a categorical variable.

We can use astype() function to make id as a categorical
variable.

df0.id = df0["id"].astype('category')

Summarize this classification variable to check descriptive

statistics.

df0.describe()
id
count 7
unique 3
top 2
freq 3

Frequency Distribution

You can calculate frequency distribution of a categorical

variable. It is one of the method to explore a categorical
variable.

df['productcode'].value_counts()

BB 3
AA 3

12. Generate Histogram

Histogram is one of the method to check distribution of a

continuous variable. In the figure shown below, there are two
values for variable 'sales' in range 1000-1100. In the
remaining intervals, there is only a single value. In this case,
there are only 5 values. If you have a large dataset, you can
plot histogram to identify outliers in a continuous variable.

df['sales'].hist()
Histogram

13. BoxPlot

Boxplot is a method to visualize continuous or numeric

variable. It shows minimum, Q1, Q2, Q3, IQR, maximum
value in a single graph.

df.boxplot(column='sales')

BoxPlot

Detailed Tutorial : Data Analysis with

Pandas Tutorial
Data Science using Python -
Examples

In this section, we cover how to perform data mining and

machine learning algorithms with Python. sklearn is the most
frequently used library for running data mining and machine
learning algorithms. We will also cover statsmodels library for
regression techniques. statsmodels library generates
formattable output which can be used further in project report
and presentation.

1. Install the required libraries

Import the following libraries before reading or exploring data

#Import required libraries

import pandas as pd
import statsmodels.api as sm
import numpy as np

2. Download and import data into Python

With the use of python library, we can easily get data from
web into python.

# Read data from web

df =
pd.read_csv("https://stats.idre.ucla.edu/stat/dat
a/binary.csv")
Variables Type Description
gre Continuous Graduate Record Exam score
gpa Continuous Grade Point Average
rank Categorical Prestige of the undergraduate
institution
admit Binary Admission in graduate school

The binary variable admit is a target variable.

3. Explore Data

Let's explore data. We'll answer the following questions -

1. How many rows and columns in the data file?

2. What are the distribution of variables?
3. Check if any outlier(s)
4. If outlier(s), treat them
5. Check if any missing value(s)
6. Impute Missing values (if any)

# See no. of rows and columns

df.shape

Result : 400 rows and 4 columns

In the code below, we rename the variable rank to 'position'

as rank is already a function in python.

# rename rank column

df = df.rename(columns={'rank': 'position'})

Summarize and plot all the columns.

# Summarize
df.describe()
# plot all of the columns
df.hist()

Categorical variable Analysis

It is important to check the frequency distribution of

categorical variable. It helps to answer the question whether
data is skewed.

# Summarize
df.position.value_counts(ascending=True)

1 61
4 67
3 121
2 151

Generating Crosstab

By looking at cross tabulation report, we can check whether

we have enough number of events against each unique
values of categorical variable.

pd.crosstab(df['admit'], df['position'])

position 1 2 3 4
admit
0 28 97 93 55
1 33 54 28 12

Number of Missing Values

We can write a simple loop to figure out the number of blank

values in all variables in a dataset.
for i in list(df.columns) :
k = sum(pd.isnull(df[i]))
print(i, k)

In this case, there are no missing values in the dataset.

4. Logistic Regression Model

Logistic Regression is a special type of regression where

target variable is categorical in nature and independent
variables be discrete or continuous. In this post, we will
demonstrate only binary logistic regression which takes
only binary values in target variable. Unlike linear regression,
logistic regression model returns probability of target
variable.It assumes binomial distribution of dependent
variable. In other words, it belongs to binomial family.

In python, we can write R-style model formula y ~ x1 + x2 +

x3 using patsy and statsmodels libraries. In the formula,
we need to define variable 'position' as a categorical variable
by mentioning it inside capital C(). You can also define
reference category using reference= option.

#Reference Category
from patsy import dmatrices, Treatment
y, X = dmatrices('admit ~ gre + gpa +
C(position, Treatment(reference=4))', df,
return_type = 'dataframe')

It returns two datasets - X and y. The dataset 'y' contains

variable admit which is a target variable. The other dataset 'X'
contains Intercept (constant value), dummy variables for
Treatment, gre and gpa. Since 4 is set as a reference
category, it will be 0 against all the three dummy variables.
See sample below -

P P_1 P_2 P_3

3 0 0 1
3 0 0 1
1 1 0 0
4 0 0 0
4 0 0 0
2 0 1 0

Split Data into two parts

80% of data goes to training dataset which is used for

building model and 20% goes to test dataset which would be
used for validating the model.

from sklearn.model_selection import

train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2,
random_state=0)

Build Logistic Regression Model

By default, the regression without formula style does not

include intercept. To include it, we already have
added intercept in X_train which would be used as a
predictor.

#Fit Logit model

logit = sm.Logit(y_train, X_train)
result = logit.fit()

#Summary of Logistic regression model

result.summary()
result.params

Logit Regression Results

=======================================================
=======================
Dep. Variable: admit No.
Observations: 320
Model: Logit Df Residuals:
315
Method: MLE Df Model:
4
Date: Sat, 20 May 2017 Pseudo R-squ.:
0.03399
Time: 19:57:24 Log-Likelihood:
-193.49
converged: True LL-Null:
-200.30
LLR p-value:
0.008627
=======================================================
================================
coef std err z
P|z| [95.0% Conf. Int.]
-------------------------------------------------------
--------------------------------
C(position)[T.1] 1.4933 0.440 3.392
0.001 0.630 2.356
C(position)[T.2] 0.6771 0.373 1.813
0.070 -0.055 1.409
C(position)[T.3] 0.1071 0.410 0.261
0.794 -0.696 0.910
gre 0.0005 0.001 0.442
0.659 -0.002 0.003
gpa 0.4613 0.214 -2.152
0.031 -0.881 -0.041
=======================================================
===============================

Confusion Matrix and Odd Ratio

Odd ratio is exponential value of parameter estimates.

#Confusion Matrix
result.pred_table()
#Odd Ratio
np.exp(result.params)
Prediction on Test Data
In this step, we take estimates of logit model which was built
on training data and then later apply it into test data.

#prediction on test data

y_pred = result.predict(X_test)

Calculate Area under Curve (ROC)

# AUC on test data

false_positive_rate, true_positive_rate,
thresholds = roc_curve(y_test, y_pred)
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.6763

Calculate Accuracy Score

accuracy_score([ 1 if p > 0.5 else 0 for p in

y_pred ], y_test)

Decision Tree Model

Decision trees can have a target variable continuous or

categorical. When it is continuous, it is called regression tree.
And when it is categorical, it is called classification tree. It
selects a variable at each step that best splits the set of
values. There are several algorithms to find best split. Some
of them are Gini, Entropy, C4.5, Chi-Square. There are
several advantages of decision tree. It is simple to use and
easy to understand. It requires a very few data preparation
steps. It can handle mixed data - both categorical and
continuous variables. In terms of speed, it is a very fast
algorithm.

#Drop Intercept from predictors for tree algorithms

X_train = X_train.drop(['Intercept'], axis = 1)
X_test = X_test.drop(['Intercept'], axis = 1)

#Decision Tree
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=7)

#Fit the model:

model_tree.fit(X_train,y_train)

#Make predictions on test set

predictions_tree = model_tree.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_tree[:,1])
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.664

Important Note

Feature engineering plays an important role in

building predictive models. In the above case,
we have not performed variable selection. We
can also select best parameters by using grid
search fine tuning technique.

Random Forest Model

Decision Tree has limitation of overfitting which implies it

does not generalize pattern. It is very sensitive to a small
change in training data. To overcome this problem, random
forest comes into picture. It grows a large number of trees on
randomised data. It selects random number of variables to
grow each tree. It is more robust algorithm than decision tree.
It is one of the most popular machine learning algorithm. It is
commonly used in data science competitions. It is always
ranked in top 5 algorithms. It has become a part of every data
science toolkit.

#Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100,
max_depth=7)

#Fit the model:

target = y_train['admit']
model_rf.fit(X_train,target)

#Make predictions on test set

predictions_rf = model_rf.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

#Variable Importance
importances = pd.Series(model_rf.feature_importances_,
index=X_train.columns).sort_values(ascending=False)
print(importances)
importances.plot.bar()

Result : AUC = 0.6974

Grid Search - Hyper Parameters

Tuning

The sklearn library makes hyper-parameters tuning very

easy. It is a strategy to select the best parameters for an
algorithm. In scikit-learn they are passed as arguments to the
constructor of the estimator classes. For example,
max_features in randomforest. alpha for lasso.

from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier()
target = y_train['admit']

param_grid = {
'n_estimators': [100, 200, 300],
'max_features': ['sqrt', 3, 4]
}

CV_rfc = GridSearchCV(estimator=rf ,
param_grid=param_grid, cv= 5, scoring='roc_auc')
CV_rfc.fit(X_train,target)

#Parameters with Scores

CV_rfc.grid_scores_

#Best Parameters
CV_rfc.best_params_
CV_rfc.best_estimator_

#Make predictions on test set

predictions_rf = CV_rfc.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

Cross Validation

# Cross Validation
from sklearn.linear_model import
LogisticRegression
from sklearn.model_selection import
cross_val_predict,cross_val_score
target = y['admit']
prediction_logit =
cross_val_predict(LogisticRegression(), X,
target, cv=10, method='predict_proba')
#AUC
cross_val_score(LogisticRegression(fit_interce
pt = False), X, target, cv=10, scoring='roc_auc')

Data Mining : PreProcessing Steps

1. The machine learning package sklearn requires all

categorical variables in numeric form. Hence, we need to
convert all character/categorical variables to be numeric. This
can be accomplished using the following script. In sklearn,
there is already a function for this step.

from sklearn.preprocessing import LabelEncoder

def ConverttoNumeric(df):
cols = list(df.select_dtypes(include=
['category','object']))
le = LabelEncoder()
for i in cols:
try:
df[i] = le.fit_transform(df[i])
except:
print('Error in Variable :'+i)
return df

ConverttoNumeric(df)

Encoding

2. Create Dummy Variables

Suppose you want to convert categorical variables into

dummy variables. It is different to the previous example as it
creates dummy variables instead of convert it in numeric
form.

productcode_dummy =
pd.get_dummies(df["productcode"])
df2 = pd.concat([df, productcode_dummy],
axis=1)
The output looks like below -

AA BB
0 1 0
1 1 0
2 1 0
3 0 1
4 0 1
5 0 1

Create k-1 Categories

To avoid multi-collinearity, you can set one of the category as

reference category and leave it while creating dummy
variables. In the script below, we are leaving first category.

productcode_dummy =
pd.get_dummies(df["productcode"],
prefix='pcode', drop_first=True)
df2 = pd.concat([df, productcode_dummy],
axis=1)

3. Impute Missing Values

Imputing missing values is an important step of predictive

modeling. In many algorithms, if missing values are not filled,
it removes complete row. If data contains a lot of missing
values, it can lead to huge data loss. There are multiple ways
to impute missing values. Some of the common techniques -
to replace missing value with mean/median/zero. It makes
sense to replace missing value with 0 when 0 signifies
meaningful. For example, whether customer holds a credit
card product.

Fill missing values of a particular variable

# fill missing values with 0
df['var1'] = df['var1'].fillna(0)
# fill missing values with mean
df['var1'] = df['var1'].fillna(df['var1'].mean())

Apply imputation to the whole dataset

from sklearn.preprocessing import Imputer

# Set an imputer object

mean_imputer =
Imputer(missing_values='NaN',
strategy='mean', axis=0)

# Train the imputor

mean_imputer = mean_imputer.fit(df)

# Apply imputation
df_new = mean_imputer.transform(df.values)

4. Outlier Treatment

There are many ways to handle or treat outliers (or extreme

values). Some of the methods are as follows -

1. Cap extreme values at 95th / 99th percentile

depending on distribution
2. Apply log transformation of variables. See below the
implementation of log transformation in Python.
import numpy as np
df['var1'] = np.log(df['var1'])

5. Standardization

In some algorithms, it is required to standardize variables

before running the actual algorithm. Standardization refers to
the process of making mean of variable zero and unit
variance (standard deviation).

#load dataset
dataset = load_boston()
predictors = dataset.data
target = dataset.target
df = pd.DataFrame(predictors, columns =
dataset.feature_names)

#Apply Standardization
from sklearn.preprocessing import
StandardScaler
k = StandardScaler()
df2 = k.fit_transform(df)

Next Steps

Practice, practice and practice. Download free public data

sets from Kaggle / UCLA websites and try to play around with
data and generate insights from it with pandas package and
build statistical models using sklearn package. I hope you
would find this tutorial helpful. I tried to cover all the important
topics which beginner must know about Python. Once
completion of this tutorial, you can flaunt you know how to
program it in Python and you can implement machine
learning algorithms using sklearn package.

Love this Post? Spread the Word

Facebook LinkedIn Twitter

About Author:
Deepanshu founded ListenData with a simple objective - Make analytics easy to
understand and follow. He has over 7 years of experience in data science and
predictive modeling. During his tenure, he has worked with global clients in
various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :

Enter your email address Submit

*Please confirm your email address by clicking on the link

sent to your Email*

Related Posts:
Linear Regression in Python
NumPy Tutorial with Exercises
K Nearest Neighbor : Step by Step Tutorial
Python for Data Science : Learn in 3 Days
Identify Person, Place and Organisation in content
using Python
Case Study : Sentiment analysis using Python
Run Python from R
20 Responses to "Python for Data Science : Learn in 3
Days"

Anonymous 23 May 2017 at 07:31

Hi, excelent tutorial!!! I'm mostly a user of R but want to
learn python. The thing is i work a lot with spatial data:
spatial relationships (spdep), interpolation (kriging with gstat
or multilevel B-Splines with MBA) etc.; and then machine
learning methods with the data that comes from spatial
features.
I understand that the ML cappabilities are already in
Pythoon but i'm worried about the spatial workflow, can you
give me some insights on this?
Thanks,
Great blog!

Replies

Louis Feoncy 27 April 2018 at 09:10

The very best people are always very helpful to
you personally for solving up your all kind of data
source management problems and form this you
can try this out https://activewizards.com/
certainly get the best data scientist, which are
professional in their work of information handling
and they may easily solve your all type of data
management problems in short time.

Pablo Moreno 23 May 2017 at 10:36

Thanks for developing this. For first time after few attempts,
I can start working with Python!

Replies

Deepanshu Bhalla 24 May 2017 at 08:26

Glad you found it helpful. Cheers!

kapil sharma 14 February 2018 at 01:01

Hi Deepanshu. Can i have your contact
number please. I want to talk regarding the
courses.

Goussu Mgoussu 29 May 2017 at 07:57

Thanks. Some things come late in the tutorial (like the np
loading) but it is a good overview.

Unknown 7 June 2017 at 15:12

Excelent! I appreciate the comparison between R and
Python commands! Very useful!

Sayan Putatunda 11 June 2017 at 01:53

Very well written article!

Young Joon Oh 12 June 2017 at 11:15

Hi.
I am using Pythin 3.6.

y, X = dmatrices("admit ~ gre + gpa + C(position,

Treatment(reference=4))", df, return_type = 'dataframe')

This code generate this error

C, including its class ClassRegistry, has been deprecated
since SymPy
1.0. It will be last supported in SymPy version 1.0. Use
direct
imports from the defining module instead. See
https://github.com/sympy/sympy/issues/9371 for more info.
.
.
.
TypeError: 'bool' object is not callable

How can I handle this ?

Thank you

Reply
Anonymous 13 June 2017 at 00:50
Very useful tutorial, lucidly presented

Anonymous 13 June 2017 at 10:41

Blatant copy paste from
http://www.datasciencecentral.com/profiles/blogs/learn-
python-in-3-days-step-by-step-guide

Replies

Deepanshu Bhalla 13 June 2017 at 10:44

Haha. Did you see the author of the post on
datasciencecentral? Read the first line - Guest
blog by Deepanshu Bhalla.

Hugo 16 June 2017 at 00:20

Well, on the bright side, the poster was
looking out for you even if s/he didn't realize it.
:-)

P@#R 4 August 2017 at 21:58

✖ ⁉⚌

Ghulam Rasool 27 September 2017 at 01:15

Thanks for an amazing introduction to Python.

Anonymous 13 November 2017 at 03:26

Thank you for this interesting tutorial.

Reply
Gagan Gupta 27 November 2017 at 12:54
Nicely written.. Thanks

Peter Myers 4 December 2017 at 12:59

Thank you for the tutorial. Bookmarked this so I can learn to
use what you find essential when using the Pandas
package.

Venu Gaddam 21 December 2017 at 09:23

Excellent resources to get hands on quick with Python

Alberta Rose 28 May 2018 at 02:42

Calculate semester GPA: Enter the number of credits the
class was worth (1-4). Enter the letter grade you think you
will receive or the grade you have received. Do this for each
of your classes. Click bottom compute icon. Your semester
credits and GPA will be calculated in the bottom row.
Calculate overall GPA: In top row, enter the total number of
credits you have completed and the most recent GPA you
know. Follow the steps listed above. Your cumulative credits
and GPA will be listed below. Your GPA is calculated by
dividing the grade points by the credits.

How to Calculate GPA?well look at it this way...

A (100-90%)= 4 points
B (89-80%)= 3 points
C (79-70%)= 2 points
D (69-60%)= 1 point
U (59-0%)= 0 points

so say you have 2 A's a D and 2 B's you would go

4+4+1+3+3=15

15 divided by 5 (the number of classes= 3.0GPA)

so basically it is just a matter of making your percentages

into grades... and doing that!

Reply
Enter your comment...

Comment as: Unknown (Goo Sign out

Publish Preview Notify me

← PREV NEXT →

Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Python Foundations For Data Analysis
67% (3)
Python Foundations For Data Analysis
339 pages
Python For Data Analytics
67% (3)
Python For Data Analytics
69 pages
01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
No ratings yet
01 Complete-Tutorial-Learn-Data-Science-Python-Scratch-2
28 pages
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
Analyticsvidhya Com
No ratings yet
Analyticsvidhya Com
38 pages
Data Analysis Tutorial
No ratings yet
Data Analysis Tutorial
152 pages
quickStartGuide Py
No ratings yet
quickStartGuide Py
30 pages
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
No ratings yet
Python Data Science - A Beginner's Guide To Mastering Analysis, Visualization, and Machine Learning by A. Eich Liana
86 pages
Python For Data Science
No ratings yet
Python For Data Science
20 pages
WEEK 1
No ratings yet
WEEK 1
121 pages
Learning Data Mining with Python Layton download pdf
100% (5)
Learning Data Mining with Python Layton download pdf
55 pages
2021 - Python For Absolute Beginners
100% (4)
2021 - Python For Absolute Beginners
158 pages
session1 python安装+基础语法+基础变量
No ratings yet
session1 python安装+基础语法+基础变量
15 pages
Pytthon For Data Analysis From Scratch
100% (5)
Pytthon For Data Analysis From Scratch
37 pages
Master Python for Data Science
No ratings yet
Master Python for Data Science
10 pages
Week1
No ratings yet
Week1
32 pages
Planning and Design For Python Guidance
No ratings yet
Planning and Design For Python Guidance
8 pages
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
No ratings yet
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
24 pages
Complete Notes - Python
No ratings yet
Complete Notes - Python
45 pages
Introduction to Python 1
No ratings yet
Introduction to Python 1
13 pages
Lec-1-Introduction To Python
No ratings yet
Lec-1-Introduction To Python
25 pages
sci_python
No ratings yet
sci_python
54 pages
1710496889134
100% (1)
1710496889134
155 pages
Python Content
No ratings yet
Python Content
4 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
FODS Record
No ratings yet
FODS Record
66 pages
Unit 1 Python
No ratings yet
Unit 1 Python
143 pages
Python For Data Science
No ratings yet
Python For Data Science
17 pages
A Crash Course On Python
No ratings yet
A Crash Course On Python
27 pages
bda24202_chapter2
No ratings yet
bda24202_chapter2
60 pages
ML Lab 1
No ratings yet
ML Lab 1
24 pages
Blog Python Tutorial For Beginners A Complete Guide
No ratings yet
Blog Python Tutorial For Beginners A Complete Guide
20 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
68 pages
Python For Artificial Intelligence
No ratings yet
Python For Artificial Intelligence
43 pages
Module 1
No ratings yet
Module 1
13 pages
Python Introduction
No ratings yet
Python Introduction
38 pages
1.1-1.4_Introduction to Python
No ratings yet
1.1-1.4_Introduction to Python
50 pages
The Python Bible For Beginners
100% (1)
The Python Bible For Beginners
185 pages
week1
No ratings yet
week1
31 pages
1 Introduction Python Programming For Data Science
No ratings yet
1 Introduction Python Programming For Data Science
11 pages
Module 1 Materials
No ratings yet
Module 1 Materials
131 pages
Lec 7-8 Basics of Python (1)
No ratings yet
Lec 7-8 Basics of Python (1)
166 pages
Handout 1 - Introduction To Setting Up Python
No ratings yet
Handout 1 - Introduction To Setting Up Python
49 pages
The Ultimate Beginner's Guide To Python: Aiming To Start A Career in Data Science
No ratings yet
The Ultimate Beginner's Guide To Python: Aiming To Start A Career in Data Science
47 pages
Lesson 02 Python Environment Setup and Essentials
No ratings yet
Lesson 02 Python Environment Setup and Essentials
77 pages
Learning Data Mining with Python Layton - Download the ebook in PDF with all chapters to read anytime
100% (1)
Learning Data Mining with Python Layton - Download the ebook in PDF with all chapters to read anytime
61 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
Python Unit1
No ratings yet
Python Unit1
35 pages
2 Minutes Crash Course on Python for Begineers
No ratings yet
2 Minutes Crash Course on Python for Begineers
6 pages
unit 1
No ratings yet
unit 1
69 pages
Programming For Data Analytics Introduction
100% (2)
Programming For Data Analytics Introduction
32 pages
Arun Teaches Python A Step by Step Guide
100% (2)
Arun Teaches Python A Step by Step Guide
264 pages
Application Based Programming in Python_ACE_INTL - Copy
No ratings yet
Application Based Programming in Python_ACE_INTL - Copy
466 pages
Python for Absolute Beginners: Learn to Code Fast!
From Everand
Python for Absolute Beginners: Learn to Code Fast!
Ibnul Jaif Farabi
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Mastering Python: A Comprehensive Guide for Beginners and Experts
From Everand
Mastering Python: A Comprehensive Guide for Beginners and Experts
Rick Spair
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
0great12
No ratings yet
0great12
2 pages
Great Story
No ratings yet
Great Story
1 page
0great1
No ratings yet
0great1
1 page
Abdur Rahman a. Siddiqi - 2025-02-28 - Copy
No ratings yet
Abdur Rahman a. Siddiqi - 2025-02-28 - Copy
1 page
Editing Passage Diffeiculty Promax
No ratings yet
Editing Passage Diffeiculty Promax
3 pages
Pmmc Solution
No ratings yet
Pmmc Solution
24 pages
agenda 2 position paper
No ratings yet
agenda 2 position paper
2 pages
Jiji
No ratings yet
Jiji
1 page
CS 230 - Deep Learning Tips and Tricks Cheatsheet
No ratings yet
CS 230 - Deep Learning Tips and Tricks Cheatsheet
8 pages
IEEE Final
No ratings yet
IEEE Final
5 pages
Crop Disease Detection Using Deep Learning Models
No ratings yet
Crop Disease Detection Using Deep Learning Models
9 pages
Laptop Price Prediction in Machine Learning Using Random Forest Classifier Technique
No ratings yet
Laptop Price Prediction in Machine Learning Using Random Forest Classifier Technique
5 pages
SSRN Id4429658
No ratings yet
SSRN Id4429658
64 pages
5 13 PB
No ratings yet
5 13 PB
85 pages
REPORT - DRONE AND IMPROVED HUMAN DETECTION IN SEA USING PI PICO New
No ratings yet
REPORT - DRONE AND IMPROVED HUMAN DETECTION IN SEA USING PI PICO New
52 pages
Task 3 (Dataset Preparation For Fine-Tuning)
No ratings yet
Task 3 (Dataset Preparation For Fine-Tuning)
2 pages
NguyenCongSang ITITIU20292 Lab2
No ratings yet
NguyenCongSang ITITIU20292 Lab2
13 pages
Liver Cirrhosis PPT
No ratings yet
Liver Cirrhosis PPT
19 pages
ML Unit V
No ratings yet
ML Unit V
46 pages
Assessing-grapevine-water-status-through-fusion-o_2024_Computers-and-Electro
No ratings yet
Assessing-grapevine-water-status-through-fusion-o_2024_Computers-and-Electro
8 pages
Deep Transformer Models For Time Series Forecasting
No ratings yet
Deep Transformer Models For Time Series Forecasting
10 pages
Ai Bot Trading For Beginners - Victor Abee
No ratings yet
Ai Bot Trading For Beginners - Victor Abee
75 pages
A Gradient Boosting model to predict the milk production
No ratings yet
A Gradient Boosting model to predict the milk production
8 pages
Mitisha Pandey Resume
No ratings yet
Mitisha Pandey Resume
1 page
Segmentation-of-diabetic-retinopathy-images-using-dee_2023_Alexandria-Engine
No ratings yet
Segmentation-of-diabetic-retinopathy-images-using-dee_2023_Alexandria-Engine
19 pages
ResearchPaper
No ratings yet
ResearchPaper
14 pages
Deep_Shield_A_Federated_Learning_Approach_to_Combat_Online_Pirated_Content_Using_Deep_Fakes (1)
No ratings yet
Deep_Shield_A_Federated_Learning_Approach_to_Combat_Online_Pirated_Content_Using_Deep_Fakes (1)
6 pages
Chest CT Image Segmentation Using Deep Learning
No ratings yet
Chest CT Image Segmentation Using Deep Learning
44 pages
Income Qualification Project3
No ratings yet
Income Qualification Project3
40 pages
1-s2.0-S2352012424021192-main
No ratings yet
1-s2.0-S2352012424021192-main
22 pages
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
No ratings yet
Data Augmentation Techniques in Time Series Domain: A Survey and Taxonomy
25 pages
PFX-48420843 (1)
No ratings yet
PFX-48420843 (1)
6 pages
Dot_Matrix_Deep_Learning
No ratings yet
Dot_Matrix_Deep_Learning
10 pages
MIC Assignment4
No ratings yet
MIC Assignment4
9 pages
Ipt Report
No ratings yet
Ipt Report
19 pages
Colorization Of Black And White Images Using Deep Learning
No ratings yet
Colorization Of Black And White Images Using Deep Learning
34 pages
Early Thyroid Base Paper
No ratings yet
Early Thyroid Base Paper
7 pages
Face Detection Report
No ratings yet
Face Detection Report
84 pages