Python for Data Science _ Learn in 3 Days
Python for Data Science _ Learn in 3 Days
HOME SAS R PYTHON DATA SCIENCE SQL EXCEL VBA SPSS RESOURCES
INFOGRAPHICS MORE
SEARCH... GO
Home » Data Science » Python » Python for Data Science : Learn in 3 Days Follow us on Facebook
PYTHON FOR DATA SCIENCE : LEARN IN 3 DAYS Join us with 5000+ Subscrib
Deepanshu Bhalla 20 Comments Data Science, Python
Subscribe to Free Updates
This tutorial helps you to learn Data Science with Python with Enter your email... Sub
Table of Contents
1. Getting Started with Python
Python 2.7 vs. 3.6
Python for Data Science : Introduction
How to install Python?
Spyder Shortcut keys
Basic programs in Python
Comparison, Logical and Assignment
Operators
3. Python Libraries
List of popular packages (comparison with
R)
Popular python commands
How to import a package
1. The official end date for the Python 2.7 is year 2020.
Afterward there would be no support from community. It does
not make any sense to learn 2.7 if you learn it today.
Key Takeaway
1. YouTube
2. Instagram
3. Reddit
4. Dropbox
5. Disqus
Coding Environments
+ Addition 10 + 2 = 12
– Subtraction 10 – 2 = 8
* Multiplication 10 * 2 = 20
/ Division 10 / 2 = 5.0
% Modulus 10 % 3 = 1
(Remainder)
** Power 10 ** 2 = 100
// Floor 17 // 3 = 5
Basic Programs
Example 1
#Basics
x = 10
y=3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)
Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1
Example 2
x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 35 or x < 60
Out[46]: True
== Equal to 5 == 3
returns
False
!= Not equal to 5 != 3
returns True
Assignment Operators
x = 100
y = 10
x += y
print(x)
print(x)
110
1. List
1. x = [1, 2, 3, 4, 5]
2. y = [‘A’, ‘O’, ‘G’, ‘M’]
3. z = [‘A’, 4, 5.1, ‘M’]
x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]
x[0]
Out[68]: 1
x[1]
Out[69]: 2
x[4]
Out[70]: 5
x[-1]
Out[71]: 5
x[-2]
Out[72]: 4
x[0] picks first element from list. Negative sign tells Python
to search list item from right to left. x[-1] selects the last
element from list.
You can select multiple elements from a list using the
following method
2. Tuple
Examples
K = (1,2,3)
State = ('Delhi','Maharashtra','Karnataka')
for i in State:
print(i)
Delhi
Maharashtra
Karnataka
Functions
z = sum_fun(10, 15)
Result : z = 25
Example
k = 27
if k%5 == 0:
print('Multiple of 5')
else:
print('Not a Multiple of 5')
Install Package
!pip install pandas
Uninstall Package
!pip uninstall pandas
1. import pandas as pd
It imports the package pandas under the alias pd. A function
DataFrame in package pandas is then submitted with
pd.DataFrame.
2. import pandas
It imports the package without using alias but here the
function DataFrame is submitted with full package
name pandas.DataFrame
import pandas as pd
s1 = pd.Series(np.random.randn(5))
s1
0 -2.412015
1 -0.451752
2 1.174207
3 0.766348
4 -0.361815
dtype: float64
s1[0]
-2.412015
s1[1]
-0.451752
s1[:3]
0 -2.412015
1 -0.451752
2 1.174207
2. DataFrame
import numpy as np
import pandas as pd
2. Build DataFrame
Sample DataFrame
mydata=
pd.read_csv("C:\\Users\\Deepanshu\\Documen
ts\\file1.csv")
You can run the command below to find out number of rows
and columns.
df.shape
df.head(3)
df.productcode
df["productcode"]
df.loc[: , "productcode"]
df.iloc[: , 1]
df[["productcode", "cost"]]
df.loc[ : , ["productcode", "cost"]]
Drop Variable
df.describe()
cost sales
count 6.000000 6.00000
mean 1166.150000 1242.65000
std 237.926793 230.46669
min 1003.700000 1010.00000
25% 1020.000000 1058.90000
50% 1072.000000 1205.85000
75% 1184.000000 1366.07500
max 1625.200000 1604.80000
df.describe(include=['O'])
df.productcode.describe()
OR
df["productcode"].describe()
count 6
unique 2
top BB
freq 3
Name: productcode, dtype: object
df.sales.mean()
df.sales.median()
df.sales.count()
df.sales.min()
df.sales.max()
8. Filter Data
9. Sort Data
df.sort_values(['sales'])
df.groupby(df.productcode).mean()
cost sales
productcode
AA 1283.066667 1146.466667
BB 1049.233333 1338.833333
df["sales"].groupby(df.productcode).mean()
df0.id = df0["id"].astype('category')
df0.describe()
id
count 7
unique 3
top 2
freq 3
Frequency Distribution
df['productcode'].value_counts()
BB 3
AA 3
df['sales'].hist()
Histogram
13. BoxPlot
df.boxplot(column='sales')
BoxPlot
With the use of python library, we can easily get data from
web into python.
3. Explore Data
# Summarize
df.describe()
# plot all of the columns
df.hist()
# Summarize
df.position.value_counts(ascending=True)
1 61
4 67
3 121
2 151
Generating Crosstab
pd.crosstab(df['admit'], df['position'])
position 1 2 3 4
admit
0 28 97 93 55
1 33 54 28 12
#Reference Category
from patsy import dmatrices, Treatment
y, X = dmatrices('admit ~ gre + gpa +
C(position, Treatment(reference=4))', df,
return_type = 'dataframe')
#Confusion Matrix
result.pred_table()
#Odd Ratio
np.exp(result.params)
Prediction on Test Data
In this step, we take estimates of logit model which was built
on training data and then later apply it into test data.
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=7)
#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_tree[:,1])
auc(false_positive_rate, true_positive_rate)
Important Note
#Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100,
max_depth=7)
#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)
#Variable Importance
importances = pd.Series(model_rf.feature_importances_,
index=X_train.columns).sort_values(ascending=False)
print(importances)
importances.plot.bar()
param_grid = {
'n_estimators': [100, 200, 300],
'max_features': ['sqrt', 3, 4]
}
CV_rfc = GridSearchCV(estimator=rf ,
param_grid=param_grid, cv= 5, scoring='roc_auc')
CV_rfc.fit(X_train,target)
#Best Parameters
CV_rfc.best_params_
CV_rfc.best_estimator_
#AUC
false_positive_rate, true_positive_rate, thresholds =
roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)
Cross Validation
# Cross Validation
from sklearn.linear_model import
LogisticRegression
from sklearn.model_selection import
cross_val_predict,cross_val_score
target = y['admit']
prediction_logit =
cross_val_predict(LogisticRegression(), X,
target, cv=10, method='predict_proba')
#AUC
cross_val_score(LogisticRegression(fit_interce
pt = False), X, target, cv=10, scoring='roc_auc')
ConverttoNumeric(df)
Encoding
productcode_dummy =
pd.get_dummies(df["productcode"])
df2 = pd.concat([df, productcode_dummy],
axis=1)
The output looks like below -
AA BB
0 1 0
1 1 0
2 1 0
3 0 1
4 0 1
5 0 1
productcode_dummy =
pd.get_dummies(df["productcode"],
prefix='pcode', drop_first=True)
df2 = pd.concat([df, productcode_dummy],
axis=1)
# Apply imputation
df_new = mean_imputer.transform(df.values)
4. Outlier Treatment
5. Standardization
#load dataset
dataset = load_boston()
predictors = dataset.data
target = dataset.target
df = pd.DataFrame(predictors, columns =
dataset.feature_names)
#Apply Standardization
from sklearn.preprocessing import
StandardScaler
k = StandardScaler()
df2 = k.fit_transform(df)
Next Steps
About Author:
Deepanshu founded ListenData with a simple objective - Make analytics easy to
understand and follow. He has over 7 years of experience in data science and
predictive modeling. During his tenure, he has worked with global clients in
various domains like banking, Telecom, HR and Health Insurance.
While I love having friends who agree, I only learn from those who don't.
Related Posts:
Linear Regression in Python
NumPy Tutorial with Exercises
K Nearest Neighbor : Step by Step Tutorial
Python for Data Science : Learn in 3 Days
Identify Person, Place and Organisation in content
using Python
Case Study : Sentiment analysis using Python
Run Python from R
20 Responses to "Python for Data Science : Learn in 3
Days"
Reply
Replies
Reply
Reply
Replies
Reply
Reply
Reply
Reply
Hi.
I am using Pythin 3.6.
Reply
Anonymous 13 June 2017 at 00:50
Very useful tutorial, lucidly presented
Reply
Reply
Replies
Reply
Reply
Reply
Reply
Gagan Gupta 27 November 2017 at 12:54
Nicely written.. Thanks
Reply
Reply
Reply
4+4+1+3+3=15
Reply
Enter your comment...
← PREV NEXT →