Analyticsvidhya Com
Analyticsvidhya Com
Analyticsvidhya Com
Introduction
It happened few years back. After working on SAS for more than 5 years, I decided to move out of my comfort zone. Being
a data scientist, my hunt for other useful tools was ON! Fortunately, it didn’t take me long to decide, Python was my
appetizer.
I always had a inclination towards coding. This was the time to do what I really loved. Code. Turned out, coding was so
easy!
I learned basics of Python within a week. And, since then, I’ve not only explored this language to the depth, but also have
helped many other to learn this language. Python was originally a general purpose language. But, over the years, with
strong community support, this language got dedicated library for data analysis and predictive modeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many others to learn python
faster. In this tutorial, we will take bite sized information about how to use Python for Data Analysis, chew it till we are
comfortable and practice it at our own end.
Table of Contents
1. Basics of Python for Data Analysis
Why learn Python for data analysis?
Python 2.7 v/s 3.4
How to install Python?
Running a few simple programs in Python
2. Python libraries and data structures
Python Data Structures
Python Iteration and Conditional Constructs
Python Libraries
3. Exploratory analysis in Python using Pandas
Introduction to series and dataframes
Analytics Vidhya dataset- Loan Prediction Problem
4. Data Munging in Python using Pandas
5. Building a Predictive Model in Python
Logistic Regression
Decision Tree
Random Forest
It is an interpreted language rather than compiled language – hence might take up more CPU time. However, given the
savings in programmer time (due to ease of learning), it might still be a good choice.
You can download Python directly from its and install individual components and libraries you want
Alternately, you can download and install a package, which comes with pre-installed libraries. I would recommend
downloading . Another option could be .
Second method provides a hassle free installation and hence I’ll recommend that to beginners. The imitation of this
approach is you have to wait for the entire package to be upgraded, even if you are interested in the latest version of a
single library. It should not matter until and unless, until and unless, you are doing cutting edge statistical research.
Once you have installed Python, there are various options for choosing an environment. Here are the 3 most common
options:
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It provides a lot of good
features for documenting while writing the code itself and you can choose to run the code in blocks (rather than the line
by line execution)
Before we deep dive into problem solving, lets take a step back and understand the basics of Python. As we know that
data structures and iteration and conditional constructs form the crux of any language. In Python, these include lists,
strings, tuples, dictionaries, for-loop, while-loop, if-else, etc. Let’s take a look at some of these.
Lists – Lists are one of the most versatile data structure in Python. A list can simply be de ned by writing a list of comma
separated values in square brackets. Lists might contain items of di erent types, but usually the items all have the same
type. Python lists are mutable and individual elements of a list can be changed.
Tuples – A tuple is represented by a number of values separated by commas. Tuples are immutable and the output is
surrounded by parentheses so that nested tuples are processed correctly. Additionally, even though tuples are immutable,
they can hold mutable data if needed.
Since Tuples are immutable and can not change, they are faster in processing as compared to lists. Hence, if your list
is unlikely to change, you should use tuples, instead of lists.
Dictionary – Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one
dictionary). A pair of braces creates an empty dictionary: {}.
Python Iteration and Conditional Constructs
Like most languages, Python also has a FOR-loop which is the most widely used method for iteration. It has a simple
syntax:
expression(i)
Here “Python Iterable” can be a list, tuple or other advanced data structures which we will explore in later sections. Let’s
take a look at a simple example, determining the factorial of a number.
fact=1
for i in range(1,N+1):
fact *= i
Coming to conditional statements, these are used to execute code fragments based on a condition. The most commonly
used construct is if-else, with following syntax:
if [condition]:
__execution if true__
else:
__execution if false__
if N%2 == 0:
print 'Even'
else:
print 'Odd'
Now that you are familiar with Python fundamentals, let’s take a step further. What if you have to perform the following
tasks:
1. Multiply 2 matrices
2. Find the root of a quadratic equation
3. Plot bar charts and histograms
4. Make statistical models
5. Access web-pages
If you try to write code from scratch, its going to be a nightmare and you won’t stay on Python for more than 2 days! But
lets not worry about that. Thankfully, there are many libraries with prede ned which we can directly import into our code
and make our life easy.
and make our life easy.
For example, consider the factorial example we just saw. We can do that in a single step as:
math.factorial(N)
Off-course we need to import the math library for that. Lets explore the various libraries next.
Python Libraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful libraries. The rst step is
obviously to learn to import them into our environment. There are several ways of doing so in Python:
import math as m
In the rst manner, we have de ned an alias m to library math. We can now use various functions from math library (e.g.
factorial) by referencing it using the alias m.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use factorial() without
referring to math.
Tip: Google recommends that you use first style of importing libraries, as you will know where the functions have come
from.
Following are a list of libraries, you will need for any scientific computations and data analysis:
NumPy stands for Numerical Python. The most powerful feature of NumPy is n-dimensional array. This library also contains
basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other
low level languages like Fortran, C and C++
SciPy stands for Scienti c Python. SciPy is built on NumPy. It is one of the most useful library for variety of high level science
and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature
in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option,
then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to
add math to your plot.
Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas
were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist
community.
Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of e ecient tools for
machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.
Statsmodels for statistical modeling. Statsmodels is a Python module that allows users to explore data, estimate statistical
models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result
statistics are available for different types of data and each estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative statistical graphics in
Python. It is based on matplotlib. Seaborn aims to make visualization a central part of exploring and understanding data.
Bokeh for creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to
generate elegant and concise graphics in the style of D3.js. Moreover, it has the capability of high-performance interactivity
over very large or streaming datasets.
Blaze for extending the capability of Numpy and Pandas to distributed and streaming datasets. It can be used to access data
from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache Spark, PyTables, etc. Together with Bokeh,
Blaze can act as a very powerful tool for creating effective visualizations and dashboards on huge chunks of data.
Scrapy for web crawling. It is a very useful framework for getting speci c patterns of data. It has the capability to start at a
website home url and then dig through web-pages within the website to gather information.
website home url and then dig through web-pages within the website to gather information.
SymPy for symbolic computation. It has wide-ranging capabilities from basic symbolic arithmetic to calculus, algebra,
discrete mathematics and quantum physics. Another useful feature is the capability of formatting the result of the
computations as LaTeX code.
Requests for accessing the web. It works similar to the the standard python library urllib2 but is much easier to code. You will
find subtle differences with urllib2 but for beginners, Requests might be more convenient.
Now that we are familiar with Python fundamentals and additional libraries, lets take a deep dive into problem solving
through Python. Yes I mean making a predictive model! In the process, we use some powerful libraries and also come
across the next level of data structures. We will take you through the 3 key phases:
Pandas is one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They
have been instrumental in increasing the use of Python in data science community. We will now use Pandas to read a data
set from an Analytics Vidhya competition, perform exploratory analysis and build our rst basic categorization algorithm
for solving this problem.
Before loading the data, lets understand the 2 key data structures in Pandas – Series and DataFrames
A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can
be accessed with use of row numbers. The essential di erence being that column names and row numbers are known as
column and row index, in case of dataframes.
Series and dataframes form the core data model for Pandas in Python. The data sets are rst read into these dataframes
and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.
More:
Practice data set – Loan Prediction Problem
You can download the dataset from . Here is the description of variables:
VARIABLE DESCRIPTIONS:
Variable Description
This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be
able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check
whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the
figure below):
plot(arange(5))
I am currently working in Linux, and have stored the dataset in the following location:
I am currently working in Linux, and have stored the dataset in the following location:
/home/kunal/Downloads/Loan_Prediction/train.csv
numpy
matplotlib
pandas
Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in
the code, in case you use the code in a different environment.
After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:
import pandas as pd
import numpy as np
df.head(10)
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical fields by using describe() function
df.describe()
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read
to refresh basic statistics to understand population distribution)
Here are a few inferences, you can draw by looking at the output of describe() function:
Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50%
figure.
For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand
whether they make sense or not. The frequency table can be printed by following command:
df['Property_Area'].value_counts()
Similarly, we can look at unique values of port of credit history. Note that dfname[‘column_name’] is a basic indexing
technique to acess a particular column of the dataframe. It can be a list of columns as well. For more information, refer to
the “10 Minutes to Pandas” resource shared above.
Distribution analysis
Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with
numeric variables – namely ApplicantIncome and LoanAmount
Lets start by plotting the histogram of ApplicantIncome using the following commands:
df['ApplicantIncome'].hist(bins=50)
Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the
distribution clearly.
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
df.boxplot(column='ApplicantIncome')
This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the
society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us
segregate them by Education:
df.boxplot(column='ApplicantIncome', by = 'Education')
We can see that there is no substantial di erent between the mean income of graduate and non-graduates. But there are
a higher number of graduates with very high incomes, which are appearing to be the outliers.
Now, Let’s look at the histogram and boxplot of LoanAmount using the following command:
df['LoanAmount'].hist(bins=50)
df.boxplot(column='LoanAmount')
Again, there are some extreme values. Clearly, both ApplicantIncome and LoanAmount require some amount of data
munging. LoanAmount has missing and well as extreme values values, while ApplicantIncome has a few extreme values,
which demand deeper understanding. We will take this up in coming sections.
Now we will look at the steps required to generate a similar insight using Python. Please refer to for getting a
hang of the different data manipulation techniques in Pandas.
temp1 = df['Credit_History'].value_counts(ascending=True)
print temp1
print temp2
Now we can observe that we get a similar pivot_table like the MS Excel one. This can be plotted as a bar chart using the
“matplotlib” library with following code:
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('Count of Applicants')
ax1.set_title("Applicants by Credit_History")
temp1.plot(kind='bar')
ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Credit_History')
Alternately, these two plots can also be visualized by combining them in a stacked chart::
You can also add gender into the mix (similar to the pivot table in Excel):
If you have not realized already, we have just created two basic classi cation algorithms here, one based on credit history,
while other on 2 categorical variables (including gender). You can quickly code this to create your rst submission on AV
Datahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would
have increased by now – given the amount of help, the library can provide you in analyzing datasets.
Next let’s explore ApplicantIncome and LoanStatus variables further, and create a dataset for
applying various modeling techniques. I would strongly urge that you take another dataset and problem and go through
an independent example before reading further.
1. There are missing values in some variables. We should estimate those values wisely depending on the amount of missing
values and the expected importance of variables.
2. While looking at the distributions, we saw that ApplicantIncome and LoanAmount seemed to contain extreme values at
either end. Though they might make intuitive sense, but should be treated appropriately.
In addition to these problems with numerical elds, we should also look at the non-numerical elds i.e. Gender,
Property_Area, Married, Education and Dependents to see, if they contain any useful information.
If you are new to Pandas, I would recommend reading before moving on. It details some useful techniques of
data manipulation.
df.apply(lambda x: sum(x.isnull()),axis=0)
This command should tell us the number of missing values in each column as isnull() returns 1, if the value is null.
Though the missing values are not very high in number, but many variables have them and each one of these should be
estimated and added in the data. Get a detailed view on different imputation techniques through .
Note: Remember that missing values may not always be NaNs. For instance, if the Loan_Amount_Term is 0, does it makes
sense or would you consider that missing? I suppose your answer is missing and you’re right. So we should check for
values which are unpractical.
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
The other extreme could be to build a supervised learning model to predict loan amount on the basis of other variables
and then use age along with other variables to predict survival.
Since, the purpose now is to bring out the steps in data munging, I’ll rather take an approach, which lies some where in
between these 2 extremes. A key hypothesis is that the whether a person is educated or self-employed can combine to
give a good estimate of loan amount.
Thus we see some variations in the median of loan amount for each group and this can be used to impute the values. But
first, we have to ensure that each of Self_Employed and Education variables should not have a missing values.
As we say earlier, Self_Employed has some missing values. Let’s look at the frequency table:
Since ~86% values are “No”, it is safe to impute the missing values as “No” as there is a high probability of success. This can
be done using the following code:
df['Self_Employed'].fillna('No',inplace=True)
Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed
Now, we will create a Pivot table, which provides us median values for all the groups of unique values of Self_Employed
and Education features. Next, we de ne a function, which returns the values of these cells and apply it to ll the missing
values of loan amount:
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
This should provide you a good way to impute missing values of loan amount.
df['LoanAmount_log'] = np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided.
Coming to ApplicantIncome. One intuition can be that some applicants have lower income but strong support Co-
applicants. So it might be a good idea to combine both incomes as total income and take a log transformation of the same.
df['TotalIncome_log'] = np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)
Now we see that the distribution is much better than before. I will leave it upto you to impute the missing values
for Gender, Married, Dependents, Loan_Amount_Term, Credit_History. Also, I encourage you to think about possible
additional information which can be derived from the data. For example, creating a column for LoanAmount/TotalIncome
might make sense as it gives an idea of how well the applicant is suited to pay back his loan.
Since, sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding
the categories. This can be done using the following code:
var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
Next, we will import the required modules. Then we will de ne a generic classi cation function, which takes a model as
input and determines the Accuracy and Cross-Validation scores. Since this is an introductory article, I will not go into the
details of coding. Please refer to for getting details of the algorithms with R and Python codes. Also, it’ll be
good to get a refresher on cross-validation through , as it is a very important measure of power performance.
model.fit(data[predictors],data[outcome])
#Make predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
kf = KFold(data.shape[0], n_folds=5)
error = []
train_predictors = (data[predictors].iloc[train,:])
train_target = data[outcome].iloc[train]
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
Logistic Regression
Let’s make our rst Logistic Regression model. One way would be to take all the variables into the model but this might
result in over tting (don’t worry if you’re unaware of this terminology yet). In simple words, taking all variables might result
in the model understanding complex relations speci c to the data and will not generalize well. Read more about
.
We can easily make some intuitive hypothesis to set the ball rolling. The chances of getting a loan will be higher for:
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df,predictor_var,outcome_var)
predictor_var = ['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
Generally we expect the accuracy to increase on adding variables. But this is a more challenging case. The accuracy and
cross-validation score are not getting impacted by less important variables. Credit_History is dominating the mode. We
have two options now:
1. Feature Engineering: dereive new information and try to predict those. I will leave this to your creativity.
2. Better modeling techniques. Let’s explore this next.
Decision Tree
Decision tree is another method for making a predictive model. It is known to provide higher accuracy than logistic
regression model. Read more about .
model = DecisionTreeClassifier()
predictor_var = ['Credit_History','Gender','Married','Education']
classification_model(model, df,predictor_var,outcome_var)
Here the model based on categorical variables is unable to have an impact because Credit History is dominating over
them. Let’s try a few numerical variables:
predictor_var = ['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model, df,predictor_var,outcome_var)
Here we observed that although the accuracy went up on adding variables, the cross-validation error went down. This is
the result of model over-fitting the data. Let’s try an even more sophisticated algorithm and see if it helps:
Random Forest
Random forest is another algorithm for solving the classification problem. Read more about .
An advantage with Random Forest is that we can make it work with all the features and it returns a feature importance
matrix which can be used to select features.
model = RandomForestClassifier(n_estimators=100)
classification_model(model, df,predictor_var,outcome_var)
Here we see that the accuracy is 100% for the training set. This is the ultimate case of overfitting and can be resolved in
two ways:
Let’s try both of these. First we see the feature importance matrix from which we’ll take the most important features.
Let’s use the top 5 variables for creating a model. Also, we will modify the parameters of random forest model a little bit:
predictor_var = ['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']
classification_model(model, df,predictor_var,outcome_var)
Notice that although accuracy reduced, but the cross-validation score is improving showing that the model is generalizing
well. Remember that random forest models are not exactly repeatable. Di erent runs will result in slight variations
because of randomization. But the output should stay in the ballpark.
You would have noticed that even after some basic parameter tuning on random forest, we have reached a cross-
validation accuracy only slightly better than the original logistic regression model. This exercise gives us some very
interesting and unique learning:
So are you ready to take on the challenge? Start your data science journey with .
End Notes
I hope this tutorial will help you maximize your e ciency when starting with data science in Python. I am sure this not only
gave you an idea about basic data analysis methods but it also showed you how to implement some of the more
sophisticated techniques available today.
Python is really a great tool, and is becoming an increasingly popular language among the data scientists. The reason
being, it’s easy to learn, integrates well with other databases and tools like Spark and Hadoop. Majorly, it has great
computational intensity and has powerful data analytics libraries.
So, learn Python to perform the full life-cycle of any data science project. It includes reading, analyzing, visualizing and
finally making predictions.
If you come across any di culty while practicing Python, or you have any thoughts / suggestions / feedback on the post,
please feel free to post them through comments below.
Note – The discussions of this article are going on at AV’s Discuss portal. !
If you like what you just read & want to continue your analytics learning,
, or like our .
Share this:
RELATED
TAGS: APPLY(), BOX PLOTS, CATEGORICAL VARIABLE, DATA CLEANING, DATA EXPLORATION, DATA FRAMES, DATA MINING, DATA MUNGING, DATA STRUCTURE, DATA WRANGLING, DICTIONARY, DISTRIBUTION ANALYSIS,
INSTALLING PYTHON, IPYTHON, LISTS, LOGISTIC REGRESSION, MATPLOTLIB, MEAN, MERGE, NUMPY, PANDAS, PYTHON, PYTHON TUTORIAL, SCIKIT-LEARN, SCIPY, SETS, SPLIT, STACKED CHART, STARTING PYTHON, STATISTICS,
STRINGS, TUPLE, TUPLES, TUTORIAL
Next Article
Previous Article
Model Monitoring Senior Business Analyst/Assistant Manager – Gurgaon (5-6 years of experience)
Author
Kunal Jain
Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data
Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he
has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.
This is article is quiet old now and you might not get a prompt response from the author. We would request you to post this
comment on Analytics Vidhya Discussion portal to get your queries resolved.
5 3
C O M M E N T S
There is a very good book on Python for Data Analysis, O Reily — Python for Data Analysis
There is a very good book on Python for Data Analysis, O Reily — Python for Data Analysis
Moumita,
The book mentioned by Paritosh is a good place to start. You can also refer some of the books mentioned here:
http://www.analyticsvidhya.com/blog/2014/06/books-data-scientists-or-aspiring-ones/
Kunal
Deepak says:
J A N U A R Y 3 1 , 2 0 1 6 A T 5 : 4 9 P M
Hey Kunal
Im trying to follow your lesson however I am stuck at reading the CSV file. Im using Ipython and trying to read it. I am
following the syntax that you have provided but it still doesnt work.
Can you please help me if its possible I would really appreciate it
Thanks
Deepak
Pranesh says:
J A N U A R Y 1 5 , 2 0 1 6 A T 5 : 0 0 A M
Hi Kunal,
When you are planning to schedule next data science meetup in Bangalore. I have missed the previous session due to
conflict
Pranesh,
We will have a meetup some time in early March. We will announce the dates on DataHack platform and our meetup
group page.
Regards,
Kunal
Gianfranco says:
J A N U A R Y 1 5 , 2 0 1 6 A T 4 : 1 6 P M
Regards,
Kunal
Thank you so much Kunal, this is indeed a great start for any Python beginner.
Really appreciate your team’s effort in bringing Data Science to a wider audience.
I strongly suggest “A Byte of Python” by Swaroop CH. It may be bit old now but helped me in getting a good start in
Python.
HighSpirits says:
J A N U A R Y 1 5 , 2 0 1 6 A T 5 : 3 1 P M
Awesome!!! This is one area where I was looking for help and AV has provided it!!! Thanks a lot for the quick guide Kunal…
very much helpful…
Kami888 says:
J A N U A R Y 1 6 , 2 0 1 6 A T 3 : 3 3 A M
Regards,
Kunal
Dr.D.K.Samuel says:
J A N U A R Y 1 6 , 2 0 1 6 A T 3 : 5 9 A M
Really well written, will be nice if it is made available as a pdf for download (with all supporting references). I will print
and refer till I learn in full. Thanks
Hi kunal ji ,
Hi kunal ji ,
Can you please guide (for a newbie )who dont have any software background , how can acquire big data knowledge.
whether is it necessary to learn SQL , JAVA ?Before stepping in the big data practically, how can i warm up my self
without getting in touch with the bias. Can you please suggest good blog regarding big data for newbie.
Smrutiranjan,
:
Kindly post this question on our discussion portal – http://discuss.analyticsvidhya.com
Regards,
Kunal
Falkor says:
J A N U A R Y 1 8 , 2 0 1 6 A T 9 : 4 1 P M
This was good until, the fact hit me. I am using IDLE and don’t have the libraries installed. Now, how do I get these
Pandas, Numpy etc installed for IDLE on Windows!?
Its been a long complicated browsing session. Only solution I seem to get is to ditch IDLE and move to Spyder or move
to Python 3.5 altogether.
Digvijay says:
M A R C H 7 , 2 0 1 6 A T 3 : 0 2 P M
I suggest installing anaconda. Its better to start with as it contains most of the commonly used libraries for data analysis.
Once anaconda is up and working, you can use any IDE of your choice.
Falkor says:
J A N U A R Y 1 9 , 2 0 1 6 A T 1 : 2 6 A M
Got Pandas to finally install and work, here is how I did it. Just in case it helps somebody else.
Open Command prompt and point to the path or open the path
Restart the system, just for the heck of it. To be on safer side.
In Command prompt, set the path using this: C:\users\yourname>set PATH = %PATH%;C:\python27\scripts
Erik says:
J A N U A R Y 1 9 , 2 0 1 6 A T 9 : 2 0 A M
Thank you for a real comprehensive post. Personally, I am mainly using Python for creating Psychology experiments but
I would like to start doing some analysis with Python (right now I mainly use R). Some of the libraries (e.g., Seaborn) was
new to me.
gt_67 says:
J A N U A R Y 1 9 , 2 0 1 6 A T 1 2 : 1 5 P M
Hello !
I can’t let this piece of code to work:
I checked the null values of the columns “LoanAmount”, “Self_Employed” and “Education” and nothing wrong shows out.
614 values as others full columns.
Mohamed says:
J A N U A R Y 2 0 , 2 0 1 6 A T 1 : 0 2 A M
Mr. gt_67,
I have same the error do you have any idea what that could be?
if Kunal can help understand and fix this piece of code will be great.
dignity says:
F E B R U A R Y 2 9 , 2 0 1 6 A T 6 : 1 4 A M
Missing values are already replaced by the mean with this line of code (1.st way)
df[‘LoanAmount’].fillna(df[‘LoanAmount’].mean(), inplace=True)
before.
This is a great, great resource. Thanks Kunal. But let me ask you for curiosity is this how data scientist do at work, I mean
it is like using a command like to get insight from the data, isn’t there GUI with python so you can be more productive?
Kishore says:
J A N U A R Y 1 9 , 2 0 1 6 A T 1 2 : 5 1 P M
Hi Kunal,
Thanks for the excellent tutorial using python. It would be great if you could do a similar tutorial using R.
Regards,
Kishore
Thank you Kunal for a real comprehensive tutorial on doing data science in Python! I really appreciated the list of
libraires. Really useful. I have, my self, started to look more and more on doing data analysis with Python. I have tested
pandas some and your exploratory analysis with-pandas part was also helpful.
Venu says:
J A N U A R Y 2 4 , 2 0 1 6 A T 5 : 3 9 A M
Good One
Hemanth says:
J A N U A R Y 2 4 , 2 0 1 6 A T 3 : 0 0 P M
Is there a python library for performing OCR on PDF files? or for converting a raw scanned PDF to a ‘searchable’ PDF? To
perform Text Analytics…
Abhi says:
J A N U A R Y 2 4 , 2 0 1 6 A T 4 : 3 8 P M
Hey, great article. I find my self getting hiccups the moment probability and statistics start appearing. Can you suggest a
book that takes me through these easily just like in this tutorial. Both of these seem to be the lifeline of ML.
Deepak says:
J A N U A R Y 3 1 , 2 0 1 6 A T 5 : 5 1 P M
Hey Kunal
Im trying to follow your lesson however I am stuck at reading the CSV file. Im using Ipython and trying to read it. I am
following the syntax that you have provided but it still doesnt work.
Can you please help me if its possible I would really appreciate it
Thanks
Deepak
Deepak says:
F E B R U A R Y 2 , 2 0 1 6 A T 2 : 2 1 A M
Hello Kunal i have started your tutorial but i am having difficulty at importing pandas an opening the csv file
do you mind assisting me
Thanks
Deepak,
What is the problem you are facing? Can you attach a screenshot?
Also, tell me which OS are you working on and which Python installation are you working on?
Regards,
Kunal
Deepak says:
F E B R U A R Y 3 , 2 0 1 6 A T 4 : 1 4 P M
Hey Thanks for replying no i do not think i can attach a screen shot on this Blog Wall. I would love to email it to you but
do not have your email address though.
But the problem i am having is trying to open the .csv file (train). I have opened pylab inline
My code is like this :
df = pd.read_csv(“/Desktop/Studying_Tools/AV/train.csv”)
Im using Anaconda
Ipython Notebook (Jupyter)- version 4.0.4
Im running it on my Windows 8 Laptop
Jaini says:
F E B R U A R Y 7 , 2 0 1 6 A T 1 2 : 3 5 A M
Hi Kunal
Sincere apologies for a very basic question. I have installed python per above instructions. Unfortunately I am unable to
launch ipython notebook. Have spent hours but I guess I missing something. Could you please kindly guide.
Thank you
Jaini
Jaini,
What is the error you are getting? Which OS you are on? And what happens when you type ipython notebook in shell /
terminal / cmd ?
Regards,
Kunal
Nice article!
A few remarks:
1- “–pylab=inline” is not recommended any more. Use “%matplotlib inline” for each notebook.
2- You can start a jupyter server using “jupyter notebook” instead of “ipython notebook”. For me, notebooks open faster
that way.
3- For plotting, use “import matplotlib.pyplot as plt”.
Regards
woiski
Jaini says:
F E B R U A R Y 8 , 2 0 1 6 A T 1 : 1 1 A M
Thank you. I sincerely appreciate your instant response. I just reinstalled and went through command prompt and it
worked.
ngnikhilgoyal says:
F E B R U A R Y 2 0 , 2 0 1 6 A T 9 : 4 5 P M
IT would be good if you explained the code as you went along the exercise. For someone unfamiliar with some of the
methods and functions, it is difficult to understand why you are doing certain things. For e.g.: While creating the pivot
table, you introduced aggfunc=lambda x: x.map({‘Y’:1,’N’:0}).mean()) without explaining it. Intuitively I know you are coding
Y as 1 and N as 0 and taking mean of each but you still need to explain what is lambda x: x.map . . . .
Olga says:
F E B R U A R Y 2 6 , 2 0 1 6 A T 1 : 0 6 P M
There seems to be a bit of confusion, when you plot histogram. Histogram, by definition, is a plot of occurrence
frequency of some variable. So, when you do manipulation with ApplicantIncome, transforming to a TotalIncome by
adding CoapplicantIncome, the outcome does not affect the histogram of LoanAmount, because the outcome of this
manipulation does not change the occurrence frequency or the values of LoanAmount. If you compare both of your
plots, they will look exactly the same for mentioned above reason. So, it will be, probably, better to correct this part of
the article.
Thanks
Vlad says:
F E B R U A R Y 2 9 , 2 0 1 6 A T 1 1 : 1 3 P M
Hi Kunal – first off thanks for this informative tutorial. Great stuff. Unfortunately I’m unable to download the dataset – I
need to be signed up on AV, and I get an invalid request on signup. Thank you again for this material.
Vlad says:
M A R C H 1 , 2 0 1 6 A T 3 : 3 7 A M
Dorinel says:
M A R C H 1 0 , 2 0 1 6 A T 1 : 2 8 P M
Hi Kunal,
Dont you give us access to the data set any more? I am reading your tutorial and want to repeat your steps for data
Dont you give us access to the data set any more? I am reading your tutorial and want to repeat your steps for data
analysis!
Thanks,
Dorinel
Sam says:
M A R C H 1 1 , 2 0 1 6 A T 1 0 : 4 3 A M
Any ideas?
Thanks In Advance
Marc says:
M A R C H 1 2 , 2 0 1 6 A T 1 2 : 5 9 A M
Thanks for this. Is there a way to get access to the dataset that was used for this? seems like it became unavailable from
March 7!
Really great and would start following – I am a new entry to the data analysis stream
Harneet says:
J U L Y 2 8 , 2 0 1 6 A T 4 : 2 0 A M
Hi Kunal,
I have trying to get some validations in python for logistic regression as available for SAS, like Area Under Curve,
Concordant, Discordant and Tied pairs, Ginni Value etc.. But I am unable to find it through google, what ever I was able
to find was very confusing.
Regards,
Harneet.
Hello,
Hello,
very good article. I just stumbled upon one piece of code where I am not quite sure if I just don’ interpret the arguments
well, or whether there is truely a mistake in your code. It is the following:
metrics.accuracy_score(predictions,data[outcome])
Isn’t “predictions” the true predictions, which should be placed as the argument “y_pred” of the accuracy_score method,
and “data[outcome]” are the real values which should be associated with the argument “y_true”?
If that is so, then I think the order of passing the arguments is wrong, because the method is defined as following
(according to doc): confusion_matrix(y_true, y_pred[, labels]) –> that means y_true comes as 1st argument. You have it
the other way arround.
Best regards,
Peter
Nicola says:
S E P T E M B E R 4 , 2 0 1 6 A T 8 : 3 6 A M
Hi!
And thank you very much for your tutorial
Unfortunately there is no way to find the .csv file for the loan prediction problem in
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
wayne says:
S E P T E M B E R 1 8 , 2 0 1 6 A T 1 : 3 2 A M
Hello,
Thanks
gopalankailash says:
O C T O B E R 2 8 , 2 0 1 6 A T 3 : 4 9 P M
The amount of effort you guys put into these article is a true inspiration for folks like me to learn!
Thanks for all this!
Jack Ma says:
N O V E M B E R 8 , 2 0 1 6 A T 1 0 : 5 2 P M
POPULAR POSTS
POPULAR POSTS
RECENT POSTS
Essentials of Deep Learning: Introduction to Unsupervised Deep Learning (with Python codes)
FAIZAN SHAIKH , MAY 6, 2018
Improve Your Model Performance using Cross Validation (in Python and R)
SUNIL RAY , MAY 3, 2018
Top 5 GitHub Repositories and Reddit Discussions for Data Science & Machine Learning (April 2018)
PRANAV DAR , MAY 3, 2018
An Introductory Guide to Understand how ANNs Conceptualize New Ideas (using Embedding)
TAVISH SRIVASTAVA , APRIL 29, 2018
GET CONNECTED
15,774
FOLLOWERS
FOLLOWERS
2,777
FOLLOWERS
Email
SUBSCRIBE
DATA SCIENTISTS
COMPANIES
JOIN OUR COMMUNITY :
Don't have an account? Sign up here.
45450 © Copyright 2013-2018 Analytics Vidhya.
15784
2783
5065