2.1 Ponder Over Questions: Quora
2.1 Ponder Over Questions: Quora
TWO
Note: Sharpening the knife longer can make it easier to hack the firewood – old Chinese proverb
There is an old Chinese proverb that Says ‘sharpening the knife longer can make it easier to hack the
firewood’. In other words, take extra time to get it right in the preparation phase and then the work will be
easier. So it is worth to take several minites to think about which programming language is better for you.
When you google it, you will get many useful results. Here are some valueable information from Quora:
5
Data Mining With Python and R Tutorials, Release v1.0
R Python
advantages
• great for prototyping • great for scripting
• great for statistical anal- and automating your
ysis different data mining
• nice IDE pipelines
• integrates easily in a
production workflow
• can be used across dif-
ferent parts of your soft-
ware engineering team
• scikit-learn library is
awesome for machine-
learning tasks.
• Ipython is also a pow-
erful tool for exploratory
analysis and presenta-
tions
disadvantages
• syntax could be obscure • It isn’t as thorough for
• libraries documenta- statistical analysis as R
tion isn’t always user • learning curve is steeper
friendly than R, since you can do
• harder to integrate to a much more with Python
production workflow.
2.3 My Opinions
In my opinion, R and Python are both choice. Since they are open-source softwares (open-source is always
good in my eyes) and are free to download. If you are a beginer without any programming experience and
only want to do some data analysis, I would definitely suggest to use R. Otherwise, I would suggest to use
both.
THREE
GETTING STARTED
Note: Good tools are prerequisite to the successful execution of a job – old Chinese proverb
Let’s keep sharpening our tools. A good programming platform can save you lots of troubles and time.
Herein I will only present how to install my favorite programming platform for R and Python and only show
the easiest way which I know to install them on Linux system. If you want to install on the other operator
system, you can Google it. In this section, you may learn how to install R, Python and the corresponding
programming platform and package.
• Installing R
Go to Ubuntu Software Center and follow the following steps:
1. Open Ubuntu Software Center
2. Search for r-base
3. And click Install
Or Open your terminal and using the following command:
sudo apt-get update
sudo apt-get install r-base
• Insralling Python
Go to Ubuntu Software Center and follow the following steps:
1. Open Ubuntu Software Center
2. Search for python
3. And click Install
Or Open your terminal and using the following command:
sudo apt-get install build-essential checkinstall
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev
7
Data Mining With Python and R Tutorials, Release v1.0
My favorite programming platform for R is definitely RStudio IDE and for Python is Eclipse+Pydev.
• Installing RStudio
Go to Ubuntu Software Center and follow the following steps:
1. Open Ubuntu Software Center
2. Search for RStudio
3. And click Install
• Installing Eclipse + Pydev
• Installing Eclipse
Go to Ubuntu Software Center and follow the following steps:
1. Open Ubuntu Software Center
2. Search for Eclipse
3. And click Install
• Installing Pydev
1. Open Eclipse
2. Go to Eclipse Marketplace
3. Search for Pydev
4. And click Pydev- Python IDE for Eclipse
Here is the video tutorial for installing Pydev for Eclipse on Youtube: Pydev on
Youtube
• Installing numpy
pip install numpy
• Installing pandas
pip install pandas
• Installing scikits-learn
Figure 3.1: Top 20 R Machine Learning and Data Science packages. From
http://www.kdnuggets.com/2015/06/top-20-r-machine-learning-packages.html
The following are the best Python modules for data mining from kdnuggets, you may also want to install all
of them.
1. Basics
• numpy - numerical library, http://numpy.scipy.org/
• scipy - Advanced math, signal processing, optimization, statistics, http://www.scipy.org/
• matplotlib, python plotting - Matplotlib, http://matplotlib.org
2. Machine Learning and Data Mining
• MDP, a collection of supervised and unsupervised learning algorithms,
http://pypi.python.org/pypi/MDP/2.4
• mlpy, Machine Learning Python, http://mlpy.sourceforge.net
• NetworkX, for graph analysis, http://networkx.lanl.gov/
• Orange, Data Mining Fruitful & Fun, http://biolab.si
• pandas, Python Data Analysis Library, http://pandas.pydata.org
• pybrain, http://pybrain.org
• scikits-learn - Classic machine learning algorithms - Provide simple an efficient solutions to learning
problems, http://scikit-learn.org/stable/
3. Natural Language
• NLTK, Natural Language Toolkit, http://nltk.org
4. For web scraping
• Scrapy, An open source web scraping framework for Python, http://scrapy.org
• urllib/urllib2
Herein I would like to add one more important package Theano for deep learning and textmining for text
mining:
• Theano, deep learning, http://deeplearning.net/tutorial/
• textmining, text mining, https://pypi.python.org/pypi/textmining/1.0
FOUR
Note: Know yourself and know your enemy, and you will never be defeated – idiom, from Sunzi’s Art
of War
4.1 procedures
Data mining is a complex process that aims to discover patterns in large data sets starting from a collection
of exsting data. In my opinion, data minig contains four main steps:
1. Collecting data: This is a complex step, I will assume we have already gotten the datasets.
2. Pre-processing: In this step, we need to try to understand our data, denoise, do dimentation reduction
and select proper predictors etc.
3. Feeding data mining: In this step, we need to use our data to feed our model.
4. Post-processing : In this step, we need to interpret and evaluate our model.
In this section, we will try to know our enemy – datasets. We will learn how to load data, how to understand
data with statistics method and how to underdtand data with visualization. Next, we will start with Loading
Datasets for the Pre-processing.
The datasets for this tutorial are available to download: Heart, Energy Efficienency. Those data are from my
course matrials, the copyrights blongs to the origial authors.
There are two main data formats “.csv” and “.xlsx”. We will show how to load those two types of data in R
and Python, respectively.
1. Loading datasets in R
• Loading “*.csv” format data
13
Data Mining With Python and R Tutorials, Release v1.0
After we get the data in hand, then we can try to understand them. I will use “Heart.csv” dataset as a example
to demonstrate how to use those statistics methods.
1. Summary of the data
It is always good to have a glance over the summary of the data. Since from the summary
you will know some statistics features of your data, and you will also know whether you data
contains missing data or not.
• Summary of the data in R
summary(rawdata)
c(nrow, ncol)
sapply(rawdata, class)
Column names:
[’Age’, ’Sex’, ’ChestPain’, ’RestBP’, ’Chol’, ’Fbs’, ’RestECG’, ’MaxHR’,
’ExAng’, ’Oldpeak’, ’Slope’, ’Ca’, ’Thal’, ’AHD’]
You can use the samilar way to check the last part of the data, for simplicity, i will skip it.
6. Correlation Matrix
correlation Matrix
Age Sex RestBP Chol Fbs RestECG MaxHR \
Age 1.000000 -0.097542 0.284946 0.208950 0.118530 0.148868 -0.393806
Sex -0.097542 1.000000 -0.064456 -0.199915 0.047862 0.021647 -0.048663
RestBP 0.284946 -0.064456 1.000000 0.130120 0.175340 0.146560 -0.045351
Chol 0.208950 -0.199915 0.130120 1.000000 0.009841 0.171043 -0.003432
Fbs 0.118530 0.047862 0.175340 0.009841 1.000000 0.069564 -0.007854
RestECG 0.148868 0.021647 0.146560 0.171043 0.069564 1.000000 -0.083389
MaxHR -0.393806 -0.048663 -0.045351 -0.003432 -0.007854 -0.083389 1.000000
ExAng 0.091661 0.146201 0.064762 0.061310 0.025665 0.084867 -0.378103
Oldpeak 0.203805 0.102173 0.189171 0.046564 0.005747 0.114133 -0.343085
Slope 0.161770 0.037533 0.117382 -0.004062 0.059894 0.133946 -0.385601
Ca 0.362605 0.093185 0.098773 0.119000 0.145478 0.128343 -0.264246
7. covariance Matrix
• Computing covariance matrix in R
# get numerical data and remove NAN
numdata=na.omit(rawdata[,c(1:2,4:12)])
A picture is worth a thousand words. You will see the powerful impact of the figures in this section.
1. Summary plot of data in figure
• Summary plot in R
# plot of the summary
plot(rawdata)
Then you will get Figure Summary plot of the data with R.
Then you will get Figure Summary plot of the data with Python.
2. Histogram of the quantitative predictors
• Histogram in R
Then you will get Figure Histogram with normal curve plot in R.
• Histogram in in Python
# Histogram
rawdata.hist()
plt.show()
dev.off()
name=colnames(numdata)
Nvars=ncol(numdata)
# boxplot
par(mfrow =c (4,3))
for (i in 1:Nvars)
{
#boxplot(numdata[,i]~numdata[,Nvars],data=data,main=name[i])
boxplot(numdata[,i],data=numdata,main=name[i])
}
• Boxplot in Python
# boxplot
pd.DataFrame.boxplot(rawdata)
plt.show()
Then you will get Figure Correlation Matrix plot in R.. More information
about the Visualization Methods of corrplot can be found at: https://cran.r-
project.org/web/packages/corrplot/vignettes/corrplot-intro.html
• Correlation Matrix plot in Python
# cocorrelation Matrix plot
pd.DataFrame.corr(rawdata)
plt.show()
Then you will get get Figure Correlation Matrix plot in Python.
The code for this section is available for download for R, for Python,
• R Source code
rm(list = ls())
# set the enverionment
path =’~/Dropbox/MachineLearningAlgorithms/python_code/data/Heart.csv’
rawdata = read.csv(path)
plot(rawdata)
dim(rawdata)
head(rawdata)
tail(rawdata)
colnames(rawdata)
attach(rawdata)
cor(numdata)
cov(numdata)
dev.off()
# laod cocorrelation Matrix plot lib
library(corrplot)
M <- cor(numdata)
#par(mfrow =c (1,2))
#corrplot(M, method = "square")
corrplot.mixed(M)
nrow=nrow(rawdata)
ncol=ncol(rawdata)
c(nrow, ncol)
Nvars=ncol(numdata)
# checking data format
typeof(rawdata)
install.packages("mlbench")
library(mlbench)
sapply(rawdata, class)
dev.off()
name=colnames(numdata)
Nvars=ncol(numdata)
# boxplot
par(mfrow =c (4,3))
for (i in 1:Nvars)
{
#boxplot(numdata[,i]~numdata[,Nvars],data=data,main=name[i])
boxplot(numdata[,i],data=numdata,main=name[i])
}
for (i in 1:Nvars)
{
x<- numdata[,i]
h<-hist(x, breaks=10, freq=TRUE, col="blue", xlab=name[i],main=" ",
font.lab=1)
axis(1, tck=1, col.ticks="light gray")
axis(1, tck=-0.015, col.ticks="black")
axis(2, tck=1, col.ticks="light gray", lwd.ticks="1")
axis(2, tck=-0.015)
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
}
library(reshape2)
library(ggplot2)
d <- melt(diamonds[,-c(2:4)])
ggplot(d,aes(x = value)) +
facet_wrap(~variable,scales = "free_x") +
geom_histogram()
if __name__ == ’__main__’:
path =’~/Dropbox/MachineLearningAlgorithms/python_code/data/Heart.csv’
rawdata = pd.read_csv(path)
# Histogram
rawdata.hist()
plt.show()
# boxplot
pd.DataFrame.boxplot(rawdata)
plt.show()
path = (’/home/feng/Dropbox/MachineLearningAlgorithms/python_code/data/’
’energy_efficiency.xlsx’)
path
rawdataEnergy= pd.read_excel(path,sheetname=0)
print rawdata[[’Age’,’Ca’]].corr()
pd.DataFrame.corr(rawdata)
plt.show()
# define colors list, to be used to plot survived either red (=0) or green (=1
colors=[’red’,’green’]
# rawdata.info()
#plt.savefig(’attribute_correlations.png’, tight_layout=True)
plt.show()
attr = rawdata[’Age’]
sns.distplot(attr)
plt.show()
plt.show()