cs3362 Foundations of Data Science Lab Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

lOMoARcPSD|28265006

CS3362 Foundations OF DATA Science LAB Manual

Computer Science and Engineering (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO.4. READING DATA FROM TEXT FILES, EXCEL AND THE WEB
DATE:

Aim:
To Reading data from text files, Excel and the web using pandas package.

ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT

This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:

import numpy as np
import pandas as pd

CSV

CSV INPUT:
df = pd.read_csv('example')
df

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

CSV OUTPUT:
df.to_csv('example',index=False)

EXCEL

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.

EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:

pip install lxml


pip install html5lib==1.1
pip install BeautifulSoup4

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

For example:

HTML INPUT

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list

df = pd.read_html(url)

df[0]

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

df_list[0]

HTML OUTPUT:

RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING DESCRIPTIVE


DATE: ANALYTICS ON THE IRIS DATA SET.

AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris_obj = load_iris()

# Dataset preview

iris_obj.data

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in


range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target, columns=pd.Index(["species"]),
index=pd.Index([i for i in range(iris_obj.target.shape[0])])))

iris # prints iris data

Commands

iris_obj.feature_names

iris.count()

iris.mean()

iris.median()
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

iris.var()

iris.std()

iris.max()

iris.min()

iris.describe()

OUTPUT:

RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO 5. USE THE DIABETES DATA SET FROM UCI AND PIMA INDIANS
DATE: DIABETES DATA SET FOR PERFORMING THE FOLLOWING:

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,


STANDARD DEVIATION, SKEWNESS AND KURTOSIS.
AIM:
To explore various commands for doing Univariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the mean, median, mode, variance, standard deviation, skewness and
kurtosis in the given excel data set package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T

# Frequency# finding the unique count


df1 = df['Outcome'].value_counts()

# displaying df1
print(df1)
#mean
df.mean()
#median
df.median()
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)

# skip the na values


# find skewness in each row
df.skew(axis = 1, skipna = True)

#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)

preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)

sns.countplot(data=df['Outcome'])

sns.distplot(df['Pregnancies'])

sns.boxplot(data=df['Pregnancies'])

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

OUTPUT:

RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION


DATE: MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING AND UCI
AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count',fontdict={'fontsize':7})
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Month of Preg.',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count',fontdict={'fontsize':7})
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0])
axes[1][0].set_title('Pregnancies Distribution',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][0].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plt.tight_layout()

plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.')
plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1][1],label='Diab.')
axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7})
axes[1][1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend text
plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend title
plt.tight_layout()

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v')
axes[2][0].set_title('Pregnancies',fontdict={'fontsize':8})
axes[2][0].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][0].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.tight_layout()

plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2][1])
axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7})
axes[2][1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

OUTPUT:

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))


plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.')
sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0][1],color='red',label='Diab')

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

LINEAR REGRESSION MODELLING ON HOUSING DATASET

# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

sns.distplot(USAhousing['Price'])

sns.heatmap(USAhousing.corr())

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)

sns.distplot((y_test-predictions),bins=50);

from sklearn import metrics


print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

LOGISTIC REGRESSION MODELLING ON PIME DIABETIES

# Data manipulation libraries


import numpy as np
import pandas as pd

###scikit Learn Modules needed for Logistic Regression


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import
LabelEncoder,MinMaxScaler,OneHotEncoder,StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')

df=pd.read_csv('C:/Users/diabetes.csv')

df.head()

df.tail()

df.isnull().sum()

df.describe(include='all')

df.corr()

sns.heatmap(df.corr(),annot=True)
plt.show()

df.hist()
plt.show()

sns.countplot(x=df['Outcome'])

scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])

df_new = df
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

# Train & Test split


x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)

print('Shape of Training Xs:{}'.format(x_train.shape))


print('Shape of Test Xs:{}'.format(x_test.shape))
print('Shape of Training y:{}'.format(y_train.shape))
print('Shape of Test y:{}'.format(y_test.shape))

Shape of Training Xs:(614, 8)


Shape of Test Xs:(154, 8)
Shape of Training y:(614,)
Shape of Test y:(154,)

# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)

score=model.score(x_test,y_test);
print(score)

0.7337662337662337

#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset
was successfully executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO:5.C) MULTIPLE REGRESSION ANALYSIS


DATE:`
AIM:
To explore various commands for doing Multiivariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the multiple regression analysis the
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

RESULT:

Thus the Multi regression analysis using housing data sets are executed successfully.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO:5.D) ALSO COMPARE THE RESULTS OF THE ABOVE ANALYSIS FOR THE
DATE: TWO DATA SETS.

AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()

#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.')
sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0][1],color='red',label='Diab')
axes[0][1].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0][1].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})


axes[1][1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
axes[1][1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

OUTPUT:

RESULT:

Thus the comparison of the above analysis for the two datasets are executed successfully.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON UCI


DATE: DATA SETS.

AIM:
To apply and explore various plotting functions on UCI datasets.

ALGORITHM:

STEP 1: Install seaborn package and import the package.


STEP 2: Normal curves, density or contour plots, correlation and sctter plots, and
histogram plots are visualized.
STEP 3: 3d plotting done using plotly package
STEP 4: Stop the program.
PROGRAM:

A. NORMAL CURVES

#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

B. DENSITY AND CONTOUR PLOTS

iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)

OUTPUT:

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function


df = sns.load_dataset("titanic")
ax = sns.heatmap(df annot=True, fmt="d")

#scatter plots of categorical variable


df = sns.load_dataset("titanic")
sns.catplot(data=df, x="age", y="class")

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

D. HISTOGRAMS

#histogram of datafra,e

df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")

OUTPUT:

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package


import plotly as px
df = sns.load_dataset("iris")

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",


size="SepalLengthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

RESULT:

Thus the various exploring visual plots are successfully executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH BASEMAP


DATE:

AIM:

To check the Visualizing Geographic Data with Basemap using googlecolap.

ALGORITHM:

STEP 1: Install the basemap package

Install the below package:


Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap

STEP 2: Explore on various projection options example: ortho, lcc.


STEP 3: Mark the location using longitude and latitude

PROGRAM:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

OUTPUT:

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)


lOMoARcPSD|28265006

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

OUTPUT:

from itertools import chain

def draw_map(m, scale=0.2):


# draw a shaded-relief image
m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary


lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances


lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style


for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

fig = plt.figure(figsize=(8, 6), edgecolor='w')


Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)
lOMoARcPSD|28265006

m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

OUTPUT:

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

OUTPUT:

RESULT:

Thus the Exploring Geographic Data with Basemap was successfully executed.

Downloaded by Jegatheeswari ic37721 (ic37721@imail.iitm.ac.in)

You might also like