CS3361-DATA SCIENCE LAB MANUAL
CS3361-DATA SCIENCE LAB MANUAL
NUMPY:
Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental
package for scientific computing with Python. Besides its obvious scientific uses, Numpy can
also be used as an efficient multi-dimensional container of generic data.
Features:
High-performance N-dimensional array object.
It contains tools for integrating code from C/C++ and FORTRAN.
It contains a multidimensional container for generic data.
Additional linear algebra, Fourier transforms, and random number capabilities.
It consists of broadcasting functions.
It had data type definition capability to work with varied databases.
Sample Program:
import numpy as np
a=np.array([1,2,3])
print(a)
OUTPUT:
[1 2 3]
SCIPY:
SciPy is a python library that is useful in solving many mathematical equations and
algorithms. It is designed on the top of Numpy library that gives more extension of finding
scientific mathematical formulae like Matrix Rank, Inverse, polynomial equations, LU
Decomposition, etc. Using its high level functions will significantly reduce the complexity of
the code and helps in better analyzing the data. SciPy is an interactive Python session used as
a data-processing library that is made to compete with its rivalries such as MATLAB, Octave,
R- Lab,etc. It has many user-friendly, efficient and easy-to-use functions that helps to solve
problems like numerical integration, interpolation, optimization, linear algebra and statistics.
Sample Program:
from scipy import constants
print(constants.pi)
OUTPUT:
3.141592653589793
JUPYTER:
The IPython Notebook concept was expanded upon to allow for additional programming
languages and was therefore renamed "Jupyter". "Jupyter" is a loose acronym meaning Julia,
Python and R, but today, the notebook technology supports many programming languages.
An IDE normally consists of at least a source code editor, build automation tools and a
debugger. Jupyter Notebook is an IDE for Python that allows its users to create documents
containing both rich text and code. It also supports the programming languages Julia, and R.
Jupyter Notebook allows users to compile all aspects of a data project in one place
making it easier to show the entire process of a project to your intended audience. Through
the web-based application, users can create data visualizations and other components of a
project to share with others via the platform.
To open jupyter-lab:
Open command prompt and type jupyter-lab.
Then after initializing all the necessary packages, it will open as follows:
Click on new notebook, then the new file will be opened with .ipynb file extension. Then type
python code and execute the code using Shift+Enter.
Sample Program and Output:
STASMODELS:
Statsmodels is a Python module that provides classes and functions for the
estimation ofmany different statistical models, as well as for conducting statistical tests, and
statistical data exploration. An extensive list of result statistics is available for each
estimator. The results aretested against existing statistical packages to ensure that they are
correct. statsmodels supports specifying models using R-style formulas and pandas
DataFrames.
statsmodels is a Python package that provides a complement to scipy for statistical
computations including descriptive statistics and estimation and inference for statistical
models.
Sample Program:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.read_csv(r"C:\Users\UGCS\Desktop\headbrain11.csv")
print(df.head())
# fitting the model
df.columns = ['Head_size', 'Brain_weight']
model = smf.ols(formula='Head_size ~ Brain_weight', data=df).fit()# model summary
print(model.summary())
OUTPUT:
PANDAS:
Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data. Pandas allow us to analyze big data
and make conclusions based on statistical theories. Pandas can clean messy data sets, and
make them readable and relevant. Relevant data is very important in data science.
Sample Program:
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
OUTPUT:
RESULT:
Thus the python packages NumPy, SciPy, Jupyter, Stasmodels and Pandas have been
downloaded, installed and the features have been explored successfully.
EX.NO:2 WORKING WITH NUMPY ARRAYS
DATE:
AIM:
To write a python code to work with numpy arrays.
ALGORITHM:
PROGRAM:
#Create a 0-D array:
import numpy as np
arr = np.array(42)
print(arr)
OUTPUT:
42
OUTPUT:
[1,2,3,4,5]
OUTPUT:
[[[1 2 3]
[4 5 6]]
[[1 2 3]
[4 5 6]]]
OUTPUT:
0
1
2
3
OUTPUT:
2
#Accessing 2-D Arrays:
import numpy as np
arr = np.array([[1,2,3,4,5], [6,7,8,9,10]])
print('2nd element on 1st row: ', arr[0, 1])
OUTPUT:
2nd element on 1st row: 2
#Array Slicing:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7])
print(arr[4:])
OUTPUT:
[5 6 7]
OUTPUT:
[7 8 9]
OUTPUT:
<U6
OUTPUT:
1
2
3
OUTPUT:
[1 2 3 4 5 6]
OUTPUT:
[array([1, 2]), array([3, 4]), array([5, 6])]
#Searching Arrays:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 4, 4])
x = np.where(arr == 4)
print(x)
OUTPUT:
(array([3, 5, 6], dtype=int32),)
#Sorting Arrays:
import numpy as np
arr = np.array([3, 2, 0, 1])
print(np.sort(arr))
OUTPUT:
[0 1 2 3]
#Filtering Arrays:
import numpy as np
arr = np.array([41, 42, 43, 44])
x = [True, False, True, False]
newarr = arr[x]
print(newarr)
OUTPUT:
[41 43]
RESULT:
Thus the python code to work with numpy arrays has been implemented and executed
successfully.
EX.NO:3 WORKING WITH PANDAS DATA FRAMES
DATE:
AIM:
To write a python program to work with pandas data frames.
ALGORITHM:
1. Pandas is a Python library used for working with data sets.
2. It has functions for analyzing, cleaning, exploring, and manipulating data.
3. Dataframes can be created using list or dictionary.
4. Dataframes can also be used to load any other .csv or .xslx files.
5. It can be used to replace the null values with other values.
6. It can also perform data and its statistical analyzing.
PROGRAM:
#Creating a dataframe using List:
import pandas as pd
lst =['Anna', 'University', 'Chennai', 'Sri Ramakrishna','College', 'of','Engineering']
df = pd.DataFrame(lst)
print(df)
OUTPUT:
0
0 Anna
1 University
2 Chennai
3 Sri Ramakrishna
4 College
5 of
6 Engineering
OUTPUT:
OUTPUT:
OUTPUT:
RESULT:
Thus the python program to work with pandas data frames have been implemented and
executed successfully.
EX.NO:4 READING DATA FROM TEXT FILES, EXCEL AND THE WEB
DATE:
AIM:
To read the data from text files, Excel and the web and exploring various commands
fordoing descriptive analytics on the Iris data set.
PRE-REQUISITES:
pip install xlrd
pip install openpyxl
pipinstall requests
pip install beautifulsoup4
ALGORTIHM:
1. Open the file to be written using open() function.
2. The file can opened with read/write/append/… mode.
3. Write the file using write() or writelines() function.
4. seek(n) takes the file handle to the nth byte from the beginning.
5. Close the file using close().
6. To read the data from the excel, install pandas.
7. Create a dataframe using read_excel()
8. To read the data from the web, install requests and beautifulsoup4.
9. The content from the web can be accessed using the function requests.get(url).
10. To perform descriptive analytics on a dataset, install seaborn, matplotlib and
pandas toexplore various functions.
PROGRAM:
#Reading data from text file:
# Program to show various ways to read and write data in a file.
file1 = open("myfile.txt","w")
L = ["This is Python \n","This is datascience \n","This is jupyter \n"]
file1.write("Hello \n")
file1.writelines(L)
file1.close() #to change file access
modesfile1 = open("myfile.txt","r+")
print("Output of Read function is ")
print(file1.read())
print()
# seek(n) takes the file handle to the nth byte from the beginning.
file1.seek(0)
print( "Output of Readline function is ")
print(file1.readline())
print()
file1.seek(0)
# To show difference between read and readline
print("Output of Read(9) function is ")
print(file1.read(9))
print()
file1.seek(0)
print("Output of Readline(9) function is ")
print(file1.readline(9))
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print(file1.close())
OUTPUT:
OUTPUT:
Gender,Age Range,Head Size(cm^3),Brain Weight(grams)
PROGRAM:
#Reading dataset
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
df.info()
df.describe()
OUTPUT:
#Finding Frequency
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
#create frequency table for 'Glucose' variable
f1=df['Glucose'].value_counts()
print('frequency table for Glucose variable\n',f1)
OUTPUT:
#Finding Mean
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
m1=df['Pregnancies'].mean()
print('Mean of Pregnancies',m1)
m2=df['Glucose'].mean()
print('Mean of Glucose',m2)
m3=df['BloodPressure'].mean()
print('Mean of BloodPressure',m3)
m4=df['SkinThickness'].mean()
print('Mean of SkinThickness',m4)
m5=df['Insulin'].mean()
print('Mean of Insulin',m5)
m6=df['BMI'].mean()
print('Mean of BMI',m6) m7=df['DiabetesPedigreeFunction'].mean()
print('Mean of DiabetesPedigreeFunction',m7)
m8=df['Age'].mean()
print('Mean of Age',m8)
OUTPUT:
#Finding Median
import pandas as pd
df = pd.read_csv("diabetes.csv")
m1=df['Pregnancies'].median()
print('median of Pregnancies',m1)
m2=df['Glucose'].median()
print('median of Glucose',m2)
m3=df['BloodPressure'].median()
print('median of BloodPressure',m3)
m4=df['SkinThickness'].median()
print('median of SkinThickness',m4)
m5=df['Insulin'].median()
print('median of Insulin',m5)
m6=df['BMI'].median()
print('median of BMI',m6)
m7=df['DiabetesPedigreeFunction'].median()
print('median of DiabetesPedigreeFunction',m7)
m8=df['Age'].median()
print('median of Age',m8)
OUTPUT:
#Finding Mode
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
m1=df['Pregnancies'].mode()
print('mode of Pregnancies',m1)
m2=df['Glucose'].mode()
print('mode of Glucose',m2)
m3=df['BloodPressure'].mode()
print('mode of BloodPressure',m3)
m4=df['SkinThickness'].mode()
print('mode of SkinThickness',m4)
m5=df['Insulin'].mode()
print('mode of Insulin',m5)
m6=df['BMI'].mode()
print('mode of BMI',m6)
m7=df['DiabetesPedigreeFunction'].mode()
print('mode of DiabetesPedigreeFunction',m7)
m8=df['Age'].mode()
print('mode of Age',m8)
OUTPUT:
#Finding Variance
import pandas as pd
import statistics #create DataFrame
df = pd.read_csv("diabetes.csv")
print("Variance of Glucose set is % s"%(statistics.variance(df.Glucose)))
print("Variance of Pregnancies set is % s"%(statistics.variance(df.Pregnancies)))
print("Variance of Age set is % s"%(statistics.variance(df.Age)))
OUTPUT:
OUTPUT:
#Finding Skewness
import scipy
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
s1=scipy.stats.skew(df.Age, axis=0, bias=True)
print('the skewness of Age is',s1)
s2=scipy.stats.skew(df.Glucose, axis=0, bias=True)
print('the skewness of Glucose is',s2)
OUTPUT:
#Finding Kurtosis
import scipy
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
k1=scipy.stats.kurtosis(df.Age, axis=0, bias=True)
print('the kurtosis of Age is',k1)
k2=scipy.stats.kurtosis(df.Glucose, axis=0, bias=True)
print('the kurtosis of Glucose is',k2)
OUTPUT:
RESULT:
Thus the Univariate analysis such as Frequency, Mean, Median, Mode, Variance,
Standard Deviation, Skewness and Kurtosis on the diabetes dataset have been performed
successfully.
EX.NO:5B BIVARIATE ANALYSIS USING DIABETES DATASET
DATE:
AIM:
To perform Bivariate analysis such as Linear and logistic regression modeling on the
Diabetes dataset.
ALGORITHM:
1. Linear regression uses the relationship between the data-points to draw a straight
linethrough all them.
2. This line can be used to predict future values.
3. Import scipy and draw the line of Linear Regression
4. Define response and explanatory variable.
5. Add constant to predictor variables.
6. Create the model using, sm.OLS(y, x).fit().
7. View the model using summary().
8. To construct the correlation matrix, use corr().
9. To model the logistic regression, Install scikit-learn of version 0.24.2.
10. Read and explore the data.
11. Split the Dataset as Train and Test dataset
12. Train the model using, LogisticRegression()
13. Visualize the performance of logistic regression model.
PROGRAM:
#creating scatterplots
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("diabetes.csv")
plt.scatter(df.BMI, df.Age)
plt.title('BMI vs. Age')
plt.xlabel('BMI')
plt.ylabel('Age')
plt.show()
OUTPUT:
#simple linear regression
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
df = pd.read_csv("diabetes.csv")
#define response variable
y = df['Insulin']
#define explanatory variable
x = df[['BloodPressure']]
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y, x).fit()
#view model summary
print(model.summary())
OUTPUT:
#creating histogram
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
df = pd.read_csv("diabetes.csv")
sns.histplot(df.Age,kde=True)
plt.show()
OUTPUT:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#Read and Explore the data
dataset = pd.read_csv("diabetes.csv")# input
x = dataset.iloc[:, [2, 3]].values # output
y = dataset.iloc[:, 4].values
RESULT:
Thus the Bivariate analysis such as Linear and logistic regression modeling on the
diabetesdataset have been performed and analyzed successfully.
EX.NO:5C MULTIPLE REGRESSION ANALYSIS USING DIABETES DATASET
DATE:
AIM:
To perform multiple regression analysis using diabetes dataset.
ALGORITHM:
1. Multiple regression is like linear regression, but with more than one independent
value,meaning that we try to predict a value based on two or more variables.
2. Import pandas, numpy and matplotlib packages.
3. Install and import sklearn(scikit-learn) package.
4. Import linear_model from scikit-learn.
5. Plot the graph using scatter()
6. Generate training and testing data from the dataset.
7. Model the dataset using, regr.fit()
8. Analyze the coefficients and intercepts.
PROGRAM:
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model
np.random.seed(19680801)
data=pd.read_csv("diabetes.csv")
data.head(210)
data = data[["Glucose","Age","Pregnancies"]]
fig=plt.figure()
ax=fig.add_subplot(111,projection='3d')
n=100
ax.scatter(data["Glucose"],data["Age"],data["Pregnancies"],color="red")
ax.set_xlabel("Glucose")
ax.set_ylabel("Age")
ax.set_zlabel("Pregnancies")
plt.show()
OUTPUT:
#Generating training and testing data from our data:
train = data[:(int((len(data)*0.8)))]
test = data[(int((len(data)*0.8))):]
# Modeling:Using sklearn package to model data :
regr = linear_model.LinearRegression()
train_x = np.array(train[["Glucose"]])
train_y = np.array(train[["Age"]])
regr.fit(train_x,train_y)
ax.scatter(data["Glucose"],data["Age"],data["Pregnancies"],color="red")
plt.plot(train_x, regr.coef_*train_x + regr.intercept_, '-r')
ax.set_xlabel("Glucose")
ax.set_ylabel("Age") ax.set_zlabel("Pregnancies")
print ("coefficients : ",regr.coef_)#Slope
print ("Intercept : ",regr.intercept_)
OUTPUT:
RESULT:
Thus the multiple regression analysis using diabetes dataset have been implemented and
executed successfully.
EX.NO:6 EXPLORING VARIOUS PLOTTING FUNCTIONS USING DATASET
DATE:
AIM:
To apply and explore various plotting functions such as Normal curves, Density and
Contour plots, Correlation and scatter plots, Histograms and three dimensional plotting on
UCIdata sets.
ALGORITHM:
1. Import numpy, matplotlib, scipy and pandas.
2. Create the dataframe.
3. Find mean and standard deviation from the dataset.
4. Find the normal curve snd using, stats.norm()
5. Generate 1000 randomvalues and plot the normalcurve.
6. Install and import seaborn package.
7. Draw the density plot using distplot().
8. Draw the contour plot using kdeplot().
9. Construct the correlation matrix using, con.corr().
10. Display the coefficient of correlation using stats.pearsonr()
11. Plot the histogram using hist().
12. To model 3D plotting, import Axes3D.
PROGRAM:
#NORMAL CURVES:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
#create DataFrame
df = pd.read_csv("diabetes.csv")
mu=df['Pregnancies'].mean()
std=df['Pregnancies'].std()
snd = stats.norm(mu, std)
# Generate 1000 random values between -100, 100
x = np.linspace(-100, 100, 1000)
plt.figure(figsize=(7.5,7.5))
plt.plot(x, snd.pdf(x))
plt.xlim(-60, 60)
plt.title('Normal Distribution', fontsize='15')
plt.xlabel('Values of Random Variable X', fontsize='15')
plt.ylabel('Probability', fontsize='15')
plt.show()
OUTPUT:
OUTPUT:
#contour plot:
import seaborn as sns
import matplotlib.pyplot
as pltimport pandas as pd
df = pd.read_csv("diabetes.csv")
sns.set_style("white")
sns.kdeplot(x=df.Age, y=df.BloodPressure)
plt.show()
sns.kdeplot(x=df.Age, y=df.BloodPressure, cmap="Reds", shade=True, bw_adjust=.5)
plt.show()
sns.kdeplot(x=df.Age, y=df.BloodPressure, cmap="Blues", shade=True, thresh=0)
plt.show()
OUTPUT:
#CORRELATION AND SCATTER PLOTS:
import pandas as pd
import matplotlib.pyplot as plt
con = pd.read_csv('diabetes.csv')
print(con)
import seaborn as sns
sns.scatterplot(x="Age", y="Glucose", data=con);
plt.show()
sns.lmplot(x="Age", y="Glucose", hue="BMI", data=con);
plt.show()
#coefficient of correlation
from scipy import stats
cr=stats.pearsonr(con['Glucose'], con['Age'])
print(cr)
#correlation matrix
cormat = con.corr()
print(round(cormat,2))
OUTPUT:
#HISTOGRAMS:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
a = pd.read_csv('diabetes.csv')# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
plt.show()
OUTPUT:
OUTPUT:
RESULT:
Thus the various plotting functions such as Normal curves, Density and contour plots,
Correlation and scatter plots, Histograms and three dimensional plotting have been explored
successfully on UCI data sets.
EX.NO:7 VISUALIZING GEOGRAPHIC DATA WITH BASEMAP
DATE:
AIM:
To implement visualization of geographic data with basemap.
PRE-REQUISITIES:
Install folium.
ALGORITHM:
1. Import folium and pandas libraries.
2. Initialize the map and store it in a m object
3. Use the function, folium.Map()
4. Save the map using save() function.
5. Open and view the file using any browser.
PROGRAM:
Installation of Folium:
# import the folium, pandas
librariesimport folium
import pandas as pd
# initialize the map and store it in a m object
m = folium.Map(location = [40, -95],zoom_start = 4)
# show the map
m.save('my_map.html')
OUTPUT:
RESULT:
Thus the implementation of visualizing geographic data with base map has been
executedsuccessfully.