CS3361 - Data Science Laboratory
CS3361 - Data Science Laboratory
CS3361 - Data Science Laboratory
NAME :
REGISTER NO. :
BRANCH :
1
ANNA UNIVERSITY
UNIVERSITY COLLEGE OF ENGINEERING – DINDIGUL
DINDIGUL – 62422
BONAFIDE CERTIFICATE
This is to certify that is a bonafide record of work done by
Mr./Ms.________________________________________
in _____________________________________________
laboratory during the academic year 2022-2023
2
INDEX
3
EXPT NO.: 01 INSTALLATION OF FEATURES FOR PYTHON
DATE :
Aim:
To write a procedure for installation of features for python.
Procedure:
Python Packages
The power of Python is in the packages that are available either through the pip or conda
package managers. This page is an overview of some of the best packages for machine
learning and data science and how to install them.
We will explore the Python packages that are commonly used for data science and machine
learning. You may need to install the packages from the terminal, Anaconda prompt,
commandprompt, or from the Jupyter Notebook. If you have multiple versions of Python or
have specificdependencies then use an environment manager such as pyenv. For most
users, a singleinstallation is typically sufficient. The Python package manager pip has all of
the packages(such as gekko) that we need for this course. If there is an administrative
access error, installto the local profile with the --user flag.pip install gecko.
Gekko
Gekko provides an interface to gradient-based solvers for machine learning and
optimization of mixed-integer, differential algebraic equations, and time series models.
Gekko provides exact first and second derivatives through automatic differentiation and
discretization with simultaneous or sequential methods. pip install gecko
Keras
Keras provides an interface for artificial neural networks. Keras acts as an interface for the
TensorFlow library. Other backend packages were supported until version 2.4. TensorFlow is
now the only backend and is installed separately with pip install tensorflow.
pip install keras.
Matplotlib
The package matplotlib generates plots in Python.
pip install matplotlib
4
Numpy
Numpy is a numerical computing package for mathematics, science, and engineering. Many
data science packages use Numpy as a dependency.
pip install numpy
OpenCV
OpenCV (Open Source Computer Vision Library) is a package for real-time computer vision
and developed with support from Intel Research.
pip install opencv-python
Pandas
Pandas visualizes and manipulates data tables. There are many functions that allow
efficient manipulation for the preliminary steps of data analysis problems.
pip install pandas
Plotly
Plotly renders interactive plots with HTML and JavaScript. Plotly Express is included with
Plotly.
pip install plotly
PyTorch
PyTorch enables deep learning, computer vision, and natural language processing.
Development is led by Facebook's AI Research lab (FAIR).
pip install torch
Scikit-Learn
Scikit-Learn (or sklearn) includes a wide variety of classification, regression and clustering
algorithms including neural network, support vector machine, random forest, gradient
boosting, k-means clustering, and other supervised or unsupervised learning methods.
pip install scikit-learn
SciPy
SciPy is a general-purpose package for mathematics, science, and engineering and extends
the base capabilities of NumPy.
pip install scipy
Seaborn
Seaborn is built on matplotlib, and produces detailed plots in few lines of code.
pip install seaborn
5
Statsmodels
Statsmodels is a package for exploring data, estimating statistical models, and performing
statistical tests. It include descriptive statistics, statistical tests, plotting functions, and result
statistics.
pip install statsmodels
TensorFlow
TensorFlow is an open source machine learning platform with particular focus on training
and inference of deep neural networks. Development is led by the Google Brain team.
pip install tensorflow
Result:
Thus the procedure of features for python and installation procedure had done
correctly.
6
DATE :
Aim:
To write a python program using numpy library and working involved in it.
Short Notes:
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked
arrays and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier
transforms, basic linear algebra, basic statistical operations, random simulation and much
more.
Source Code:
import random,numpy
row= int(input("Enter the Number of Rows: "))
column=int(input("Enter the Number of Columns: "))
matrix=[]
for i in range(row):
d=[]
for j in range(column):
x = random.randint(1,100)
d.append(x)
matrix.append(d)
matrix=numpy.array(matrix)
print("\n Matrix = \n",matrix)
print("\n Dimension of Matrix = ",matrix.ndim)
print("\n Byte-Size of Each Element in the Matrix = ",matrix.itemsize)
print("\n Data Type of Matrix = ",matrix.dtype)
print("\n Total Number of Elements in the Matrix= ",matrix.size)
7
print("\n Shape of the Matrix = ",matrix.shape)
print("\n Reshaped Matrix = \n",matrix.reshape(column,row),"\n")
print("Printing entire row (1,2) of Matrix= \n",matrix[0:2],"\n")
print("Printing 3rd element of 2nd row of Matrix = \n",matrix[1,1],"\n")
print("Printing 4th element of 1st and 2nd row of Matrix = \n",matrix[0:2,3],"\n")
print("\nMaximum Element in the Matrix = ",matrix.max())
print("\nMinimum Element in the Matrix = ",matrix.min())
print("\nSum of All Element in the Matrix = ",matrix.sum())
print("\nSquare root of Each Element in the Matrix = ",numpy.sqrt(matrix))
print("\nStandard Deviation of the Matrix = ",numpy.std(matrix))
Result:
Thus the python program for working with numpy library was executed successfully
and the output was verified.
8
EXPT NO.: 03 WORKING WITH PANDAS DATA FRAMES
DATE :
Aim:
To write a python program using pandas library and working of data frames involved
in it.
Short Notes:
Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy
library. Pandas is fast and it has high performance & productivity for users.
Advantages
Fast and efficient for manipulating and analyzing data.
Data from different file objects can be loaded.
Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data
Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
Data set merging and joining.
Flexible reshaping and pivoting of data sets
Provides time-series functionality.
Powerful group by functionality for performing split-apply-combine operations
on data sets.
Creating a dataframe using List:
DataFrame can be created using a single list or a list of lists.
9
Source Code:
# import pandas as pd
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Create DataFrame
df = pd.DataFrame(data)
10
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
Result:
Thus the python program for working with pandas data frames library was executed
successfully and the output was verified.
11
EXPT NO.: 04 Reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris
DATE : data set.
Aim:
To read data from files and exploring various commands for doing descriptive
analytics on the Iris data set.
Algorithm:
1. Download “Iris.csv” file from GitHub.com
2. Load the “Iris.csv” into google colab.
3. Perform descriptive analysis on the Iris file.
Program:
Iris Dataset
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
Print(df.head())
12
Getting Information about the Dataset
df.shape
df.info()
df.describe()
Checking Duplicates
df.value_counts("Species")
Data Visualization
Visualizing the target column
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
plt.show()
Histograms
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
13
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);
Result:
Thus the python program to read and work on iris dataset was executed successfully
and the output was verified.
14
EXPT NO.: 05(a) Use the diabetes data set from UCI and Pima Indians Diabetes data set
for performing the following: Univariate analysis: Frequency, Mean,
DATE : Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis.
Aim:
To write a python program for Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis using UCI and Pima Diabetes dataset.
Procedure
1. Download dataset like Pima Indian diabetes dataset. Save them in any drive and call
them for process.
2. The mean () function can be used to calculate mean/average of a given list of
numbers.
3. The median () method calculates the median (middle value) of the given data set.
4. The mode of a set of data values is the value that appears most often.
5. The var () method calculates the variance for each column.
6. Standard deviation std () is a number that describes how spread out the values are.
7. The skew () method calculates the skew for each column. Skewness refers to a
distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data.
Kurtosis:
It is also a statistical term and an important characteristic of frequency distribution.
It determines whether a distribution is heavy-tailed in respect of the normal distribution. It
provides information about the shape of a frequency distribution.
Program:
import pandas as pd
from scipy.stats import kurtosis
import pylab as p
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
df1.mean()
df1.median()
df1.mode()
print(df1.var())
df1.std()
print(df1.skew())
print(kurtosis(df, axis=0, bias=True))
15
Dataset download link
https://github.com/npradaschnor/Pima-Indians-Diabetes-
Dataset/blob/master/Pima%20Indians%20Diabetes%20Dataset.ipynb
Result:
Thus the python program for Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis was executed successfully and the
output was verified.
16
EXPT NO.: 05(b) Linear Regression and Logistic Regression with the Diabetes
Dataset Using Python Machine Learning.
DATE :
Aim:
To write a python program for Linear Regression and Logistic Regression with the
Diabetes Dataset Using Python Machine Learning using UCI and Pima Diabetes dataset.
Procedure:
1.Load sklearn Libraries.
2.Load Data
3.Load the diabetes dataset
4.Split Dataset
5.Creating Model Linear Regression and Logistic Regression
6.Make predictions using the testing set
7.Finding Coefficient and Mean Square Error
Program
17
# Create Logistic regression object
Logistic_model = LogisticRegression()
Logistic_model.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
Result:
Thus the python program for Linear Regression and Logistic Regression with the
Diabetes Dataset Using Python Machine Learning was executed successfully and the output
was verified.
DATE :
18
Aim:
To write a python program for Multiple Regression using UCI and Pima Diabetes
dataset.
Procedure:
1.The Pandas module allows us to read csv files and return a DataFrame object.
2.Then make a list of the independent values and call this variable X.
3.Put the dependent values in a variable called y.
4.From the sklearn module we will use the LinearRegression() method to create a
linear regression object.
5.This object has a method called fit() that takes the independent and dependent
values as parameters and fills the regression object with data that describes the
relationship.
6.We have a regression object that are ready to predict age values based on a person
Glucose and BloodPressure
Program:
import pandas as pd
from sklearn import linear_model
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
predictedage = regr.predict([[150, 13]])
print(predictedage)
Result:
Thus the python program for Multiple Regression was executed successfully and the
output was verified.
EXPT NO.: 05(d) Compare the results of the above analysis for the two data sets.
DATE :
19
Aim:
To write a python program for compare the results of the above analysis for the two
data sets using UCI and Pima Diabetes dataset.
Procedure:
Step 1: Prepare the datasets to be compared
Step 2: Create the two DataFrames
Based on the above data, you can then create the following two DataFrames
Step 3: Compare the values between the two Pandas DataFrames
1. In this step, you’ll need to import the NumPy package.
2. Let’s say that you have the following data stored in a CSV file called car1.csv
3. While you have the data below stored in a second CSV file called car2.csv
Program:
import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')
df1 = pd.DataFrame(data_1)
data_2 = pd.read_csv(r'd:\car2.csv')
df2 = pd.DataFrame(data_2)
df1['amount1'] = df2['amount1']
df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')
df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] -
df2['amount1'])
print(df1)
Result:
Thus the python program for compare the results of the above analysis for the two
data sets was executed successfully and the output was verified.
EXPT NO.: 06(a) Apply and explore various plotting functions on UCI data sets:
Normal Curves
20
DATE :
AIM:
To apply and explore Normal curves functions on UCI-Iris data sets.
Algorithm:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve for Iris data set.
Normal Curves:
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics
because of its advantages in real case scenarios.
Source Code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
# import dataset
df = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv")
# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Normal Curves was executed successfully and the output was verified.
21
DATE :
Aim:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.
Algorithm:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the density and contour plotting for Iris data sets.
Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.
Density plots uses Kernel Density Estimation (so they are also known as Kernel
density estimation plots or KDE) which is a probability density function. The region of plot
with a higher peak is the region with maximum data points residing between those values.
Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to plot
then the response Z will be plotted as slices on the X-Y plane due to which contours are
sometimes referred as Z-slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the
mountain as well as in the meteorological department.
Source Code
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
px_orbital = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv")
x = px_orbital.iloc[0, 1:]
22
y = px_orbital.iloc[1:, 0]
px_values = px_orbital.iloc[1:, 1:]
mpl.rcParams['font.size'] = 14
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
fig, ax = plt.subplots()
ticks = np.linspace(pmin, pmax, 6)
CS = ax.contourf(x, y, px_values, cmap="RdBu", levels=levels)
ax.set_aspect('equal')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.colorbar(CS, format="%.3f", ticks=ticks)
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Density and contour plots was executed successfully and the output was verified.
23
DATE :
Aim:
To apply and correlation & Scatter plotting functions on UCI-Iris data sets.
Algorithm:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the correlation and scatter plotting for Iris data sets.
Source Code:
# Correction Matrix Plot
import matplotlib.pyplot as plt
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pimaindians-
diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
Scatter Plotting
24
A scatterplot shows the relationship between two variables as dots in two
dimensions, one axis for each attribute. You can create a scatterplot for each pair of
attributes in your data. Drawing all these scatterplots together is called a scatterplot matrix.
Scatter plots are useful for spotting structured relationships between variables, like
whether you could summarize the relationship between two variables with a line. Attributes
with structured relationships may also be correlated and good candidates for removal from
your dataset.
Source Code:
# Scatterplot Matrix
import matplotlib.pyplot as plt
import pandas
from pandas.plotting import scatter_matrix
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pimaindians-
diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
scatter_matrix(data)
plt.show()
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Correlation and Scatter Plots was executed successfully and the output was
verified.
25
DATE :
Aim:
To apply and explore Histograms plotting functions on UCI-Iris data sets.
ALGORITHM:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the Histograms for Iris data set.
Source Code:
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Histogram was executed successfully and the output was verified.
26
EXPT NO.: 06(e) Three Dimensional Plotting
DATE :
Aim:
To apply and explore Three Dimensional Plotting plotting functions on UCI-Iris data sets.
Algorithm:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the Three Dimensional Plotting for Iris data set.
Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
Source Code:
import numpy as np
import pandas as pd
# plt.style.use('default')
color_pallete = ['#fc5185', '#3fc1c9', '#364f6b']
sns.set_palette(color_pallete)
sns.set_style("white")
27
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Histogram was executed successfully and the output was verified.
28
EXPT NO.: 07 Visualizing Geographics Data With Basemap.
DATE :
Aim:
To write a python program to visulalize the geographics data with basemap on data
science
The first code block creates a figure, sets the area for the map, adds basics like
boundaries and water colour
A basic map
I've added notes below for more details. Essentially the process here is
1. Create a figure
2. Create a basic map to plot things on to our map (see 'm = Basemap' below)
3. Add features like mapboundaries, rivers, etc
4. I created tuples to store my coords in one place
5. Run a 'for' function to grab each set of coords and plot them on the map -
If you removed the for loop then you would just have a nice plain map.
29
Source Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
fig, ax = plt.subplots(figsize=(10,10))
m = Basemap(llcrnrlon=-7.5600,llcrnrlat=49.7600,
urcrnrlon=2.7800,urcrnrlat=60.840,
resolution='i', # Set using letters, e.g. c is a crude drawing, f is a full detailed drawing
projection='tmerc', # The projection style is what gives us a 2D view of the world for
this
lon_0=-4.36,lat_0=54.7, # Setting the central point of the image
epsg=27700) # Setting the coordinate system we're using
for i in df_2000[0:500]['lat_lon']:
x,y = i
m.plot(x, y, marker = 'o', c='r', markersize=1, alpha=0.8, latlon=False)
plt.show()
30
Result:
Thus the python program for Visualizing Geographics Data With Basemap was
executed successfully and the output was verified.
31