CS3361 - Data Science Laboratory

ANNA UNIVERSITY
UNIVERSITY COLLEGE OF ENGINEERING- DINDIGUL
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS3362 - DATA SCIENCE LABORATORY
NAME :
REGISTER NO. :
BRANCH :
1
ANNA UNIVERSITY
UNIVERSITY COLLEGE OF ENGINEERING – DINDIGUL
DINDIGUL – 62422
BONAFIDE CERTIFICATE
This is to certify that is a bonafide record of work done by
Mr./Ms.________________________________________
in _____________________________________________
laboratory during the academic year 2022-2023
University Registration no:
Staff In charge Head of the Department
Submitted for the university Practical Examination held on __________________
INTERNAL EXAMINER EXTERNAL EXAMINER
2
INDEX
EXPT NAME OF THE EXPERIMENT PAGE DATE OF SIGNATURE

NO. NO. COMPLETION
1 Download, install and explore the
features of NumPy, SciPy, Jupyter, 4
Statsmodels and Pandas packages.
2 Working with Numpy arrays 7
3 Working with Pandas data frames 9
4 Reading data from text files, Excel and 12

the web and exploring various commands
for doing descriptive analytics on the Iris
data set
5 Use the diabetes data set from UCI and

Pima Indians Diabetes data set for
performing the following:
a. Univariate analysis: Frequency, Mean,
Median, Mode, Variance, Standard
Deviation, Skewness and Kurtosis. 15
b. Bivariate analysis: Linear and logistic
regression modeling
c. Multiple Regression analysis
d. Also compare the results of the above
analysis for the two data sets
Apply and explore various plotting

functions on UCI data sets.
a. Normal curves
b. Density and contour plots 21
6 c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7 Visualizing Geographic Data with 29

Basemap
3
EXPT NO.: 01 INSTALLATION OF FEATURES FOR PYTHON
DATE :
Aim:
To write a procedure for installation of features for python.
Procedure:
Install Python Data Science Packages

Python is a high-level and general-purpose programming language with data science
and machine learning packages. Use the video below to install on Windows, MacOS, or
Linux.As a first step, install Python for Windows, MacOS, or Linux.
Python Packages
The power of Python is in the packages that are available either through the pip or conda
package managers. This page is an overview of some of the best packages for machine
learning and data science and how to install them.
We will explore the Python packages that are commonly used for data science and machine
learning. You may need to install the packages from the terminal, Anaconda prompt,
commandprompt, or from the Jupyter Notebook. If you have multiple versions of Python or
have specificdependencies then use an environment manager such as pyenv. For most
users, a singleinstallation is typically sufficient. The Python package manager pip has all of
the packages(such as gekko) that we need for this course. If there is an administrative
access error, installto the local profile with the --user flag.pip install gecko.
Gekko
Gekko provides an interface to gradient-based solvers for machine learning and
optimization of mixed-integer, differential algebraic equations, and time series models.
Gekko provides exact first and second derivatives through automatic differentiation and
discretization with simultaneous or sequential methods. pip install gecko
Keras
Keras provides an interface for artificial neural networks. Keras acts as an interface for the
TensorFlow library. Other backend packages were supported until version 2.4. TensorFlow is
now the only backend and is installed separately with pip install tensorflow.
pip install keras.
Matplotlib
The package matplotlib generates plots in Python.
pip install matplotlib
4
Numpy
Numpy is a numerical computing package for mathematics, science, and engineering. Many
data science packages use Numpy as a dependency.
pip install numpy
OpenCV
OpenCV (Open Source Computer Vision Library) is a package for real-time computer vision
and developed with support from Intel Research.
pip install opencv-python
Pandas
Pandas visualizes and manipulates data tables. There are many functions that allow
efficient manipulation for the preliminary steps of data analysis problems.
pip install pandas
Plotly
Plotly renders interactive plots with HTML and JavaScript. Plotly Express is included with
Plotly.
pip install plotly
PyTorch
PyTorch enables deep learning, computer vision, and natural language processing.
Development is led by Facebook's AI Research lab (FAIR).
pip install torch
Scikit-Learn
Scikit-Learn (or sklearn) includes a wide variety of classification, regression and clustering
algorithms including neural network, support vector machine, random forest, gradient
boosting, k-means clustering, and other supervised or unsupervised learning methods.
pip install scikit-learn
SciPy
SciPy is a general-purpose package for mathematics, science, and engineering and extends
the base capabilities of NumPy.
pip install scipy
Seaborn
Seaborn is built on matplotlib, and produces detailed plots in few lines of code.
pip install seaborn
5
Statsmodels
Statsmodels is a package for exploring data, estimating statistical models, and performing
statistical tests. It include descriptive statistics, statistical tests, plotting functions, and result
statistics.
pip install statsmodels
TensorFlow
TensorFlow is an open source machine learning platform with particular focus on training
and inference of deep neural networks. Development is led by the Google Brain team.
pip install tensorflow
Result:
Thus the procedure of features for python and installation procedure had done
correctly.
EXPT NO.: 02 WORKING WITH NUMPY
6
DATE :
Aim:
To write a python program using numpy library and working involved in it.
Short Notes:
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked
arrays and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier
transforms, basic linear algebra, basic statistical operations, random simulation and much
more.
Source Code:
import random,numpy
row= int(input("Enter the Number of Rows: "))
column=int(input("Enter the Number of Columns: "))
matrix=[]
for i in range(row):
d=[]
for j in range(column):
x = random.randint(1,100)
d.append(x)
matrix.append(d)
matrix=numpy.array(matrix)
print("\n Matrix = \n",matrix)
print("\n Dimension of Matrix = ",matrix.ndim)
print("\n Byte-Size of Each Element in the Matrix = ",matrix.itemsize)
print("\n Data Type of Matrix = ",matrix.dtype)
print("\n Total Number of Elements in the Matrix= ",matrix.size)
7
print("\n Shape of the Matrix = ",matrix.shape)
print("\n Reshaped Matrix = \n",matrix.reshape(column,row),"\n")
print("Printing entire row (1,2) of Matrix= \n",matrix[0:2],"\n")
print("Printing 3rd element of 2nd row of Matrix = \n",matrix[1,1],"\n")
print("Printing 4th element of 1st and 2nd row of Matrix = \n",matrix[0:2,3],"\n")
print("\nMaximum Element in the Matrix = ",matrix.max())
print("\nMinimum Element in the Matrix = ",matrix.min())
print("\nSum of All Element in the Matrix = ",matrix.sum())
print("\nSquare root of Each Element in the Matrix = ",numpy.sqrt(matrix))
print("\nStandard Deviation of the Matrix = ",numpy.std(matrix))
Result:
Thus the python program for working with numpy library was executed successfully
and the output was verified.
8
EXPT NO.: 03 WORKING WITH PANDAS DATA FRAMES
DATE :
Aim:
To write a python program using pandas library and working of data frames involved
in it.
Short Notes:
Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy
library. Pandas is fast and it has high performance & productivity for users.
Advantages
 Fast and efficient for manipulating and analyzing data.
 Data from different file objects can be loaded.
 Easy handling of missing data (represented as NaN) in floating point as well as
non-floating point data
 Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
 Data set merging and joining.
 Flexible reshaping and pivoting of data sets
 Provides time-series functionality.
 Powerful group by functionality for performing split-apply-combine operations
on data sets.
Creating a dataframe using List:
DataFrame can be created using a single list or a list of lists.
Creating DataFrame from dict of ndarray/lists:

To create DataFrame from dict of narray/list, all the narray must be of same length. If
index is passed then the length index should be equal to the length of arrays. If no index is
passed, then by default, index will be range(n) where n is the array length.
Dealing with Rows and Columns
9
Source Code:
# import pandas as pd
import pandas as pd
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list

df = pd.DataFrame(lst)
print(df)
# Python code demonstrate creating

# DataFrame from dict narray / lists
# By default addresses.
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.

print(df)
# Define a dictionary containing employee data

data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],
10
'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification':['Msc', 'MA', 'MCA', 'Phd']}
# Convert the dictionary into DataFrame

df = pd.DataFrame(data)
# select two columns

print(df[['Name', 'Qualification']])
Result:
Thus the python program for working with pandas data frames library was executed
successfully and the output was verified.
11
EXPT NO.: 04 Reading data from text files, Excel and the web and exploring
various commands for doing descriptive analytics on the Iris
DATE : data set.
Aim:
To read data from files and exploring various commands for doing descriptive
analytics on the Iris data set.
Algorithm:
1. Download “Iris.csv” file from GitHub.com
2. Load the “Iris.csv” into google colab.
3. Perform descriptive analysis on the Iris file.
About Iris Database:

Iris Dataset is considered as the Hello World for data science. It contains five
columns namely – Petal Length, Petal Width, Sepal Length, Sepal Width, and Species Type.
Iris is a flowering plant, the researchers have measured various features of the different iris
flowers and recorded them digitally. You can download the Iris.csv file from the above link.
Now we will use the Pandas library to load this CSV file, and we will convert it into the
dataframe. read_csv() method is used to read CSV files.
Download Link For Iris File:

https://datahub.io/machine-learning/iris#data
Program:
Iris Dataset
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
Print(df.head())
12
Getting Information about the Dataset
df.shape
df.info()
df.describe()
Checking Missing Values

df.isnull().sum()
Checking Duplicates
data = df.drop_duplicates(subset ="Species",)

print(data)
df.value_counts("Species")
Data Visualization
Visualizing the target column
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Species', data=df, )
plt.show()
Histograms
# importing packages
fig, axes = plt.subplots(2, 2, figsize=(10,10))
axes[0,0].set_title("Sepal Length")
axes[0,0].hist(df['SepalLengthCm'], bins=7)
13
axes[0,1].set_title("Sepal Width")
axes[0,1].hist(df['SepalWidthCm'], bins=5);
axes[1,0].set_title("Petal Length")
axes[1,0].hist(df['PetalLengthCm'], bins=6);
axes[1,1].set_title("Petal Width")
axes[1,1].hist(df['PetalWidthCm'], bins=6);
Result:
Thus the python program to read and work on iris dataset was executed successfully
and the output was verified.
14
EXPT NO.: 05(a) Use the diabetes data set from UCI and Pima Indians Diabetes data set
for performing the following: Univariate analysis: Frequency, Mean,
DATE : Median, Mode, Variance, Standard Deviation, Skewness and Kurtosis.
Aim:
To write a python program for Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis using UCI and Pima Diabetes dataset.
Procedure
1. Download dataset like Pima Indian diabetes dataset. Save them in any drive and call
them for process.
2. The mean () function can be used to calculate mean/average of a given list of
numbers.
3. The median () method calculates the median (middle value) of the given data set.
4. The mode of a set of data values is the value that appears most often.
5. The var () method calculates the variance for each column.
6. Standard deviation std () is a number that describes how spread out the values are.
7. The skew () method calculates the skew for each column. Skewness refers to a
distortion or asymmetry that deviates from the symmetrical bell curve, or normal
distribution, in a set of data.
Kurtosis:
It is also a statistical term and an important characteristic of frequency distribution.
It determines whether a distribution is heavy-tailed in respect of the normal distribution. It
provides information about the shape of a frequency distribution.
Program:
import pandas as pd
from scipy.stats import kurtosis
import pylab as p
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
df1 = pd.DataFrame(df, columns= ['Age','Glucose'])

print (df1)
df1.mean()
df1.median()
df1.mode()
print(df1.var())
df1.std()
print(df1.skew())
print(kurtosis(df, axis=0, bias=True))
15
Dataset download link
https://github.com/npradaschnor/Pima-Indians-Diabetes-
Dataset/blob/master/Pima%20Indians%20Diabetes%20Dataset.ipynb
Result:
Thus the python program for Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis was executed successfully and the
output was verified.
16
EXPT NO.: 05(b) Linear Regression and Logistic Regression with the Diabetes
Dataset Using Python Machine Learning.
DATE :
Aim:
To write a python program for Linear Regression and Logistic Regression with the
Diabetes Dataset Using Python Machine Learning using UCI and Pima Diabetes dataset.
Procedure:
1.Load sklearn Libraries.
2.Load Data
3.Load the diabetes dataset
4.Split Dataset
5.Creating Model Linear Regression and Logistic Regression
6.Make predictions using the testing set
7.Finding Coefficient and Mean Square Error
Program
import matplotlib. pyplot as plt

import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#To calculate accuracy measures and confusion matrix
from sklearn import metrics
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets

diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object

regr = linear_model.LinearRegression()
# Train the model using the training sets

regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set

diabetes_y_pred = regr.predict(diabetes_X_test)
17
# Create Logistic regression object
Logistic_model = LogisticRegression()
Logistic_model.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error

print('Mean squared error: %.2f'
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction

print('Coefficient of determination: %.2f'
% r2_score(diabetes_y_test, diabetes_y_pred))
y_predict = Logistic_model.predict(diabetes_X_train)
#print("Y predict/hat ", y_predict)

y_predict
Result:
Thus the python program for Linear Regression and Logistic Regression with the
Diabetes Dataset Using Python Machine Learning was executed successfully and the output
was verified.
EXPT NO.: 05(c) Multiple Regression
DATE :
18
Aim:
To write a python program for Multiple Regression using UCI and Pima Diabetes
dataset.
Procedure:
1.The Pandas module allows us to read csv files and return a DataFrame object.
2.Then make a list of the independent values and call this variable X.
3.Put the dependent values in a variable called y.
4.From the sklearn module we will use the LinearRegression() method to create a
linear regression object.
5.This object has a method called fit() that takes the independent and dependent
values as parameters and fills the regression object with data that describes the
relationship.
6.We have a regression object that are ready to predict age values based on a person
Glucose and BloodPressure
Program:
import pandas as pd
from sklearn import linear_model
df = pd.read_csv (r'd:\\diabetes.csv')
print (df)
X = df[['Glucose', 'BloodPressure']]
y = df['Age']
regr = linear_model.LinearRegression()
regr.fit(X, y)
predictedage = regr.predict([[150, 13]])
print(predictedage)
Result:
Thus the python program for Multiple Regression was executed successfully and the
output was verified.
EXPT NO.: 05(d) Compare the results of the above analysis for the two data sets.
DATE :
19
Aim:
To write a python program for compare the results of the above analysis for the two
data sets using UCI and Pima Diabetes dataset.
Procedure:
Step 1: Prepare the datasets to be compared
Step 2: Create the two DataFrames
Based on the above data, you can then create the following two DataFrames
Step 3: Compare the values between the two Pandas DataFrames
1. In this step, you’ll need to import the NumPy package.
2. Let’s say that you have the following data stored in a CSV file called car1.csv
3. While you have the data below stored in a second CSV file called car2.csv
Program:
import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')
df1 = pd.DataFrame(data_1)
data_2 = pd.read_csv(r'd:\car2.csv')
df2 = pd.DataFrame(data_2)
df1['amount1'] = df2['amount1']
df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')
df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] -
df2['amount1'])
print(df1)
Result:
Thus the python program for compare the results of the above analysis for the two
data sets was executed successfully and the output was verified.
EXPT NO.: 06(a) Apply and explore various plotting functions on UCI data sets:
Normal Curves
20
DATE :
AIM:
To apply and explore Normal curves functions on UCI-Iris data sets.
Algorithm:
1. Download Iris data set from UCI.
2. Load the above Iris data files into google colab.
3. Plot the normal curve for Iris data set.
Normal Curves:
It is a probability function used in statistics that tells about how the data values are
distributed. It is the most important probability distribution function used in statistics
because of its advantages in real case scenarios.
Source Code:
import numpy as np
from scipy.stats import norm
import statistics
# import dataset
df = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv")
# Plot between -10 and 10 with .001 steps.
x_axis = np.arange(-20, 20, 0.01)
# Calculating mean and standard deviation
mean = df["sepal.length"].mean()
sd = df.loc[:"sepal.width"].std()
plt.plot(x_axis, norm.pdf(x_axis, mean, sd))
plt.show()
Result:
Thus the python program for Apply and explore various plotting functions on UCI
data sets: Normal Curves was executed successfully and the output was verified.
EXPT NO.: 06(b) Density and contour plots.
21
DATE :
Aim:
To apply and explore Density & Contour plotting functions on UCI-Iris data sets.
Algorithm:
3. Plot the density and contour plotting for Iris data sets.
Density Plotting
Density Plot is a type of data visualization tool. It is a variation of the histogram that
uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a
histogram inferred from a data.
Density plots uses Kernel Density Estimation (so they are also known as Kernel
density estimation plots or KDE) which is a probability density function. The region of plot
with a higher peak is the region with maximum data points residing between those values.
Source Code: Density plot of several variables
# libraries & dataset

# set a grey background (use sns.set_theme() if seaborn version 0.11.0 or
above)
sns.set(style="darkgrid")
df = sns.load_dataset('iris')
# plotting both distibutions on the same figure
fig = sns.kdeplot(df['sepal_width'], shade=True, color="r")
fig = sns.kdeplot(df['sepal_length'], shade=True, color="b")
plt.show()
Contour plotting
Contour plots also called level plots are a tool for doing multivariate analysis and
visualizing 3-D plots in 2-D space. If we consider X and Y as our variables we want to plot
then the response Z will be plotted as slices on the X-Y plane due to which contours are
sometimes referred as Z-slices or iso-response.
Contour plots are widely used to visualize density, altitudes or heights of the
mountain as well as in the meteorological department.
Source Code
import pandas as pd
import matplotlib as mpl
px_orbital = pd.read_csv("/content/drive/MyDrive/Data_Science/iris.csv")
x = px_orbital.iloc[0, 1:]
22
y = px_orbital.iloc[1:, 0]
px_values = px_orbital.iloc[1:, 1:]
mpl.rcParams['font.size'] = 14
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
fig, ax = plt.subplots()
ticks = np.linspace(pmin, pmax, 6)
CS = ax.contourf(x, y, px_values, cmap="RdBu", levels=levels)
ax.set_aspect('equal')
ax.set_xlabel('x')
ax.set_ylabel('y')
fig.colorbar(CS, format="%.3f", ticks=ticks)
Result:
data sets: Density and contour plots was executed successfully and the output was verified.
EXPT NO.: 06(c) Correlation and Scatter Plots
23
DATE :
Aim:
To apply and correlation & Scatter plotting functions on UCI-Iris data sets.
Algorithm:
3. Plot the correlation and scatter plotting for Iris data sets.
Correlation Matrix Plotting

Correlation gives an indication of how related the changes are between two
variables.
If two variables change in the same direction they are positively correlated. If the change in
opposite directions together (one goes up, one goes down), then they are negatively
correlated.
You can calculate the correlation between each pair of attributes. This is called a
correlation matrix. You can then plot the correlation matrix and get an idea of which
variables have a high correlation with each other.
This is useful to know, because some machine learning algorithms like linear and
logistic regression can have poor performance if there are highly correlated input variables
in your data.
Source Code:
# Correction Matrix Plot
import pandas
import numpy
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pimaindians-
diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()
Scatter Plotting
24
A scatterplot shows the relationship between two variables as dots in two
dimensions, one axis for each attribute. You can create a scatterplot for each pair of
attributes in your data. Drawing all these scatterplots together is called a scatterplot matrix.
Scatter plots are useful for spotting structured relationships between variables, like
whether you could summarize the relationship between two variables with a line. Attributes
with structured relationships may also be correlated and good candidates for removal from
your dataset.
Source Code:
# Scatterplot Matrix
import pandas
from pandas.plotting import scatter_matrix
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pimaindians-
diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
scatter_matrix(data)
plt.show()
Result:
data sets: Correlation and Scatter Plots was executed successfully and the output was
verified.
EXPT NO.: 06(d) Histograms.
25
DATE :
Aim:
To apply and explore Histograms plotting functions on UCI-Iris data sets.
ALGORITHM:
3. Plot the Histograms for Iris data set.
Histograms plotting functions

A histogram is basically used to represent data provided in a form of some groups.It is
accurate method for the graphical representation of numerical data distribution.It is a type
of bar plot where X-axis represents the bin ranges while Y-axis gives information about
frequency.
Source Code:

import pandas as pd
import numpy as np
df = pd.read_csv('/content/drive/MyDrive/Data_Science/iris.csv ')
data = df[' sepal.length']
bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins = bins, density = True)
plt.ylabel('sepal.width')
plt.xlabel( petal.length')
plt.show()
Result:
data sets: Histogram was executed successfully and the output was verified.
26
EXPT NO.: 06(e) Three Dimensional Plotting
DATE :
Aim:
To apply and explore Three Dimensional Plotting plotting functions on UCI-Iris data sets.
Algorithm:
3. Plot the Three Dimensional Plotting for Iris data set.
Matplotlib:
Matplotlib is a comprehensive library for creating static, animated, and interactive
visualizations in Python. Matplotlib makes easy things easy and hard things possible.
 Create publication quality plots.

 Make interactive figures that can zoom, pan, update.
 Customize visual style and layout.
 Export to many file formats.
 Embed in JupyterLab and Graphical User Interfaces.
 Use a rich array of third-party packages built on Matplotlib.
Source Code:
import numpy as np
import pandas as pd

import plotly_express as px
# plt.style.use('default')
color_pallete = ['#fc5185', '#3fc1c9', '#364f6b']
sns.set_palette(color_pallete)
sns.set_style("white")
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalLengthCm",

size="SepalWidthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})
27
Result:
data sets: Histogram was executed successfully and the output was verified.
28
EXPT NO.: 07 Visualizing Geographics Data With Basemap.
DATE :
Aim:
To write a python program to visulalize the geographics data with basemap on data
science
Using Basemap (mpl_toolkits.basemap module)

This is the fundamental mapping tool for Python. You could master this and not need
another mapping tool.
The first code block creates a figure, sets the area for the map, adds basics like
boundaries and water colour
A basic map
I've added notes below for more details. Essentially the process here is
1. Create a figure
2. Create a basic map to plot things on to our map (see 'm = Basemap' below)
3. Add features like mapboundaries, rivers, etc
4. I created tuples to store my coords in one place
5. Run a 'for' function to grab each set of coords and plot them on the map -
If you removed the for loop then you would just have a nice plain map.
Calling the British system
 First, we call 'epsg=27700' when creating the map.

 Seond, when plotting points, we sat 'latlon=False' because we aren't
providing latitude or longitude we're providing OSGB/espg=27700
coordinates
 For projection we use 'tmerc'
If you want to use the normal lat and long¶
Use projection = 'merc',
 Delete 'latlon=False' (it defaults to True)

 Delete 'epsg=27700'.
 Change 'x,y=i' to 'x,y = m(lat, lon)' (for reasons I don't fully grasp the
basemap does a further translation on the lat and long with this.
29
Source Code:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.cm
from mpl_toolkits.basemap import Basemap

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.colors import Normalize

import matplotlib.cm
from mpl_toolkits.basemap import Basemap

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
from matplotlib.colors import Normalize
df = pd.read_csv('../input/ukTrafficAADF.csv')
fig, ax = plt.subplots(figsize=(10,10))
m = Basemap(llcrnrlon=-7.5600,llcrnrlat=49.7600,
urcrnrlon=2.7800,urcrnrlat=60.840,
resolution='i', # Set using letters, e.g. c is a crude drawing, f is a full detailed drawing
projection='tmerc', # The projection style is what gives us a 2D view of the world for
this
lon_0=-4.36,lat_0=54.7, # Setting the central point of the image
epsg=27700) # Setting the coordinate system we're using
m.drawmapboundary(fill_color='#46bcec') # Make your map into any style you like

m.fillcontinents(color='#f2f2f2',lake_color='#46bcec') # Make your map into any style you
like
m.drawcoastlines()
m.drawrivers() # Default colour is black but it can be customised
m.drawcountries()
df['lat_lon'] = list(zip(df.Easting, df.Northing)) # Creating tuples

df_2000 = df[df['AADFYear']==2000]
for i in df_2000[0:500]['lat_lon']:
x,y = i
m.plot(x, y, marker = 'o', c='r', markersize=1, alpha=0.8, latlon=False)
plt.show()
30
Result:
Thus the python program for Visualizing Geographics Data With Basemap was
executed successfully and the output was verified.
31

CS3361 - Data Science Laboratory

Uploaded by

Copyright:

Available Formats

CS3361 - Data Science Laboratory

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS3361 - Data Science Laboratory

Uploaded by

Copyright:

Available Formats

ANNA UNIVERSITY

UNIVERSITY COLLEGE OF ENGINEERING- DINDIGUL

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

University Registration no:

Staff In charge Head of the Department

Submitted for the university Practical Examination held on __________________

INTERNAL EXAMINER EXTERNAL EXAMINER

EXPT NAME OF THE EXPERIMENT PAGE DATE OF SIGNATURE

3 Working with Pandas data frames 9

4 Reading data from text files, Excel and 12

5 Use the diabetes data set from UCI and

Apply and explore various plotting

7 Visualizing Geographic Data with 29

Install Python Data Science Packages

EXPT NO.: 02 WORKING WITH NUMPY

Creating DataFrame from dict of ndarray/lists:

Dealing with Rows and Columns

# Calling DataFrame constructor on list

# Python code demonstrate creating

# Print the output.

# Define a dictionary containing employee data

# Convert the dictionary into DataFrame

# select two columns

About Iris Database:

Download Link For Iris File:

Checking Missing Values

data = df.drop_duplicates(subset ="Species",)

df1 = pd.DataFrame(df, columns= ['Age','Glucose'])

import matplotlib. pyplot as plt

# Split the targets into training/testing sets

# Create linear regression object

# Train the model using the training sets

# Make predictions using the testing set

# The mean squared error

# The coefficient of determination: 1 is perfect prediction

#print("Y predict/hat ", y_predict)

EXPT NO.: 05(c) Multiple Regression

EXPT NO.: 06(b) Density and contour plots.

Source Code: Density plot of several variables

# libraries & dataset

EXPT NO.: 06(c) Correlation and Scatter Plots

Correlation Matrix Plotting

EXPT NO.: 06(d) Histograms.

Histograms plotting functions

import matplotlib.pyplot as plt

 Create publication quality plots.

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalLengthCm",

Using Basemap (mpl_toolkits.basemap module)

Calling the British system

 First, we call 'epsg=27700' when creating the map.

If you want to use the normal lat and long¶

Use projection = 'merc',

 Delete 'latlon=False' (it defaults to True)

import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap

import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap