733702205-DSBDA-Mini-Project-Report

SCTR's Pune Institute of Computer Technology
Dhankawadi, Pune
A PROJECT REPORT ON
Covid Vaccine Statewise Analysis
SUBMITTED BY
Omkar Shinde (31480)
Amey Wadgaonkar (31492)
Under the guidance of

Prof. Rutuja
Kulkarni
DEPARTMENT OF COMPUTER ENGINEERING

Academic Year 2023-24
Title:
Mini-Project
Exploratory data analysis of the covid vaccination data of India using the given
dataset.
Problem Statement:
Use the following covid_vaccine_statewise.csv dataset and perform the following
analytics on the given
dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?
select=covid_vaccine_statewise.csv
● Describe the dataset
● Number of persons statewise vaccinated for the first dose in India
● Number of persons statewise vaccinated for the second dose in India
● Number of males vaccinated
● Number of females vaccinated
Objectives:
● To Describe the dataset
● To do Preprocessing on the given dataset
Theory:
Libraries:
● Pandas - Pandas, a powerful Python library, plays a crucial role in machine
learning (ML) workflows
by providing efficient data manipulation and analysis capabilities. With its
intuitive data structures, such
as DataFrames and Series, Pandas simplifies the process of preprocessing and
cleaning datasets for ML
tasks. It offers various functionalities like data selection, filtering, merging,
and transformation,
allowing users to handle missing values, outliers, and feature engineering
effectively. Additionally,
Pandas seamlessly integrates with other ML libraries like NumPy and scikit-learn,
enabling smooth data
integration and model building. It also supports reading and writing data from
various file formats,
making it convenient for ML practitioners to work with diverse data sources.
Whether it's exploratory
data analysis, data preprocessing, or feature extraction, Pandas provides a
versatile and efficient toolkit
that significantly enhances productivity and facilitates the development of robust
machine learning
models.
● Numpy - NumPy, a fundamental library for numerical computations in Python, is
widely used in
machine learning (ML) applications. Its array-oriented programming paradigm allows
for efficient
manipulation and processing of large multi-dimensional arrays and matrices, which
are central to many
ML algorithms. NumPy's extensive collection of mathematical functions enables quick
and vectorized
operations on arrays, improving computational performance significantly. ML tasks
such as data
preprocessing, feature extraction, and model evaluation benefit from NumPy's
capabilities in handling
numerical data. The seamless integration of NumPy with other ML libraries like
Pandas, scikit-learn,
and TensorFlow ensures smooth data interchange and compatibility.
● Sklearn - Scikit-learn, a widely-used machine learning library in Python,

provides a comprehensive
set of tools for various aspects of ML workflows. Its extensive collection of
algorithms and utilities
covers a broad range of tasks, including classification, regression, clustering,
dimensionality reduction,
and model selection. With scikit-learn, ML practitioners can easily implement and
experiment with
different algorithms and models without having to build everything from scratch.
The library offers a
consistent and user-friendly API, making it straightforward to preprocess and
transform data, split
datasets for training and testing, and evaluate model performance using various
metrics. scikit-learn
also includes modules for feature extraction, feature selection, and hyperparameter
tuning, enabling
researchers to fine-tune models for optimal performance.
Methods used:
● read_csv () - The .read_csv() function takes a path to a CSV file and reads the
data into a Pandas
DataFrame object.
● describe() - The describe() method returns a description of the data in the

DataFrame. If the
DataFrame contains numerical data, the description contains this information for
each column:
count - The number of not-empty values. mean - The average (mean) value.
● groupby() and sum() -Use DataFrame.groupby().sum() to group rows based on one

or multiple
columns and calculate sum agg function. groupby() function returns a
DataFrameGroupBy object
which contains an aggregate function sum() to calculate a sum of a given column
for each group.
The dataframe.groupby() involves a combination of splitting the object, applying
a function, and
combining the results. This can be used to group large amounts of data and
compute operations on
these groups such as sum().
Pandas dataframe.sum() function returns the sum of the values for the requested
axis. If the input
is the index axis, then it adds all the values in a column and repeats the same
for all the columns
and returns a series containing the sum of all the values in each column.
System Architecture:
Methodology:
1. Data collection: Gather the relevant data from reliable sources.
2. Data loading: Load the data into a suitable data structure (e.g., DataFrame)
using a programming
language like Python.
3. Data overview: Get an initial understanding of the dataset by examining its
structure, dimensions, and
basic statistical summaries.
4. Data cleaning: Handle missing values, duplicates, and outliers in the dataset to
ensure data quality.
5. Data visualization: Create visual representations such as histograms, scatter
plots, and box plots to
understand patterns, relationships, and distributions in the data.
6. Feature engineering: Extract or transform features to derive new meaningful
variables that can
enhance the analysis and model performance.
7. Statistical analysis: Apply statistical techniques to uncover insights,
correlations, and associations
within the data.
8. Data segmentation: Group and explore the data based on different criteria (e.g.,
demographics, time
periods) to uncover patterns and differences.
9. Conclusion and reporting: Summarize findings, draw conclusions, and present the
results of the EDA
process in a clear and concise manner.
Results:
Conclusion: In this project, we analyzed the COVID-19 vaccination data in
India using the
"covid_vaccine_statewise.csv" dataset. We explored the number of individuals
vaccinated for the first
and second doses across different states. Additionally, we determined the number of
males and females
vaccinated in the country. The insights gained from this analysis can aid in
understanding the progress
of vaccination efforts in India and help in formulating effective strategies to
combat the COVID-19
pandemic.

733702205-DSBDA-Mini-Project-Report

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

733702205-DSBDA-Mini-Project-Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

733702205-DSBDA-Mini-Project-Report

Uploaded by

Copyright:

Available Formats

SCTR's Pune Institute of Computer Technology

Under the guidance of

DEPARTMENT OF COMPUTER ENGINEERING

● Sklearn - Scikit-learn, a widely-used machine learning library in Python,

● describe() - The describe() method returns a description of the data in the

● groupby() and sum() -Use DataFrame.groupby().sum() to group rows based on one

You might also like