733702205-DSBDA-Mini-Project-Report
733702205-DSBDA-Mini-Project-Report
733702205-DSBDA-Mini-Project-Report
Dhankawadi, Pune
A PROJECT REPORT ON
Covid Vaccine Statewise Analysis
SUBMITTED BY
Omkar Shinde (31480)
Amey Wadgaonkar (31492)
Problem Statement:
Use the following covid_vaccine_statewise.csv dataset and perform the following
analytics on the given
dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?
select=covid_vaccine_statewise.csv
● Describe the dataset
● Number of persons statewise vaccinated for the first dose in India
● Number of persons statewise vaccinated for the second dose in India
● Number of males vaccinated
● Number of females vaccinated
Objectives:
● To Describe the dataset
● To do Preprocessing on the given dataset
Theory:
Libraries:
● Pandas - Pandas, a powerful Python library, plays a crucial role in machine
learning (ML) workflows
by providing efficient data manipulation and analysis capabilities. With its
intuitive data structures, such
as DataFrames and Series, Pandas simplifies the process of preprocessing and
cleaning datasets for ML
tasks. It offers various functionalities like data selection, filtering, merging,
and transformation,
allowing users to handle missing values, outliers, and feature engineering
effectively. Additionally,
Pandas seamlessly integrates with other ML libraries like NumPy and scikit-learn,
enabling smooth data
integration and model building. It also supports reading and writing data from
various file formats,
making it convenient for ML practitioners to work with diverse data sources.
Whether it's exploratory
data analysis, data preprocessing, or feature extraction, Pandas provides a
versatile and efficient toolkit
that significantly enhances productivity and facilitates the development of robust
machine learning
models.
● Numpy - NumPy, a fundamental library for numerical computations in Python, is
widely used in
machine learning (ML) applications. Its array-oriented programming paradigm allows
for efficient
manipulation and processing of large multi-dimensional arrays and matrices, which
are central to many
ML algorithms. NumPy's extensive collection of mathematical functions enables quick
and vectorized
operations on arrays, improving computational performance significantly. ML tasks
such as data
preprocessing, feature extraction, and model evaluation benefit from NumPy's
capabilities in handling
numerical data. The seamless integration of NumPy with other ML libraries like
Pandas, scikit-learn,
and TensorFlow ensures smooth data interchange and compatibility.
Methods used:
● read_csv () - The .read_csv() function takes a path to a CSV file and reads the
data into a Pandas
DataFrame object.
Pandas dataframe.sum() function returns the sum of the values for the requested
axis. If the input
is the index axis, then it adds all the values in a column and repeats the same
for all the columns
and returns a series containing the sum of all the values in each column.
System Architecture:
Methodology:
1. Data collection: Gather the relevant data from reliable sources.
2. Data loading: Load the data into a suitable data structure (e.g., DataFrame)
using a programming
language like Python.
3. Data overview: Get an initial understanding of the dataset by examining its
structure, dimensions, and
basic statistical summaries.
4. Data cleaning: Handle missing values, duplicates, and outliers in the dataset to
ensure data quality.
5. Data visualization: Create visual representations such as histograms, scatter
plots, and box plots to
understand patterns, relationships, and distributions in the data.
6. Feature engineering: Extract or transform features to derive new meaningful
variables that can
enhance the analysis and model performance.
7. Statistical analysis: Apply statistical techniques to uncover insights,
correlations, and associations
within the data.
8. Data segmentation: Group and explore the data based on different criteria (e.g.,
demographics, time
periods) to uncover patterns and differences.
9. Conclusion and reporting: Summarize findings, draw conclusions, and present the
results of the EDA
process in a clear and concise manner.
Results:
Conclusion: In this project, we analyzed the COVID-19 vaccination data in
India using the
"covid_vaccine_statewise.csv" dataset. We explored the number of individuals
vaccinated for the first
and second doses across different states. Additionally, we determined the number of
males and females
vaccinated in the country. The insights gained from this analysis can aid in
understanding the progress
of vaccination efforts in India and help in formulating effective strategies to
combat the COVID-19
pandemic.