DATA ANALYSIS USING PYTHON2
DATA ANALYSIS USING PYTHON2
PYTHON
INTERNSHIP PROGRAMME
PROJECT NAME:
ANALYSIS OF SELLING PRICE OF USED CARS
TEAM NO:2
TEAM LEADER:
BATTULA SUDHARSANA RAO -22NE1A4204
TEAM MEMBERS:
ACHI SIVA SANKAR -22NE1A4201
CHINNAM HARISH KUMAR -22NE1A4208
MARKAPURAM SURENDRA -22NE1A4241
ANALYSIS OF SELLING PRICE OF USED CARS USING PYTHON
SYSTEM REQUIREMENTS
1.Operating system:
Device ID : 85503FD3-F8F3-4038-AFCB-D11D73BE5358
Product ID :00342-42668-74072-AAOEM
2.Software Requirements:
ARCHITECTURE OF PROJECT
Fig 1: Price prediction of used cars
1. numpy
2. pandas
3. seaborn
4. scipy
5. matplotlib
About Libraries
1.numpy
NumPy is a widely-used library in Python, essential for many scientific, mathematical,
and engineering applications. Here are some key uses and points about NumPy:
Uses of NumPy
Efficient Data Storage: NumPy's ndarray objects store large datasets efficiently, using
less memory and allowing faster operations than traditional Python lists.
Statistical Analysis: Provides functions for computing statistical metrics such as mean,
median, variance, and standard deviation.
3. Data Preprocessing
Data Cleaning: Facilitates operations like replacing missing values, normalizing data,
and removing outliers.
Reshaping and Resizing: Allows for reshaping arrays, adding or removing dimensions,
and resizing arrays without losing data.
4. Scientific Computing
Data Preparation: Often used to prepare and manipulate datasets before feeding them
into machine learning models.
Integration with Libraries: Forms the backbone of many machine learning libraries
like TensorFlow and Scikit-Learn.
Support for Plotting Libraries: Works seamlessly with plotting libraries like
Matplotlib to create visualizations of data.
Image Processing: Can be used to manipulate and process image data efficiently.
Homogeneous Data: Elements must be of the same data type, which ensures efficient
storage and computation.
2. Performance
Speed: Much faster than native Python lists due to optimized C and Fortran code under
the hood.
Flexible Operations: Allows for operations on arrays of different shapes without the
need for explicit loops.
4. Integration
Interoperability: Can be used with other scientific libraries like SciPy, Pandas, and
Matplotlib.
Low-level Languages: Supports integration with C, C++, and Fortran code for high-
performance tasks.
2. pandas
Pandas is a powerful and flexible open-source data analysis and manipulation library
for Python. It is particularly well-suited for working with structured data (like spreadsheets or
SQL tables). Here are some key points and uses of Pandas:
Key Points
1. Data Structures: Pandas introduces two primary data structures: Series (1-
dimensional) and DataFrame (2-dimensional). These structures are built on top of
NumPy arrays and are highly optimized for performance.
2. Data Cleaning: Pandas provides extensive tools for cleaning data, including handling
missing values, filtering out unwanted data, and transforming data formats.
3. Data Manipulation: It supports a variety of operations such as merging, joining,
reshaping, and pivoting datasets, making it easy to manipulate data to fit the desired
structure.
4. Data Analysis: Pandas offers a suite of analytical tools, including groupby operations,
statistical functions, and time-series functionality.
5. Integration: It integrates seamlessly with other data analysis and machine learning
libraries such as NumPy, SciPy, Matplotlib, and scikit-learn.
6. Input and Output: Pandas can read from and write to a variety of data formats,
including CSV, Excel, SQL databases, JSON, and more.
7. Performance: Although it is built on top of NumPy, which provides fast operations on
large datasets, Pandas itself is optimized for performance, especially with its ability to
handle large datasets efficiently.
Uses
Pandas is excellent for time series data, with tools to handle date and time
data, perform resampling, and calculate rolling statistics.
Pandas makes it easy to reshape and transform data for analysis, including
pivot tables, melting, and stacking/unstacking data.
Pandas works well with other Python libraries for data analysis,
visualization, and machine learning.
3. seaborn
Seaborn is a powerful and user-friendly data visualization library built on top of
Matplotlib in Python. It provides a high-level interface for drawing attractive and informative
statistical graphics. Here are some key points and uses of Seaborn:
Key Points
Uses
1. Visualizing Distributions
2. Visualizing Relationships
Seaborn makes it easy to visualize relationships between variables with scatter plots,
line plots, and more.
Seaborn provides several functions to visualize categorical data, such as bar plots, count
plots, box plots, and violin plots.
4. Statistical Estimations
Seaborn has tools for statistical estimation, such as linear regression plots and error
bars.
Seaborn can handle multivariate data visualization with tools like pair plots and
heatmaps.
6. Facet Grids
Facet grids are useful for visualizing data across multiple subplots, enabling the
examination of relationships conditioned on categorical variables.
4. scipy
Scipy (pronounced "sigh-pie") is a powerful Python library used in scientific and technical
computing. It builds on top of NumPy and provides a wide range of functions that operate on
NumPy arrays and are useful for different scientific disciplines. Here are some key uses and
points about Scipy:
1. Integration and Optimization: Scipy offers functions for numerical integration
(scipy.integrate) and optimization (scipy.optimize). These are essential for solving
complex mathematical problems, such as finding the minimum or maximum of a
function or integrating differential equations.
2. Linear Algebra: Scipy includes modules for linear algebra (scipy.linalg) that provide
optimized routines for matrix operations, including solving linear systems, eigenvalue
problems, and matrix decompositions (e.g., LU, QR, SVD).
3. Statistics: The scipy.stats module offers a wide range of statistical functions and tests,
including probability distributions (PDF, CDF), statistical tests (t-tests, ANOVA), and
descriptive statistics (mean, median, variance).
4. Signal Processing: For signal processing tasks, Scipy (scipy.signal) provides functions
for filtering, spectral analysis, and interpolation, which are crucial in fields such as
digital signal processing and telecommunications.
5. Image Processing: Scipy's scipy.ndimage module provides functions for multi-
dimensional image processing. This includes tasks like filtering, interpolation, and
measurements on n-dimensional image data.
6. Sparse Matrix Operations: Scipy supports sparse matrices (scipy.sparse), which are
memory-efficient representations of large matrices with many zero elements. It provides
functions for sparse matrix manipulation, arithmetic, and solving linear systems.
7. Spatial and Distance Operations: The scipy.spatial module offers spatial algorithms
and data structures, including distance computations, spatial transformations, and
clustering algorithms.
8. Special Functions: Scipy includes a scipy.special module that provides access to many
special functions of mathematical physics, such as Bessel, gamma, and hypergeometric
functions.
9. File I/O and Interoperability: Scipy provides utilities (scipy.io) for reading and
writing various file formats, including MATLAB files (.mat), NetCDF, and more. This
makes it easier to work with data from different sources.
10. Integration with NumPy: Scipy seamlessly integrates with NumPy arrays, allowing
for efficient computation and manipulation of large datasets typical in scientific
computing.
Scipy is invaluable for scientific computing tasks in Python, offering a rich set of
functionalities that cover numerical integration, optimization, linear algebra, statistics, signal
processing, image processing, and more. Its extensive capabilities make it a go-to library for
researchers, engineers, and data scientists working on complex mathematical and scientific
problems.
5. matplotlib
Matplotlib is a widely-used Python library for creating static, animated, and interactive
visualizations. It provides a high-level interface for drawing a variety of plots and charts,
making it essential for data analysis, exploration, and presentation. Here are some key uses
and points about Matplotlib:
1. Plotting Functions: Matplotlib's pyplot module offers a MATLAB-like interface for
creating a variety of plots such as line plots, scatter plots, bar plots, histograms, pie
charts, and more. It allows users to customize virtually every aspect of the plot,
including colors, labels, markers, and annotations.
2. Publication Quality: Matplotlib is designed to produce high-quality graphics suitable
for publication and presentations. It provides fine-grained control over every visual
aspect of a plot, ensuring that plots meet specific formatting requirements.
3. Wide Adoption and Community: Being one of the oldest and most established
plotting libraries in Python, Matplotlib has a large user base and an active community.
This results in extensive documentation, tutorials, and examples available online,
making it easier for users to learn and troubleshoot.
4. Integration with Jupyter Notebooks: Matplotlib integrates seamlessly with Jupyter
Notebooks, allowing users to create interactive plots directly within their notebooks.
This is particularly useful for data exploration and interactive data analysis tasks.
5. Support for Various Output Formats: Matplotlib can generate plots in various
formats, including PNG, PDF, SVG, and more. This flexibility allows users to integrate
plots into different types of documents and presentations.
6. Customizability: Matplotlib offers extensive customization options through its object-
oriented interface. Users can manipulate individual elements of a plot (e.g., axes, lines,
markers) directly, making it possible to create complex and tailored visualizations.
7. Multi-platform Support: Matplotlib is compatible with different operating systems
(Windows, macOS, Linux) and can be used with multiple Python environments,
including standard Python distributions and scientific computing platforms like
Anaconda.
8. Layered Architecture: Matplotlib follows a layered architecture, allowing users to
construct plots by building layers of components such as axes, labels, legends, and
annotations. This modular approach provides flexibility in creating complex
visualizations.
9. Extensibility: Matplotlib's architecture supports extensions and plugins, enabling the
creation of specialized plots and custom visualizations. Users can leverage additional
libraries built on top of Matplotlib, such as Seaborn (for statistical data visualization)
and mplfinance (for financial data visualization).
10. Matplotlib and Pandas Integration: Matplotlib integrates well with Pandas, a popular
data manipulation and analysis library in Python. Users can directly plot Pandas
DataFrame and Series objects using Matplotlib's plotting functions, simplifying the
visualization of data stored in Pandas structures.
In summary, Matplotlib is a versatile and powerful library for creating a wide range of static
and interactive visualizations in Python. Its rich feature set, customization options, and broad
community support make it an essential tool for data scientists, researchers, engineers, and
anyone involved in data visualization and presentation.
Steps for installing these packages:
1. If you are using anaconda- jupyter/ syder or any other third party softwares to write
your python code, make sure to set the path to the “scripts folder” of that software in
command prompt of your pc.
2. Then type–pip install package-name
Example:
3. Then after the installation is done. (Make sure you are connected to the internet!!)
Open your IDE, then import those packages.
Example:
import numpy
PROJECT CODE
Step 1: Import the modules needed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
df = pd.read_csv('data.csv')
df.head(5)
Output:
Step
3: Defining
headers for
our dataset.
headers =
["symboling", "normalized-losses", "make",
"fuel-type", "aspiration","num-of-doors",
"body-style","drive-wheels", "engine-location",
"wheel-base","length", "width","height", "curb-weight",
"engine-type","num-of-cylinders", "engine-size",
"fuel-system","bore","stroke", "compression-ratio",
"horsepower", "peak-rpm","city-mpg","highway-mpg","price"]
df.columns=headers
df.head()
Output:
print(df.columns)
Output:
data = df
data.isna().any()
data.isnull().any()
data.price.unique()
Output:
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
data['height'] = data['height']/data['height'].max()
bins = np.linspace(min(data['price']), max(data['price']), 4)
group_names = ['Low', 'Medium', 'High']
data['price-binned'] = pd.cut(data['price'], bins,
labels = group_names,
include_lowest = True)
print(data['price-binned'])
plt.hist(data['price-binned'])
plt.show()
Output:
pd.get_dummies(data['fuel-type']).head()
data.describe()
Output:
Step 9: Plotting the data according to the price based on engine size.
plt.boxplot(data['price'])
sns.boxplot(x ='drive-wheels', y ='price', data =
data)
plt.scatter(data['engine-size'], data['price'])
plt.title('Scatterplot of Enginesize vs Price')
plt.xlabel('Engine size')
plt.ylabel('Price')
plt.grid()
plt.show()
Output:
Step 10: Grouping the data according to wheel, body-style and price.
test = data[[‘drive-wheels’, ‘body-style’,’price’]]
data_grp = test.groupby([‘drive-wheels’, ‘body-style’],
as_index = False).mean()
data_grp
Output:
Step 11: Using the pivot method and plotting the heatmap according
to the data obtained by pivot method
Output:
Step 12: #Heatmap for visualing data
plt.pcolor(data_pivot, cmap ='RdBu') #RedBlue
plt.colorbar()
plt.show()
Output:
Step 13: Obtaining the final result and showing it in the form of a graph. As the
slope is increasing in a positive direction, it is a positive linear relationship.
data_annova = data[[‘make’, ‘price’]]
grouped_annova = data_annova.groupby([‘make’])
annova_results_l = sp.stats.f_oneway(
grouped_annova.get_group(‘honda’)[‘price’],
grouped_annova.get_group(‘subaru’)[‘price’]
)
print(annova_results_l)
sns.regplot(x =‘engine-size’, y =‘price’, data = data)
plt.ylim(0, )
Output:
AGGREGATION FUNCTIONS
Output:
204
Step 17:# to get minimum values in the dataframe from all columns
df.aggregate([‘min’])
Output:
Step 18: #to get maximum values in the dataframe from all columns
df.aggregate([‘max’])
Output:
Step 19:#fill the Nan values with ‘0’
df['price']=df['price'].fillna(0)
df ['price']
Output:
CONCLUSION
REFERENCE LINKS:
Dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Architecture Pictures:
1.https://encryptedtbn0.gstatic.com/images?q=tbn:ANd9GcT65v78AG4OzXQ1e3k1VuOi1zbJ_CqkCAh9XA&s
2. https://miro.medium.com/v2/resize:fit:1200/1*ftnM93QhlS0A7I55QegbrA.jpeg
THANK YOU
-APPSDC