0% found this document useful (0 votes)
8 views27 pages

DATA ANALYSIS USING PYTHON2

The document outlines an internship project focused on analyzing the selling price of used cars using Python and machine learning techniques. It emphasizes the importance of data analysis in organizing raw data to derive meaningful insights and maximize profits for sellers. The project involves using various Python libraries such as NumPy, Pandas, Seaborn, SciPy, and Matplotlib for data manipulation, statistical analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

DATA ANALYSIS USING PYTHON2

The document outlines an internship project focused on analyzing the selling price of used cars using Python and machine learning techniques. It emphasizes the importance of data analysis in organizing raw data to derive meaningful insights and maximize profits for sellers. The project involves using various Python libraries such as NumPy, Pandas, Seaborn, SciPy, and Matplotlib for data manipulation, statistical analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

DATA ANALYSIS USING

PYTHON
INTERNSHIP PROGRAMME

PROJECT NAME:
ANALYSIS OF SELLING PRICE OF USED CARS

TEAM NO:2

TEAM LEADER:
BATTULA SUDHARSANA RAO -22NE1A4204

TEAM MEMBERS:
ACHI SIVA SANKAR -22NE1A4201
CHINNAM HARISH KUMAR -22NE1A4208
MARKAPURAM SURENDRA -22NE1A4241
ANALYSIS OF SELLING PRICE OF USED CARS USING PYTHON

ABSTRACT OF THE PROJECT

Now-a-days, with the technological advancement, Techniques like


Machine Learning, etc are being used on a large scale in many
organisations. These models usually work with a set of predefined data-
points available in the form of datasets. These datasets contain the
past/previous information on a specific domain. Organising these
datapoints before it is fed to the model is very important. This is where we
use Data Analysis. If the data fed to the machine learning model is not well
organised, it gives out false or undesired output. This can cause major
losses to the organisation. Hence making use of proper data analysis is very
important.
The data that we are going to use in this example is about cars.
Specifically containing various information datapoints about the used cars,
like their price, color, etc. Here we need to understand that simply collecting
data isn’t enough. Raw data isn’t useful. Here data analysis plays a vital role
in unlocking the information that we require and to gain new insights into
this raw data.
Consider this scenario, our friend, Otis, wants to sell his car. But he
doesn’t know how much should he sell his car for! He wants to maximize the
profit but he also wants it to be sold for a reasonable price for someone who
would want to own it. So here, us, being a data scientist, we can help our
friend Otis.
Let’s think like data scientists and clearly define some of his problems:
For example, is there data on the prices of other cars and their
characteristics? What features of cars affect their prices? Colour? Brand?
Does horsepower also affect the selling price, or perhaps, something else?
As a data analyst or data scientist, these are some of the questions we
can start thinking about. To answer these questions, we’re going to need
some data. But this data is in raw form. Hence we need to analyze it first. The
data is available in the form of .csv/.data format.

SYSTEM REQUIREMENTS

1.Operating system:

Device name :LAPTOP-L63LSK90

Processor :12th Gen Intel(R) Core(TM) i7-1255U 1.70 GHz

Device ID : 85503FD3-F8F3-4038-AFCB-D11D73BE5358

Product ID :00342-42668-74072-AAOEM

System type :64-bit operating system, x64-based processor

2.Software Requirements:

Using PYTHON language in JUPYTER NOTEBOOK implemented Analysis


of selling price of used cars.
3.Hardware Requirements:

1. Processor (CPU): A modern multi-core processor (e.g., Intel i5/i7) to handle


data processing efficiently.
2. Memory (RAM): At least 8 GB of RAM, though 16 GB or more is
recommended for larger datasets and more complex analyses.
3.Storage:
1. SSD (Solid State Drive) for faster data access. Aim for at least 256 GB, but
512 GB or more is preferable if you’re working with large datasets.
2. External Storage: Optional, for backup or additional storage needs.
4. Network: Reliable internet connection if your analysis involves accessing or
downloading data from online sources.

ARCHITECTURE OF PROJECT
Fig 1: Price prediction of used cars

Fig 2: Assumption of selling cars in real-time


USED LIBRARIES IN THE PROJECT

1. numpy
2. pandas
3. seaborn
4. scipy
5. matplotlib

About Libraries
1.numpy
NumPy is a widely-used library in Python, essential for many scientific, mathematical,
and engineering applications. Here are some key uses and points about NumPy:

Uses of NumPy

1. Numerical Data Analysis

Efficient Data Storage: NumPy's ndarray objects store large datasets efficiently, using
less memory and allowing faster operations than traditional Python lists.

Statistical Analysis: Provides functions for computing statistical metrics such as mean,
median, variance, and standard deviation.

2. Mathematical and Logical Operations

Element-wise Operations: Supports element-wise addition, subtraction, multiplication,


division, and logical operations on arrays.

Matrix Operations: Includes functions for matrix multiplication, inversion,


determinant calculation, and solving linear equations.

3. Data Preprocessing
Data Cleaning: Facilitates operations like replacing missing values, normalizing data,
and removing outliers.

Reshaping and Resizing: Allows for reshaping arrays, adding or removing dimensions,
and resizing arrays without losing data.

4. Scientific Computing

Fourier Transforms: Provides functions to compute Fast Fourier Transforms (FFT)


for signal processing.

Linear Algebra: Includes advanced linear algebra functions such as eigenvalues,


eigenvectors, and singular value decomposition (SVD).

5. Machine Learning and Artificial Intelligence

Data Preparation: Often used to prepare and manipulate datasets before feeding them
into machine learning models.

Integration with Libraries: Forms the backbone of many machine learning libraries
like TensorFlow and Scikit-Learn.

6. Graphics and Plotting

Support for Plotting Libraries: Works seamlessly with plotting libraries like
Matplotlib to create visualizations of data.

Image Processing: Can be used to manipulate and process image data efficiently.

Key Points about NumPy

1. Array Object (ndarray)

Multidimensional: Supports arrays of arbitrary dimensions.

Homogeneous Data: Elements must be of the same data type, which ensures efficient
storage and computation.

2. Performance

Speed: Much faster than native Python lists due to optimized C and Fortran code under
the hood.

Memory Efficiency: Uses contiguous memory blocks, leading to efficient memory


usage and fast access.
3. Broadcasting

Flexible Operations: Allows for operations on arrays of different shapes without the
need for explicit loops.

Simplifies Code: Makes code more concise and easier to read.

4. Integration

Interoperability: Can be used with other scientific libraries like SciPy, Pandas, and
Matplotlib.

Low-level Languages: Supports integration with C, C++, and Fortran code for high-
performance tasks.

2. pandas
Pandas is a powerful and flexible open-source data analysis and manipulation library
for Python. It is particularly well-suited for working with structured data (like spreadsheets or
SQL tables). Here are some key points and uses of Pandas:

Key Points

1. Data Structures: Pandas introduces two primary data structures: Series (1-
dimensional) and DataFrame (2-dimensional). These structures are built on top of
NumPy arrays and are highly optimized for performance.
2. Data Cleaning: Pandas provides extensive tools for cleaning data, including handling
missing values, filtering out unwanted data, and transforming data formats.
3. Data Manipulation: It supports a variety of operations such as merging, joining,
reshaping, and pivoting datasets, making it easy to manipulate data to fit the desired
structure.
4. Data Analysis: Pandas offers a suite of analytical tools, including groupby operations,
statistical functions, and time-series functionality.
5. Integration: It integrates seamlessly with other data analysis and machine learning
libraries such as NumPy, SciPy, Matplotlib, and scikit-learn.
6. Input and Output: Pandas can read from and write to a variety of data formats,
including CSV, Excel, SQL databases, JSON, and more.
7. Performance: Although it is built on top of NumPy, which provides fast operations on
large datasets, Pandas itself is optimized for performance, especially with its ability to
handle large datasets efficiently.

Uses

1. Data Cleaning and Preparation


Pandas is widely used for cleaning and preparing data for analysis. It can
handle missing values, detect and remove outliers, and perform transformations to
make data suitable for analysis.

2. Data Exploration and Analysis

Pandas provides tools to explore data, calculate statistics, and visualize


relationships within the data.

3. Time Series Analysis

Pandas is excellent for time series data, with tools to handle date and time
data, perform resampling, and calculate rolling statistics.

4. Data Wrangling and Transformation

Pandas makes it easy to reshape and transform data for analysis, including
pivot tables, melting, and stacking/unstacking data.

5. Integration with Other Libraries

Pandas works well with other Python libraries for data analysis,
visualization, and machine learning.

3. seaborn
Seaborn is a powerful and user-friendly data visualization library built on top of
Matplotlib in Python. It provides a high-level interface for drawing attractive and informative
statistical graphics. Here are some key points and uses of Seaborn:

Key Points

1. High-Level Interface: Seaborn simplifies the process of creating complex visualizations


with just a few lines of code, thanks to its high-level interface for drawing attractive
statistical graphics.
2. Built on Matplotlib: While Seaborn builds on Matplotlib, it provides more
aesthetically pleasing default settings and simplifies many aspects of creating and
customizing plots.
3. Integration with Pandas: Seaborn works seamlessly with Pandas DataFrames, making
it easy to visualize data stored in these structures.
4. Themes and Styles: Seaborn offers a variety of themes and color palettes to make
visualizations more appealing and informative.
5. Statistical Plotting: It includes many built-in statistical plots, such as bar plots, box
plots, violin plots, and more, with tools to add statistical annotations.
6. Complex Visualizations: Seaborn simplifies the creation of complex visualizations like
heatmaps, pair plots, and facet grids, which are otherwise cumbersome with Matplotlib.

Uses

1. Visualizing Distributions

Seaborn provides several functions to visualize distributions of data, including


histograms, KDE plots, and more.

2. Visualizing Relationships

Seaborn makes it easy to visualize relationships between variables with scatter plots,
line plots, and more.

3. Categorical Data Visualization

Seaborn provides several functions to visualize categorical data, such as bar plots, count
plots, box plots, and violin plots.

4. Statistical Estimations

Seaborn has tools for statistical estimation, such as linear regression plots and error
bars.

5. Multivariate Data Visualization

Seaborn can handle multivariate data visualization with tools like pair plots and
heatmaps.

6. Facet Grids

Facet grids are useful for visualizing data across multiple subplots, enabling the
examination of relationships conditioned on categorical variables.
4. scipy
Scipy (pronounced "sigh-pie") is a powerful Python library used in scientific and technical
computing. It builds on top of NumPy and provides a wide range of functions that operate on
NumPy arrays and are useful for different scientific disciplines. Here are some key uses and
points about Scipy:
1. Integration and Optimization: Scipy offers functions for numerical integration
(scipy.integrate) and optimization (scipy.optimize). These are essential for solving
complex mathematical problems, such as finding the minimum or maximum of a
function or integrating differential equations.
2. Linear Algebra: Scipy includes modules for linear algebra (scipy.linalg) that provide
optimized routines for matrix operations, including solving linear systems, eigenvalue
problems, and matrix decompositions (e.g., LU, QR, SVD).
3. Statistics: The scipy.stats module offers a wide range of statistical functions and tests,
including probability distributions (PDF, CDF), statistical tests (t-tests, ANOVA), and
descriptive statistics (mean, median, variance).
4. Signal Processing: For signal processing tasks, Scipy (scipy.signal) provides functions
for filtering, spectral analysis, and interpolation, which are crucial in fields such as
digital signal processing and telecommunications.
5. Image Processing: Scipy's scipy.ndimage module provides functions for multi-
dimensional image processing. This includes tasks like filtering, interpolation, and
measurements on n-dimensional image data.
6. Sparse Matrix Operations: Scipy supports sparse matrices (scipy.sparse), which are
memory-efficient representations of large matrices with many zero elements. It provides
functions for sparse matrix manipulation, arithmetic, and solving linear systems.
7. Spatial and Distance Operations: The scipy.spatial module offers spatial algorithms
and data structures, including distance computations, spatial transformations, and
clustering algorithms.
8. Special Functions: Scipy includes a scipy.special module that provides access to many
special functions of mathematical physics, such as Bessel, gamma, and hypergeometric
functions.
9. File I/O and Interoperability: Scipy provides utilities (scipy.io) for reading and
writing various file formats, including MATLAB files (.mat), NetCDF, and more. This
makes it easier to work with data from different sources.
10. Integration with NumPy: Scipy seamlessly integrates with NumPy arrays, allowing
for efficient computation and manipulation of large datasets typical in scientific
computing.
Scipy is invaluable for scientific computing tasks in Python, offering a rich set of
functionalities that cover numerical integration, optimization, linear algebra, statistics, signal
processing, image processing, and more. Its extensive capabilities make it a go-to library for
researchers, engineers, and data scientists working on complex mathematical and scientific
problems.
5. matplotlib
Matplotlib is a widely-used Python library for creating static, animated, and interactive
visualizations. It provides a high-level interface for drawing a variety of plots and charts,
making it essential for data analysis, exploration, and presentation. Here are some key uses
and points about Matplotlib:
1. Plotting Functions: Matplotlib's pyplot module offers a MATLAB-like interface for
creating a variety of plots such as line plots, scatter plots, bar plots, histograms, pie
charts, and more. It allows users to customize virtually every aspect of the plot,
including colors, labels, markers, and annotations.
2. Publication Quality: Matplotlib is designed to produce high-quality graphics suitable
for publication and presentations. It provides fine-grained control over every visual
aspect of a plot, ensuring that plots meet specific formatting requirements.
3. Wide Adoption and Community: Being one of the oldest and most established
plotting libraries in Python, Matplotlib has a large user base and an active community.
This results in extensive documentation, tutorials, and examples available online,
making it easier for users to learn and troubleshoot.
4. Integration with Jupyter Notebooks: Matplotlib integrates seamlessly with Jupyter
Notebooks, allowing users to create interactive plots directly within their notebooks.
This is particularly useful for data exploration and interactive data analysis tasks.
5. Support for Various Output Formats: Matplotlib can generate plots in various
formats, including PNG, PDF, SVG, and more. This flexibility allows users to integrate
plots into different types of documents and presentations.
6. Customizability: Matplotlib offers extensive customization options through its object-
oriented interface. Users can manipulate individual elements of a plot (e.g., axes, lines,
markers) directly, making it possible to create complex and tailored visualizations.
7. Multi-platform Support: Matplotlib is compatible with different operating systems
(Windows, macOS, Linux) and can be used with multiple Python environments,
including standard Python distributions and scientific computing platforms like
Anaconda.
8. Layered Architecture: Matplotlib follows a layered architecture, allowing users to
construct plots by building layers of components such as axes, labels, legends, and
annotations. This modular approach provides flexibility in creating complex
visualizations.
9. Extensibility: Matplotlib's architecture supports extensions and plugins, enabling the
creation of specialized plots and custom visualizations. Users can leverage additional
libraries built on top of Matplotlib, such as Seaborn (for statistical data visualization)
and mplfinance (for financial data visualization).
10. Matplotlib and Pandas Integration: Matplotlib integrates well with Pandas, a popular
data manipulation and analysis library in Python. Users can directly plot Pandas
DataFrame and Series objects using Matplotlib's plotting functions, simplifying the
visualization of data stored in Pandas structures.
In summary, Matplotlib is a versatile and powerful library for creating a wide range of static
and interactive visualizations in Python. Its rich feature set, customization options, and broad
community support make it an essential tool for data scientists, researchers, engineers, and
anyone involved in data visualization and presentation.
Steps for installing these packages:
1. If you are using anaconda- jupyter/ syder or any other third party softwares to write
your python code, make sure to set the path to the “scripts folder” of that software in
command prompt of your pc.
2. Then type–pip install package-name
Example:

pip install numpy

3. Then after the installation is done. (Make sure you are connected to the internet!!)
Open your IDE, then import those packages.

4. import,type–import package name

Example:

import numpy

DATA SET OF THE USED CARS


Referencelink:https://archive.ics.uci.edu/ml/machine-learningdatabases/autos/imports-85.data

PROJECT CODE
Step 1: Import the modules needed.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp

Step 2: Let’s check the first five entries of dataset.

df = pd.read_csv('data.csv')
df.head(5)
Output:

Step
3: Defining
headers for
our dataset.

headers =
["symboling", "normalized-losses", "make",
"fuel-type", "aspiration","num-of-doors",
"body-style","drive-wheels", "engine-location",
"wheel-base","length", "width","height", "curb-weight",
"engine-type","num-of-cylinders", "engine-size",
"fuel-system","bore","stroke", "compression-ratio",
"horsepower", "peak-rpm","city-mpg","highway-mpg","price"]

df.columns=headers
df.head()

Output:

Step 4: Finding the missing value if any.


Output:

Step 5: Print the columns.

print(df.columns)

Output:

Step 6: Check the unique values in the dataset.

data = df
data.isna().any()
data.isnull().any()

data.price.unique()
Output:

Step 7: Normalizing values by using simple feature scaling method examples(do


for the rest) and binning- grouping values

data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
data['height'] = data['height']/data['height'].max()
bins = np.linspace(min(data['price']), max(data['price']), 4)
group_names = ['Low', 'Medium', 'High']
data['price-binned'] = pd.cut(data['price'], bins,
labels = group_names,
include_lowest = True)
print(data['price-binned'])
plt.hist(data['price-binned'])
plt.show()
Output:

Step 8: Doing descriptive analysis of data categorical to numerical


values.

pd.get_dummies(data['fuel-type']).head()
data.describe()

Output:
Step 9: Plotting the data according to the price based on engine size.

plt.boxplot(data['price'])
sns.boxplot(x ='drive-wheels', y ='price', data =
data)
plt.scatter(data['engine-size'], data['price'])
plt.title('Scatterplot of Enginesize vs Price')
plt.xlabel('Engine size')
plt.ylabel('Price')
plt.grid()
plt.show()

Output:

Step 10: Grouping the data according to wheel, body-style and price.
test = data[[‘drive-wheels’, ‘body-style’,’price’]]
data_grp = test.groupby([‘drive-wheels’, ‘body-style’],
as_index = False).mean()
data_grp

Output:

Step 11: Using the pivot method and plotting the heatmap according
to the data obtained by pivot method

data_pivot = data_grp.pivot(index = 'drive-wheels',


columns = 'body-style')
data_pivot

Output:
Step 12: #Heatmap for visualing data
plt.pcolor(data_pivot, cmap ='RdBu') #RedBlue
plt.colorbar()
plt.show()

Output:

Step 13: Obtaining the final result and showing it in the form of a graph. As the
slope is increasing in a positive direction, it is a positive linear relationship.
data_annova = data[[‘make’, ‘price’]]
grouped_annova = data_annova.groupby([‘make’])
annova_results_l = sp.stats.f_oneway(
grouped_annova.get_group(‘honda’)[‘price’],
grouped_annova.get_group(‘subaru’)[‘price’]
)
print(annova_results_l)
sns.regplot(x =‘engine-size’, y =‘price’, data = data)
plt.ylim(0, )

Output:

Step 14: print the dataset and apply basic operations


print(df)
Output:

Step 15: print


the one column in the data frame
df['price']
Output:

AGGREGATION FUNCTIONS

Step 16: #to get the count of the one column


df[‘price’].count()

Output:
204
Step 17:# to get minimum values in the dataframe from all columns
df.aggregate([‘min’])
Output:

Step 18: #to get maximum values in the dataframe from all columns
df.aggregate([‘max’])
Output:
Step 19:#fill the Nan values with ‘0’
df['price']=df['price'].fillna(0)

df ['price']

Output:

Step 20:# dropping the values using the dropna()


df = df.dropna(axis=1)
df
Output:

Step 21:#using the reset_index


df = df.reset_index(drop=True)
df
Output:

ADVANTAGES OF ANALYSIS OF SELLING PRICE


OF USED CARS USING PYTHON
For Sellers:
1. Market Awareness: Understanding the market trends and the current
selling prices of similar used cars helps sellers set a competitive and realistic
price.
2. Maximizing Profits: By analyzing the data, sellers can identify the best
times to sell and the most lucrative price points, ensuring they maximize
their return on investment.
3. Identifying Value-Added Features: Sellers can identify which features or
conditions (e.g., low mileage, service history, recent upgrades) add the most
value, allowing them to highlight these aspects in their listings.
4. Strategic Pricing: Analysis can reveal patterns such as the impact of
seasonal demand fluctuations, helping sellers price their cars more
strategically.
5. Faster Sales: Setting an appropriate price based on market analysis can lead
to quicker sales, reducing the time the car spends on the market.
For Buyers:
1. Fair Pricing: Buyers can ensure they are paying a fair price by comparing
the asking price with the analyzed market data of similar vehicles.
2. Budget Planning: With insights into price trends, buyers can plan their
purchases better, knowing when prices might drop or increase.
3. Avoiding Overpayment: Analysis helps buyers avoid overpaying for a car
by providing a clear understanding of its market value.
4. Feature Prioritization: Buyers can identify which features justify higher
prices, helping them prioritize what is important to them in a used car.
5. Negotiation Power: Knowledge of market prices empowers buyers to
negotiate better deals with sellers.
General Advantages:
1. Transparency: Analyzing selling prices contributes to a more transparent
marketplace, where both buyers and sellers have access to relevant
information.
2. Informed Decisions: Data-driven decisions help in making more informed
and confident buying or selling choices.
3. Trend Analysis: Long-term analysis can reveal trends and shifts in
consumer preferences, helping dealers and manufacturers adjust their
strategies accordingly.
4. Market Efficiency: Better price analysis leads to a more efficient market
where cars are bought and sold at prices reflective of their true market value.
Tools and Methods for Analysis:
1. Data Collection: Collecting data from online marketplaces, dealerships, and
auctions to understand the price ranges.
2. Statistical Analysis: Using statistical tools to analyze the distribution of
prices and identify outliers.
3. Comparative Analysis: Comparing prices of similar cars based on make,
model, year, mileage, and condition.
4. Machine Learning: Implementing machine learning algorithms to predict
prices based on historical data and market trends.
5. Visualization: Creating visualizations such as graphs and charts to easily
interpret the data and identify patterns.

CONCLUSION

The conclusion of a project, particularly one analyzing the selling price of


used cars, should summarize the key findings, reflect on the process, and suggest
any implications or future work. Here’s a structured approach to crafting a
comprehensive conclusion:
1. Summary of Findings
1. Key Insights: Recap the primary insights gained from the analysis. For
instance, highlight any patterns or trends in the selling prices of used cars,
such as the impact of factors like make, model, age, mileage, or condition.
2. Statistical Results: Summarize any significant statistical results or models
used, such as regression coefficients or correlation values.
2. Implications
1. Market Trends: Discuss what the findings mean for the used car market.
For example, if certain brands or models consistently fetch higher prices, this
might indicate strong brand value or demand.
2. Consumer Insights: Explain how the results could inform buyers and
sellers. For example, buyers might use this information to identify
undervalued cars, while sellers could adjust their pricing strategies based on
the analysis.
3. Challenges and Limitations
1. Data Limitations: Acknowledge any limitations in the data, such as
incomplete records, potential biases, or external factors not accounted for.
2. Methodological Constraints: Discuss any limitations related to the analysis
methods used, such as assumptions made in statistical models or potential
errors.

REFERENCE LINKS:

Dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Architecture Pictures:
1.https://encryptedtbn0.gstatic.com/images?q=tbn:ANd9GcT65v78AG4OzXQ1e3k1VuOi1zbJ_CqkCAh9XA&s
2. https://miro.medium.com/v2/resize:fit:1200/1*ftnM93QhlS0A7I55QegbrA.jpeg
THANK YOU
-APPSDC

You might also like