Lab3 by Dr.
Hoora Fakhrmoosavy
Exploratory Data Analysis of the US Cars Dataset
The US Cars Dataset contains scraped data from the online North
American Car auction. It contains information about 28 car brands for sale
in the US. In this post, we will perform exploratory data analysis on the US
Cars Dataset.
First, let’s import the Pandas library
import pandas as pd
Next, let’s remove the default display limits for Pandas data frames:
pd.set_option('display.max_columns', None)
Now, let’s read the data into a data frame:
df = pd.read_csv("USA_cars_datasets.csv")
Let’s print the list of columns in the data:
print(list(df.columns))
Let’s find the unique values in brand and year columns:
df.brand.unique()
years = df.year.unique()
Let's sort it:
np.sort(years)
We can also take a look at the number of rows in the data:
print("Number of rows: ", len(df))
Next, let’s print the first five rows of data:
print(df.head())
df.describe()
Now, let’s look at the brands of white cars:
df_d1 = df[df['color'] =='white']
print(set(df_d1['brand']))
We can also look at the most common brands for white cars:
from collections import Counter
print(dict(Counter(df_d1['brand']).most_common(5)))
Dealing with missing value:
df['mileage'].replace(np.nan, df[' mileage '].mean(), inplace=True)
df.year.replace(np.nan, df.year.mean(), inplace=True)
df.info()
Let's begin by importing matplotlib.pyplot and seaborn.
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (8,
6)matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let’s find popular models:
import plotly.express as px
models_df = df.dropna(subset = [ 'model'])
fig = px.treemap(models_df, path=['model'], title='Most Popular
Models')
fig.show()
Relationship between Car's Release Year and Price
The better way to study this relationship is to consider the age of car than the year when it was released.
Let's add another column in the dataframe for the age of car. The age is calculated with the help of datetime
library in Python.
import datetime
df['age'] = datetime.datetime.now().year - df['year']
sns.scatterplot(x=df.age, y=df.price, s=40);
Adding Log Price Column:
df['Log Price'] = df['price'].map(lambda p: np.log(p))
sns.scatterplot(x=df.Age, y=df['Log Price'], s=40);
On the logrithmic scale, the visualization becomes much clearer than before and the inverse relationship is
more obvious.
Popularity based on Model:
models= df.groupby('model')['model'].count()
models = pd.DataFrame(models)
models.columns = ['models Counts']
models.sort_values(by=['models Counts'], inplace=True, ascending=False)
models = models.head(5)
models.plot.bar();
plt.title('Preferred models')
plt.xlabel('models')
plt.ylabel('No. of Cars');
Finding Top brands in our database:
topbrands= df.groupby('brand')['brand'].count()
topbrands = pd.DataFrame(topbrands)
topbrands.columns = ['Top Brands']
topbrands.sort_values(by=['Top Brands'], inplace=True, ascending=False)
topbrands = topbrands.head(10)
topbrands.plot.bar();
plt.title('Famous Brands')
plt.xlabel('Brands')
plt.ylabel('No. of Cars');
Most Expensive Car Brands:
expensive= df.groupby('brand')['price'].mean()
expensive = pd.DataFrame(expensive)
expensive.columns = ['Average Prices']
expensive.sort_values(by=['Average Prices'], inplace=True, ascending=False)
expensive = expensive.head(10)
expensive.plot.bar();
plt.title('Expensive Brands')
plt.xlabel('Car Brands')
plt.ylabel('No. of Cars');
Let’s look at Distribution of Price:
cars_price_df = df[(df.price > 1000) & (df.price < 5000)]
plt.title('Distribution of Price')
plt.hist(cars_price_df.price, bins=np.arange(1000, 5000, 500));
plt.xlabel('Price')
plt.ylabel('No. of Samples')
plt.xlim(1000, 5000);
Let’s find the histogram of price:
plt.figure(figsize=(10,6))
sns.distplot(df['price']).set_title('Distribution of Car Prices')
Finally let’s create a boxplot of ‘price’ in the 5 most commonly occurring
‘brand’ categories:
import matplotlib.pyplot as plt
def get_boxplot_of_categories(data_frame, categorical_column, numerical_column, limit):
import seaborn as sns
from collections import Counter
keys = []
for i in dict(Counter(df[categorical_column].values).most_common(limit)):
keys.append(i)
print(keys)
df_new = df[df[categorical_column].isin(keys)]
sns.set()
sns.boxplot(x = df_new[categorical_column], y = df_new[numerical_column])
plt.show()
get_boxplot_of_categories(df, 'brand', 'price', 5)
Also, we can get for all brands:
plt.figure(figsize=(12,8))
sns.set(style='darkgrid')
sns.boxplot(x='brand', y='price', data=df).set_title("Price Distribution of Different Brands")
Please write a code to answer these questions.
Question1: Find Price Distribution of Top 3 Brands in database?
Question2: What is the average price of nissan, BMWand ford cars or
the 3 most famous car brands in database?
Question3: Cars from which release years are most cheapest (on
average) in database for the release years beyond 2000?
Question4: Which brand cars have covered most mileage on the roads?
Question5: Which state has the highest registered Mercedes cars?