0% found this document useful (0 votes)
27 views11 pages

GmPrac1 - Jupyter Notebook

The document describes the loading and initial exploration of a car dataset using pandas in Python. It includes operations such as reading the CSV file, checking for null values, calculating averages for specific columns, and creating new features based on existing data. The dataset consists of 205 entries with 26 columns, detailing various attributes of cars.

Uploaded by

azaanahrmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

GmPrac1 - Jupyter Notebook

The document describes the loading and initial exploration of a car dataset using pandas in Python. It includes operations such as reading the CSV file, checking for null values, calculating averages for specific columns, and creating new features based on existing data. The dataset consists of 205 entries with 26 columns, detailing various attributes of cars.

Uploaded by

azaanahrmad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

In [1]: import pandas as pd

import matplotlib.pylab as plt


import numpy as np

In [2]: df = pd.read_csv("autodata.csv")

In [3]: df.head(5)

Out[3]:
num-
normalized- fuel- body- drive- engine- whe
symboling make aspiration of-
losses type style wheels location ba
doors

alfa-
0 3 122.0 gas std two convertible rwd front 8
romero

alfa-
1 3 122.0 gas std two convertible rwd front 8
romero

alfa-
2 1 122.0 gas std two hatchback rwd front 9
romero

3 2 164.0 audi gas std four sedan fwd front 9

4 2 164.0 audi gas std four sedan 4wd front 9

5 rows × 26 columns
 

In [4]: df.tail(5)

Out[4]:
num-
normalized- fuel- body- drive- engine- wheel
symboling make aspiration of-
losses type style wheels location base
doors

200 -1 95.0 volvo gas std four sedan rwd front 109.1

201 -1 95.0 volvo gas turbo four sedan rwd front 109.1

202 -1 95.0 volvo gas std four sedan rwd front 109.1

203 -1 95.0 volvo diesel turbo four sedan rwd front 109.1

204 -1 95.0 volvo gas turbo four sedan rwd front 109.1

5 rows × 26 columns
 
In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 205 non-null int64
1 normalized-losses 205 non-null float64
2 make 205 non-null object
3 fuel-type 205 non-null object
4 aspiration 205 non-null object
5 num-of-doors 203 non-null object
6 body-style 205 non-null object
7 drive-wheels 205 non-null object
8 engine-location 205 non-null object
9 wheel-base 205 non-null float64
10 length 205 non-null float64
11 width 205 non-null float64
12 height 205 non-null float64
13 curb-weight 205 non-null int64
14 engine-type 205 non-null object
15 num-of-cylinders 205 non-null object
16 engine-size 205 non-null int64
17 fuel-system 205 non-null object
18 bore 205 non-null float64
19 stroke 205 non-null float64
20 compression-ratio 205 non-null float64
21 horsepower 205 non-null float64
22 peak-rpm 205 non-null float64
23 city-mpg 205 non-null int64
24 highway-mpg 205 non-null int64
25 price 205 non-null float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.8+ KB

In [6]: df.describe()

Out[6]:
normalized- wheel-
symboling length width height curb-weight
losses base

count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000

mean 0.834146 122.000000 98.756585 174.049268 65.907805 53.724878 2555.565854

std 1.245307 31.681008 6.021776 12.337289 2.145204 2.443522 520.680204

min -2.000000 65.000000 86.600000 141.100000 60.300000 47.800000 1488.000000

25% 0.000000 101.000000 94.500000 166.300000 64.100000 52.000000 2145.000000

50% 1.000000 122.000000 97.000000 173.200000 65.500000 54.100000 2414.000000

75% 2.000000 137.000000 102.400000 183.100000 66.900000 55.500000 2935.000000

max 3.000000 256.000000 120.900000 208.100000 72.300000 59.800000 4066.000000

 
In [7]: df.isnull()

Out[7]:
num-
normalized- fuel- body- drive- engine- wheel-
symboling make aspiration of-
losses type style wheels location base
doors

0 False False False False False False False False False False

1 False False False False False False False False False False

2 False False False False False False False False False False

3 False False False False False False False False False False

4 False False False False False False False False False False

... ... ... ... ... ... ... ... ... ... ...

200 False False False False False False False False False False

201 False False False False False False False False False False

202 False False False False False False False False False False

203 False False False False False False False False False False

204 False False False False False False False False False False

205 rows × 26 columns


 

In [9]: df.notnull().sum()

Out[9]: symboling 205


normalized-losses 205
make 205
fuel-type 205
aspiration 205
num-of-doors 203
body-style 205
drive-wheels 205
engine-location 205
wheel-base 205
length 205
width 205
height 205
curb-weight 205
engine-type 205
num-of-cylinders 205
engine-size 205
fuel-system 205
bore 205
stroke 205
compression-ratio 205
horsepower 205
peak-rpm 205
city-mpg 205
highway-mpg 205
price 205
dtype: int64
In [10]: # calculate the mean vaule for "stroke" column
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)
# replace NaN by mean value in "stroke" column
df["stroke"].replace(np.nan, avg_stroke, inplace = True)

Average of stroke: 3.2554228855721337

In [11]: avg_hp = df["horsepower"].astype("float").mean(axis = 0)


print("Average of stroke:", avg_hp)

Average of stroke: 104.25615763546797

In [12]: df["peak-rpm"].replace(np.nan, avg_hp, inplace = True)

In [13]: df['num-of-doors'].value_counts()

Out[13]: four 114


two 89
Name: num-of-doors, dtype: int64

In [14]: df['num-of-doors'].value_counts().idxmax()

Out[14]: 'four'

In [15]: # Replace missing 'num-of-doors' values with the most frequent value ('four
df["num-of-doors"].fillna(df["num-of-doors"].mode()[0], inplace=True)

# Drop rows with NaN values in the "horsepower" column
df.dropna(subset=["horsepower"], axis=0, inplace=True)

# Reset the index after dropping rows
df.reset_index(drop=True, inplace=True)
In [17]: df.isnull().sum()

Out[17]: symboling 0
normalized-losses 0
make 0
fuel-type 0
aspiration 0
num-of-doors 0
body-style 0
drive-wheels 0
engine-location 0
wheel-base 0
length 0
width 0
height 0
curb-weight 0
engine-type 0
num-of-cylinders 0
engine-size 0
fuel-system 0
bore 0
stroke 0
compression-ratio 0
horsepower 0
peak-rpm 0
city-mpg 0
highway-mpg 0
price 0
dtype: int64

In [18]: df['city-L/100km'] = 235/df["city-mpg"]


df.head()

Out[18]:
num-
normalized- fuel- body- drive- engine- whe
symboling make aspiration of-
losses type style wheels location ba
doors

alfa-
0 3 122.0 gas std two convertible rwd front 8
romero

alfa-
1 3 122.0 gas std two convertible rwd front 8
romero

alfa-
2 1 122.0 gas std two hatchback rwd front 9
romero

3 2 164.0 audi gas std four sedan fwd front 9

4 2 164.0 audi gas std four sedan 4wd front 9

5 rows × 27 columns
 
In [19]: df['highway-L/100km'] = 235/df["highway-mpg"]
df.head()

Out[19]:
num-
normalized- fuel- body- drive- engine- whe
symboling make aspiration of-
losses type style wheels location ba
doors

alfa-
0 3 122.0 gas std two convertible rwd front 8
romero

alfa-
1 3 122.0 gas std two convertible rwd front 8
romero

alfa-
2 1 122.0 gas std two hatchback rwd front 9
romero

3 2 164.0 audi gas std four sedan fwd front 9

4 2 164.0 audi gas std four sedan 4wd front 9

5 rows × 28 columns
 

In [20]: df['length'] = df['length']/df['length'].max()


df['width'] = df['width']/df['width'].max()

In [21]: df['height'] = df['height']/df['height'].max()


df[["length","width","height"]].head()

Out[21]:
length width height

0 0.811148 0.886584 0.816054

1 0.811148 0.886584 0.816054

2 0.822681 0.905947 0.876254

3 0.848630 0.915629 0.908027

4 0.848630 0.918396 0.908027

In [22]: df.columns

Out[22]: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiratio


n',
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-t
ype',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price', 'city-L/100km', 'highway-L/100km'],
dtype='object')

In [23]: df['aspiration'].value_counts()

Out[23]: std 168


turbo 37
Name: aspiration, dtype: int64
In [24]: dummy_variable_1 = pd.get_dummies(df["aspiration"])
dummy_variable_1.head()

Out[24]:
std turbo

0 1 0

1 1 0

2 1 0

3 1 0

4 1 0

In [25]: df = pd.concat([df, dummy_variable_1], axis=1)


df.drop("aspiration", axis = 1, inplace=True)

In [26]: df.head()

Out[26]:
num-
normalized- fuel- body- drive- engine- wheel-
symboling make of- lengt
losses type style wheels location base
doors

alfa-
0 3 122.0 gas two convertible rwd front 88.6 0.81114
romero

alfa-
1 3 122.0 gas two convertible rwd front 88.6 0.81114
romero

alfa-
2 1 122.0 gas two hatchback rwd front 94.5 0.82268
romero

3 2 164.0 audi gas four sedan fwd front 99.8 0.84863

4 2 164.0 audi gas four sedan 4wd front 99.4 0.84863

5 rows × 29 columns
 

In [27]: df["horsepower"]=df["horsepower"].astype(float, copy=True)


In [28]: %matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Out[28]: Text(0.5, 1.0, 'horsepower bins')

In [29]: bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)


bins

Out[29]: array([ 48., 128., 208., 288.])

In [30]: group_names = ['Low', 'Medium', 'High']

In [31]: # Define bin edges for horsepower


bins = [df["horsepower"].min(), 100, 150, df["horsepower"].max()] # Exampl
group_names = ["Low", "Medium", "High"] # Labels for bins

# Bin 'horsepower' column into categorical values
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names

# Display first 20 rows of 'horsepower' and 'horsepower-binned' columns
df[['horsepower', 'horsepower-binned']].head(4)

Out[31]:
horsepower horsepower-binned

0 111.0 Medium

1 111.0 Medium

2 154.0 High

3 102.0 Medium
In [32]: df["horsepower-binned"].value_counts()

Out[32]: Low 110


Medium 62
High 32
Name: horsepower-binned, dtype: int64

In [33]: %matplotlib inline


import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

Out[33]: Text(0.5, 1.0, 'horsepower bins')

In [34]: df["peak-rpm"]=df["peak-rpm"].astype(float, copy=True)


In [35]: %matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["peak-rpm"])
plt.pyplot.xlabel("peak-rpm")
plt.pyplot.ylabel("count")
plt.pyplot.title("Peak-rpm bins")

Out[35]: Text(0.5, 1.0, 'Peak-rpm bins')

In [36]: bins = np.linspace(min(df["peak-rpm"]), max(df["peak-rpm"]), 4)


bins

Out[36]: array([4150. , 4966.66666667, 5783.33333333, 6600. ])

In [37]: group_names1 = ['Low', 'Medium', 'High']


In [39]: import numpy as np

# Ensure 'peak-rpm' is numeric
df['peak-rpm'] = pd.to_numeric(df['peak-rpm'], errors='coerce')

# Fill missing values with the mean
df['peak-rpm'].fillna(df['peak-rpm'].mean(), inplace=True)

# Define bin edges (ensuring they are sorted)
bins = sorted([df["peak-rpm"].min(), 4000, 5000, 6000, df["peak-rpm"].max()

# Define bin labels
group_names = ["Low", "Medium", "High", "Very High"]

# Apply binning
df['peakrpm-binned'] = pd.cut(df['peak-rpm'], bins, labels=group_names, inc

# Display first 20 rows of 'peak-rpm' and 'peakrpm-binned'
df[['peak-rpm', 'peakrpm-binned']].head(5)

Out[39]:
peak-rpm peakrpm-binned

0 5000.0 Medium

1 5000.0 Medium

2 5000.0 Medium

3 5500.0 High

4 5500.0 High

In [40]: df["peakrpm-binned"].value_counts()

Out[40]: High 107


Medium 91
Low 5
Very High 2
Name: peakrpm-binned, dtype: int64

In [ ]: ​

You might also like