Why Data Analysis?
Data Analysis plays an important role in;
Discovering useful information
Answering questions
Predicting future or the unknown
Reading & Writing Data in Python
In [1]:
import pandas as pd
import numpy as np
In [2]:
path = "Auto85.csv"
df = pd.read_csv(path, header = None) #read_csv() assumes data has a header
Printing the Data Frame
df prints the entire data frame
df.head(n) prints the first n rows
df.tail(n) prints the last n rows
In [5]:
df.head() # Prints the top n rows
Out [5]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 rows × 26 columns
In [6]:
headers = ["symboling","normalized-losses","make","fuel-type", "aspiration", "num-of-doors", "body-style", "dri
In [7]:
df.columns = headers
In [8]: df.head()
Out [8]:
num-
normalized- fuel- drive- engine- wheel- engine- fuel- compression-
symboling make aspiration of- body-style ... bore stroke
losses type wheels location base size system ratio
doors
alfa-
0 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
1 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
2 1 ? gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0
romero
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0
5 rows × 26 columns
In [9]:
path = "up_data.csv"
df.to_csv(path)
Analyzing Data in Python
Checking Data Types of data
df.dtypes
In [10]:
df.dtypes
Out [10]: symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
Price object
dtype: object
In [11]:
df.describe() #Does not include columns with non-numeric data types
Out [11]:
compression- highway-
symboling wheel-base length width height curb-weight engine-size city-mpg
ratio mpg
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 10.142537 25.219512 30.751220
std 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 3.972040 6.542142 6.886443
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 7.000000 13.000000 16.000000
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 8.600000 19.000000 25.000000
50% 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 9.000000 24.000000 30.000000
75% 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 9.400000 30.000000 34.000000
max 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 23.000000 49.000000 54.000000
In [90]:
df.describe(include = "all") # top = most frequent value
Out [90]:
num-
normalized- fuel- body- drive- engine- fuel- com
symboling make aspiration of- wheel-base ... engine-size bore stroke
losses type style wheels location system
doors
count 205.000000 205 205 205 205 205 205 205 205 205.000000 ... 205.000000 205 205 205 205
unique NaN 52 22 2 2 3 5 3 2 NaN ... NaN 8 39 37 NaN
top NaN ? toyota gas std four sedan fwd front NaN ... NaN mpfi 3.62 3.40 NaN
freq NaN 41 32 185 168 114 96 120 202 NaN ... NaN 94 23 20 NaN
mean 0.834146 NaN NaN NaN NaN NaN NaN NaN NaN 98.756585 ... 126.907317 NaN NaN NaN 10.1
std 1.245307 NaN NaN NaN NaN NaN NaN NaN NaN 6.021776 ... 41.642693 NaN NaN NaN 3.97
min -2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 86.600000 ... 61.000000 NaN NaN NaN 7.00
25% 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 94.500000 ... 97.000000 NaN NaN NaN 8.60
50% 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN 97.000000 ... 120.000000 NaN NaN NaN 9.00
75% 2.000000 NaN NaN NaN NaN NaN NaN NaN NaN 102.400000 ... 141.000000 NaN NaN NaN 9.40
max 3.000000 NaN NaN NaN NaN NaN NaN NaN NaN 120.900000 ... 326.000000 NaN NaN NaN 23.0
11 rows × 26 columns
Data Preprocessing
Accessing columns of a data frame
In [91]: df["symboling"]
Out [91]: 0 3
1 3
2 1
3 2
4 2
..
200 -1
201 -1
202 -1
203 -1
204 -1
Name: symboling, Length: 205, dtype: int64
In [92]:
df["body-style"]
Out [92]: 0 convertible
1 convertible
2 hatchback
3 sedan
4 sedan
...
200 sedan
201 sedan
202 sedan
203 sedan
204 sedan
Name: body-style, Length: 205, dtype: object
Dealing with Missing Values
Replace ? in normalized-losses column by NAN
In [94]:
df.head()
Out [94]:
num-
normalized- fuel- drive- engine- wheel- engine- fuel- compression-
symboling make aspiration of- body-style ... bore stroke
losses type wheels location base size system ratio
doors
alfa-
0 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
1 3 ? gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
2 1 ? gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0
romero
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0
5 rows × 26 columns
In [95]: df["normalized-losses"]
Out [95]: 0 ?
1 ?
2 ?
3 164
4 164
...
200 95
201 95
202 95
203 95
204 95
Name: normalized-losses, Length: 205, dtype: object
In [96]:
df["normalized-losses"].replace("?", np.nan, inplace = True)
In [97]:
df.head()
Out [97]:
num-
normalized- fuel- drive- engine- wheel- engine- fuel- compression-
symboling make aspiration of- body-style ... bore stroke
losses type wheels location base size system ratio
doors
alfa-
0 3 NaN gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
1 3 NaN gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0
romero
alfa-
2 1 NaN gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0
romero
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0
5 rows × 26 columns
In [98]: df["normalized-losses"] = pd.to_numeric(df["normalized-losses"]) # Convert this column from object to numeric c
In [99]:
df.dtypes
Out [99]: symboling int64
normalized-losses float64
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
Price object
dtype: object
In [100]:
mean = df["normalized-losses"].mean()
mean
Out [100]: 122.0
In [101]:
df["normalized-losses"].replace(np.nan, mean, inplace=True) # replace nan with mean value # inplace=True (This
In [102]:
df["normalized-losses"]
Out [102]: 0 122.0
1 122.0
2 122.0
3 164.0
4 164.0
...
200 95.0
201 95.0
202 95.0
203 95.0
204 95.0
Name: normalized-losses, Length: 205, dtype: float64
Calculate Average Values based on Other Features
In [103]: df[["normalized-losses","make"]]
Out [103]:
normalized-losses make
0 122.0 alfa-romero
1 122.0 alfa-romero
2 122.0 alfa-romero
3 164.0 audi
4 164.0 audi
... ... ...
200 95.0 volvo
201 95.0 volvo
202 95.0 volvo
203 95.0 volvo
204 95.0 volvo
205 rows × 2 columns
In [104]:
df.groupby("make").mean() # calculate different values based on make column mean
Out [104]:
normalized- compression- highway
symboling wheel-base length width height curb-weight engine-size city-mpg
losses ratio mp
make
alfa-romero 2.333333 122.000000 90.566667 169.600000 64.566667 50.000000 2639.666667 137.333333 9.000000 20.333333 26.66666
audi 1.285714 144.285714 102.271429 183.828571 68.714286 54.428571 2800.714286 130.714286 8.400000 18.857143 24.14285
bmw 0.375000 156.000000 103.162500 184.500000 66.475000 54.825000 2929.375000 166.875000 8.575000 19.375000 25.37500
chevrolet 1.000000 100.000000 92.466667 151.933333 62.500000 52.400000 1757.000000 80.333333 9.566667 41.000000 46.33333
dodge 1.000000 133.444444 95.011111 160.988889 64.166667 51.644444 2151.333333 102.666667 8.634444 28.000000 34.11111
honda 0.615385 103.000000 94.330769 160.769231 64.384615 53.238462 2096.769231 99.307692 9.215385 30.384615 35.46153
isuzu 0.750000 122.000000 94.825000 163.775000 63.550000 52.225000 2213.500000 102.500000 9.225000 31.000000 36.00000
normalized- compression- highway
symboling wheel-base length width height curb-weight engine-size city-mpg
losses ratio mp
make
jaguar 0.000000 129.666667 109.333333 196.966667 69.933333 51.133333 4027.333333 280.666667 9.233333 14.333333 18.33333
mazda 1.117647 123.705882 97.017647 170.805882 65.588235 53.358824 2297.823529 103.000000 10.488235 25.705882 31.94117
mercedes-
0.000000 110.000000 110.925000 195.262500 71.062500 55.725000 3696.250000 226.500000 14.825000 18.500000 21.00000
benz
mercury 1.000000 122.000000 102.700000 178.400000 68.000000 54.800000 2910.000000 140.000000 8.000000 19.000000 24.00000
mitsubishi 1.846154 140.615385 95.353846 168.030769 65.253846 50.692308 2381.923077 118.307692 8.061538 24.923077 31.15384
nissan 1.000000 135.166667 95.722222 170.988889 65.088889 53.633333 2400.388889 127.888889 9.794444 27.000000 32.94444
peugot 0.000000 146.818182 110.200000 191.136364 68.390909 57.181818 3221.000000 135.818182 14.000000 22.454545 26.63636
plymouth 1.000000 128.000000 95.385714 164.900000 64.271429 51.971429 2220.857143 106.285714 8.671429 28.142857 34.14285
porsche 2.600000 134.800000 92.280000 170.260000 67.120000 51.100000 2891.200000 187.200000 9.600000 17.400000 26.00000
renault 1.000000 122.000000 96.100000 179.150000 66.550000 52.850000 2519.500000 132.000000 8.700000 23.000000 31.00000
saab 2.500000 127.000000 99.100000 186.600000 66.500000 56.100000 2745.500000 121.000000 9.201667 20.333333 27.33333
subaru 0.500000 92.250000 96.175000 168.858333 64.950000 53.750000 2316.250000 107.083333 8.816667 26.333333 30.75000
toyota 0.562500 110.656250 98.103125 171.934375 65.090625 53.721875 2441.093750 118.812500 10.340625 27.500000 32.90625
volkswagen 1.666667 121.500000 97.608333 172.533333 65.616667 55.183333 2343.166667 107.250000 13.625000 28.583333 34.91666
volvo -1.272727 91.454545 106.481818 188.800000 67.963636 56.236364 3037.909091 142.272727 10.227273 21.181818 25.81818
In [105]:
t = df.groupby("make").mean()
print(t.columns)
print(t)
Index(['symboling', 'normalized-losses', 'wheel-base', 'length', 'width',
'height', 'curb-weight', 'engine-size', 'compression-ratio', 'city-mpg',
'highway-mpg'],
dtype='object')
symboling normalized-losses wheel-base length \
make
alfa-romero 2.333333 122.000000 90.566667 169.600000
audi 1.285714 144.285714 102.271429 183.828571
bmw 0.375000 156.000000 103.162500 184.500000
chevrolet 1.000000 100.000000 92.466667 151.933333
dodge 1.000000 133.444444 95.011111 160.988889
honda 0.615385 103.000000 94.330769 160.769231
isuzu 0.750000 122.000000 94.825000 163.775000
jaguar 0.000000 129.666667 109.333333 196.966667
mazda 1.117647 123.705882 97.017647 170.805882
mercedes-benz 0.000000 110.000000 110.925000 195.262500
mercury 1.000000 122.000000 102.700000 178.400000
mitsubishi 1.846154 140.615385 95.353846 168.030769
nissan 1.000000 135.166667 95.722222 170.988889
peugot 0.000000 146.818182 110.200000 191.136364
plymouth 1.000000 128.000000 95.385714 164.900000
porsche 2.600000 134.800000 92.280000 170.260000
renault 1.000000 122.000000 96.100000 179.150000
saab 2.500000 127.000000 99.100000 186.600000
subaru 0.500000 92.250000 96.175000 168.858333
toyota 0.562500 110.656250 98.103125 171.934375
volkswagen 1.666667 121.500000 97.608333 172.533333
volvo -1.272727 91.454545 106.481818 188.800000
width height curb-weight engine-size \
make
alfa-romero 64.566667 50.000000 2639.666667 137.333333
audi 68.714286 54.428571 2800.714286 130.714286
bmw 66.475000 54.825000 2929.375000 166.875000
chevrolet 62.500000 52.400000 1757.000000 80.333333
dodge 64.166667 51.644444 2151.333333 102.666667
honda 64.384615 53.238462 2096.769231 99.307692
isuzu 63.550000 52.225000 2213.500000 102.500000
jaguar 69.933333 51.133333 4027.333333 280.666667
mazda 65.588235 53.358824 2297.823529 103.000000
mercedes-benz 71.062500 55.725000 3696.250000 226.500000
mercury 68.000000 54.800000 2910.000000 140.000000
mitsubishi 65.253846 50.692308 2381.923077 118.307692
nissan 65.088889 53.633333 2400.388889 127.888889
peugot 68.390909 57.181818 3221.000000 135.818182
plymouth 64.271429 51.971429 2220.857143 106.285714
porsche 67.120000 51.100000 2891.200000 187.200000
renault 66.550000 52.850000 2519.500000 132.000000
saab 66.500000 56.100000 2745.500000 121.000000
subaru 64.950000 53.750000 2316.250000 107.083333
toyota 65.090625 53.721875 2441.093750 118.812500
volkswagen 65.616667 55.183333 2343.166667 107.250000
volvo 67.963636 56.236364 3037.909091 142.272727
compression-ratio city-mpg highway-mpg
make
alfa-romero 9.000000 20.333333 26.666667
audi 8.400000 18.857143 24.142857
bmw 8.575000 19.375000 25.375000
chevrolet 9.566667 41.000000 46.333333
dodge 8.634444 28.000000 34.111111
honda 9.215385 30.384615 35.461538
isuzu 9.225000 31.000000 36.000000
jaguar 9.233333 14.333333 18.333333
mazda 10.488235 25.705882 31.941176
mercedes-benz 14.825000 18.500000 21.000000
mercury 8.000000 19.000000 24.000000
mitsubishi 8.061538 24.923077 31.153846
nissan 9.794444 27.000000 32.944444
peugot 14.000000 22.454545 26.636364
plymouth 8.671429 28.142857 34.142857
porsche 9.600000 17.400000 26.000000
renault 8.700000 23.000000 31.000000
saab 9.201667 20.333333 27.333333
subaru 8.816667 26.333333 30.750000
toyota 10.340625 27.500000 32.906250
volkswagen 13.625000 28.583333 34.916667
volvo 10.227273 21.181818 25.818182
How to drop rows/columns with misisng values?
In [106]:
# axis=0 (drop the entire row), #axis = 1 (drop the entire column)
df.dropna(subset=["Price"], axis=0, inplace = True) # drop all the values where prices value = nan
In [ ]: