Mall Customer Data Analysis PDF

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 10

12/6/21, 1:19 PM Mall Customer Data Analysis.

ipynb - Colaboratory

Mall Customer Data Analysis –

By – Sarthak Aneja and Sakshi


Chaudhary.

Introduction – The goal of this project is to find the relation


of age and annual income. In this data we have different
columns such as customer ID, gender, age, annual income,
spending score . In this we make a modle which find the relation
between age and annual income of a customer which predict
spending score based on the data. First we read data and check
records given in the data.
Then check any null values is present or not, if present then we
have to replace or remove it. Then we rename the data
frames, then we scale raw data. We perform descriptive
statistics.

Project planning –

 Read client data and check records.


 Check null values if exist and remove/replace null
values if required.
 Rename data frame column if required.
 Scale Raw data as per model requirement.
 Perform descriptive statistics and calculate mean,
median etc.

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 1/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

 Create box plot for numerical column.


 Group data and create box plot for grouped data if
required.
 Check correlation between variable and draw
correlation matrix.
 Draw histogram of data and check density (KDE) is
required.
 Check type of data for regression or classification.
 Perform train and test split for client data and fit into
required model.
 Create model as per requirement and perform
classification/regression/clustering.
 Try to apply some other model and check the best
model.
 Create confusion matrix and classification report for
these model.
 Write your conclusion.

1-Read Client Data and Check Records.

import numpy as np import pandas


as pd import m a t p l o t l i b . p y p l o t
as p l t import seaborn as sns

data = pd.read_csv("Mall_Customers.csv")
data.head(10)
CustomerID Genre Ag Annual Income ( k$) Spending Score (1-100)
e
0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 2/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

3 4 Female 23 16 77

4 5 Female 31 17 40

5 6 Female 22 17 76

6 7 Female 35 18 6

7 8 Female 23 18 94

8 9 Male 64 19 3

9 10 Female 30 19 72

2- Check Null Values if exist and remove/replace Null values if


required.

data.isnull().sum()

CustomerID 0
Genre 0
Age 0
Annual Income ( k $ ) 0
Spending Score ( 1- 100)

0 dtype: i n t 64
3- Rename Data Frame
column names if
required.
data.rename(columns = {'Genre':'Gender','Spending Score (1-100)':'SpendingScore','Annual I
data.head(3)

CustomerID Gender Age AnnualIncome(k$) SpendingScore

0 1 Male19 15 39

1 2 Male21 15 81

2 3 Female20 16 6

4- Scale Raw data as per Model requirement.

x = d f . i l o c [ : , [ 2 , 3 , 4 ] ] # s p l i t i n g columns x

Age AnnualIncome(k$) SpendingScore

0 19 15 39

1 21 15 81

2 20 16 6

3 23 16 77

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 3/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

4 31 17 40

... ... ... ...

195 35 120 79

196 45 126 28

197 32 126 74

198 32 137 18

199 30 137 83

200 rows × 3 columns

x.dtypes

Age int64
AnnualIncome(k$)
SpendingScore int64
dtype: obj ect
int64
5- Perform descriptive statistics and calculate mean, median etc.
import s t a t i s t i c s as s t

df=pd.DataFrame(x) df.mean()

Age 38.85
AnnualIncome(k$) 60.56
SpendingScore 50.20
dtype: f l oa t 64

st.median(x)

'AnnualIncome(k$)'

st.mode(x)

' A ge'

df.mode()

Age AnnualIncome(k$) SpendingScore

0 32.0 54 42.0
1 NaN 78 NaN
df.std()

Age 13.969007

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 4/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

AnnualIncome(k$) 26.264721
SpendingScore 25.823522
dtype: f l o at 64

6- Create boxplot for numerical columns.

sns.boxplot(data =x)
<AxesSubplot: >

7 Group data and Create boxplot for Grouped Data if required.

8 Check Correlation b/w variables and draw correlation matrix.

data.corr()

CustomerID Age AnnualIncome(k$) SpendingScore

CustomerID 1.000000 -0.026763 0.977548 0.013835

Age -0.026763 1.000000 -0.012398 -0.327227


AnnualIncome(k$) 0.977548 -0.012398 1.000000 0.009903

SpendingScore 0.013835 -0.327227 0.009903 1.000000


c o r r e l a t i o n = data.Age.corr(data.SpendingScore) c o r r e l a t i o n

-0.32722684603909014

sns.heatmap(data.corr(),annot=True)

<AxesSubplot:>

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 5/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

sns.heatmap(x.corr(),annot=True)
<AxesSubplot:>

9- Draw histogram for data and check density (KDE) if required.

color s = ' r e d ' , ' g r e e n '


p l t . h i s t ( [ d a t a [ ' S p e n d i n g S c o r e ' ] , d a t a [ ' A g e ' ] ] , c ol o r = c o l o r s , edgecolor = ' b l a c k ' ,
bins = i n t ( 9 0 / 1 0 ) )

(array ([[17., 20., 12., 30., 43., 21., 24., 16., 17.],
[ 0 . , 25., 59., 47., 40., 19., 10., 0., 0.]]),
array([ 1. , 11.88888889, 22.77777778, 33.66666667, 44.55555556,
55.44444444, 66.33333333, 77.22222222, 88.11111111, 99. ] ) , <a
list
o f 2 BarContainer objects>)

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 6/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

x.columns

I n d e x ( [ ' A g e ' , 'AnnualIncome(k$)', 'SpendingScore'], d t y p e = ' o b j e c t ' )

x.columns = x . c ol umns . s t r. r epl ac e( ' ' , ' _ ' )

x
Age AnnualIncome(k$)
SpendingScore

0 19 15 39

1 21 15 81

2 20 16 6

3 23 16 77

4 31 17 40

... ... ...


...
195 35 120 79

196 45 126 28

197 32 126 74

198 32 137 18

199 30 137 83
s ns .di s pl ot ( dat a = data, x = "AnnualIncome(k$)",hue = 'Gender')
200 rows × 3 columns
<seaborn.axisgrid.FacetGrid a t 0x1f8ff324220>

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 7/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

s n s . d i s t p l o t ( ( d a t a [ ' A g e ' ] ) , h i s t = Tr u e ,kde=True,


c ol or = ' r e d ' ,
hist_kws={'edgecolor':'black'},
k de_k w s = { ' l i new i dt h' : 4} )

s ns .di s t pl ot ( ( dat a[ ' Spendi ngS c or e' ] ) , hi s t = Tr u e ,kde=True,


color = ' g r e e n ' , hist_kws ={'edg e c o l o r ' : ' b l a c k ' } ,
k de_k w s = { ' l i new i dt h' : 4} ) C:\ProgramData\Anaconda3\lib\site-
packages\seaborn\distributions.py:2557: FutureWa
warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='SpendingScore', y l abe l = ' D ens i t y ' >

10- Check type of Data for regression or classi cation.

data.dtypes

CustomerID i nt 64

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 8/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

Gender object
Age
AnnualIncome(k$) int64
SpendingScore
dtype: object int64

int64

11- Perform Train and Test split for Client data and t into required

Model.

## Separate t r a i n dataset and t e s t dataset x _ t r a i n , x _ t e s t , y _ t r a i n , y_ t es t =


t r ai n _t es t _s pl i t ( x [ [ " A ge " , " A n n ual I nc o me ( k $) " ] ] , x.Spend x _ t r a i n
Age AnnualIncome(k$)

88 34 58

58 27 46

113 19 64

149 34 78
42 34
12- Create model as per requirement and perform36
...
classi... cation/regression/clustering....
151 39 78

67 68 48
from sklearn.linear_model import LinearRegression from
sklearn.model_selection 25 29 import28
train_test_split
l r = LinearRegression()

lr.fit(data[[ 1 9 6 4 5 "Age","AnnualIncome(k$)" 1 2 6 ]],data.SpendingScore)


175 30 88
y_predicted = l r. p r e d i c t ( x _ t e s t )
y_predicted140 rows × 2 columns

array([40.41222091,
61.34408401, 50.75108989, 54.33957352, 47.61793098, 40.95952258,
57.40374973, 45.67135234,
32.58447792, 45.26776466, 62.2822877 , 62.73761242, 49.14367625,
52.28952082, 52.8701253 ,
57.46698388, 40.85029995, 61.60276918, 50.81432404, 43.3384317 ,
38.20002002, 52.60569158, 54.43155047, 37.97582621, 41.26994478,
54.49478462, 44.06968726, 54.71897843, 33.79405243, 57.52446947,
31.87046803, 33.79405243, 39.93390196, 60.957742 , 59.84014444,
44.62848604, 47.75014784, 56.37238055, 39.18540072, 35.61991133,
50.77408413, 38.07355171, 31.27717789, 49.55301249, 52.34125785,
34.25512571, 59.05715184, 54.58676157, 60.81977658, 45.37123872,
41.48383001, 58.94792922, 44.08693294, 59.74816749, 50.16929687,
49.32307012, 42.03113167, 54.42580191, 61.67175189, 57.36350981])

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 9/10
12/6/21, 1:19 PM Mall Customer Data Analysis.ipynb - Colaboratory

https://colab.research.google.com/drive/1wFtP4v8SzFKKYBfFiUZlIe1gRHZ6NmSe#printMode=true 10/10

You might also like