RFM - Analysis - Ipynb - Colaboratory
RFM - Analysis - Ipynb - Colaboratory
ipynb - Colaboratory
Import Library
1 !pip install gdown
2 !pip install matplotlib
3 !pip install wget
4 import gdown
5 import pandas as pd
6 import numpy as np
7 import os
8 import math
9 import scipy.stats as stat
10 import datetime as dt
11 import seaborn as sb
12 import re
13 import matplotlib
14 import matplotlib.pyplot as plt
15 %matplotlib inline
Import Data
1 url1 = "https://raw.githubusercontent.com/hendraswntr/MiniProject-DataAnalyst-RFM_Analysis/main/data/Products.csv"
2 url2 = "https://raw.githubusercontent.com/hendraswntr/MiniProject-DataAnalyst-RFM_Analysis/main/data/OrderDetails.csv"
3 url3 = "https://raw.githubusercontent.com/hendraswntr/MiniProject-DataAnalyst-RFM_Analysis/main/data/Orders.csv"
4 url4 = "https://raw.githubusercontent.com/hendraswntr/MiniProject-DataAnalyst-RFM_Analysis/main/data/Customers.csv"
5
6 products = pd.read_csv(url1, sep=';')
7 orderdetails = pd.read_csv(url2, sep=';')
8 orders = pd.read_csv(url3, sep=';')
9 customers = pd.read_csv(url4, sep=';')
10
1 products.head()
10 boxes x 20
0 1 Chai 1 1 18
bags
1 2 Chang 1 1 24 - 12 oz bottles 19
Aniseed
2 3 1 2 12 - 550 ml bottles 10
Syrup
Chef Anton's
3 4 Cajun 2 2 48 - 6 oz jars 22
Seasoning
Chef Anton's
4 5 2 2 36 boxes 21,35
Gumbo Mix
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 1/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
1 orderdetails.head()
0 10248 11 14 12 0
1 10248 42 9,8 10 0
2 10248 72 34,8 5 0
3 10249 14 18,6 9 0
4 10249 51 42,4 40 0
1 orders.head()
1996-07-
1996-08-01 1996-07-16
0 10248 VINET 5 04 3
00:00:00 00:00:00
00:00:00
1996-07-
1996-08-16 1996-07-10
1 10249 TOMSP 6 05 1
00:00:00 00:00:00
00:00:00
1996-07-
1996-08-05 1996-07-12
2 10250 HANAR 4 08 2
00:00:00 00:00:00
00:00:00
1996-07-
1996-08-05 1996-07-15
3 10251 VICTE 3 08 1
00:00:00 00:00:00
00:00:00
1996-07-
1996-08-06 1996-07-11
4 10252 SUPRD 4 09 2
00:00:00 00:00:00
00:00:00
1 customers.head()
Alfreds Sales
0 ALFKI Maria Anders Obere Str. 57 Berlin Na
Futterkiste Representative
Antonio
Antonio Mataderos Moxico
2 ANTON Moreno Owner Na
Moreno 2312 D.F.
Taqueroa
Data Merge
1 data1 = pd.merge(customers, orders, on=['CustomerID'])
2 data2 = pd.merge(data1, orderdetails, on=['OrderID'])
3 data = pd.merge(data2, products, on=['ProductID'])
1 data.head()
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 2/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
Alfreds Sales
0 ALFKI Maria Anders Obere Str. 57 Berlin N
Futterkiste Representative
Alfreds Sales
1 ALFKI Maria Anders Obere Str. 57 Berlin N
Futterkiste Representative
5 rows × 'pandas.core.frame.DataFrame'>
<class 37 columns
Int64Index: 2155 entries, 0 to 2154
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 2155 non-null object
1 CompanyName 2155 non-null object
2 ContactName 2155 non-null object
3 ContactTitle 2155 non-null object
4 Address 2155 non-null object
5 City 2155 non-null object
6 Region 826 non-null object
7 PostalCode 2100 non-null object
8 Country 2155 non-null object
9 Phone 2155 non-null object
10 Fax 1506 non-null object
11 OrderID 2155 non-null int64
12 EmployeeID 2155 non-null int64
13 OrderDate 2155 non-null object
14 RequiredDate 2155 non-null object
15 ShippedDate 2082 non-null object
16 ShipVia 2155 non-null int64
17 Freight 2155 non-null object
18 ShipName 2155 non-null object
19 ShipAddress 2155 non-null object
20 ShipCity 2155 non-null object
21 ShipRegion 856 non-null object
22 ShipPostalCode 2100 non-null object
23 ShipCountry 2155 non-null object
24 ProductID 2155 non-null int64
25 UnitPrice_x 2155 non-null object
26 Quantity 2155 non-null int64
27 Discount 2155 non-null object
28 ProductName 2155 non-null object
29 SupplierID 2155 non-null int64
30 CategoryID 2155 non-null int64
31 QuantityPerUnit 2155 non-null object
32 UnitPrice_y 2155 non-null object
33 UnitsInStock 2155 non-null int64
34 UnitsOnOrder 2155 non-null int64
35 ReorderLevel 2155 non-null int64
36 Discontinued 2155 non-null bool
dtypes: bool(1), int64(10), object(26)
memory usage: 625.0+ KB
1 data.isnull().sum()
CustomerID 0
CompanyName 0
ContactName 0
ContactTitle 0
Address 0
City 0
Region 1329
PostalCode 55
Country 0
Phone 0
Fax 649
OrderID 0
EmployeeID 0
OrderDate 0
RequiredDate 0
ShippedDate 73
ShipVia 0
Freight 0
ShipName 0
ShipAddress 0
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 3/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
ShipCity 0
ShipRegion 1299
ShipPostalCode 55
ShipCountry 0
ProductID 0
UnitPrice_x 0
Quantity 0
Discount 0
ProductName 0
SupplierID 0
CategoryID 0
QuantityPerUnit 0
UnitPrice_y 0
UnitsInStock 0
UnitsOnOrder 0
ReorderLevel 0
Discontinued 0
dtype: int64
1 del data['Region']
2 del data['PostalCode']
3 del data['Fax']
4 del data['ShippedDate']
5 del data['ShipRegion']
6 del data['ShipPostalCode']
1 data.duplicated().sum()
1 data.nunique()
CustomerID 89
CompanyName 89
ContactName 89
ContactTitle 12
Address 89
City 69
Country 21
Phone 89
OrderID 830
EmployeeID 9
OrderDate 480
RequiredDate 454
ShipVia 3
Freight 799
ShipName 90
ShipAddress 89
ShipCity 70
ShipCountry 21
ProductID 77
UnitPrice_x 116
Quantity 55
Discount 11
ProductName 77
SupplierID 29
CategoryID 8
QuantityPerUnit 70
UnitPrice_y 62
UnitsInStock 51
UnitsOnOrder 10
ReorderLevel 7
Discontinued 2
dtype: int64
1 data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2155 entries, 0 to 2154
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 2155 non-null object
1 CompanyName 2155 non-null object
2 ContactName 2155 non-null object
3 ContactTitle 2155 non-null object
4 Address 2155 non-null object
5 City 2155 non-null object
6 Country 2155 non-null object
7 Phone 2155 non-null object
8 OrderID 2155 non-null int64
9 EmployeeID 2155 non-null int64
10 OrderDate 2155 non-null object
11 RequiredDate 2155 non-null object
12 ShipVia 2155 non-null int64
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 4/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
13 Freight 2155 non-null object
14 ShipName 2155 non-null object
15 ShipAddress 2155 non-null object
16 ShipCity 2155 non-null object
17 ShipCountry 2155 non-null object
18 ProductID 2155 non-null int64
19 UnitPrice_x 2155 non-null object
20 Quantity 2155 non-null int64
21 Discount 2155 non-null object
22 ProductName 2155 non-null object
23 SupplierID 2155 non-null int64
24 CategoryID 2155 non-null int64
25 QuantityPerUnit 2155 non-null object
26 UnitPrice_y 2155 non-null object
27 UnitsInStock 2155 non-null int64
28 UnitsOnOrder 2155 non-null int64
29 ReorderLevel 2155 non-null int64
30 Discontinued 2155 non-null bool
dtypes: bool(1), int64(10), object(20)
memory usage: 524.0+ KB
1 data['UnitPrice_x'] = pd.to_numeric(data['UnitPrice_x'].str.replace(',','.'), errors='coerce', downcast='float')
2 data['UnitPrice_y'] = pd.to_numeric(data['UnitPrice_y'].str.replace(',','.'), errors='coerce', downcast='float')
3 data['Discount'] = pd.to_numeric(data['Discount'].str.replace(',','.'), errors='coerce', downcast='float')
Data Analysis
1 data['OrderDate'].head()
0 1997-08-25 00:00:00
1 1998-03-16 00:00:00
2 1997-11-07 00:00:00
3 1998-03-04 00:00:00
4 1997-04-17 00:00:00
Name: OrderDate, dtype: object
1 data['recentpurchasedate'] = pd.DatetimeIndex(data['OrderDate']).date
1 data['recentpurchasedate'].head()
0 1997-08-25
1 1998-03-16
2 1997-11-07
3 1998-03-04
4 1997-04-17
Name: recentpurchasedate, dtype: object
1 recency_data = data.groupby(by='CustomerID', as_index=False)['recentpurchasedate'].max()
2 recency_data.columns = ['CustomerID','lastpurshacedate']
3 recency_data.head()
CustomerID lastpurshacedate
0 ALFKI 1998-04-09
1 ANATR 1998-03-04
2 ANTON 1998-01-28
3 AROUT 1998-04-10
4 BERGS 1998-03-04
1 data['OrderDate'].max()
'1998-05-06 00:00:00'
1 now = dt.date(1998,5,6)
2 print(now)
1998-05-06
1 recency_data['Recency'] = recency_data['lastpurshacedate'].apply(lambda x: (now - x).days)
2 recency_data.head()
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 5/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
0 ALFKI 1998-04-09 27
1 ANATR 1998-03-04 63
2 ANTON 1998-01-28 98
3 AROUT 1998-04-10 26
1 recency_data.drop('lastpurshacedate',axis=1,inplace=True)
2 recency_data.head()
4 BERGS 1998-03-04 63
CustomerID Recency
0 ALFKI 27
1 ANATR 63
2 ANTON 98
3 AROUT 26
4 BERGS 63
1 frequency_data = data.groupby(by=['CustomerID'], as_index=False)['OrderID'].count()
2 frequency_data.columns = ['CustomerID','Frequency']
3 frequency_data.head()
CustomerID Frequency
0 ALFKI 12
1 ANATR 10
2 ANTON 17
3 AROUT 30
4 BERGS 52
1 data['TotalSales'] = (data['UnitPrice_x'] - data['Discount'])*data['Quantity']
2 data['TotalSales']
0 680.249977
1 91.199997
2 911.999969
3 1365.000000
4 136.799995
...
2150 185.399998
2151 386.249995
2152 155.000000
2153 247.999992
2154 775.000000
Name: TotalSales, Length: 2155, dtype: float64
1 monetary_data = data.groupby(by='CustomerID',as_index=False).agg({'TotalSales': 'sum'})
2 monetary_data.columns = ['CustomerID','Monetary']
3 monetary_data.head()
CustomerID Monetary
0 ALFKI 4583.500010
1 ANATR 1402.949990
2 ANTON 7497.149937
3 AROUT 13788.899798
4 BERGS 26900.049850
1 temp_data = recency_data.merge(frequency_data,on='CustomerID')
2 rfm_data = temp_data.merge(monetary_data,on='CustomerID')
3 rfm_data.set_index('CustomerID',inplace=True)
4 rfm_data.head()
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 6/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
CustomerID
ALFKI 27 12 4583.500010
1 rfm_data.shape
ANATR 63 10 1402.949990
(89, 3)
ANTON 98 17 7497.149937
AROUT 26 30 13788.899798
1 quantiles = rfm_data.quantile(q=[0.25,0.5,0.75])
2 quantiles
BERGS 63 52 26900.049850
1 quantiles.to_dict()
1 def RScore(x,p,d):
2 if x <= d[p][0.25]:
3 return 4
4 elif x <= d[p][0.50]:
5 return 3
6 elif x <= d[p][0.75]:
7 return 2
8 else:
9 return 1
10
11 def FMScore(x,p,d):
12 if x <= d[p][0.25]:
13 return 1
14 elif x <= d[p][0.50]:
15 return 2
16 elif x <= d[p][0.75]:
17 return 3
18 else:
19 return 4
1 rfm_segmentation = rfm_data
2 rfm_segmentation['R_Score'] = rfm_segmentation['Recency'].apply(RScore, args=('Recency',quantiles,))
3 rfm_segmentation['F_Score'] = rfm_segmentation['Frequency'].apply(FMScore, args=('Frequency',quantiles,))
4 rfm_segmentation['M_Score'] = rfm_segmentation['Monetary'].apply(FMScore, args=('Monetary',quantiles,))
5 rfm_segmentation.head()
CustomerID
ALFKI 27 12 4583.500010 2 2 2
ANATR 63 10 1402.949990 1 1 1
ANTON 98 17 7497.149937 1 2 2
AROUT 26 30 13788.899798 2 3 3
BERGS 63 52 26900.049850 1 4 4
1 rfm_segmentation['RFMScore'] = rfm_segmentation.R_Score.map(str) \
2 + rfm_segmentation.F_Score.map(str) \
3 + rfm_segmentation.M_Score.map(str)
4 rfm_segmentation.head()
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 7/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
CustomerID
1 rfm_segmentation['Customer_Segmentation'] = np.where(rfm_segmentation['RFMScore'] ==444, "Best Customers",
2 (np.where(rfm_segmentation['F_Score'] == 4,"Loyal Customers",
3 (np.where(rfm_segmentation['M_Score'] == 4, "Big Spenders",
4 (np.where(rfm_segmentation['F_Score'] == 3, "Gold Customers",
5 (np.where(rfm_segmentation['F_Score'] == 2, "Silver Customers",
6 (np.where(rfm_segmentation['RFMScore'] == 244,"Almost Lost",
7 (np.where(rfm_segmentation['RFMScore'] == 144, "Lost Customers",
8 np.where(rfm_segmentation['RFMScore'] == 111,'Lost Cheap Customers', 'Low Value Custom
9 ))))))))))))))
10 rfm_segmentation
CustomerID
89 rows × 8 columns
1 rfsegment = pd.merge(rfm_segmentation, data, on=['CustomerID'])
2 data = rfsegment.sort_values(by=['CustomerID'])
3 data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2155 entries, 0 to 2154
Data columns (total 41 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 2155 non-null object
1 Recency 2155 non-null int64
2 Frequency 2155 non-null int64
3 Monetary 2155 non-null float64
4 R_Score 2155 non-null int64
5 F_Score 2155 non-null int64
6 M_Score 2155 non-null int64
7 RFMScore 2155 non-null object
8 Customer_Segmentation 2155 non-null object
9 CompanyName 2155 non-null object
10 ContactName 2155 non-null object
11 ContactTitle 2155 non-null object
12 Address 2155 non-null object
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 8/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
13 City 2155 non-null object
14 Country 2155 non-null object
15 Phone 2155 non-null object
16 OrderID 2155 non-null int64
17 EmployeeID 2155 non-null int64
18 OrderDate 2155 non-null object
19 RequiredDate 2155 non-null object
20 ShipVia 2155 non-null int64
21 Freight 2155 non-null object
22 ShipName 2155 non-null object
23 ShipAddress 2155 non-null object
24 ShipCity 2155 non-null object
25 ShipCountry 2155 non-null object
26 ProductID 2155 non-null int64
27 UnitPrice_x 2155 non-null float32
28 Quantity 2155 non-null int64
29 Discount 2155 non-null float32
30 ProductName 2155 non-null object
31 SupplierID 2155 non-null int64
32 CategoryID 2155 non-null int64
33 QuantityPerUnit 2155 non-null object
34 UnitPrice_y 2155 non-null float32
35 UnitsInStock 2155 non-null int64
36 UnitsOnOrder 2155 non-null int64
37 ReorderLevel 2155 non-null int64
38 Discontinued 2155 non-null bool
39 recentpurchasedate 2155 non-null object
40 TotalSales 2155 non-null float64
dtypes: bool(1), float32(3), float64(2), int64(15), object(20)
memory usage: 667.1+ KB
1 data.head()
5 rows × 41 columns
1 data['Customer_Segmentation'].value_counts()
1 plt.scatter(data['Recency'], data['Monetary'])
2 plt.title('Recency/Avg. TotalSales')
3 plt.savefig("RecedncyAvgTotalSales.png")
4 plt.show()
1 plt.scatter(data['Frequency'], data['Monetary'])
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 9/10
2/10/23, 3:05 PM RFM_Analysis.ipynb - Colaboratory
1 plt.scatter(data[ Frequency ], data[ Monetary ])
2 plt.title('Frequency/Avg. TotalSales')
3 plt.savefig("FrequencyAvgTotalSales.png")
4 plt.show()
1 data.to_csv('./RFM_Analysis.csv', index=False)
https://colab.research.google.com/drive/1oMZl8cfYL0lYQLQQ3RYsYmqpdD50l-ZR#scrollTo=SfKlyh_hi9nZ&printMode=true 10/10