0% found this document useful (0 votes)
42 views19 pages

Banking Dataset - Marketing Targets

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 19

Project Title Banking Dataset - Marketing Targets

Tools Jupyter Notebook and VS code

Technologies Banking / Finance Analyst

Project Difficulties level intermediate

Dataset : Dataset is available in the given link. You can download it at your convenience.

Click here to download data set

About Dataset
Context
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a
financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or
term. The bank has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However,
they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it
is crucial to identify the customers most likely to convert beforehand so that they can be specifically
targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The
classification goal is to predict if the client will subscribe to a term deposit (variable y).

Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing
campaigns were based on phone calls. Often, more than one contact to the same client was required, in
order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer
or not. The data folder contains two datasets:-

● train.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)
● test.csv: 4521 rows and 18 columns with 10% of the examples (4521), randomly selected from
train.csv

Detailed Column Descriptions

bank client data:

1 - age (numeric)
2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or
widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes
last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")

Output variable (desired target):


17 - y - has the client subscribed a term deposit? (binary: "yes","no")
Missing Attribute Values: None

Citation

This dataset is publicly available for research. It has been picked up from the UCI Machine Learning with
random sampling and a few additional columns.

Please add this citation if you use this dataset for any further analysis.

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing.
Decision Support Systems, Elsevier, 62:22-31, June 2014

Past Usage

The full dataset was described and analyzed in:

● S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of
the CRISP-DM Methodology.
● In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference -
ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

Acknowledgement

Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012. Thanks to Berkin Kaplanoğlu
for helping with the proper column descriptions.

Banking Dataset - Marketing Targets Machine Learning Project

This project involves building a machine learning model to identify potential marketing targets for a
bank based on customer data. The goal is to predict whether a customer will respond positively to a
marketing campaign. Here's a step-by-step guide:

1. Problem Definition

Objective: Develop a machine learning model to predict customer response to a marketing campaign
and identify potential marketing targets.

2. Data Collection
For this example, we will use the UCI Bank Marketing dataset, which can be downloaded from UCI
Machine Learning Repository.

3. Data Preprocessing

import pandas as pd

# Load the dataset


data = pd.read_csv('bank-full.csv', sep=';')

# Display basic info and check for missing values


print(data.info())
print(data.isnull().sum())
print(data.head())

# Encode categorical variables


data = pd.get_dummies(data, drop_first=True)

# Separate features and target variable


X = data.drop('y_yes', axis=1)
y = data['y_yes']

4. Exploratory Data Analysis (EDA)

import seaborn as sns


import matplotlib.pyplot as plt

# Basic statistics
print(data.describe())

# Histograms for numeric features


data.hist(bins=30, figsize=(20, 15))
plt.show()

# Correlation matrix
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

5. Feature Engineering

# Feature engineering example: combining related features

data['contact_duration_rate'] = data['duration'] / (data['campaign'] + 1)

# Drop original features if necessary

data = data.drop(['duration', 'campaign'], axis=1)

# Redefine features and target variable after feature engineering

X = data.drop('y_yes', axis=1)

y = data['y_yes']

6. Model Selection

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score

# Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize models
log_reg = LogisticRegression()
rf_clf = RandomForestClassifier()

# Train models
log_reg.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

# Make predictions
log_reg_pred = log_reg.predict(X_test)
rf_clf_pred = rf_clf.predict(X_test)

# Evaluate models
print("Logistic Regression Metrics")
print(f"Accuracy: {accuracy_score(y_test, log_reg_pred)}")
print(f"Precision: {precision_score(y_test, log_reg_pred)}")
print(f"Recall: {recall_score(y_test, log_reg_pred)}")
print(f"F1 Score: {f1_score(y_test, log_reg_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, log_reg_pred)}")

print("\nRandom Forest Classifier Metrics")


print(f"Accuracy: {accuracy_score(y_test, rf_clf_pred)}")
print(f"Precision: {precision_score(y_test, rf_clf_pred)}")
print(f"Recall: {recall_score(y_test, rf_clf_pred)}")
print(f"F1 Score: {f1_score(y_test, rf_clf_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, rf_clf_pred)}")

7. Model Interpretation

import matplotlib.pyplot as plt


import seaborn as sns

# Feature importance for Random Forest


importances = rf_clf.feature_importances_
features = X.columns
indices = np.argsort(importances)[::-1]

# Plot feature importances


plt.figure(figsize=(10, 6))
sns.barplot(x=importances[indices], y=features[indices])
plt.title('Feature Importances')
plt.show()

8. Deployment
To deploy the model, you can create a web application using Flask.

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
input_data = np.array([data[feature] for feature in X.columns])
input_data = scaler.transform([input_data])
prediction = rf_clf.predict(input_data)
return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
app.run(debug=True)

9. Monitoring and Maintenance

Set up logging and monitoring to track the performance of your deployed model, and
schedule regular retraining with new data to keep the model accurate.

10. Documentation and Reporting

Maintain comprehensive documentation of the project, including data sources, preprocessing


steps, model selection, and evaluation results. Create detailed reports and visualizations to
communicate findings and insights to stakeholders.

Tools and Technologies

● Programming Language: Python


● Libraries: pandas, numpy, seaborn, matplotlib, scikit-learn, Flask
● Visualization Tools: Tableau, Power BI, or any dashboarding tool for advanced
visualizations

This is a basic outline of a banking dataset marketing targets project. Depending on your
specific goals and data, you may need to adjust the steps accordingly.

Sample Project Report

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
df1 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/train.csv",sep=";")
df2 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/test.csv",sep=";")

In [3]:
# Combining both train Test Datasets
df = pd.concat([df1,df2],ignore_index=True)

In [4]:
df.head()

Out[4]:
a lo d
mari educa def bala hous conta mo dura camp pd previ poutc
g job a a y
tal tion ault nce ing ct nth tion aign ays ous ome
e n y

5 manage marr tertiar n unkn ma unkno n


0 no 2143 yes 5 261 1 -1 0
8 ment ied y o own y wn o

4 technici singl secon n unkn ma unkno n


1 no 29 yes 5 151 1 -1 0
4 an e dary o own y wn o

y
3 entrepre marr secon unkn ma unkno n
2 no 2 yes e 5 76 1 -1 0
3 neur ied dary own y wn o
s

4 blue-coll marr unkno n unkn ma unkno n


3 no 1506 yes 5 92 1 -1 0
7 ar ied wn o own y wn o

3 unknow singl unkno n unkn ma unkno n


4 no 1 no 5 198 1 -1 0
3 n e wn o own y wn o

In [5]:
df.shape

Out[5]:

(49732, 17)

In [6]:
# Find null values in dataset
df.isnull().sum()

Out[6]:
age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0

dtype: int64

In [7]:
df.describe()

Out[7]:

age balance day duration campaign pdays previous

count 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000


mean 40.957472 1367.761562 15.816315 258.690179 2.766549 40.158630 0.576892

std 10.615008 3041.608766 8.315680 257.743149 3.099075 100.127123 2.254838

min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000

25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000

50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000

75% 48.000000 1431.000000 21.000000 320.000000 3.000000 -1.000000 0.000000

102127.00000
max 95.000000 31.000000 4918.000000 63.000000 871.000000 275.000000
0

In [8]:
# Checking data types
df.dtypes

Out[8]:
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object

dtype: object

In [9]:
x = df.drop(['y'],axis = 1)
y =df.y

In [10]:
y.head()

Out[10]:
0 no
1 no
2 no
3 no
4 no

Name: y, dtype: object

In [11]:
# Store all categorical (text) column into dataframe
categorical_columns = df.select_dtypes(include=['object']).columns

In [12]:
#Import labelencoder for converting string to number.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [13]:
#Converting Categorical columns in Numeric for training M.L. model
for col in categorical_columns:
df[col]=le.fit_transform(df[col])

In [14]:
df.head()

Out[14]:

a d
jo mari educat defa balan housi lo cont mo durati campa pda previo poutco
g a y
b tal ion ult ce ng an act nth on ign ys us me
e y

5
0 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3 0
8

4
1 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3 0
4

3
2 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3 0
3
4
3 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3 0
7

3 1
4 2 3 0 1 0 0 2 5 8 198 1 -1 0 3 0
3 1

In [15]:
#Define independent variable into x and dependent into y.

#Independents variables

x1= df.drop(['y'],axis=1)
x1.head()

Out[15]:

ag jo marit educati defa balan housi lo cont d mon durati campai pda previo poutco
e b al on ult ce ng an act ay th on gn ys us me

0 58 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3

1 44 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3

2 33 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3
3 47 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3

1
4 33 2 3 0 1 0 0 2 5 8 198 1 -1 0 3
1

In [16]:
#Dependent variable
y1=df.y
y1.head()

Out[16]:
0 0
1 0
2 0
3 0
4 0

Name: y, dtype: int64

In [17]:
#Find best parameters using hyper parameter tuning

In [18]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [19]:
# Find the best parameters.
model_params = {
'random_forest': {
'model': RandomForestClassifier(),
'params': {
'n_estimators': [0,1, 5, 10]
}
}
}

In [20]:
scores = []

for model_name, mp in model_params.items():

clf = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)


clf.fit(x1, y1)
scores.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})

In [21]:
df1 = pd.DataFrame(scores)
df1

Out[21]:

model best_score best_params

0 random_forest 0.837789 {'n_estimators': 10}


In [22]:
# Create a Pipeline to Encode Categorical Features Numerically and Train a Model

from sklearn.pipeline import Pipeline


from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder

# Define the pipeline


clf = Pipeline([
('encodef', OneHotEncoder()), # Encoding categorical features
('mod', RandomForestRegressor(n_estimators=10)) # Random Forest model
])

In [23]:
clf.fit(x,y1)

Out[23]:

Pipeline

OneHotEncoder

RandomForestRegressor

In [24]:
clf.score(x,y1)

Out[24]:

0.8761215575318506

Our model achieves an accuracy of 87%.

In [25]:
columns = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan',
'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']

new_data_points = [

[59, 'admin.', 'married', 'secondary', 'no', 2343, 'yes', 'no', 'unknown', 5, 'may', 1042, 1, -1, 0, 'unknown']
]

input = pd.DataFrame(new_data_points, columns=columns)

In [26]:
# Test the model based on above input.

prediction= clf.predict(input)[0]

In [27]:

linkcode
probability_percentage = prediction * 100
print("The probability of this lead converting into a customer is :",probability_percentage,'%')

The probability of this lead converting into a customer is : 70.0 %

Reference link

You might also like