Banking Dataset - Marketing Targets

Project Title Banking Dataset - Marketing Targets
Tools Jupyter Notebook and VS code
Technologies Banking / Finance Analyst
Project Difficulties level intermediate
Dataset : Dataset is available in the given link. You can download it at your convenience.
Click here to download data set
About Dataset
Context
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a
financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or
term. The bank has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing, and digital marketing.
Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However,
they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it
is crucial to identify the customers most likely to convert beforehand so that they can be specifically
targeted via call.
The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The
classification goal is to predict if the client will subscribe to a term deposit (variable y).
Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing
campaigns were based on phone calls. Often, more than one contact to the same client was required, in
order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer
or not. The data folder contains two datasets:-
● train.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)
● test.csv: 4521 rows and 18 columns with 10% of the examples (4521), randomly selected from
train.csv
Detailed Column Descriptions
bank client data:
1 - age (numeric)
2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or
widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes
last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")
Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")
Missing Attribute Values: None
Citation
This dataset is publicly available for research. It has been picked up from the UCI Machine Learning with
random sampling and a few additional columns.
Please add this citation if you use this dataset for any further analysis.
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing.
Decision Support Systems, Elsevier, 62:22-31, June 2014
Past Usage
The full dataset was described and analyzed in:
● S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of
the CRISP-DM Methodology.
● In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference -
ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
Acknowledgement
Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012. Thanks to Berkin Kaplanoğlu
for helping with the proper column descriptions.
Banking Dataset - Marketing Targets Machine Learning Project
This project involves building a machine learning model to identify potential marketing targets for a
bank based on customer data. The goal is to predict whether a customer will respond positively to a
marketing campaign. Here's a step-by-step guide:
1. Problem Definition
Objective: Develop a machine learning model to predict customer response to a marketing campaign
and identify potential marketing targets.
2. Data Collection
For this example, we will use the UCI Bank Marketing dataset, which can be downloaded from UCI
Machine Learning Repository.
3. Data Preprocessing
import pandas as pd
# Load the dataset

data = pd.read_csv('bank-full.csv', sep=';')
# Display basic info and check for missing values

print(data.info())
print(data.isnull().sum())
print(data.head())
# Encode categorical variables

data = pd.get_dummies(data, drop_first=True)
# Separate features and target variable

X = data.drop('y_yes', axis=1)
y = data['y_yes']
4. Exploratory Data Analysis (EDA)
import seaborn as sns

import matplotlib.pyplot as plt
# Basic statistics
print(data.describe())
# Histograms for numeric features

data.hist(bins=30, figsize=(20, 15))
plt.show()
# Correlation matrix
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
5. Feature Engineering
# Feature engineering example: combining related features
data['contact_duration_rate'] = data['duration'] / (data['campaign'] + 1)
# Drop original features if necessary
data = data.drop(['duration', 'campaign'], axis=1)
# Redefine features and target variable after feature engineering
X = data.drop('y_yes', axis=1)
y = data['y_yes']
6. Model Selection
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score
# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the data

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize models
log_reg = LogisticRegression()
rf_clf = RandomForestClassifier()
# Train models
log_reg.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
# Make predictions
log_reg_pred = log_reg.predict(X_test)
rf_clf_pred = rf_clf.predict(X_test)
# Evaluate models
print("Logistic Regression Metrics")
print(f"Accuracy: {accuracy_score(y_test, log_reg_pred)}")
print(f"Precision: {precision_score(y_test, log_reg_pred)}")
print(f"Recall: {recall_score(y_test, log_reg_pred)}")
print(f"F1 Score: {f1_score(y_test, log_reg_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, log_reg_pred)}")
print("\nRandom Forest Classifier Metrics")

print(f"Accuracy: {accuracy_score(y_test, rf_clf_pred)}")
print(f"Precision: {precision_score(y_test, rf_clf_pred)}")
print(f"Recall: {recall_score(y_test, rf_clf_pred)}")
print(f"F1 Score: {f1_score(y_test, rf_clf_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, rf_clf_pred)}")
7. Model Interpretation

# Feature importance for Random Forest

importances = rf_clf.feature_importances_
features = X.columns
indices = np.argsort(importances)[::-1]
# Plot feature importances

plt.figure(figsize=(10, 6))
sns.barplot(x=importances[indices], y=features[indices])
plt.title('Feature Importances')
plt.show()
8. Deployment
To deploy the model, you can create a web application using Flask.
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
input_data = np.array([data[feature] for feature in X.columns])
input_data = scaler.transform([input_data])
prediction = rf_clf.predict(input_data)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)
9. Monitoring and Maintenance
Set up logging and monitoring to track the performance of your deployed model, and
schedule regular retraining with new data to keep the model accurate.
10. Documentation and Reporting
Maintain comprehensive documentation of the project, including data sources, preprocessing

steps, model selection, and evaluation results. Create detailed reports and visualizations to
communicate findings and insights to stakeholders.
Tools and Technologies
● Programming Language: Python

● Libraries: pandas, numpy, seaborn, matplotlib, scikit-learn, Flask
● Visualization Tools: Tableau, Power BI, or any dashboarding tool for advanced
visualizations
This is a basic outline of a banking dataset marketing targets project. Depending on your
specific goals and data, you may need to adjust the steps accordingly.
Sample Project Report
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
In [2]:
df1 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/train.csv",sep=";")
df2 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/test.csv",sep=";")
In [3]:
# Combining both train Test Datasets
df = pd.concat([df1,df2],ignore_index=True)
In [4]:
df.head()
Out[4]:
a lo d
mari educa def bala hous conta mo dura camp pd previ poutc
g job a a y
tal tion ault nce ing ct nth tion aign ays ous ome
e n y
5 manage marr tertiar n unkn ma unkno n

0 no 2143 yes 5 261 1 -1 0
8 ment ied y o own y wn o
4 technici singl secon n unkn ma unkno n

1 no 29 yes 5 151 1 -1 0
4 an e dary o own y wn o
y
3 entrepre marr secon unkn ma unkno n
2 no 2 yes e 5 76 1 -1 0
3 neur ied dary own y wn o
s
4 blue-coll marr unkno n unkn ma unkno n

3 no 1506 yes 5 92 1 -1 0
7 ar ied wn o own y wn o
3 unknow singl unkno n unkn ma unkno n

4 no 1 no 5 198 1 -1 0
3 n e wn o own y wn o
In [5]:
df.shape
Out[5]:
(49732, 17)
In [6]:
# Find null values in dataset
df.isnull().sum()
Out[6]:
age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64
In [7]:
df.describe()
Out[7]:
age balance day duration campaign pdays previous
count 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000

mean 40.957472 1367.761562 15.816315 258.690179 2.766549 40.158630 0.576892
std 10.615008 3041.608766 8.315680 257.743149 3.099075 100.127123 2.254838
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1431.000000 21.000000 320.000000 3.000000 -1.000000 0.000000
102127.00000
max 95.000000 31.000000 4918.000000 63.000000 871.000000 275.000000
0
In [8]:
# Checking data types
df.dtypes
Out[8]:
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
In [9]:
x = df.drop(['y'],axis = 1)
y =df.y
In [10]:
y.head()
Out[10]:
0 no
1 no
2 no
3 no
4 no
Name: y, dtype: object
In [11]:
# Store all categorical (text) column into dataframe
categorical_columns = df.select_dtypes(include=['object']).columns
In [12]:
#Import labelencoder for converting string to number.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
In [13]:
#Converting Categorical columns in Numeric for training M.L. model
for col in categorical_columns:
df[col]=le.fit_transform(df[col])
In [14]:
df.head()
Out[14]:
a d
jo mari educat defa balan housi lo cont mo durati campa pda previo poutco
g a y
b tal ion ult ce ng an act nth on ign ys us me
e y
5
0 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3 0
8
4
1 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3 0
4
3
2 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3 0
3
4
3 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3 0
7
3 1
4 2 3 0 1 0 0 2 5 8 198 1 -1 0 3 0
3 1
In [15]:
#Define independent variable into x and dependent into y.
#Independents variables
x1= df.drop(['y'],axis=1)
x1.head()
Out[15]:
ag jo marit educati defa balan housi lo cont d mon durati campai pda previo poutco
e b al on ult ce ng an act ay th on gn ys us me
0 58 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3
1 44 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3
2 33 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3
3 47 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3
1
4 33 2 3 0 1 0 0 2 5 8 198 1 -1 0 3
1
In [16]:
#Dependent variable
y1=df.y
y1.head()
Out[16]:
0 0
1 0
2 0
3 0
4 0
Name: y, dtype: int64
In [17]:
#Find best parameters using hyper parameter tuning
In [18]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
In [19]:
# Find the best parameters.
model_params = {
'random_forest': {
'model': RandomForestClassifier(),
'params': {
'n_estimators': [0,1, 5, 10]
}
}
}
In [20]:
scores = []
for model_name, mp in model_params.items():
clf = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)

clf.fit(x1, y1)
scores.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
In [21]:
df1 = pd.DataFrame(scores)
df1
Out[21]:
model best_score best_params
0 random_forest 0.837789 {'n_estimators': 10}

In [22]:
# Create a Pipeline to Encode Categorical Features Numerically and Train a Model
from sklearn.pipeline import Pipeline

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
# Define the pipeline

clf = Pipeline([
('encodef', OneHotEncoder()), # Encoding categorical features
('mod', RandomForestRegressor(n_estimators=10)) # Random Forest model
])
In [23]:
clf.fit(x,y1)
Out[23]:
Pipeline
OneHotEncoder
RandomForestRegressor
In [24]:
clf.score(x,y1)
Out[24]:
0.8761215575318506
Our model achieves an accuracy of 87%.
In [25]:
columns = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan',
'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']
new_data_points = [
[59, 'admin.', 'married', 'secondary', 'no', 2343, 'yes', 'no', 'unknown', 5, 'may', 1042, 1, -1, 0, 'unknown']
]
input = pd.DataFrame(new_data_points, columns=columns)
In [26]:
# Test the model based on above input.
prediction= clf.predict(input)[0]
In [27]:
linkcode
probability_percentage = prediction * 100
print("The probability of this lead converting into a customer is :",probability_percentage,'%')
The probability of this lead converting into a customer is : 70.0 %
Reference link

Banking Dataset - Marketing Targets

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Banking Dataset - Marketing Targets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Banking Dataset - Marketing Targets

Uploaded by

Copyright:

Available Formats

Project Title Banking Dataset - Marketing Targets

Tools Jupyter Notebook and VS code

Technologies Banking / Finance Analyst

Project Difficulties level intermediate

Click here to download data set

Detailed Column Descriptions

bank client data:

Output variable (desired target):

The full dataset was described and analyzed in:

Banking Dataset - Marketing Targets Machine Learning Project

# Load the dataset

# Display basic info and check for missing values

# Encode categorical variables

# Separate features and target variable

4. Exploratory Data Analysis (EDA)

import seaborn as sns

# Histograms for numeric features

# Feature engineering example: combining related features

data['contact_duration_rate'] = data['duration'] / (data['campaign'] + 1)

# Drop original features if necessary

data = data.drop(['duration', 'campaign'], axis=1)

# Redefine features and target variable after feature engineering

from sklearn.model_selection import train_test_split

# Split the data

# Standardize the data

print("\nRandom Forest Classifier Metrics")

import matplotlib.pyplot as plt

# Feature importance for Random Forest

# Plot feature importances

from flask import Flask, request, jsonify

9. Monitoring and Maintenance

10. Documentation and Reporting

Maintain comprehensive documentation of the project, including data sources, preprocessing

Tools and Technologies

● Programming Language: Python

Sample Project Report

5 manage marr tertiar n unkn ma unkno n

4 technici singl secon n unkn ma unkno n

4 blue-coll marr unkno n unkn ma unkno n

3 unknow singl unkno n unkn ma unkno n

age balance day duration campaign pdays previous

count 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000 49732.000000

std 10.615008 3041.608766 8.315680 257.743149 3.099075 100.127123 2.254838

min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000

25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000

50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000

75% 48.000000 1431.000000 21.000000 320.000000 3.000000 -1.000000 0.000000

Name: y, dtype: object

Name: y, dtype: int64

for model_name, mp in model_params.items():

clf = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)

model best_score best_params

0 random_forest 0.837789 {'n_estimators': 10}

from sklearn.pipeline import Pipeline

# Define the pipeline

Our model achieves an accuracy of 87%.

input = pd.DataFrame(new_data_points, columns=columns)