Banking Dataset - Marketing Targets
Banking Dataset - Marketing Targets
Banking Dataset - Marketing Targets
Dataset : Dataset is available in the given link. You can download it at your convenience.
About Dataset
Context
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a
financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or
term. The bank has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing, and digital marketing.
Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However,
they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it
is crucial to identify the customers most likely to convert beforehand so that they can be specifically
targeted via call.
The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The
classification goal is to predict if the client will subscribe to a term deposit (variable y).
Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing
campaigns were based on phone calls. Often, more than one contact to the same client was required, in
order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer
or not. The data folder contains two datasets:-
● train.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)
● test.csv: 4521 rows and 18 columns with 10% of the examples (4521), randomly selected from
train.csv
1 - age (numeric)
2 - job : type of job (categorical:
"admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or
widowed)
4 - education (categorical: "unknown","secondary","primary","tertiary")
5 - default: has credit in default? (binary: "yes","no")
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: "yes","no")
8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes
last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical:
"unknown","other","failure","success")
Citation
This dataset is publicly available for research. It has been picked up from the UCI Machine Learning with
random sampling and a few additional columns.
Please add this citation if you use this dataset for any further analysis.
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing.
Decision Support Systems, Elsevier, 62:22-31, June 2014
Past Usage
● S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of
the CRISP-DM Methodology.
● In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference -
ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
Acknowledgement
Created by: Paulo Cortez (Univ. Minho) and Sérgio Moro (ISCTE-IUL) @ 2012. Thanks to Berkin Kaplanoğlu
for helping with the proper column descriptions.
This project involves building a machine learning model to identify potential marketing targets for a
bank based on customer data. The goal is to predict whether a customer will respond positively to a
marketing campaign. Here's a step-by-step guide:
1. Problem Definition
Objective: Develop a machine learning model to predict customer response to a marketing campaign
and identify potential marketing targets.
2. Data Collection
For this example, we will use the UCI Bank Marketing dataset, which can be downloaded from UCI
Machine Learning Repository.
3. Data Preprocessing
import pandas as pd
# Basic statistics
print(data.describe())
# Correlation matrix
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
5. Feature Engineering
X = data.drop('y_yes', axis=1)
y = data['y_yes']
6. Model Selection
# Initialize models
log_reg = LogisticRegression()
rf_clf = RandomForestClassifier()
# Train models
log_reg.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
# Make predictions
log_reg_pred = log_reg.predict(X_test)
rf_clf_pred = rf_clf.predict(X_test)
# Evaluate models
print("Logistic Regression Metrics")
print(f"Accuracy: {accuracy_score(y_test, log_reg_pred)}")
print(f"Precision: {precision_score(y_test, log_reg_pred)}")
print(f"Recall: {recall_score(y_test, log_reg_pred)}")
print(f"F1 Score: {f1_score(y_test, log_reg_pred)}")
print(f"ROC AUC: {roc_auc_score(y_test, log_reg_pred)}")
7. Model Interpretation
8. Deployment
To deploy the model, you can create a web application using Flask.
app = Flask(__name__)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
input_data = np.array([data[feature] for feature in X.columns])
input_data = scaler.transform([input_data])
prediction = rf_clf.predict(input_data)
return jsonify({'prediction': int(prediction[0])})
if __name__ == '__main__':
app.run(debug=True)
Set up logging and monitoring to track the performance of your deployed model, and
schedule regular retraining with new data to keep the model accurate.
This is a basic outline of a banking dataset marketing targets project. Depending on your
specific goals and data, you may need to adjust the steps accordingly.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
In [2]:
df1 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/train.csv",sep=";")
df2 = pd.read_csv("/kaggle/input/banking-dataset-marketing-targets/test.csv",sep=";")
In [3]:
# Combining both train Test Datasets
df = pd.concat([df1,df2],ignore_index=True)
In [4]:
df.head()
Out[4]:
a lo d
mari educa def bala hous conta mo dura camp pd previ poutc
g job a a y
tal tion ault nce ing ct nth tion aign ays ous ome
e n y
y
3 entrepre marr secon unkn ma unkno n
2 no 2 yes e 5 76 1 -1 0
3 neur ied dary own y wn o
s
In [5]:
df.shape
Out[5]:
(49732, 17)
In [6]:
# Find null values in dataset
df.isnull().sum()
Out[6]:
age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64
In [7]:
df.describe()
Out[7]:
102127.00000
max 95.000000 31.000000 4918.000000 63.000000 871.000000 275.000000
0
In [8]:
# Checking data types
df.dtypes
Out[8]:
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
In [9]:
x = df.drop(['y'],axis = 1)
y =df.y
In [10]:
y.head()
Out[10]:
0 no
1 no
2 no
3 no
4 no
In [11]:
# Store all categorical (text) column into dataframe
categorical_columns = df.select_dtypes(include=['object']).columns
In [12]:
#Import labelencoder for converting string to number.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
In [13]:
#Converting Categorical columns in Numeric for training M.L. model
for col in categorical_columns:
df[col]=le.fit_transform(df[col])
In [14]:
df.head()
Out[14]:
a d
jo mari educat defa balan housi lo cont mo durati campa pda previo poutco
g a y
b tal ion ult ce ng an act nth on ign ys us me
e y
5
0 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3 0
8
4
1 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3 0
4
3
2 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3 0
3
4
3 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3 0
7
3 1
4 2 3 0 1 0 0 2 5 8 198 1 -1 0 3 0
3 1
In [15]:
#Define independent variable into x and dependent into y.
#Independents variables
x1= df.drop(['y'],axis=1)
x1.head()
Out[15]:
ag jo marit educati defa balan housi lo cont d mon durati campai pda previo poutco
e b al on ult ce ng an act ay th on gn ys us me
0 58 4 1 2 0 2143 1 0 2 5 8 261 1 -1 0 3
1 44 9 2 1 0 29 1 0 2 5 8 151 1 -1 0 3
2 33 2 1 1 0 2 1 1 2 5 8 76 1 -1 0 3
3 47 1 1 3 0 1506 1 0 2 5 8 92 1 -1 0 3
1
4 33 2 3 0 1 0 0 2 5 8 198 1 -1 0 3
1
In [16]:
#Dependent variable
y1=df.y
y1.head()
Out[16]:
0 0
1 0
2 0
3 0
4 0
In [17]:
#Find best parameters using hyper parameter tuning
In [18]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
In [19]:
# Find the best parameters.
model_params = {
'random_forest': {
'model': RandomForestClassifier(),
'params': {
'n_estimators': [0,1, 5, 10]
}
}
}
In [20]:
scores = []
In [21]:
df1 = pd.DataFrame(scores)
df1
Out[21]:
In [23]:
clf.fit(x,y1)
Out[23]:
Pipeline
OneHotEncoder
RandomForestRegressor
In [24]:
clf.score(x,y1)
Out[24]:
0.8761215575318506
In [25]:
columns = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan',
'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']
new_data_points = [
[59, 'admin.', 'married', 'secondary', 'no', 2343, 'yes', 'no', 'unknown', 5, 'may', 1042, 1, -1, 0, 'unknown']
]
In [26]:
# Test the model based on above input.
prediction= clf.predict(input)[0]
In [27]:
linkcode
probability_percentage = prediction * 100
print("The probability of this lead converting into a customer is :",probability_percentage,'%')
Reference link