FORM NO.
F/ TL / 024
Rev.00 Date 20.03.2020
Customer Segmentation using
K-Means Algorithm
MINI PROJECT REPORT
submitted in partial fulfilment of the requirements
for the award of the degree in
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
AKASH S 211191101006
AKASH V 211191101007
GURUCHANDRAN R 211191101044
DEPARTMENT
OF
COMPUTER SCIENCE AND ENGINEERING
MAY 2024
FORM NO. F/ TL / 024
Rev.00 Date 20.03.2020
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of Mr. AKASH S
Reg.No 211191101006, Mr. AKASH V Reg.No 211191101007,Mr.GURUCHANDRAN
Reg. No 211191101044 who carried out the project entitled
Customer Segmentation using K-Means Algorithm under our super visionfrom January
2024 to May 2024
PROJECT COORDINATOR 1 PROJECT COORDINATOR 2 HOD
Mrs. NALINI POORNIMA Dr.USHA Dr.S. GEETHA
Assistant Professor Professor Professor & Head of Dept
Department of CSE Department of CSE Department of CSE
Dr. MGR Educational Dr. MGR Educational Dr. MGR Educational
& & &
Research Institute Research Institute Research Institute
Submitted for Viva Voce Examination held on -------------------------
Internal Examiner External Examiner
FORM NO. F/ TL / 024
Rev.00 Date 20.03.2020
DECLARATION
We Mr. AKASH S Reg.No 211191101006, Mr. AKASH V
Reg.No 211191101007, Mr. GURUCHANDRAN R Reg. No 211191101044 here by
declarethat the Mini Project Report entitled “Customer Segmentation using K-Means
Algorithm” is done by us under the guidance of “Mrs. NALINI POORNIMA &
Dr.USHA” is submitted in partial fulfilment of the requirements for the award of
the degree in BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND
ENGINEERING.
DATE:
PLACE:
1)
2)
3)
SIGNATURE OF THE CANDIDATES
V
FORM NO. F/ TL / 024
Rev.00 Date 20.03.2020
ACKNOWLEDGEMENT
We would like to thank our beloved Chancellor Thiru. Dr.A.C. SHANMUGAM, B.A., B.L.
President Er. A.C.S. ARUN KUMAR, B.TECH. and Secretary Thiru A. RAVIKUMAR
for all the encouragement and support extended to us during the tenure of this project and
our years of studies in this wonderful University.
We express my heartfelt thanks to our Vice Chancellor Dr. GEETHALAKSHMI in
providing all the support of my Project.
We express my heartfelt thanks to our Head of the Department, Prof. Dr. S. GEETHA,
who has been actively involved and very influential from the start till the completion of our
Project.
Our sincere thanks to our Project Coordinators Mrs. NALINI POORNIMA &
Dr. USHA for their continuous guidance and encouragement throughout this work, which
has made the project a success.
We would also like to thank all the teaching and non-teaching staff of Computer Science
and Engineering department, for their constant support and the encouragement given to us
while we went about to achieving our project goals.
CONTENTS
CHAPTER TITLE PG NO
1 ABSTRACT 1
2 INTRODUCTION 2
3 PROBLEM DEFINITION 3
4 OBJECTIVE OF THE PROJECT 4
5 LITERATURE SUREVEY 5
6 REQUIREMENT ANALYSIS 6
Data Requirements
Algorithmic Requirements
Domain Knowledge
7
SOFTWARE/HARDWARE
7 DESIGN
UML DIAGRAM
FLOW CHART
8 IMPLEMENTATION 9
MODULES
DESCRIPTION
9 SAMPLE CODE AND OUTPUT 11
10 CONCLUSION 25
11 REFERENCES/BIBILOGRAPHY 26
CHAPTER 1- ABSTRACT
Customer segmentation is a critical task in marketing and business strategy, aiming to divide a
heterogeneous customer base into more homogeneous groups based on similar characteristics and
behaviors. Among various clustering techniques, the K-Means algorithm stands out for its simplicity
and efficiency in handling large datasets.This study explores the application of the K-Means algorithm
for customer segmentation, utilizing a dataset containing diverse customer attributes such as
demographics, purchasing history, and behavioral patterns. The objective is to identify distinct
customer segments that can aid in targeted marketing campaigns, personalized product
recommendations, and improved customer satisfaction.The K-Means algorithm iteratively partitions
the customer data into K clusters, where each cluster represents a group of customers with similar
traits. Through an iterative process of assigning data points to clusters and updating cluster centroids,
K-Means minimizes the within-cluster variance, effectively grouping customers into clusters that
maximize homogeneity within each group.The effectiveness of the K-Means algorithm in customer
segmentation is evaluated through metrics such as silhouette score, within-cluster sum of squares, and
visual inspection of cluster centroids. The resulting segmentation enables businesses to gain insights
into customer preferences, identify high-value customer segments, and tailor marketing strategies to
meet the specific needs of each segment.
1
CHAPTER 2- INTRODUCTION
In the dynamic landscape of modern business, understanding and catering to the diverse needs and
preferences of customers is paramount for success. Customer segmentation, the process of dividing a
customer base into distinct groups based on common characteristics, behaviors, or other attributes, has
emerged as a fundamental strategy for businesses seeking to enhance marketing effectiveness, drive
customer engagement, and maximize profitability.Traditionally, businesses have relied on
demographic data such as age, gender, and income to segment their customer base. However, with the
advent of big data and advanced analytics techniques, businesses now have access to a wealth of
customer information, ranging from purchasing history and online behavior to social media
interactions and geographic location. Leveraging this wealth of data to gain deeper insights into
customer behavior and preferences has become increasingly important in today's competitive market
landscape.One powerful technique for customer segmentation is the K-Means algorithm, a popular
clustering method in machine learning and data mining. K-Means is a simple yet effective
unsupervised learning algorithm that partitions a dataset into K clusters, with each cluster represented
by a centroid that minimizes the within-cluster sum of squares.The application of the K-Means
algorithm in customer segmentation offers several advantages. It allows businesses to identify distinct
groups of customers with similar traits or behaviors, enabling targeted marketing campaigns,
personalized product recommendations, and tailored customer experiences.
2
CHAPTER 3- PROJECT DEFINITION
In the realm of marketing and business strategy, customer segmentation plays a pivotal role in
understanding and targeting diverse customer groups effectively. By dividing a customer base into
distinct segments based on shared characteristics or behaviors, businesses can tailor their marketing
efforts and product offerings to meet the specific needs and preferences of each segment. One
powerful technique for customer segmentation is the K-Means algorithm, which partitions data into
clusters based on similarity.
OUTCOME OF THE PROJECT
Segmented Customer Groups:
The primary outcome of the project will be the segmentation of the customer dataset into
distinct clusters using the K-Means algorithm. Each cluster will represent a group of customers
with similar characteristics, behaviors, or preferences. The segmentation will provide
businesses with valuable insights into the diversity of their customer base and help identify key
customer segments.
Customer Profiles and Characteristics:
Alongside the segmented clusters, the project will generate customer profiles for each
identified segment. These profiles will include key demographic information, purchasing
behavior, interaction patterns, and any other relevant attributes that define each customer
segment. Understanding the unique characteristics of each segment will enable businesses to
tailor their marketing strategies and product offerings accordingly.
Insights and Recommendations:
The segmented customer groups and their corresponding profiles will yield actionable insights and
recommendations for businesses. These insights may include:
Identification of high-value customer segments: Highlighting segments with the highest
potential for profitability or growth.
Targeted marketing strategies: Recommending specific marketing channels, messaging,
and promotions tailored to each customer segment.
Product customization and innovation: Suggesting product features or enhancements to
better meet the needs of different customer segments.
Customer retention and loyalty initiatives: Proposing strategies to enhance customer
satisfaction and loyalty within each segment.
Evaluation Metrics:
The effectiveness of the K-Means algorithm in customer segmentation will be assessed using
various evaluation metrics, including within-cluster sum of squares, silhouette score, and visual
inspection of cluster centroids. These metrics will help validate the quality and coherence of
the segmented clusters and ensure that they accurately represent distinct customer groups.
3
Visualization and Reporting:
The project outcomes will be communicated through visualizations and reports, providing clear
and concise summaries of the segmented customer groups, their characteristics, and the
associated insights and recommendations. Visualizations such as cluster plots, dendrogram
diagrams, and demographic distributions will aid in conveying the results effectively to
business stakeholders.
Business Impact:
Ultimately, the outcome of the project will have a tangible impact on business decision-
making and strategy formulation. By leveraging the insights gained from customer
segmentation using the K-Means algorithm, businesses will be better equipped to optimize
resource allocation, enhance customer engagement, and drive overall growth and profitability.
4
CHAPTER 4- OBJECTIVE OF THE PROJECT
Identifying Homogeneous Customer Segments:
The primary objective is to partition the customer dataset into distinct clusters based
on similarities in customer attributes, behaviors, or preferences. By doing so, the
project aims to uncover homogeneous customer segments that exhibit similar
characteristics within each cluster.
Understanding Customer Diversity:
Through customer segmentation, the project seeks to gain a deeper understanding of
the diverse customer base. By identifying and delineating different customer
segments, the project aims to capture the heterogeneity of the customer population
and explore the unique characteristics of each segment.
Optimizing Marketing Strategies:
Another objective of the project is to provide actionable insights for businesses to
optimize their marketing strategies. By segmenting customers based on common traits
or behaviors, businesses can tailor their marketing campaigns, messaging, and product
offerings to better resonate with each customer segment, thereby improving overall
marketing effectiveness and ROI.
Enhancing Customer Engagement and Satisfaction:
The project aims to help businesses enhance customer engagement and satisfaction
by delivering more personalized and targeted experiences. By understanding the
preferences and needs of different customer segments, businesses can deliver tailored
communications, recommendations, and services that align with the specific interests
of each segment, ultimately fostering stronger customer relationships.
Driving Business Growth and Profitability:
Ultimately, the overarching objective of the project is to contribute to business growth
and profitability. By leveraging customer segmentation insights to refine marketing
strategies, optimize resource allocation, and improve customer satisfaction, businesses
can enhance customer acquisition, retention, and lifetime value, leading to sustainable
growth and increased profitability over time.
5
CHAPTER 5- LITERATURE SURVEY
1. "Customer Segmentation in E-Commerce Using K-Means Clustering Algorithm" by R.
Ranjitha and P. G. Dasthagiri:
This study explores the application of the K-Means algorithm for customer segmentation in e-
commerce settings. The authors demonstrate how K-Means clustering can effectively group
customers based on their purchasing behavior and preferences, enabling targeted marketing
campaigns and personalized recommendations.
2. "Customer Segmentation and Profiling based on Online Shopping Behavior using K-
Means Clustering" by S. K. Mishra and A. Gupta:
Mishra and Gupta investigate customer segmentation based on online shopping behavior using
the K-Means algorithm. They analyze various features such as browsing history, purchase
frequency, and product category preferences to divide customers into distinct segments. The
study highlights the importance of personalized marketing strategies derived from K-Means
clustering results.
3. "Application of K-Means Clustering Algorithm in Market Segmentation: A Literature
Review" by V. Rajesh et al.:
This literature review provides a comprehensive overview of the application of the K-Means
algorithm in market segmentation across various industries. The authors summarize key studies
and methodologies used for customer segmentation using K-Means, emphasizing its
effectiveness in identifying homogeneous customer groups and improving marketing
outcomes.
4. "Customer Segmentation for Strategic Marketing Using K-Means Clustering Technique"
by V. Kumar and A. K. Saini:
Kumar and Saini present a case study on customer segmentation for strategic marketing using
the K-Means clustering technique. They demonstrate how K-Means clustering can assist
businesses in identifying market segments with distinct needs and preferences, allowing for
targeted marketing strategies tailored to each segment's requirements.
6
CHAPTER 6- REQUIREMENT ANALYSIS
1. Data Requirements:
Availability of a comprehensive customer dataset containing relevant attributes such as
demographics, purchasing behavior, browsing history, and other pertinent variables.
Sufficient data volume to ensure representative sampling and meaningful segmentation.
Data quality assurance measures to address issues such as missing values, outliers, and
data inconsistencies.
2. Algorithmic Requirements:
Implementation of the K-Means clustering algorithm to partition the customer dataset
into distinct segments.
Selection of appropriate distance metric (e.g., Euclidean distance) for measuring
similarity between data points.
Determination of the optimal number of clusters (K) using techniques such as the elbow
method, silhouette analysis, or domain knowledge.
3. Software and Tools:
Utilization of programming languages and libraries suitable for data analysis and
machine learning, such as Python with scikit-learn, pandas, and NumPy.
Availability of computational resources for processing large datasets and running
iterative clustering algorithms.
Visualization tools for generating cluster plots, dendrogram diagrams, and other
visualizations to interpret segmentation results effectively.
4. Domain Knowledge:
Understanding of the business domain and domain-specific factors that may influence
customer behavior and segmentation.
Collaboration with domain experts and stakeholders to identify relevant customer
attributes and segmentation criteria.
Incorporation of domain knowledge into the interpretation and validation of
segmentation results to ensure actionable insights.
5. Evaluation Metrics:
Selection of appropriate evaluation metrics to assess the quality of the segmented
clusters, such as within-cluster sum of squares, silhouette score, or Davies–Bouldin
index.
Interpretation of evaluation metrics to validate the coherence and separation of the
identified customer segments.
7
SOFTWARE/HARDWARE REQUIREMENT:
PROJECT CATEGORY:
Data Science / Machine Learning
LANGUAGES:
PYTHON
LIBRARIES USED:
1. scikit-learn: This is a powerful machine learning library in Python that provides various tools
for data mining and analysis. It includes efficient implementations of the K-Means algorithm
and other clustering techniques, along with functions for data preprocessing, evaluation
metrics, and model evaluation.
2. pandas: Pandas is a popular library for data manipulation and analysis in Python. It provides
data structures like DataFrame and Series, which are essential for handling structured data such
as customer datasets. Pandas offers functionalities for data cleaning, transformation, and
aggregation, which are often required in customer segmentation projects.
3. NumPy: NumPy is a fundamental library for numerical computing in Python. It provides
support for large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on these arrays efficiently. NumPy is frequently used
alongside pandas for data manipulation tasks and in the implementation of machine learning
algorithms.
4. matplotlib: Matplotlib is a versatile library for creating static, interactive, and animated
visualizations in Python. It offers a wide range of plotting functions and styles for generating
plots, charts, histograms, and more. Matplotlib is commonly used to visualize the results of
customer segmentation, such as cluster plots and distribution plots.
DEVELOPMENT PLATFORM:
Google Collab
Local System
8
DATASET USED:
Offline Global Datas
TOOLS USED:
Editor Used:
• Google Collab
• Vs Code
Operating System:
• Windows 11
Hardware Used:
• Local system:
• Processor: AMD Ryzen 5 5600H 3.30 GHz
• RAM : 16GB
• Storage: 512GB SSD
• Google Colab:
• Tesla T4 GPU
9
CHAPTER 7- DESIGN
UML DIAGRAM
10
Flow chart :
11
CHAPTER 8- IMPLEMENTATION
1. Data Preprocessing:
Load the customer dataset into memory using a suitable data manipulation library like
pandas.
Perform data cleaning to handle missing values, outliers, and inconsistencies.
Encode categorical variables if necessary using techniques like one-hot encoding.
Scale numerical features to ensure that all variables contribute equally to the clustering
process.
2. Model Development:
Split the preprocessed data into training and testing sets if evaluation on unseen data is
required.
Implement the K-Means algorithm using a machine learning library like scikit-learn.
Determine the optimal number of clusters (K) using techniques such as the elbow
method or silhouette analysis.
Train the K-Means model on the preprocessed data to partition customers into K
clusters.
3. Evaluation:
Evaluate the quality of the clustering results using appropriate evaluation metrics.
Common metrics for K-Means clustering include the silhouette score, within-cluster
sum of squares, and Davies–Bouldin index.
Assess the cohesion and separation of the clusters to ensure that they are meaningful and
distinct.
4. Interpretation and Visualization:
Interpret the segmented clusters by analyzing the characteristics and behaviors of
customers within each cluster.
Generate visualizations such as scatter plots, cluster plots, or dendrogram diagrams to
visualize the segmented clusters and their relationships.
Identify key customer segments based on their attributes and behaviors, such as high-
value customers, loyal customers, or churn-risk customers.
5. Business Insights and Recommendations:
Derive actionable insights from the segmented clusters to inform strategic decision-
making and marketing initiatives.
Develop targeted marketing strategies, product recommendations, and customer
engagement initiatives tailored to each customer segment.
Monitor the performance of implemented strategies and iterate as necessary based on
feedback and changing market dynamics.
6. Documentation and Reporting:
Document the implementation process, including data preprocessing steps, model
parameters, evaluation results, and insights derived from the segmented clusters.
13
Prepare a summary report or presentation to communicate the findings and
recommendations to relevant stakeholders within the organization.
MODULES DESCRIPTION:
1. Data Preprocessing Module:
Description: This module is responsible for preparing the customer data for clustering. It
includes tasks such as data cleaning, handling missing values, encoding categorical
variables, and scaling numerical features.
Functions:
Data cleaning functions
Missing value imputation
Encoding categorical variables
Feature scaling
2. Model Training Module:
Description: This module encompasses the implementation and training of the K-Means
clustering algorithm on the preprocessed customer data.
Functions:
K-Means algorithm implementation
Determining the optimal number of clusters (K)
Training the K-Means model on the preprocessed data
3. Evaluation Module:
Description: This module evaluates the quality of the clustering results obtained from
the K-Means model. It computes evaluation metrics to assess the cohesion and
separation of the clusters.
Functions:
Computing evaluation metrics such as silhouette score, within-cluster sum of
squares, and Davies–Bouldin index
Visualizing evaluation metrics to aid in model selection and interpretation
4. Visualization Module:
Description: This module generates visualizations to interpret and communicate the
segmented customer groups and their characteristics.
Functions:
Scatter plots to visualize clusters in two or three dimensions
Cluster plots to display the centroids and boundaries of the clusters
Dendrogram diagrams for hierarchical clustering (if applicable)
5. Insights and Recommendations Module:
14
Description: This module derives actionable insights from the segmented clusters and
provides recommendations for marketing strategies, product recommendations, and
customer engagement initiatives.
Functions:
Analyzing cluster characteristics and customer behaviors
Identifying key customer segments and their needs
Developing targeted marketing strategies and recommendations based on cluster
insights
6. Reporting Module:
Description: This module is responsible for documenting the implementation process,
including data preprocessing steps, model parameters, evaluation results, insights
derived from the segmented clusters, and recommendations.
Functions:
Generating summary reports or presentations for communicating findings to
stakeholders
Documenting code, methodology, and results for reproducibility and future
reference
15
SAMPLE CODE
16
17
18
19
20
21
22
CHAPTER 10- CONCLUSION
In conclusion, customer segmentation using the K-Means algorithm is a powerful technique for
businesses to gain insights into their customer base, tailor marketing strategies, and drive growth.
Through this project, we have successfully leveraged the K-Means algorithm to partition the customer
dataset into distinct clusters based on similarities in customer attributes, behaviors, or preferences.
The segmented clusters provide valuable insights into the diverse nature of the customer population,
allowing businesses to identify key customer segments and understand their unique characteristics and
needs. By analyzing these segments, businesses can develop targeted marketing strategies, product
recommendations, and customer engagement initiatives to enhance customer satisfaction and drive
business growth.
Through the implementation of the project, we have demonstrated the effectiveness of the K-Means
algorithm in customer segmentation and its potential to unlock actionable insights for strategic
decision-making. By combining data preprocessing, model development, evaluation, and visualization
techniques, we have successfully derived meaningful segmentation results and communicated them
effectively to stakeholders.
Moving forward, businesses can use the insights gained from customer segmentation to refine their
marketing strategies, optimize resource allocation, and foster stronger customer relationships.
Continuous monitoring and iteration based on feedback and changing market dynamics will ensure
that businesses remain responsive to evolving customer needs and preferences, driving sustained
growth and competitiveness in the marketplace.
23
CHAPTER 11- REFERENCES/BIBLIOGRAPHY
I. Jain, A.K., Murty, M.N., & Flynn, P.J. (1999). Data clustering: A review. ACM Computing
Surveys (CSUR), 31(3), 264-323. DOI: 10.1145/331499.331504
II. MacQueen, J. (1967). Some methods for classification and analysis of multivariate
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, Volume 1: Statistics (pp. 281-297). University of California Press.
III. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering
method for very large databases. In Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data (pp. 103-114).
IV. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques.
Journal of Intelligent Information Systems, 17(2-3), 107-145. DOI: 10.1023/A:1012801612483
V. Gan, G., Ma, C., & Wu, J. (2007). Data clustering: Theory, algorithms, and applications. SIAM.
24