Data Science Process Alliance CRISP DM For Data Science
Data Science Process Alliance CRISP DM For Data Science
CRISP-DM FOR
DATA SCIENCE
Executive Summary
What is CRISP-DM?
Published in 1999, CRISP-DM (CRoss Industry
Standard Process for Data Mining (CRISP-DM)
is the most popular framework for executing
data science projects. It provides a natural
description of a data science life cycle (the
workflow in data-focused projects).
Six Phases
1. Business understanding
What does the business need?
2. Data understanding
What data do we have / need? Is it clean?
3. Data preparation
How do we organize the data for modeling?
4. Modeling
What modeling techniques should we apply?
5. Evaluation
What best meets the business objectives?
6. Deployment
How do stakeholders access the results?
Reviewing CRISP-DM
Diving into the CRISP-DM Phases
I. Business Understanding
The Business Understanding phase focuses on understanding the objectives and requirements of the project.
While many teams hurry through this phase, establishing a strong business understanding is like building the
foundation of a house – absolutely essential. Aside from the third task, the three other tasks in this phase are
foundational project management activities that are universal to most projects:
1. Determine business objectives: understand what the customer / client is trying to
achieve, including the business success criteria.
2. Assess situation: Determine resources availability, project requirements, assess
risks and contingencies, and conduct a cost-benefit analysis.
3. Determine project goals: In addition to defining the business objectives, you should
also define what success looks like from a technical data mining perspective.
4. Produce project plan: Select technologies and tools and define detailed plans for
each project phase.
1. Select data: Determine which data sets will be used and document reasons
for inclusion/exclusion.
2. Clean data: Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out. A common practice during this task is to correct,
impute, or remove erroneous values.
3. Construct data: Derive new attributes that will be helpful. For example, derive
someone’s body mass index from height and weight fields.
4. Integrate data: Create new data sets by combining data from multiple
sources.
5. Format data: Re-format data as necessary. For example, you might convert
string values that store numbers to numeric values so that you can perform
mathematical operations.
Reviewing CRISP-DM
Diving into the CRISP-DM Phases
IV. Modeling
Modeling is often regarded as data science’s most exciting work. In this phase, the team builds and assesses
various models based, often using several different modeling techniques. Although the CRISP-DM guide suggests
to “iterate model building and assessment until you strongly believe that you have found the best model(s)”, in
practice teams might iterating until they have a “good enough” model. This phase has four tasks:
1. Select modeling techniques: Determine which algorithms to try (e.g. regression,
neural net).
2. Generate test design: Pending your modeling approach, you might need to split the
data into training, test, and validation sets.
3. Build model: As glamorous as this might sound, this might just be executing a few
lines of code like “reg = LinearRegression().fit(X, y)”.
4. Assess model: Generally, multiple models are competing against each other, and
the data scientist needs to interpret the model results based on domain knowledge,
the pre-defined success criteria, and the test design.
V. Evaluation
Whereas the Assess Model task of the Modeling phase focuses on technical model assessment, the Evaluation
phase looks more broadly at which model best meets the business and what to do next. This phase has three
tasks:
1. Evaluate results: Do the models meet the business success criteria?
Which one(s) should we approve for the business?
2. Review process: Review the work accomplished. Was anything
overlooked? Were all steps properly executed? Summarize findings
and correct anything if needed.
3. Determine next steps: Based on the previous three tasks, determine
whether to proceed to deployment, iterate further, or initiate new
projects.
VI. Deployment
A model is not particularly useful unless the customer can access its results. So, deployment should be thought of
in terms of what does it take to actually use the results of the project. Depending on the project, this can be as
simple as sharing a report or as complex as implementing a live real-time predictive model. This final phase has
four tasks:
1. Plan deployment: Develop and document a plan for deploying the model.
2. Plan monitoring and maintenance: Develop a thorough monitoring and
maintenance plan to avoid issues during the operational phase (or post-project
phase) of a model.
3. Produce final report: The project team documents a summary of the project
which might include a final presentation of data mining results.
4. Review project: Conduct a project retrospective about what went well, what
could have been better, and how to improve in the future.
Analyzing CRISP-DM
Strengths and Weaknesses
Strengths & Benefits
Common Sense: Data scientists naturally follow a CRISP-DM-like Key Strengths:
process. When people are asked to do a data science project without
project management direction, they tend toward a CRISP-like
Common sense steps
methodology and can easily identify with the CRISP-DM phases and
doing iterations.
Cyclical: CRISP-DM can support the iterative nature of data science Easy to understand
(but how to actually do iterations is not defined)
Adopt-able: CRISP-DM can be implemented without much training, Defines a shared
organizational role changes, or controversy.
vocabulary for the
Right Start: The initial focus on Business Understanding, an often-
overlooked step, is helpful to align technical work with business steps in a project
needs and to steer data scientists away from jumping into a problem
without properly understanding business objectives.
Flexible: A loose CRISP-DM implementation can be flexible to
provide many of the benefits of agile principles and practices. By
accepting that a project starts with significant unknowns, the user can
cycle through steps, each time gaining a deeper understanding of the
data and the problem. The empirical knowledge learned from
previous cycles can then feed into the following cycles.
Going Forward
Key Actions to Consider
1. Combine with a team coordination process
There needs to be a mechanism for the team to communicate and prioritize work.
The team process should define how the team communicates, prioritizes tasks and
“loops back” to previous project phases.
Teams can leverage the CRISP-DM phases, and then use a framework such as
Scrum, Kanban or Data Driven Scrum to prioritize potential tasks.
5. Add phases (if needed) and define the subitems within each phase
Add steps or phases for practices like git version control and ML ops.
Be clear how tasks (within a phase) are defined.
Some tasks that should be explicitly discussed include: bias checks, accuracy
assessments, business validation, and dev dicussions.
But there is a better way. Which is why the Data Science Team™ teaches leaders, teams, and
organizations to apply effective agile principles to data science projects so that they can deliver better data
science outcomes.
www.datascience-pm.com
© Data Science Process Alliance 2022
info@datascience-pm.com