Data Analytics Lifecycle
Data Analytics Lifecycle
Data Analytics Lifecycle
ADNALOYAYDUMENNA
Most of these roles are not new, the last two roles—data engineer and data
scientist—have become popular and in high demand as interest in Big Data has grown
Key Roles for a Successful Analytics Project
Each plays a critical part in a
successful analytics project.
Although seven roles are listed,
fewer or more people can
accomplish the work depending
on the scope of the project, the
organizational structure, and the
skills of the participants. For
example, on a small, versatile
team, these seven roles may be
fulfilled by only 3 people, but a
very large project may require
20 or more people.
Key Roles for a Successful Analytics Project
Business User: Someone who Project Sponsor: Responsible Business Intelligence Analyst:
understands the domain area for the genesis of the project. Provides business domain
and usually benefits from the Provides the impetus and expertise based on a deep
results. This person can requirements for the project understanding of the data,
consult and advise the project and defines the core business key performance indicators
team on the context of the problem. Generally provides (KPIs), key metrics, and
project, the value of the the funding and gauges the business intelligence from a
results, and how the outputs degree of value from the final reporting perspective.
will be operationalized. outputs of the working team. Business Intelligence
Usually a business analyst, line This person sets the priorities Analysts generally create
manager, or deep subject for the project and clarifies dashboards and reports and
matter expert in the project the desired outputs. have knowledge of the data
domain fulfills this role feeds and sources.
Project Manager: Ensures that
key milestones and objectives
are met on time and at the
expected quality
Key Roles for a Successful Analytics Project
Database Administrator Data Engineer: Leverages deep Data Scientist: Provides
(DBA): Provisions and technical skills to assist with subject matter expertise for
configures the database tuning SQL queries for data analytical techniques, data
environment to support the management and data
modeling, and applying valid
analytics needs of the extraction, and provides
analytical techniques to given
working team. These support for data ingestion into
business problems. Ensures
responsibilities may include the analytic sandbox. Whereas
overall analytics objectives
providing access to key the DBA sets up and configures
are met. Designs and
the databases to be used, the
databases or tables and executes analytical methods
data engineer executes the
ensuring the appropriate and approaches with the data
actual data extractions and
security levels are in place available to the project
performs substantial data
related to the data
manipulation to facilitate the
repositories
analytics. The data engineer
works closely with the data
scientist to help shape data in
the right ways for analyses.
Overview of
Data Analytics Lifecycle
The Data Analytics Lifecycle defines analytics process best practices spanning
discovery to project completion. The lifecycle draws from established methods in the realm
of data analytics and decision science. This synthesis was developed after gathering input
from data scientists and consulting established approaches that provided input on pieces
of the process.
Data Analytics Lifecycle has six phases. Teams commonly learn new things in a phase that
cause them to go back and refine the work done in prior phases based on new insights and
information that have been uncovered.
Data Analytics Lifecycle
Note that these phases do not represent
formal stage gates; rather, they serve as
criteria to help test whether it makes
sense to stay in the current phase or
move to the next
Phase 1: Discovery
The team learns the business domain, including
relevant history such as whether the organization
or business unit has attempted similar projects in
the past from which they can learn. The team
assesses the resources available to support the
project in terms of people, technology, time, and
data. Important activities in this phase include
framing the business problem as an analytics
challenge that can be addressed in subsequent
phases and formulating initial hypotheses (IHs) to
test and begin learning the data
Phase 2—Data preparation
Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and
perform analytics for the duration of the project.
The team needs to execute extract, load, and
transform (ELT) or extract, transform and load
(ETL) to get data into the sandbox. The ELT and ETL
are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team
can work with it and analyze it. In this phase, the
team also needs to familiarize itself with the data
thoroughly and take steps to condition the data
Phase 3—Model planning:
Phase 3 is model planning, where the team
determines the methods, techniques, and workflow
it intends to follow for the subsequent model
building phase. The team explores the data to learn
about the relationships between variables and
subsequently selects key variables and the most
suitable models
Phase 4—Model building
In Phase 4, the team develops datasets for testing, training, and production purposes. In addition,
in this phase the team builds and executes models based on the work done in the model planning
phase. The team also considers whether its existing tools will suffice for running the models, or if
it will need a more robust environment for executing models and workflows (for example, fast
hardware and parallel processing, if applicable).
Phase 5 —
Communicate results
In Phase 5, the team, in collaboration
with major stakeholders, determines if
the results of the project are a success
or a failure based on the criteria
developed in Phase 1. The team should
identify key findings, quantify the
business value, and develop a narrative
to summarize and convey findings to
stakeholders
Phase 6 —
Operationalize
In Phase 6, the team delivers final
reports, briefings, code, and technical
documents. In addition, the team may
run a pilot project to implement the
models in a production environment.
the main stakeholders AND PHASE of an analytics project and
what they usually expect at the conclusion of a project.
Business User typically tries to determine the benefits and implications of the findings
to the business.
Project Sponsor typically asks questions related to the business impact of the project,
the risks and return on investment (ROI), and the way the project can be evangelized
within the organization (and beyond).
Project Manager needs to determine if the project was completed on time and within
budget and how well the goals were met.
Business Intelligence Analyst needs to know if the reports and dashboards he
manages will be impacted and need to change.
Data Engineer and Database Administrator (DBA) typically need to share their code
from the analytics project and create a technical document on how to implement it.
Data Scientist needs to share the code and explain the model to her peers, managers,
and other stakeholde
Although these seven roles represent many interests within
a project, these interests usually overlap, and most of them
can be met with four main deliverables.
Presentation for project sponsors: This contains high-level takeaways for executive
level stakeholders, with a few key messages to aid their decision-making process.
Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp.
Presentation for analysts, which describes business process changes and reporting
changes. Fellow data scientists will want the details and are comfortable with
technical graphs (such as Receiver Operating Characteristic [ROC] curves, density
plots, and histograms ).
Code for technical people.
Technical specifications of implementing the code
Using Big Data to Get Results:
Basic analytics
It can be used to explore your data, if you’re not
sure what you have, but you think something is
of value. This might include simple
visualizations or simple statistics. Basic analysis
is often used when you have large amounts of
disparate data.
For example, you might have a scientific data set of water column data from many different locations
that contains numerous variables captured from multiple sensors. Attributes might include
temperature, pressure, transparency, dissolved oxygen, pH, salinity, and so on, collected over time.
You might want some simple graphs or plots that let you explore your data across different dimensions,
such as temperature versus pH or transparency versus salinity. You might want some basic statistics
such as average or range for each attribute, from each height, for the time period.
The point is that you might use this basic type of exploration of the variables to ask specific questions in
your problem space. The difference between this kind of analysis and what happens in a basic business
intelligence system is that you’re dealing with huge volumes of data where you might not know how
much query space you’ll need to examine it and you’re probably going to want to run computations in
real time
Basic analytics: Basic monitoring
You might also want to monitor large volumes of data in real time.
For example, you might want to monitor the water column attributes in the preceding
example every second for an extended period of time from hundreds of locations and at
varying heights in the water column.
This would produce a huge data set. Or, you might be interested in monitoring the buzz
associated with your product every minute when you launch an ad campaign. Whereas the
water column data set might produce a large amount of relatively structured time-
sensitive data, the social media campaign is going to produce large amounts of disparate
kinds of data from multiple sources across the Internet.
Basic analytics: Anomaly identification
You might want to identify anomalies, such as an event where the actual observation
differs from what you expected, in your data because that may clue you in that something
is going wrong with your organization, manufacturing process, and so on.
For example, you might want to analyze the records for your manufacturing operation to
determine whether one kind of machine, or one operator, has a higher incidence of a certain
kind of problem. This might involve some simple statistics like moving averages triggered
by an alert from the problematic machine.
Using Big Data to Get Results:
Advanced analytics
Advanced analytics provides algorithms for complex analysis of
either structured or unstructured data. It includes sophisticated
statistical models, machine learning, neural networks, text analytics,
and other advanced data-mining techniques.
The analysis and extraction processes used in text analytics take advantage of
techniques that originated in computational linguistics, statistics, and other
computer science disciplines. Text analytics is being used in all sorts of analysis,
from predicting churn, to fraud, and to social media analytics.
ADVANCED analytics: data mining
Data mining involves exploring and analyzing large amounts of data to find
patterns in that data. The techniques came out of the fields of statistics and
artificial intelligence (AI), with a bit of database management thrown into the
mix. Generally, the goal of the data mining is either classification or prediction.
REMEMBER:
This finding suggests that at least part of the initial hypothesis is correct;
the data can identify innovators who span different geographies and
business units. The team used Tableau software for data visualization and
exploration and used the Pivotal Greenplum database as the main data
repository and analytics engine.
PHASE 5: COMMUNICATE RESULTS
This project was considered successful in identifying boundary spanners and
hidden innovators. As a result, the CTO office launched longitudinal studies to
begin data collection efforts and track innovation results over longer periods of
time. The GINA project promoted knowledge sharing related to innovation and
researchers spanning multiple areas within the company and outside of it. GINA
also enabled EMC to cultivate additional intellectual property that led to additional
research topics and provided opportunities to forge relationships with universities
for joint academic research in the fields of Data Science and Big Data. In addition,
the project was accomplished with a limited budget, leveraging a volunteer force
of highly skilled and distinguished engineers and data scientists.
PHASE 5: COMMUNICATE RESULTS
One of the key findings from the project is that there was a disproportionately
high density of innovators in Cork, Ireland. After further research, it was learned
that the COE in Cork, Ireland had received focused training in innovation from an
external consultant, which was proving effective. The Cork COE came up with
more innovation ideas, and better ones, than it had in the past, and it was making
larger contributions to innovation at EMC. Applying social network analysis
enabled the team to find a pocket of people within EMC who were making
disproportionately strong contributions. These findings were shared internally
through presentations and conferences and promoted through social media and
blogs.
Phase 6: Operationalize
The CTO office and GINA need more data in the future,
including a marketing initiative to convince people to inform
the global community on their innovation/research
activities.
Some of the data is sensitive, and the team needs to
consider security and privacy related to the data, such as
who can run the models and see the results.
In addition to running models, a parallel initiative needs to
be created to improve basic Business Intelligence activities,
such as dashboards, reporting, and queries on research
activities worldwide.
A mechanism is needed to continually reevaluate the model
after deployment. Assessing the benefits is one of the main
goals of this stage, as is defining a process to retrain the
model as needed.
Analytic Plan from the EMC GINA Project
Email Address
ANNEMUDYAYOLANDA@
Questions? LECTURER.UNRI.AC.ID
Comments?
Feel free to share your
feedback.
OFFICE
EPSILON 12
STATISTIKA FMIPA UNRI