Data Analytics Lifecycle

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Data Analytics Lifecycle

ADNALOYAYDUMENNA

PERTEMUAN 4: BIG DATA


PROGRAM STUDI STATISTIKA JURUSAN MATEMATIKA FMIPA UNIVERSITAS RIAU
ANNE MUDYA YOLANDA , S.STAT., M.SI.
Data Analytics
Lifecycle
DEFINITION
METHODS
EXAMPLE
DATA Analytics : BIG DATA
More and more conferences are held annually
focusing on innovation in the areas of Data
Science and topics dealing with Big Data. Despite
this strong focus on the emerging role of the data
scientist specifically, there are actually seven key
roles that need to be fulfilled for a highfunctioning
data science team to execute analytic projects
successfully
Key Roles for a Successful Analytics Project

Most of these roles are not new, the last two roles—data engineer and data
scientist—have become popular and in high demand as interest in Big Data has grown
Key Roles for a Successful Analytics Project
Each plays a critical part in a
successful analytics project.
Although seven roles are listed,
fewer or more people can
accomplish the work depending
on the scope of the project, the
organizational structure, and the
skills of the participants. For
example, on a small, versatile
team, these seven roles may be
fulfilled by only 3 people, but a
very large project may require
20 or more people.
Key Roles for a Successful Analytics Project
Business User: Someone who Project Sponsor: Responsible Business Intelligence Analyst:
understands the domain area for the genesis of the project. Provides business domain
and usually benefits from the Provides the impetus and expertise based on a deep
results. This person can requirements for the project understanding of the data,
consult and advise the project and defines the core business key performance indicators
team on the context of the problem. Generally provides (KPIs), key metrics, and
project, the value of the the funding and gauges the business intelligence from a
results, and how the outputs degree of value from the final reporting perspective.
will be operationalized. outputs of the working team. Business Intelligence
Usually a business analyst, line This person sets the priorities Analysts generally create
manager, or deep subject for the project and clarifies dashboards and reports and
matter expert in the project the desired outputs. have knowledge of the data
domain fulfills this role feeds and sources.
Project Manager: Ensures that
key milestones and objectives
are met on time and at the
expected quality
Key Roles for a Successful Analytics Project
Database Administrator Data Engineer: Leverages deep Data Scientist: Provides
(DBA): Provisions and technical skills to assist with subject matter expertise for
configures the database tuning SQL queries for data analytical techniques, data
environment to support the management and data
modeling, and applying valid
analytics needs of the extraction, and provides
analytical techniques to given
working team. These support for data ingestion into
business problems. Ensures
responsibilities may include the analytic sandbox. Whereas
overall analytics objectives
providing access to key the DBA sets up and configures
are met. Designs and
the databases to be used, the
databases or tables and executes analytical methods
data engineer executes the
ensuring the appropriate and approaches with the data
actual data extractions and
security levels are in place available to the project
performs substantial data
related to the data
manipulation to facilitate the
repositories
analytics. The data engineer
works closely with the data
scientist to help shape data in
the right ways for analyses.
Overview of
Data Analytics Lifecycle

The Data Analytics Lifecycle defines analytics process best practices spanning
discovery to project completion. The lifecycle draws from established methods in the realm
of data analytics and decision science. This synthesis was developed after gathering input
from data scientists and consulting established approaches that provided input on pieces
of the process.

Data Analytics Lifecycle has six phases. Teams commonly learn new things in a phase that
cause them to go back and refine the work done in prior phases based on new insights and
information that have been uncovered.
Data Analytics Lifecycle
Note that these phases do not represent
formal stage gates; rather, they serve as
criteria to help test whether it makes
sense to stay in the current phase or
move to the next
Phase 1: Discovery
The team learns the business domain, including
relevant history such as whether the organization
or business unit has attempted similar projects in
the past from which they can learn. The team
assesses the resources available to support the
project in terms of people, technology, time, and
data. Important activities in this phase include
framing the business problem as an analytics
challenge that can be addressed in subsequent
phases and formulating initial hypotheses (IHs) to
test and begin learning the data
Phase 2—Data preparation
Phase 2 requires the presence of an analytic
sandbox, in which the team can work with data and
perform analytics for the duration of the project.
The team needs to execute extract, load, and
transform (ELT) or extract, transform and load
(ETL) to get data into the sandbox. The ELT and ETL
are sometimes abbreviated as ETLT. Data should
be transformed in the ETLT process so the team
can work with it and analyze it. In this phase, the
team also needs to familiarize itself with the data
thoroughly and take steps to condition the data
Phase 3—Model planning:
Phase 3 is model planning, where the team
determines the methods, techniques, and workflow
it intends to follow for the subsequent model
building phase. The team explores the data to learn
about the relationships between variables and
subsequently selects key variables and the most
suitable models
Phase 4—Model building
In Phase 4, the team develops datasets for testing, training, and production purposes. In addition,
in this phase the team builds and executes models based on the work done in the model planning
phase. The team also considers whether its existing tools will suffice for running the models, or if
it will need a more robust environment for executing models and workflows (for example, fast
hardware and parallel processing, if applicable).
Phase 5 —
Communicate results
In Phase 5, the team, in collaboration
with major stakeholders, determines if
the results of the project are a success
or a failure based on the criteria
developed in Phase 1. The team should
identify key findings, quantify the
business value, and develop a narrative
to summarize and convey findings to
stakeholders
Phase 6 —
Operationalize
In Phase 6, the team delivers final
reports, briefings, code, and technical
documents. In addition, the team may
run a pilot project to implement the
models in a production environment.
the main stakeholders AND PHASE of an analytics project and
what they usually expect at the conclusion of a project.
Business User typically tries to determine the benefits and implications of the findings
to the business.
Project Sponsor typically asks questions related to the business impact of the project,
the risks and return on investment (ROI), and the way the project can be evangelized
within the organization (and beyond).
Project Manager needs to determine if the project was completed on time and within
budget and how well the goals were met.
Business Intelligence Analyst needs to know if the reports and dashboards he
manages will be impacted and need to change.
Data Engineer and Database Administrator (DBA) typically need to share their code
from the analytics project and create a technical document on how to implement it.
Data Scientist needs to share the code and explain the model to her peers, managers,
and other stakeholde
Although these seven roles represent many interests within
a project, these interests usually overlap, and most of them
can be met with four main deliverables.

Presentation for project sponsors: This contains high-level takeaways for executive
level stakeholders, with a few key messages to aid their decision-making process.
Focus on clean, easy visuals for the presenter to explain and for the viewer to grasp.
Presentation for analysts, which describes business process changes and reporting
changes. Fellow data scientists will want the details and are comfortable with
technical graphs (such as Receiver Operating Characteristic [ROC] curves, density
plots, and histograms ).
Code for technical people.
Technical specifications of implementing the code
Using Big Data to Get Results:
Basic analytics
It can be used to explore your data, if you’re not
sure what you have, but you think something is
of value. This might include simple
visualizations or simple statistics. Basic analysis
is often used when you have large amounts of
disparate data.

✓ Slicing and dicing


✓ Basic monitoring
✓ Anomaly identification
Basic analytics: Slicing and dicing
Slicing and dicing refers to breaking down your data into smaller sets of data that are easier to explore.

For example, you might have a scientific data set of water column data from many different locations
that contains numerous variables captured from multiple sensors. Attributes might include
temperature, pressure, transparency, dissolved oxygen, pH, salinity, and so on, collected over time.

You might want some simple graphs or plots that let you explore your data across different dimensions,
such as temperature versus pH or transparency versus salinity. You might want some basic statistics
such as average or range for each attribute, from each height, for the time period.

The point is that you might use this basic type of exploration of the variables to ask specific questions in
your problem space. The difference between this kind of analysis and what happens in a basic business
intelligence system is that you’re dealing with huge volumes of data where you might not know how
much query space you’ll need to examine it and you’re probably going to want to run computations in
real time
Basic analytics: Basic monitoring
You might also want to monitor large volumes of data in real time.

For example, you might want to monitor the water column attributes in the preceding
example every second for an extended period of time from hundreds of locations and at
varying heights in the water column.

This would produce a huge data set. Or, you might be interested in monitoring the buzz
associated with your product every minute when you launch an ad campaign. Whereas the
water column data set might produce a large amount of relatively structured time-
sensitive data, the social media campaign is going to produce large amounts of disparate
kinds of data from multiple sources across the Internet.
Basic analytics: Anomaly identification
You might want to identify anomalies, such as an event where the actual observation
differs from what you expected, in your data because that may clue you in that something
is going wrong with your organization, manufacturing process, and so on.

For example, you might want to analyze the records for your manufacturing operation to
determine whether one kind of machine, or one operator, has a higher incidence of a certain
kind of problem. This might involve some simple statistics like moving averages triggered
by an alert from the problematic machine.
Using Big Data to Get Results:
Advanced analytics
Advanced analytics provides algorithms for complex analysis of
either structured or unstructured data. It includes sophisticated
statistical models, machine learning, neural networks, text analytics,
and other advanced data-mining techniques.

Today, advanced analytics is becoming more mainstream. With


increases in computational power, improved data infrastructure,
new algorithm development, and the need to obtain better insight
from increasingly vast amounts of data, companies are pushing
toward utilizing advanced analytics as part of their decision-making
process. Businesses realize that better insights can provide a
superior competitive position.
ADVANCED analytics: Predictive modeling
Predictive modeling is one of the most popular big data advanced analytics use cases.

A predictive model is a statistical or data-mining solution consisting of algorithms and


techniques that can be used on both structured and unstructured data (together or
individually) to determine future outcomes.

For example, a telecommunications company might use a predictive model to predict


customers who might drop its service. In the big data world, you might have large
numbers of predictive attributes across huge amounts of observations. Whereas in
the past, it might have taken hours (or longer) to run a predictive model, with a large
amount of data on your desktop, you might be able to now run it iteratively hundreds
of times if you have a big data infrastructure in place
ADVANCED analytics: Text analytics
Unstructured data is such a big part of big data, so text analytics — the process
of analyzing unstructured text, extracting relevant information, and transforming
it into structured information that can then be leveraged in various ways — has
become an important component of the big data ecosystem.

The analysis and extraction processes used in text analytics take advantage of
techniques that originated in computational linguistics, statistics, and other
computer science disciplines. Text analytics is being used in all sorts of analysis,
from predicting churn, to fraud, and to social media analytics.
ADVANCED analytics: data mining
Data mining involves exploring and analyzing large amounts of data to find
patterns in that data. The techniques came out of the fields of statistics and
artificial intelligence (AI), with a bit of database management thrown into the
mix. Generally, the goal of the data mining is either classification or prediction.

Typical algorithms used in data mining include the following:


✓ Classification trees
✓ Logistic regression
✓ Neural networks
✓ Clustering techniques like K-nearest neighbors
ADVANCED analytics: others
Other statistical and data-mining algorithms: This
may include advanced forecasting, optimization,
cluster analysis for segmentation or even
microsegmentation, or affinity analysis.

REMEMBER:

Advanced analytics doesn’t ALWAYS require big


data. However, being able to apply advanced
analytics with big data can provide some
important results
Using Big Data to Get Results:
Operationalized analytics
When you operationalize analytics, you make them part of a
business process. For example, statisticians at an insurance
company might build a model that predicts the likelihood of a claim
being fraudulent. The model, along with some decision rules, could
be included in the company’s claims-processing system to flag
claims with a high probability of fraud. These claims would be sent
to an investigation unit for further review. In other cases, the model
itself might not be as apparent to the end user. For example, a
model could be built to predict customers who are good targets for
upselling when they call into a call center. The call center agent,
while on the phone with the customer, would receive a message on
specific additional products to sell to this customer. The agent
might not even know that a predictive model was working behind
the scenes to make this recommendation
Using Big Data to Get Results:
Monetizing analytics
Analytics can be used to optimize your business to create better
decisions and drive bottom- and top-line revenue. However, big
data analytics can also be used to derive revenue above and beyond
the insights it provides just for your own department or company.
You might be able to assemble a unique data set that is valuable to
other companies, as well. For example, credit card providers take
the data they assemble to offer value-added analytics products.
Likewise, with financial institutions. Telecommunications companies
are beginning to sell location-based insights to retailers. The idea is
that various sources of data, such as billing data, location data, text-
messaging data, or web-browsing data can be used together or
separately to make inferences about customer behavior patterns
that retailers would find useful. As a regulated industry, they must
do so in compliance with legislation and privacy policies.
EMC’s Global Innovation Network and
Analytics (GINA) team is a group of
senior technologists located in centers
of excellence (COEs) around the world.
This team’s charter is to engage
employees across global COEs to drive
innovation, research, and university
Case Study: partnerships. In 2012, a newly hired
director wanted to improve these
Global Innovation activities and provide a mechanism to
track and analyze the related
Network and information. In addition, this team
wanted to create more robust
Analysis (GINA) mechanisms for capturing the results of
its informal conversations with other
thought leaders within EMC, in
academia, or in other organizations,
which could later be mined for insights.
The GINA team thought its approach
would provide a means to share ideas
globally and increase knowledge
sharing among GINA members who
may be separated geographically. It
planned to create a data repository
containing both structured and
unstructured data to accomplish three
main goals.
Store formal and informal data.
Track research from global
technologists.
Mine the data for patterns and
insights to improve the team’s
operations and strategy
The GINA case study provides an
example of how a team applied
the Data Analytics Lifecycle to
analyze innovation data at EMC.
Innovation is typically a difficult
concept to measure, and this
team wanted to look for ways to
use advanced analytical methods
to identify key innovators within
the company.
Phase 1: Discovery
The project sponsor’s approach was to leverage social media and blogging to
accelerate the collection of innovation and research data worldwide and to
motivate teams of “volunteer” data scientists at worldwide locations. Given
that he lacked a formal team, he needed to be resourceful about finding
people who were both capable and willing to volunteer their time to work on
interesting problems. Data scientists tend to be passionate about data, and the
project sponsor was able to tap into this passion of highly talented people to
accomplish challenging work in a creative way.
Phase 1: Discovery
The data for the project fell into two main categories. The first category
represented five years of idea submissions from EMC’s internal innovation
contests, known as the Innovation Roadmap (formerly called the Innovation
Showcase). The Innovation Roadmap is a formal, organic innovation process
whereby employees from around the globe submit ideas that are then vetted
and judged. The best ideas are selected for further incubation. As a result, the
data is a mix of structured data, such as idea counts, submission dates, inventor
names, and unstructured content, such as the textual descriptions of the ideas
themselves.
The second category of data encompassed minutes and notes representing
innovation and research activity from around the world. This also represented a
mix of structured and unstructured data. The structured data included attributes
such as dates, names, and geographic locations. The unstructured documents
contained the “who, what, when, and where” information that represents rich data
about knowledge growth and transfer within the company. This type of
information is often stored in business silos that have little to no visibility across
disparate research teams.
Phase 1: Discovery
The 10 main Initial Hypotheses (IHs) that the GINA team developed were as follows:
IH1: Innovation activity in different geographic regions can be mapped to corporate
strategic directions.
IH2: The length of time it takes to deliver ideas decreases when global knowledge
transfer occurs as part of the idea delivery process.
IH3: Innovators who participate in global knowledge transfer deliver ideas more quickly
than those who do not.
IH4: An idea submission can be analyzed and evaluated for the likelihood of receiving
funding.
IH5: Knowledge discovery and growth for a particular topic can be measured and
compared across geographic regions.
IH6: Knowledge transfer activity can identify research-specific boundary spanners in
disparate regions.
IH7: Strategic corporate themes can be mapped to geographic regions.
IH8: Frequent knowledge expansion and transfer events reduce the time it takes to
generate a corporate asset from an idea.
IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or has
not) resulted in a corporate asset.
IH10: Emerging research topics can be classified and mapped to specific ideators,
innovators, boundary spanners, and assets.
The GINA (IHs) can be grouped
into two categories:
1. Descriptive analytics of
what is currently
happening to spark further
creativity, collaboration,
and asset generation
2. Predictive analytics to
advise executive
management of where it
should be investing in the
future
Phase 2: Data Preparation
The team partnered with its IT
department to set up a new analytics
sandbox to store and experiment on the
data. During the data exploration
exercise, the data scientists and data
engineers began to notice that certain
data needed conditioning and
normalization. In addition, the team
realized that several missing datasets
were critical to testing some of the
analytic hypotheses.
Phase 2: Data Preparation
As the team explored the data, it quickly realized that if it did
not have data of sufficient quality or could not get good
quality data, it would not be able to perform the subsequent
steps in the lifecycle process. As a result, it was important to
determine what level of data quality and cleanliness was
sufficient for the project being undertaken. In the case of the
GINA, the team discovered that many of the names of the
researchers and people interacting with the universities
were misspelled or had leading and trailing spaces in the
datastore. Seemingly small problems such as these in the
data had to be addressed in this phase to enable better
analysis and data aggregation in subsequent phases.
Phase 3: Model Planning
The parameters related to the scope of the study included the following
considerations:
Identify the right milestones to achieve this goal
Trace how people move ideas from each milestone toward the goal.
Once this is done, trace ideas that die, and trace others that reach the goal.
Compare the journeys of ideas that make it and those that do not.
Compare the times and the outcomes using a few different methods
(depending on how the data is collected and assembled). These could be as
simple as t-tests or perhaps involve different types of classification
algorithms.
Phase 4:
Model Building
The GINA team employed several
analytical methods. This included work by
the data scientist using Natural Language
Processing (NLP) techniques on the textual
descriptions of the Innovation Roadmap
ideas. In addition, he conducted social
network analysis using R and RStudio, and
then he developed social graphs and
visualizations of the network of
communications related to innovation
using R’s ggplot2 package.
Phase 4:
Model Building
It portrays the relationships
between idea submitters within
GINA. Each color represents an
innovator from a different country.
The large dots with red circles
around them represent hubs. A
hub represents a person with high
connectivity and a high
Social graph visualization of idea submitters and finalists “betweenness” score
Phase 4:
Model Building
The cluster contains geographic
variety, which is critical to prove
the hypothesis about geographic
boundary spanners. One person in
this graph has an unusually high
score when compared to the rest
of the nodes in the graph. The data
scientist identified this person and
ran a query against his name within
the analytic sandbox
Social graph visualization of top innovation influencers
Phase 4: Model Building
Social graph illustrated how influential he was within his business unit and
across many other areas of the company worldwide:
In 2011, he attended the ACM SIGMOD conference, which is a top-
tier conference on large-scale data management problems and
databases.
He visited employees in France who are part of the business unit for
EMC’s content management teams within Documentum (now part of
the Information Intelligence Group, or IIG).
He presented his thoughts on the SIGMOD conference at a virtual
brownbag session attended by three employees in Russia, one
employee in Cairo, one employee in Ireland, one employee in India,
three employees in the United States, and one employee in Israel.
In 2012, he attended the SDM 2012 conference in California.
Phase 4: Model Building
On the same trip he visited innovators and researchers at EMC
federated companies, Pivotal and VMware.
Later on that trip he stood before an internal council of technology
leaders and introduced two of his researchers to dozens of corporate
innovators and researchers.

This finding suggests that at least part of the initial hypothesis is correct;
the data can identify innovators who span different geographies and
business units. The team used Tableau software for data visualization and
exploration and used the Pivotal Greenplum database as the main data
repository and analytics engine.
PHASE 5: COMMUNICATE RESULTS
This project was considered successful in identifying boundary spanners and
hidden innovators. As a result, the CTO office launched longitudinal studies to
begin data collection efforts and track innovation results over longer periods of
time. The GINA project promoted knowledge sharing related to innovation and
researchers spanning multiple areas within the company and outside of it. GINA
also enabled EMC to cultivate additional intellectual property that led to additional
research topics and provided opportunities to forge relationships with universities
for joint academic research in the fields of Data Science and Big Data. In addition,
the project was accomplished with a limited budget, leveraging a volunteer force
of highly skilled and distinguished engineers and data scientists.
PHASE 5: COMMUNICATE RESULTS
One of the key findings from the project is that there was a disproportionately
high density of innovators in Cork, Ireland. After further research, it was learned
that the COE in Cork, Ireland had received focused training in innovation from an
external consultant, which was proving effective. The Cork COE came up with
more innovation ideas, and better ones, than it had in the past, and it was making
larger contributions to innovation at EMC. Applying social network analysis
enabled the team to find a pocket of people within EMC who were making
disproportionately strong contributions. These findings were shared internally
through presentations and conferences and promoted through social media and
blogs.
Phase 6: Operationalize
The CTO office and GINA need more data in the future,
including a marketing initiative to convince people to inform
the global community on their innovation/research
activities.
Some of the data is sensitive, and the team needs to
consider security and privacy related to the data, such as
who can run the models and see the results.
In addition to running models, a parallel initiative needs to
be created to improve basic Business Intelligence activities,
such as dashboards, reporting, and queries on research
activities worldwide.
A mechanism is needed to continually reevaluate the model
after deployment. Assessing the benefits is one of the main
goals of this stage, as is defining a process to retrain the
model as needed.
Analytic Plan from the EMC GINA Project
Email Address
ANNEMUDYAYOLANDA@

Questions? LECTURER.UNRI.AC.ID

Comments?
Feel free to share your
feedback.
OFFICE
EPSILON 12
STATISTIKA FMIPA UNRI

You might also like