0% found this document useful (0 votes)
16 views

big_data_and_data_science

The CDMP Study Group session on September 2, 2020, led by Nupur Gandhi, focused on Chapter 14: Big Data and Data Science, discussing the differences between Big Data and Data Science, their applications, and the challenges in data storage and processing. Key topics included the 6 V's of Big Data, the importance of data governance, and the steps involved in a Big Data strategy. The session aimed to prepare members for the CDMP exam by reviewing relevant content from the DMBOK2.

Uploaded by

Rafikul Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

big_data_and_data_science

The CDMP Study Group session on September 2, 2020, led by Nupur Gandhi, focused on Chapter 14: Big Data and Data Science, discussing the differences between Big Data and Data Science, their applications, and the challenges in data storage and processing. Key topics included the 6 V's of Big Data, the importance of data governance, and the steps involved in a Big Data strategy. The session aimed to prepare members for the CDMP exam by reviewing relevant content from the DMBOK2.

Uploaded by

Rafikul Islam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CDMP Study Group

SESSION 15
September 2, 2020
Nupur Gandhi, DAMA New England, VP Online Services
Email: nupurgandhi@gmail.com
AGENDA
• Facilitator
• Introductory Note
• Chapter 14: Big Data and Data Science
• Overview
•Q&A
• Next Session

New England Data Management Community


Facilitator
Nupur Gandhi
 The Hartford, Senior Consultant - Reference
Data Management
 DAMA NE Chapter, VP, Online Services

CONTACT INFO:
EMAIL: nupurgandhi@gmail.com
PHONE: 860-712-6097
: /IN/nupur gandhi

New England Data Management Community


INTRODUCTORY NOTE

This study group is offered as a service of DAMA New England for DAMA New England
members. It not an official, DAMA International authorized training course because DAMA-I has
not yet created an authorized trainer program.

The purpose of this group is to help prepare members to take the CDMP. We will do so by
reviewing the content of chapters of the DMBOK2.

The chapter makes no claims for the effectiveness of the sessions or the ability of participants to
pass the CDMP exam after having attended. In fact, you should plan on doing a lot of individual
study to pass the exam.

New England Data Management Community


On a Fun Note….
Big Data & Data Science Can Be Linked To Teenage Relationships 

Everyone Talks About It,


Few Really Know How To Do It,
It’s Been The Source Of Many Rumors,
Everyone Thinks That Everyone Else Is Doing It,
So Everyone Claims They Are Doing It !
HOMEWORK – Big Data & Data Science
Is there a difference between Big Data and Data
Science?

Big Data – Collection of Data


Data Science – Analysis of the big data/ Applied
Statistics

Big Data and Data Science have helped to generate,


store and analyze larger amounts of data
New England Data Management Community
Chapter 14: Big Data Introduction
Big Data and Data Science Vs BI/DW

Business Intelligence (BI): Rear view


mirror reporting
Think analysis to describe past trends

Data Science: Forward looking


(windshield) view of the organization
Think analysis to describe future trends

New England Data Management Community


Data Processing -Traditional DW Vs Big Data
Traditional Big Data
DW

How is data Relational Not relational


organized Model model

Concept ETL (Extract, ELT (Extract,


Transform Load and
and Load) Transform)

New England Data Management Community


Big Data Overview
Definition: The collection (Big Data) and analysis (Data Science, Analytics and Visualization) of many
different types of data to find answers and insights for questions that are not known at the start of the
analysis.

Goals:
1. Discover relationships between data and the business
2. Support the iterative integration of data source(s) from the enterprise
3. Discover and analyze new factors that might affect the business
4. Publish data using: visualization techniques (appropriate, trusted, efficient manner)

New England Data Management Community


Big Data Overview
Biggest Business
Driver for using Big
Data:

Find and act on


business opportunities
that may be
discovered through
data sets generated
through a diversified
range of products

New England Data Management Community


Real life examples using Data Science

New England Data Management Community


Big Data Features
What are the 6 V’s in Big Data?

• Volume (Amount of Data)

• Velocity (Speed at which data is produced)

• Variety/Variability (Various Forms, formats, data structures)

• Viscosity (How difficult the data is to use or integrate)

• Volatility (How data changes occur and therefore how long the data is useful)

• Veracity (How trustworthy the data is)

New England Data Management Community


Visualization of Big Data

New England Data Management Community


Big Data Storage Challenges
Every 2 days we create as much data as we did from the beginning of time until 2003

Sources of Big Data:

• Social Sites Audio/Video


• Sensors
• Blogs
• Advertising Data
• Web Logs
• Phone
• POS devices
• Online Orders
• Online Video Games

New England Data Management Community


Conceptual DW/BI and Big Data Architecture
Sr.
Big Data Data Warehouse
No
Big data is the data which is in enormous form on which Data warehouse is the collection of historical data from
1.
technologies can be applied. different operations in an enterprise.
Data warehouse is an architecture used to organize the
2. Big data is a technology to store and manage large amount of data.
data.
It takes structured, non-structured or semi-structured data as an
3. It only takes structured data as an input.
input.
Data warehouse doesn’t use distributed file system for
4. Big data does processing by using distributed file system.
processing.
In data warehouse we use SQL queries to fetch data from
5. Big data doesn’t follow any SQL queries to fetch data from database.
relational databases.
Data warehouse cannot be used to handle enormous
6. Apache Hadoop can be used to handle enormous amount of data.
amount of data.
When new data is added, the changes in data are stored in the form When new data is added, the changes in data do not
7.
of a file which is represented by a table. directly impact the data warehouse.
Data warehouse requires more efficient management
Big data doesn’t require efficient management techniques as
8. techniques as the data is collected from different
compared to data warehouse.
departments of the enterprise
New England Data Management Community
Analytics Progression

New England Data Management Community


Conceptual DW/BI and Big Data Architecture

- Selection, installation
and configuration of
Big Data environment
requires specialized
expertise

- Develop and
rationalize end to end
architecture using:
• Data exploratory
tools
• New acquisitions

In a Big Data environment, data is ingested and loaded before it is integrated (extract, LOAD, transform) Vs
In a data warehouse, data is integrated as it is brought in the warehouse (extract, LOAD, transform)
New England Data Management Community
Game: Match the following?
1. What is a Data Lake? a. Anticipates what will happen, when it will happen
and implies why it will happen
2. What is Machine Learning?
b. Analysis that reveals patterns in data using various
3. What is Sentiment Analysis? algorithms

4. What is Data Mining? c. Analyzes documents with text analysis and data
mining technologies to classify content automatically

5. What is Text Mining? d. Development of probability models based on


variables using historical data

6. What is Predictive Analytics?


e. Uses NLP (Natural Language Processing) to detect
7. What is Prescriptive sentiment and reveal changes in sentiment to predict
Analytics? possible scenarios

f. Environment where a vast amount of data of various


types and structures can be ingested, stored,
assessed and analyzed

g. Explores construction and study of learning


algorithms
Answers: 1f, 2g , 3e , 4b , 5c, 6d, 7a

New England Data Management Community


Game: What do I stand for?
Services based architecture(SBA) is a way to provide immediate data as well as update a
historical data set
Speed layer is referred as ODS, all transactions are updates only if required

Speed Layer?

Batch Layer?

Serving Layer?

Every transaction is an insert

New England Data Management Community


Game: What do I stand for?
Machine learning explores the construction and study of learning algorithms.
These algorithms are fall into 3 types…What are those?

Supervised Learning, Unsupervised Learning, Reinforcement Learning

Supervised Learning: Based on generalized rules eg Separating SPAM from non SPAM email

Unsupervised Learning: Based on identifying hidden patterns (Data Mining)

Reinforcement Learning: Based on achieving a goal (Beating a opponent at chess)

New England Data Management Community


Big Data Process
1. Define Big Data Strategy & Business Need(s):
Define requirements that identify desired outcomes with
measured tangible benefits
2. Choose Data Source(s):
Identify gaps in the current data asset and find data sources to fill
those gaps
3. Acquire & Ingest Data Source(s):
Obtain data sources and onboard them
4. Develop Data Science Hypothesis(es) & Methods:
Obtain data sources, refine requirements, define model inputs,
types or model hypotheses
5. Integrate/Align Data For Analysis:
Model feasibility depends on the quality of the data source,
leverage trusted and credible sources
6. Explore Data Using Models:
Apply statistical analysis and Machine Learning algorithms against the
integrated data. Validate, train and over time, evolve the model
7. Deploy and Monitor:
Deploy models to production for ongoing monitoring of value and
effectiveness

New England Data Management Community


Big Data Science Activities
1. Define Big Data Strategy & Business Need(s)
• “Define business requirements that identify desired outcomes with measurable tangible benefits”
• Start With An “Problem Statement” And / Or A Hypothesis “Why are diaper sales down and in what markets or areas?”

2. Choose Data Source(s)


• “Identify gaps in the current data asset base and find data sources or sets to fill the gaps.”
• Determine What Data Would Be Needed To Understand Problem
External: US Census (Demographics), US Natality (birth rates), IRS, …
Internal: Sales (POS - RSI), Distribution (SAP), Trade (Promax), …

3. Acquire & Ingest Data Source(s)


• “Secure (purchase or obtain) data sets and onboard them into your environment”

4. Develop Data Science Hypothesis(es) & Methods


• “Explore data sources using profiling, visualization, mining, or other methods to understand this data and then refine
Theory(ies) and define model algorithm inputs, types, or test methods of analysis”
• Review the data to understand its value, composition and key relationships
“What data sets have what time intervals and how do they sync up together”

New England Data Management Community


Big Data Science Activities
5. Integrate/Align Data For Analysis
• Model feasibility depends on the quality and matching of the source data sets: the higher quality/match of the data, the
more likely the model will succeed. Leveraging trusted and credible sources and applying appropriate data integration and
cleansing techniques can increase usefulness of data sets.

6. Explore Data Using Models


• Exploration involves applying statistical analysis (models) and machine learning (AI) algorithms against the integrated data.
The model is constructed, evaluated against training sets, and validated. Training the model is also critical to its effectiveness.
Training entails repeated runs of the model against actual data to verify assumptions and make adjustments, such as identifying
outliers and selecting different variables.
• Through this process, hypotheses will be refined. Initial feasibility / viability metrics can guide evolution of the model.
New hypotheses may be introduced that require additional data sets and results of this exploration will shape the future
modeling and outputs.

7. Deploy and Monitor


• Those models that produce useful information can be deployed to production for ongoing monitoring of value and
effectiveness. Often times data science projects turn into data warehousing projects where more vigorous development
processes are put in place (ETL, DQ, Master Data, etc.).

New England Data Management Community


Tools and Techniques

New England Data Management Community


Tools
• MPP (Massively Parallel Processing): Data is partitioned across multiple processing servers

• Distributed File Based Databases


Eg: Open Source Hadoop

• In-database algorithms
Eg: K-means Clustering, Linear regression, Conjugate Gradient, Cohort Analysis

• Big data Cloud Solutions

• Statistical Computing and Graphical Langages


Eg: R

• Data Visualization Tools


Eg: Radar Charts, Coordinate plots, Tag Charts, Heat Map

New England Data Management Community


Techniques
Analytic Modelling Big Data Modelling

Technical challenge but critical


if organization wants to
Learn by example through describe and govern the data
training the model

Types of analysis associated


with Analytic models
Data Modelling while
accounting for the varierty of
Descriptive Modelling sources
Explanatory Modelling

New England Data Management Community


Techniques
Descriptive Modelling: Represents data structures in a compact Manner

Does not Validate a Causal Hypothesis or Predict outcomes, uses algorithm to define or refine
relationships across variables

Explanatory Modelling: Application of statistical models to data for testing casual hypothesis about
theortical constructs

Does not predict outcomes, match model results only with existing data

New England Data Management Community


Implementation Best Practices & Guidelines
Aligning the business to a implementation plan is the key to success

a. Strategy Alignment
• Strategically aligned with organizational objectives

b. Readiness Assessment / Risk Assessment


• Business Relevance (Align initiatives with company’s business)
• Business Readiness (Commitment to knowledge centre, Skillset Gap)
• Economic Viability (Ownership Costs, Benefits: Tangible and Intangible)
• Prototype (Prototype for subset of the end user community for a finite timeframes

c. Organization & Cultural Change


• New roles and responsibilities are required to implement
• These roles are in addition to the existing data management/BI roles
 Big Data Platform Architect
 Ingestion Architect
 Metadata Specialist
 Analytic Design Lead
 Data Scientist

New England Data Management Community


Big Data Governance
Big Data, like other data, requires governance.
Need to consider business and technical controls addressing below questions:

• Sourcing: What to source, when to source, what is the best source of data for particular study.
• Sharing: What data sharing agreements and contracts to enter into, terms and conditions both inside and
outside the organization
• Metadata: What the data means on the source side, how to interpret the results on the output side
• Enrichment: Whether to enrich the data, how to enrich the data, and what the benefits will be to do so
• Access: What to publish and to whom, how and when

Think Visualization Management, Visualization Standards, Data Security, Metadata, Data Quality, Metrics

New England Data Management Community


STUDY GROUP MATERIALS
Study group presentations will be posted on CDMP Study Group page, on DAMA New England website, in the Schedule &
Agenda section.

New England Data Management Community


Q&A

New England Data Management Community

You might also like