0% found this document useful (0 votes)

16 views

big_data_and_data_science

The CDMP Study Group session on September 2, 2020, led by Nupur Gandhi, focused on Chapter 14: Big Data and Data Science, discussing the differences between Big Data and Data Science, their applications, and the challenges in data storage and processing. Key topics included the 6 V's of Big Data, the importance of data governance, and the steps involved in a Big Data strategy. The session aimed to prepare members for the CDMP exam by reviewing relevant content from the DMBOK2.

Uploaded by

Rafikul Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

big_data_and_data_science

Uploaded by

Rafikul Islam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CDMP Study Group

SESSION 15
September 2, 2020
Nupur Gandhi, DAMA New England, VP Online Services
Email: nupurgandhi@gmail.com
AGENDA
• Facilitator
• Introductory Note
• Chapter 14: Big Data and Data Science
• Overview
•Q&A
• Next Session

New England Data Management Community

Facilitator
Nupur Gandhi
 The Hartford, Senior Consultant - Reference
Data Management
 DAMA NE Chapter, VP, Online Services

CONTACT INFO:
EMAIL: nupurgandhi@gmail.com
PHONE: 860-712-6097
: /IN/nupur gandhi

New England Data Management Community

INTRODUCTORY NOTE

This study group is offered as a service of DAMA New England for DAMA New England
members. It not an official, DAMA International authorized training course because DAMA-I has
not yet created an authorized trainer program.

The purpose of this group is to help prepare members to take the CDMP. We will do so by
reviewing the content of chapters of the DMBOK2.

The chapter makes no claims for the effectiveness of the sessions or the ability of participants to
pass the CDMP exam after having attended. In fact, you should plan on doing a lot of individual
study to pass the exam.

New England Data Management Community

On a Fun Note….
Big Data & Data Science Can Be Linked To Teenage Relationships 

Everyone Talks About It,

Few Really Know How To Do It,
It’s Been The Source Of Many Rumors,
Everyone Thinks That Everyone Else Is Doing It,
So Everyone Claims They Are Doing It !
HOMEWORK – Big Data & Data Science
Is there a difference between Big Data and Data
Science?

Big Data – Collection of Data

Data Science – Analysis of the big data/ Applied
Statistics

Big Data and Data Science have helped to generate,

store and analyze larger amounts of data
New England Data Management Community
Chapter 14: Big Data Introduction
Big Data and Data Science Vs BI/DW

Business Intelligence (BI): Rear view

mirror reporting
Think analysis to describe past trends

Data Science: Forward looking

(windshield) view of the organization
Think analysis to describe future trends

New England Data Management Community

Data Processing -Traditional DW Vs Big Data
Traditional Big Data
DW

How is data Relational Not relational

organized Model model

Concept ETL (Extract, ELT (Extract,

Transform Load and
and Load) Transform)

New England Data Management Community

Big Data Overview
Definition: The collection (Big Data) and analysis (Data Science, Analytics and Visualization) of many
different types of data to find answers and insights for questions that are not known at the start of the
analysis.

Goals:
1. Discover relationships between data and the business
2. Support the iterative integration of data source(s) from the enterprise
3. Discover and analyze new factors that might affect the business
4. Publish data using: visualization techniques (appropriate, trusted, efficient manner)

New England Data Management Community

Big Data Overview
Biggest Business
Driver for using Big
Data:

Find and act on

business opportunities
that may be
discovered through
data sets generated
through a diversified
range of products

New England Data Management Community

Real life examples using Data Science

New England Data Management Community

Big Data Features
What are the 6 V’s in Big Data?

• Volume (Amount of Data)

• Velocity (Speed at which data is produced)

• Variety/Variability (Various Forms, formats, data structures)

• Viscosity (How difficult the data is to use or integrate)

• Volatility (How data changes occur and therefore how long the data is useful)

• Veracity (How trustworthy the data is)

New England Data Management Community

Visualization of Big Data

New England Data Management Community

Big Data Storage Challenges
Every 2 days we create as much data as we did from the beginning of time until 2003

Sources of Big Data:

New England Data Management Community

Conceptual DW/BI and Big Data Architecture
Sr.
Big Data Data Warehouse
No
Big data is the data which is in enormous form on which Data warehouse is the collection of historical data from
1.
technologies can be applied. different operations in an enterprise.
Data warehouse is an architecture used to organize the
2. Big data is a technology to store and manage large amount of data.
data.
It takes structured, non-structured or semi-structured data as an
3. It only takes structured data as an input.
input.
Data warehouse doesn’t use distributed file system for
4. Big data does processing by using distributed file system.
processing.
In data warehouse we use SQL queries to fetch data from
5. Big data doesn’t follow any SQL queries to fetch data from database.
relational databases.
Data warehouse cannot be used to handle enormous
6. Apache Hadoop can be used to handle enormous amount of data.
amount of data.
When new data is added, the changes in data are stored in the form When new data is added, the changes in data do not
7.
of a file which is represented by a table. directly impact the data warehouse.
Data warehouse requires more efficient management
Big data doesn’t require efficient management techniques as
8. techniques as the data is collected from different
compared to data warehouse.
departments of the enterprise
New England Data Management Community
Analytics Progression

New England Data Management Community

Conceptual DW/BI and Big Data Architecture

- Selection, installation
and configuration of
Big Data environment
requires specialized
expertise

- Develop and
rationalize end to end
architecture using:
• Data exploratory
tools
• New acquisitions

In a Big Data environment, data is ingested and loaded before it is integrated (extract, LOAD, transform) Vs
In a data warehouse, data is integrated as it is brought in the warehouse (extract, LOAD, transform)
New England Data Management Community
Game: Match the following?
1. What is a Data Lake? a. Anticipates what will happen, when it will happen
and implies why it will happen
2. What is Machine Learning?
b. Analysis that reveals patterns in data using various
3. What is Sentiment Analysis? algorithms

4. What is Data Mining? c. Analyzes documents with text analysis and data
mining technologies to classify content automatically

5. What is Text Mining? d. Development of probability models based on

variables using historical data

6. What is Predictive Analytics?

e. Uses NLP (Natural Language Processing) to detect
7. What is Prescriptive sentiment and reveal changes in sentiment to predict
Analytics? possible scenarios

f. Environment where a vast amount of data of various

types and structures can be ingested, stored,
assessed and analyzed

g. Explores construction and study of learning

algorithms
Answers: 1f, 2g , 3e , 4b , 5c, 6d, 7a

New England Data Management Community

Game: What do I stand for?
Services based architecture(SBA) is a way to provide immediate data as well as update a
historical data set
Speed layer is referred as ODS, all transactions are updates only if required

Speed Layer?

Batch Layer?

Serving Layer?

Every transaction is an insert

New England Data Management Community

Game: What do I stand for?
Machine learning explores the construction and study of learning algorithms.
These algorithms are fall into 3 types…What are those?

Supervised Learning, Unsupervised Learning, Reinforcement Learning

Supervised Learning: Based on generalized rules eg Separating SPAM from non SPAM email

Unsupervised Learning: Based on identifying hidden patterns (Data Mining)

Reinforcement Learning: Based on achieving a goal (Beating a opponent at chess)

New England Data Management Community

Big Data Process
1. Define Big Data Strategy & Business Need(s):
Define requirements that identify desired outcomes with
measured tangible benefits
2. Choose Data Source(s):
Identify gaps in the current data asset and find data sources to fill
those gaps
3. Acquire & Ingest Data Source(s):
Obtain data sources and onboard them
4. Develop Data Science Hypothesis(es) & Methods:
Obtain data sources, refine requirements, define model inputs,
types or model hypotheses
5. Integrate/Align Data For Analysis:
Model feasibility depends on the quality of the data source,
leverage trusted and credible sources
6. Explore Data Using Models:
Apply statistical analysis and Machine Learning algorithms against the
integrated data. Validate, train and over time, evolve the model
7. Deploy and Monitor:
Deploy models to production for ongoing monitoring of value and
effectiveness

New England Data Management Community

Big Data Science Activities
1. Define Big Data Strategy & Business Need(s)
• “Define business requirements that identify desired outcomes with measurable tangible benefits”
• Start With An “Problem Statement” And / Or A Hypothesis “Why are diaper sales down and in what markets or areas?”

2. Choose Data Source(s)

• “Identify gaps in the current data asset base and find data sources or sets to fill the gaps.”
• Determine What Data Would Be Needed To Understand Problem
External: US Census (Demographics), US Natality (birth rates), IRS, …
Internal: Sales (POS - RSI), Distribution (SAP), Trade (Promax), …

3. Acquire & Ingest Data Source(s)

• “Secure (purchase or obtain) data sets and onboard them into your environment”

4. Develop Data Science Hypothesis(es) & Methods

• “Explore data sources using profiling, visualization, mining, or other methods to understand this data and then refine
Theory(ies) and define model algorithm inputs, types, or test methods of analysis”
• Review the data to understand its value, composition and key relationships
“What data sets have what time intervals and how do they sync up together”

New England Data Management Community

Big Data Science Activities
5. Integrate/Align Data For Analysis
• Model feasibility depends on the quality and matching of the source data sets: the higher quality/match of the data, the
more likely the model will succeed. Leveraging trusted and credible sources and applying appropriate data integration and
cleansing techniques can increase usefulness of data sets.

6. Explore Data Using Models

• Exploration involves applying statistical analysis (models) and machine learning (AI) algorithms against the integrated data.
The model is constructed, evaluated against training sets, and validated. Training the model is also critical to its effectiveness.
Training entails repeated runs of the model against actual data to verify assumptions and make adjustments, such as identifying
outliers and selecting different variables.
• Through this process, hypotheses will be refined. Initial feasibility / viability metrics can guide evolution of the model.
New hypotheses may be introduced that require additional data sets and results of this exploration will shape the future
modeling and outputs.

7. Deploy and Monitor

• Those models that produce useful information can be deployed to production for ongoing monitoring of value and
effectiveness. Often times data science projects turn into data warehousing projects where more vigorous development
processes are put in place (ETL, DQ, Master Data, etc.).

New England Data Management Community

Tools and Techniques

New England Data Management Community

Tools
• MPP (Massively Parallel Processing): Data is partitioned across multiple processing servers

• Distributed File Based Databases

Eg: Open Source Hadoop

• In-database algorithms
Eg: K-means Clustering, Linear regression, Conjugate Gradient, Cohort Analysis

• Big data Cloud Solutions

• Statistical Computing and Graphical Langages

Eg: R

• Data Visualization Tools

Eg: Radar Charts, Coordinate plots, Tag Charts, Heat Map

New England Data Management Community

Techniques
Analytic Modelling Big Data Modelling

Technical challenge but critical

if organization wants to
Learn by example through describe and govern the data
training the model

Types of analysis associated

with Analytic models
Data Modelling while
accounting for the varierty of
Descriptive Modelling sources
Explanatory Modelling

New England Data Management Community

Techniques
Descriptive Modelling: Represents data structures in a compact Manner

Does not Validate a Causal Hypothesis or Predict outcomes, uses algorithm to define or refine
relationships across variables

Explanatory Modelling: Application of statistical models to data for testing casual hypothesis about
theortical constructs

Does not predict outcomes, match model results only with existing data

New England Data Management Community

Implementation Best Practices & Guidelines
Aligning the business to a implementation plan is the key to success

a. Strategy Alignment
• Strategically aligned with organizational objectives

b. Readiness Assessment / Risk Assessment

• Business Relevance (Align initiatives with company’s business)
• Business Readiness (Commitment to knowledge centre, Skillset Gap)
• Economic Viability (Ownership Costs, Benefits: Tangible and Intangible)
• Prototype (Prototype for subset of the end user community for a finite timeframes

c. Organization & Cultural Change

• New roles and responsibilities are required to implement
• These roles are in addition to the existing data management/BI roles
 Big Data Platform Architect
 Ingestion Architect
 Metadata Specialist
 Analytic Design Lead
 Data Scientist

New England Data Management Community

Big Data Governance
Big Data, like other data, requires governance.
Need to consider business and technical controls addressing below questions:

• Sourcing: What to source, when to source, what is the best source of data for particular study.
• Sharing: What data sharing agreements and contracts to enter into, terms and conditions both inside and
outside the organization
• Metadata: What the data means on the source side, how to interpret the results on the output side
• Enrichment: Whether to enrich the data, how to enrich the data, and what the benefits will be to do so
• Access: What to publish and to whom, how and when

Think Visualization Management, Visualization Standards, Data Security, Metadata, Data Quality, Metrics

New England Data Management Community

STUDY GROUP MATERIALS
Study group presentations will be posted on CDMP Study Group page, on DAMA New England website, in the Schedule &
Agenda section.

New England Data Management Community

Q&A

New England Data Management Community

Business Plan Broiler Production Sabur
100% (3)
Business Plan Broiler Production Sabur
17 pages
Wiring Diagram Renault C PDF
50% (2)
Wiring Diagram Renault C PDF
724 pages
Project Report On Hardware Shop For Riyaz Hardware: Confidential
80% (5)
Project Report On Hardware Shop For Riyaz Hardware: Confidential
5 pages
Nisargadatta Ultimatum Transcript
No ratings yet
Nisargadatta Ultimatum Transcript
32 pages
ACC IT APP MIdterm Bigdata
No ratings yet
ACC IT APP MIdterm Bigdata
12 pages
Big Data
No ratings yet
Big Data
10 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
BDA UNIT-1 (Lecture-1)
No ratings yet
BDA UNIT-1 (Lecture-1)
5 pages
Book Big Data Technology
No ratings yet
Book Big Data Technology
87 pages
Big Data in Management Unit - I: Session 1-5
No ratings yet
Big Data in Management Unit - I: Session 1-5
25 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Unit 1 Introduction: Data Science and Big Data: Syllabus
No ratings yet
Unit 1 Introduction: Data Science and Big Data: Syllabus
38 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
BDA_Notes
No ratings yet
BDA_Notes
68 pages
Data and Information Management
No ratings yet
Data and Information Management
18 pages
Week-1 Introduction To BDDA-TWM PDF
No ratings yet
Week-1 Introduction To BDDA-TWM PDF
48 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Basics of Big Data
No ratings yet
Basics of Big Data
14 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Unit I Big Data
No ratings yet
Unit I Big Data
256 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Unit 1
No ratings yet
Unit 1
76 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
UNUT 1- Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1- Introduction and Data Analytics Life Cycle
86 pages
BDA1
No ratings yet
BDA1
2 pages
Unit 1
No ratings yet
Unit 1
137 pages
ADBMS-Module 1 Notes
No ratings yet
ADBMS-Module 1 Notes
18 pages
00 - 00 DS - Overview - FRAMEWORK
No ratings yet
00 - 00 DS - Overview - FRAMEWORK
63 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
Big Data Analytics_AAM_Unit 1
No ratings yet
Big Data Analytics_AAM_Unit 1
178 pages
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
No ratings yet
What Is Need of Big Data in Enterprises and How It Is Different From Business Intelligence
56 pages
Lesson 5 - Business Analytics and Big Data
No ratings yet
Lesson 5 - Business Analytics and Big Data
39 pages
Week 5 Big Data Application in Business
No ratings yet
Week 5 Big Data Application in Business
51 pages
Big Data Analytics PPT Fat 2
No ratings yet
Big Data Analytics PPT Fat 2
9 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Introduction To Big Data - The Four V's
No ratings yet
Introduction To Big Data - The Four V's
35 pages
BDM Unit I Slides Part 2
No ratings yet
BDM Unit I Slides Part 2
21 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
BDA PST
No ratings yet
BDA PST
11 pages
UNIT I BIG DATA Extra Content
No ratings yet
UNIT I BIG DATA Extra Content
15 pages
Bigdata Assess1 PDF
No ratings yet
Bigdata Assess1 PDF
12 pages
IMTC634_Data Science_Chapter 11
No ratings yet
IMTC634_Data Science_Chapter 11
22 pages
Big Data Analytics
No ratings yet
Big Data Analytics
25 pages
Wibd Notes
No ratings yet
Wibd Notes
32 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Unit-1
No ratings yet
Unit-1
11 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
BDA Unit-2 (Part 3)
No ratings yet
BDA Unit-2 (Part 3)
7 pages
Insights Into Big Data: An Industrial Perspective
No ratings yet
Insights Into Big Data: An Industrial Perspective
52 pages
Unit 1
No ratings yet
Unit 1
74 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
FBA-FINALS-LONG-QUIZ
No ratings yet
FBA-FINALS-LONG-QUIZ
13 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
WHAT IS BIG DATA87
No ratings yet
WHAT IS BIG DATA87
4 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Opd Sec019a
No ratings yet
Opd Sec019a
43 pages
Data Classification Template - 1566298522
No ratings yet
Data Classification Template - 1566298522
1 page
Projected Asset & Liabilities Tanmoy
No ratings yet
Projected Asset & Liabilities Tanmoy
1 page
Thomas Acqnas On Political Obligation
No ratings yet
Thomas Acqnas On Political Obligation
19 pages
Module 1 Branch Bound TSP
No ratings yet
Module 1 Branch Bound TSP
25 pages
ISDP PO Acceptance For Partner
No ratings yet
ISDP PO Acceptance For Partner
10 pages
CMP2015 - Coarse Gold Recovery Using Flotation in A Fluidize
No ratings yet
CMP2015 - Coarse Gold Recovery Using Flotation in A Fluidize
9 pages
Basic Pool Calculations: Step 1: Calculate Quantity of Water
No ratings yet
Basic Pool Calculations: Step 1: Calculate Quantity of Water
2 pages
Study Guide A: Key Concept
No ratings yet
Study Guide A: Key Concept
17 pages
Etr 560
No ratings yet
Etr 560
17 pages
Preliminary Examinations: 2. Main Examination
No ratings yet
Preliminary Examinations: 2. Main Examination
3 pages
Jawahar Navodaya Vidyalaya, Malampuzha, Palakkad: Annual Report Foreword
No ratings yet
Jawahar Navodaya Vidyalaya, Malampuzha, Palakkad: Annual Report Foreword
11 pages
Cheap CARFAX Vehicle History Report For This 2018 BMW X5 SDRIVE35I 5UXKR2C59J0X07843
No ratings yet
Cheap CARFAX Vehicle History Report For This 2018 BMW X5 SDRIVE35I 5UXKR2C59J0X07843
7 pages
13-A Hybrid Global Maximum Power Point Tracking Technique With Fast Convergence Speed For Partial-Shaded PV Systems-18
No ratings yet
13-A Hybrid Global Maximum Power Point Tracking Technique With Fast Convergence Speed For Partial-Shaded PV Systems-18
10 pages
Bnp-b3484 Centro Usinagem L31-Mazak
No ratings yet
Bnp-b3484 Centro Usinagem L31-Mazak
227 pages
A Study On Customer Satisfaction at HDFC Bank, Vijayapura
100% (1)
A Study On Customer Satisfaction at HDFC Bank, Vijayapura
90 pages
Gps
No ratings yet
Gps
3 pages
STP Monitoring - Revised Format (April 2024)
No ratings yet
STP Monitoring - Revised Format (April 2024)
7 pages
Question
No ratings yet
Question
41 pages
Panasonic KX T 7730 White PDF
100% (1)
Panasonic KX T 7730 White PDF
12 pages
Quarter 2-Module 3
No ratings yet
Quarter 2-Module 3
2 pages
Teori Behavior Dan Humanistik Dan Kognitif
No ratings yet
Teori Behavior Dan Humanistik Dan Kognitif
10 pages
Assignment For Boundary Layer PDF
No ratings yet
Assignment For Boundary Layer PDF
7 pages
CPXL12.5ESK R96 - chg18 - 2021-04-09
No ratings yet
CPXL12.5ESK R96 - chg18 - 2021-04-09
48 pages
os
No ratings yet
os
6 pages
Agile ppt and notes
No ratings yet
Agile ppt and notes
54 pages
INVERTER-driven: Split-Ductless and Ducted Comfort Systems
No ratings yet
INVERTER-driven: Split-Ductless and Ducted Comfort Systems
32 pages
Writing Task 2unit 2-4
No ratings yet
Writing Task 2unit 2-4
3 pages
Chs SHC Catalogue Vol2 202306 41116938
No ratings yet
Chs SHC Catalogue Vol2 202306 41116938
29 pages
Magnetic Hydro Dynamic Power Generation Full Seminar Report
100% (1)
Magnetic Hydro Dynamic Power Generation Full Seminar Report
9 pages
72121368
No ratings yet
72121368
81 pages