big_data_and_data_science
big_data_and_data_science
SESSION 15
September 2, 2020
Nupur Gandhi, DAMA New England, VP Online Services
Email: nupurgandhi@gmail.com
AGENDA
• Facilitator
• Introductory Note
• Chapter 14: Big Data and Data Science
• Overview
•Q&A
• Next Session
CONTACT INFO:
EMAIL: nupurgandhi@gmail.com
PHONE: 860-712-6097
: /IN/nupur gandhi
This study group is offered as a service of DAMA New England for DAMA New England
members. It not an official, DAMA International authorized training course because DAMA-I has
not yet created an authorized trainer program.
The purpose of this group is to help prepare members to take the CDMP. We will do so by
reviewing the content of chapters of the DMBOK2.
The chapter makes no claims for the effectiveness of the sessions or the ability of participants to
pass the CDMP exam after having attended. In fact, you should plan on doing a lot of individual
study to pass the exam.
Goals:
1. Discover relationships between data and the business
2. Support the iterative integration of data source(s) from the enterprise
3. Discover and analyze new factors that might affect the business
4. Publish data using: visualization techniques (appropriate, trusted, efficient manner)
• Volatility (How data changes occur and therefore how long the data is useful)
- Selection, installation
and configuration of
Big Data environment
requires specialized
expertise
- Develop and
rationalize end to end
architecture using:
• Data exploratory
tools
• New acquisitions
In a Big Data environment, data is ingested and loaded before it is integrated (extract, LOAD, transform) Vs
In a data warehouse, data is integrated as it is brought in the warehouse (extract, LOAD, transform)
New England Data Management Community
Game: Match the following?
1. What is a Data Lake? a. Anticipates what will happen, when it will happen
and implies why it will happen
2. What is Machine Learning?
b. Analysis that reveals patterns in data using various
3. What is Sentiment Analysis? algorithms
4. What is Data Mining? c. Analyzes documents with text analysis and data
mining technologies to classify content automatically
Speed Layer?
Batch Layer?
Serving Layer?
Supervised Learning: Based on generalized rules eg Separating SPAM from non SPAM email
• In-database algorithms
Eg: K-means Clustering, Linear regression, Conjugate Gradient, Cohort Analysis
Does not Validate a Causal Hypothesis or Predict outcomes, uses algorithm to define or refine
relationships across variables
Explanatory Modelling: Application of statistical models to data for testing casual hypothesis about
theortical constructs
Does not predict outcomes, match model results only with existing data
a. Strategy Alignment
• Strategically aligned with organizational objectives
• Sourcing: What to source, when to source, what is the best source of data for particular study.
• Sharing: What data sharing agreements and contracts to enter into, terms and conditions both inside and
outside the organization
• Metadata: What the data means on the source side, how to interpret the results on the output side
• Enrichment: Whether to enrich the data, how to enrich the data, and what the benefits will be to do so
• Access: What to publish and to whom, how and when
Think Visualization Management, Visualization Standards, Data Security, Metadata, Data Quality, Metrics