0% found this document useful (0 votes)
6 views48 pages

BDA BigDataArchitecturesAndModelManagement

Uploaded by

Juan Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views48 pages

BDA BigDataArchitecturesAndModelManagement

Uploaded by

Juan Pablo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Big Data and Model

Management

1
Knowledge objectives
1. Explain the data and model management challanges in a data science
pipeline
2. Justify the need of a Data Lake for data management
3. Identify the difficulties of a Data Lake
4. Explain different model selection techniques
5. Justifiy the need for model management
6. Explain different model management techniques

2
Application Objectives
1. Given a use case, define its software architecture
2. Given a data science pipeline, manage the models and their associated
data

3
Data Science
Overview and Challenges
Data science (I)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1

• Traditionally, descriptive analysis


• Data: Focused on gathering data that are obviously valuable
• E.g., business objects representing customers, items in catalogs, purchases, contracts, etc.
• Common tasks: ETL from multiple data sources, creating joined views, followed
by filtering, aggregation, or cube materialization
• Feed report generators, front-end dashboards, and other visualization tools to support
common “roll-up” and “drill-down” operations on multi-dimensional data
Data warehouse
BI tools

ETL Data marts


Reporting tools

Data sources Storage layer Presentation layer

1
Source: Techopedia 5
Descriptive analysis (Use case)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1
Use case: e-commerce company selling a wide range of products
• Data: Customer data, their demographics, purchasing history
• Traditionally,• descriptive analysis
Examine the frequency and distribution of purchases across different product
on gathering data that are obviously valuable
• Data: Focusedcategories
Discover:
• E.g., business• objects a significantcustomers,
representing portion of the customers
items frequently
in catalogs, purchase
purchases, contracts, etc.
electronics (e.g., smartphones and laptops)
• Common tasks: ETL from multiple data sources, creating joined views, followed
• Analyze their demographic information
by filtering, aggregation, or cube materialization
Discover: a large number of them belong to a young group (18-35)

• Feed report generators, and
front-end
reside in dashboards,
urban areas and other visualization tools to support
common “roll-up” and “drill-down” operations on multi-dimensional data
• Action: tailer marketing campaigns and product offerings to better cater to this
specific group Data warehouse
BI tools

ETL Data marts


Reporting tools

Data sources Storage layer Presentation layer

1
Source: Techopedia 6
Data science
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1

• More recently, predictive analysis


• Data: In addition, gather behavioral data from users (not obviously important)
• E.g., web pages that users click, links that they click on, etc.
• Common tasks: Use machine learning techniques to train predictive models
• E.g., whether a piece of content is spam, whether two users should become “friends”, etc.

1
Source: Techopedia 7
Predictive analysis (Use case)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1

Use case: e-commerce company selling a wide range of products


• More recently, predictive
• Data: Customer data, theiranalysis
demographics, purchasing history + browsing behavior
• Data:• InProduct
addition, gather behavioral data from users
recommendation
• Analyze
• E.g., web customer
pages that userspurchasing
click, linkshistory, browsing
that they behavior,
click on, etc. and similarities with
other customers, the e-commerce company can create a recommendation
• Common tasks: Use machine learning techniques to train predictive models
engine that suggests relevant products to individual customers
• E.g., whether a piece of content is spam, whether two users should become “friends”, the
• This personalized recommendation system can enhance the customer experience,
likelihood that a user will complete a purchase or be interested in a related product, etc.
increase customer engagement, and drive cross-selling or upselling opportunities

1
Source: Techopedia 8
A typical data science workflow1
1. Data management
Acquire data 2. Model management Analysis

Execute
Reformat and scripts
clean data Edit analysis
scripts

Data-centric
Preparation tric
l-c
en Inspect
e
Mo
d outputs
Explore Debug
alternatives

Dissemination

Make comparisons Write reports

Take notes Deploy online

Hold meetings Archive experiment


Share experiment
Reflection

1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 9
A typical data science workflow1
Challenges
1. Data management
Acquire data 2. Model management Analysis

Execute
scripts
1. How to Reformat and
assign names to data files that are created or downloaded? How to organize
clean data
those files into directories? Editscripts
analysis

Data-centric
2. Preparation
How to keep track of provenance?
ntr
ic
ce Inspect
e l-
3. Where does data come Mo from and is it still up-to-date?
d outputs
Explore Debug
4. How to store data that cannot fit on a single hard drive?
alternatives
5. How to fix semantic errors, missing entries, inconsistent formatting? How to do it
efficiently for large amounts of data? Dissemination
6. How to integrate data?Make comparisons Write reports
7. … Take notes Deploy online

Hold meetings Archive experiment


Share experiment
Reflection

1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 10
A typical data science experiment directory1

A file listing from a computational


biologist’s experiment directory

Often metadata such as


version numbers, script
parameter values, and even
short notes are encoded in
output filenames:
J Metadata stays attached
to the file and remains
highly visible
L Leads to data
management problems

1
Source: Philip J. Guo. Software Tools
to Facilitate Research Programming.
Ph.D. dissertation, 2012.
Therefore, in a real world ML system1 …

Data Machine
Resource Monitoring
Verification
Management
Configuration Data Collection
Serving
ML Infrastructure
Code
Analysis Tools

Feature
Process
Extraction
Management Tools

Only a tiny fraction of the code is devoted to learning or prediction,


much of the remainder may be described as plumbing

1
Source: D. Sculley et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS, 2021. 12
Data Management Backbone
From Warehouses to Lakehouses
Michael Armbrust, Ali Ghodsi, Reynold
Xin, Matei Zaharia. Lakehouse: A New
Generation of Open Platforms that
Unify Data Warehousing and Advanced
Analytics. CIDR 2021
Data Management
Data management refers to the functionalities a DBMS must provide:
§ Ingestion: means provided to insert /upload data
§ E.g., ORACLE SQL*Loader
§ Storage: format/structures used to persist data
§ E.g., hash, B-tree, heap file
§ Modelling: arrangement of data within the available structures
§ E.g., normalization, partitioning
§ Processing: means provided to manipulate data
§ E.g., PL/SQL
§ Querying/fetching: means provided to allow users to retrieve data
§ E.g., SQL, Relational Algebra

In Big Data settings, they are the same concepts but assuming NOSQL underneath
1. Typically, a distributed system
2. Possibly with an alternative data model to the Relational one
3. Implementing ad-hoc architectural solutions

14
Data Management The majority of NoSQL
databases were inspired by
the necessity to be
Data management refers to the functionalities a DBMS must provide:
executed in clusters, which
§ Ingestion: means provided to insert /upload data lead to ”aggregated” data
§ E.g., ORACLE SQL*Loader models: data that are
§ Storage: format/structures used to persist data accessed together.
§ E.g., hash, B-tree, heap file
§ Modelling: arrangement of data within the available structures They differ on how they
§ E.g., normalization, partitioning structure the “aggregate”
§ Processing: means provided to manipulate data and how they allow it to be
§ E.g., PL/SQL accessed.
§ Querying/fetching: means provided to allow users to retrieve data
§ E.g., SQL, Relational Algebra

In Big Data settings, they are the same concepts but assuming NOSQL underneath
1. Typically, a distributed system
2. Possibly with an alternative data model to the Relational one
3. Implementing ad-hoc architectural solutions

source: https://k21academy.com/dba-to-cloud-dba/nosql-database-service-in-oracle-cloud 15
1st gen.: Data Warehouses
• Aid the business via analytical insights
• Extract data from operational databases
• Transform and load them into centralized DWs
• Schema-on-write
• The data model is optimized for BI operations
• Some challenges
• Compute and storage in on-premise appliances
• Requires to provision and pay for peak workloads / large datasets
• Unstructured data
• Video, audio, text documents
• DWs cannot deal with this formats

16
Model-First (Load-Later)

Product
• Popularity
• Top feature • Avg (sentiment)
• Bottom feature • Keen: Avg(landing time)/#visits

Interested In
Is part of

Feature Assesses User


• Avg rating
• Avg(sentiment) • List of preferences

Sentiment Analysis Product homogenization Log Analysis (e.g.,


(e.g., Text Mining) (e.g., duplicate detection) Process Mining)

• User
Twitter API • User • Product • Product Web Logs
(JSON) • Tweet • Product features • Landing time (Logs)
• Date • Visits ts
• Location USER WEB
USER FEEDBACK PRODUCT INFO
In-house DB BEHAVIOUR
(PostgreSQL)

17
Drawbacks of Model-First (Load-Later)
- Costly: on-premise appliance, provisioned and payed for the peak of user load
- Unstructured data: more and more datasets are completely unstructured, DWs cannot cope well

Product
• Popularity
• Top feature • Avg (sentiment)
Fixed Target • Bottom feature • Keen: Avg(landing time)/#visits
Schema
Interested In
Is part of

Feature Assesses User


• Avg rating
Permanent • Avg(sentiment) • List of preferences
transformations
Sentiment Analysis Product homogenization Log Analysis (e.g.,
(e.g., Text Mining) (e.g., duplicate detection) Process Mining) High Entry
• User
Barriers
Twitter API • User • Product • Product Web Logs
(JSON) • Tweet • Product features • Landing time (Logs)
• Date • Visits ts
• Location USER WEB
USER FEEDBACK PRODUCT INFO
In-house DB BEHAVIOUR
(PostgreSQL)

18
2nd gen.: Data Lakes
• Idea: Load-First, Model-Later
• Modelling at load time restricts the
potential analysis that can be done
later (Big Analytics)
• Characteristics:
a) Store raw data
b) Low cost storage
c) Create on-demand
selection/processing to handle
precise analysis needs
d) Initiated by the Apache Hadoop
movement

19
Load-First (Model-Later)
Data repository
Product Interested In User
• Popularity
USER WEB • Top feature
• Avg rating
USER FEEDBACK • Avg (sentiment) • List of preferences
BEHAVIOUR • Bottom feature • Keen:
Twitter API Web Logs Avg(landing
time)/#visits Analyst 1
(JSON) (Logs)
User
PRODUCT INFO • Avg rating
• List of preferences
Product Is part of
In-house DB • Popularity
Feature
• Top feature Assesses
(Relational) • Bottom feature - Avg (sentiment)
Analyst 2

Data Views
USER WEB
USER FEEDBACK PRODUCT INFO
BEHAVIOUR
Twitter API Web Logs
(JSON) In-house DB
(Logs)
(Relational)

20
Drawbacks of Load-First (Model-Later)
Data Swamp
Data repository
Product Interested In User
• Popularity
USER WEB • Top feature
• Avg rating
USER FEEDBACK • Avg (sentiment) • List of preferences
BEHAVIOUR • Bottom feature • Keen:
Twitter API Web Logs Avg(landing
time)/#visits Analyst 1
(JSON) (Logs) Complex
Transformations User
PRODUCT INFO • Avg rating
• List of preferences
Product Is part of
In-house DB • Popularity
Feature
• Top feature Assesses
(PostgreSQL) • Bottom feature - Avg (sentiment)
Analyst 2

Data Views
USER WEB
USER FEEDBACK PRODUCT INFO
BEHAVIOUR
Twitter API Web Logs
(JSON) In-house DB
(Logs)
(PostgreSQL)
Stonebraker (2014)

21
Towards semantic-awareness
Semantic-aware
Data repository data repository
Product Interested In User
• Popularity
• Avg rating
• Top feature • Avg (sentiment) • List of preferences
• Bottom feature • Keen:
Avg(landing
time)/#visits Analyst 1

User
Metadata catalog • Avg rating
File 3 • List of preferences
File 1 File 2
Product Is part of
• User • Popularity
• User
• Product • Product • Top feature Feature Assesses
• Tweet • Landing • Bottom feature - Avg (sentiment)
• Product
• Date time
Analyst 2
features
• Location • Visits ts
Data Views

USER FEEDBACK PRODUCT INFO USER WEB • Metadata provides semantics


• Source schemata, mappings to views, parsing
BEHAVIOUR
Twitter API In-house DB
Web Logs
information, …
(JSON) (PostgreSQL)
(Logs)

• Automation of the integration processes

22
From IT-Centered to User-Centered
Semantic-aware data repository
Product Interested In User
• Popularity
• Avg rating
• Top feature • Avg (sentiment) • List of preferences
• Bottom feature • Keen:
Avg(landing
time)/#visits Analyst 1

User
Metadata catalog • Avg rating
File 3 • List of preferences
File 1 File 2
Product Is part of
• User • Popularity
• User
• Product • Product • Top feature Feature Assesses
• Tweet • Landing • Bottom feature - Avg (sentiment)
• Product
• Date time
Analyst 2
features
• Location • Visits ts
Data Views

USER FEEDBACK PRODUCT INFO USER WEB


BEHAVIOUR
Twitter API
(JSON)
In-house DB
(PostgreSQL) Web Logs AUTOMATED DATA GOVERNANCE
(Logs)

23
A possible implementation
Sergi Nadal, Petar Jovanovic, Besim
Bilalli, and Oscar Romero.
Operationalizing and automating Data
Goverance. Journal of Big Data 2022.
A possible instantiation
Sergi Nadal, Petar Jovanovic, Besim
Bilalli, and Oscar Romero.
Operationalizing and automating Data
Goverance. Journal of Big Data 2022.
Hetereogeneity of data formats
• General-Purpose Formats
• CSV (comma separated values), JSON ( javascript object notation), XML, Protobuf
• CLI/API access to DBs, KV-stores, Doc-stores, Time series DBs, etc
• Sparse Matrix Formats
• Matrix market: text IJV (row, col, value)
• Libsvm: text compressed sparse rows
• Scientific formats: NetCDF, HDF5
• Large-Scale Data Formats
• Parquet (columnar file format)
• Arrow (cross-platform columnar in-memory data)
• Domain-Specific Formats
• Health care: DICOM images, HL7 messages (health-level seven XML)
• Automotive: MDF (measurements), CDF (calibrations), ADF (auto-lead XML)

27
Difference between Parquet and CSV
• Columnar (hybrid) storage that brings efficiency compared to row-based files like CSV
• When querying you can skip over the non-relevant data quickly, aggregation queries are less time
consuming -> hardware savings and minimizes latency
• Build from ground up and is able to support advanced data structures
• The layout is optimized for queries that process large volumes of data
• Supports flexible compression options and efficient encoding schemes

Dataset Size on Amazon S3 Query Run Time Data Scanned Cost


Data stored as CSV files 1 TB 236 seconds 1.15 TB $5.75
Data stored in Apache
13 GB 6.78 seconds 2.51 GB $0.01
Parquet Format
Savings 87% less when 34x faster 99% less 99.7% savings
using Parquet data scanned

source: https://www.databricks.com/glossary/what-is-parquet 28
Parquet
• Row groups (RG) - horizontal partitions
• Data vertically partitioned within RGs
• Statistics per row group (aid filtering)
• E.g., min-max

29
3rd gen.: Cloud Data Lakes
• Cloud Data Lakes
• Superior availability
• Geo-replication
• Low cost (pay as you go and elasticity)
• AWS S3, AWS Glacier, Azure Data Lake Storage (ADLS), Google Cloud Storage
(GCS)
• Two-tier data architectures
• Same as 2nd gen, but the DW also resides on the cloud
• AWS Redshift, Snowflake

33
Drawbacks of 3rd gen. Data Analytics Platforms
• Reliability
• Keeping the DL and the DW consistent is difficult and costly
• Data staleness
• Data in DW is stale compared to that of the DL
• This is a step backwards w.r.t. 1st gen where operational data was quickly available
• Limited support for advanced analytics
• None of the leading ML systems (TensorFlow, PyTorch or XGBoost) work well on
top of DWs
• These systems need to process large datasets using complex non-SQL code
• Accessing the DL, all nice features from the DW are lost (transactions, data versioning,
indexing,…)
• Total cost of ownership
• You pay double for the stored data (in the DL and in the DW)

34
A 4th gen.: Lakehouses
• Question: “is it possible to turn data lakes based on
standard open data formats, such as Parquet and ORC,
into high - performance systems that can provide both
the performance and management features of data
warehouses and fast, direct I/O from advanced analytics
workloads?”
• The Lakehouse
• Reliable data management on data lakes
• Not just a ”bunch of files”
• Support for ML and data science
• Declarative DataFrame APIs
• SQL performance
• Optimize open data layouts to compete with DWs

35
Data Analysis Backbone
The need for Model Management

*Some of the following slides are borrowed from the ‘Architectures of ML Systems’ course of Matthias Boehm, TU Berlin
Traditional Software vs AI/Data Science1
Software AI (Software + Data)
Goal Functional correctness Optimization of a metric, e.g., minimize loss
Quality Depends on code Depends on data, code, model architecture,
hyperparameters, random seeds, …
Outcome Works deterministically Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Data Engineers,
Research Scientists, ML engineers
Tooling Usually standardized Often heterogeneous even within teams
within a dev team
Established/hardened Few established standards and in constant change due
over decades to open source innovation

AI depends on Code AND Data AI requires many different roles


to get involved

1Clemens Mewald: Announcing Databricks Machine Learning, Feature Store, AutoML, Keynote Data+ AI Summit 2021. 37
The Data Analysis Backbone
1. Data management
Acquire data 2. Model management Analysis

Execute
Reformat and scripts
Data/SW clean data Edit analysis
Engineer scripts
Preparation
Inspect
outputs
Explore Debug
alternatives

Data Scientist
Dissemination

Make comparisons Write reports

Take notes Deploy online

Hold meetings Archive experiment


DevOps
Share experiment
Reflection Engineer

1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 38
Thriving Procurement and
ecosystem of DevOps
innovation! nightmare!

39
ML Lifecycle Management
[Clemens Mewald: Announcing
Databricks Machine Learning,
Feature Store, AutoML, Keynote
Data+ AI Summit 2021]

Data versioning Code versioning Model lifecycle


… … management

MLOps = DataOps + DevOps + ModelOps

40
AutoML Overview
[Chris Thornton, Frank Hutter, Holger
H. Hoos, Kevin Leyton-Brown: Auto-
WEKA: combined selection and
hyperparameter optimization of
classification algorithms. KDD 2013]

• #1 Model Selection
• Given a dataset and ML task (e.g., classification or regression)
• Select the model (type) that performs best
(e.g.: LogReg, Naïve Bayes, SVM, Decision Tree, Random Forest, DNN)
• #2 Hyper Parameter Tuning
• Given a model and dataset, find best hyper parameter values
(e.g., learning rate, regularization, kernels, kernel params, tree params)
• Validation: Generalization Error
• Goodness of fit to held-out data (e.g., 80-20 train/test)
• Cross validation (e.g., 10 fold cross validation, leave-one out)
• AutoML Systems/Services
• Often providing both model selection and hyper parameter search
• Integrated ML system, often in distributed/cloud environments

41
Basic Grid Search
• Basic approach
• Given n hyper parameters λ1, …, λn with domains Λ1, …, Λn
• Enumerate and evaluate parameter space Λ ⊆ Λ1 × … × Λ𝑛
(often strict subset due to dependency structure of params)
• Continuous hyper parameters -> discretization
• Equi-width
• Exponential
(e.g., regularization 0.1, 0.01, 0.001, etc.)
• Problem: Only applicable with small domains

• Heuristic: Monte-Carlo
(random search, anytime)

42
Basic Grid Search, cont.
• Example Adult Dataset (train 32,561 x 14)
• Binary classification (>50K),
• #1 MLogReg defaults with one-hot categoricals Accuracy(%): 82.35
• #2 MLogReg defaults with one-hot + binning Accuracy(%): 84.73
• #3 GridSearch MLogReg: Accuracy(%): 90.07

params = list (“icpt”, “reg”, “numBins”)


paramRanges = list(seq(0,2), 10^seq(3,-6), 10^(seq(1,4))

43
More sophisticated methods Example 1D Problem
Gaussian process
4 iterations
• Simulated Annealing
• Recursive Random Search
• Bayesian Optimization
• Sequential Model-Based Optimization
• Fit a probabilistic model based on the first n-1
evaluated hyperparameters
• Use model to select next candidate
• Gaussian process (GP) models, or tree-based
Bayesian Optimization
[Eric Brochu, Vlad M. Cora, Nando de
Freitas: A Tutorial on Bayesian
Optimization of Expensive Cost
Functions, with Application to Active
User Modeling and Hierarchical
Reinforcement Learning. CoRR 2010]

44
Selected AutoML Systems
• Auto Weka
• Bayesian optimization with 28 learners, 11
ensemble/meta methods
• Auto Sklearn
• Bayesian optimization with 15 classifiers, 14
feature prep, 4 data prep
• TuPAQ
• Multi-armed bandit and large-scale
• TPOT
• Genetic programming
• Other Services
• Azure ML, Amazon ML
• Google AutoML, H20 AutoML

45
Model Management and
Provenance
Overview Model Management
• Motivation
• Exploratory data science process -> trial and error
(preparation, feature engineering, model selection)
• Different personas (data engineer, ML expert, devops)
• Problems
• No record of experiments, insights lost along the way
• Difficult to reproduce results
• Cannot search or query models
• Difficult to collaborate
• Overview
• Experiment tracking and visualization
• Coarse-grained ML pipeline provenance and versioning
• Fine-grained data provenance (data-/ops-oriented)

47
Model Management Systems (MLOps) [Hui Miao, Ang Li, Larry S. Davis,
Amol Deshpande: ModelHub:
Deep Learning Lifecycle
Management. ICDE 2017]

ModelHub
• Versioning system for DBB models, including provenance tracking
• DSL for model exploration and enumeration queries (model selection +
hyeperparameters) [Manasi Vartak, Samuel Madden:

• Model versions stored as deltas MODELDB: Opportunities and


Challenges in Managing Machine
Learning Models. IEEE Data
ModelDB -> Verta.ai Eng. Bull. 2018]

• Model and provenance logging for ML pipelines via programmatic


APIs
• Support for different ML systems (e.g., spark.mk, scikit-learn, others)
• GUIs for capturing metadata and metric visualization

48
Model Management Systems (MLOps), cont.
• MLflow [https://mlflow.org]
• An open source platform for the machine learning lifecycle
• Use of existing ML systems and various language bindings
• MLflow Tracking: logging and querying experiments
• MLflow Projects: packaging/reproduction of ML pipeline results
• MLflow Models: deployment of models in various services/tools
• MLflow Model Registry: cataloging models and managing in deployment

[Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski,
Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, Corey Zumar:
Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 41(4) 2018]

49
Mlflow
Example

50
MLflow UI
• Run mlflow ui in the command line on top of the folder mlruns
http://localhost:5000

Load a registered model

More details in https://docs.databricks.com/en/mlflow/ 51


Closing

52

You might also like