BDA BigDataArchitecturesAndModelManagement
BDA BigDataArchitecturesAndModelManagement
Management
1
Knowledge objectives
1. Explain the data and model management challanges in a data science
pipeline
2. Justify the need of a Data Lake for data management
3. Identify the difficulties of a Data Lake
4. Explain different model selection techniques
5. Justifiy the need for model management
6. Explain different model management techniques
2
Application Objectives
1. Given a use case, define its software architecture
2. Given a data science pipeline, manage the models and their associated
data
3
Data Science
Overview and Challenges
Data science (I)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1
1
Source: Techopedia 5
Descriptive analysis (Use case)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1
Use case: e-commerce company selling a wide range of products
• Data: Customer data, their demographics, purchasing history
• Traditionally,• descriptive analysis
Examine the frequency and distribution of purchases across different product
on gathering data that are obviously valuable
• Data: Focusedcategories
Discover:
• E.g., business• objects a significantcustomers,
representing portion of the customers
items frequently
in catalogs, purchase
purchases, contracts, etc.
electronics (e.g., smartphones and laptops)
• Common tasks: ETL from multiple data sources, creating joined views, followed
• Analyze their demographic information
by filtering, aggregation, or cube materialization
Discover: a large number of them belong to a young group (18-35)
•
• Feed report generators, and
front-end
reside in dashboards,
urban areas and other visualization tools to support
common “roll-up” and “drill-down” operations on multi-dimensional data
• Action: tailer marketing campaigns and product offerings to better cater to this
specific group Data warehouse
BI tools
1
Source: Techopedia 6
Data science
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1
1
Source: Techopedia 7
Predictive analysis (Use case)
Collective processes, theories, concepts, tools and technologies that enable the review,
analysis and extraction of valuable knowledge and information from raw data1
1
Source: Techopedia 8
A typical data science workflow1
1. Data management
Acquire data 2. Model management Analysis
Execute
Reformat and scripts
clean data Edit analysis
scripts
Data-centric
Preparation tric
l-c
en Inspect
e
Mo
d outputs
Explore Debug
alternatives
Dissemination
1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 9
A typical data science workflow1
Challenges
1. Data management
Acquire data 2. Model management Analysis
Execute
scripts
1. How to Reformat and
assign names to data files that are created or downloaded? How to organize
clean data
those files into directories? Editscripts
analysis
Data-centric
2. Preparation
How to keep track of provenance?
ntr
ic
ce Inspect
e l-
3. Where does data come Mo from and is it still up-to-date?
d outputs
Explore Debug
4. How to store data that cannot fit on a single hard drive?
alternatives
5. How to fix semantic errors, missing entries, inconsistent formatting? How to do it
efficiently for large amounts of data? Dissemination
6. How to integrate data?Make comparisons Write reports
7. … Take notes Deploy online
1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 10
A typical data science experiment directory1
1
Source: Philip J. Guo. Software Tools
to Facilitate Research Programming.
Ph.D. dissertation, 2012.
Therefore, in a real world ML system1 …
Data Machine
Resource Monitoring
Verification
Management
Configuration Data Collection
Serving
ML Infrastructure
Code
Analysis Tools
Feature
Process
Extraction
Management Tools
1
Source: D. Sculley et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS, 2021. 12
Data Management Backbone
From Warehouses to Lakehouses
Michael Armbrust, Ali Ghodsi, Reynold
Xin, Matei Zaharia. Lakehouse: A New
Generation of Open Platforms that
Unify Data Warehousing and Advanced
Analytics. CIDR 2021
Data Management
Data management refers to the functionalities a DBMS must provide:
§ Ingestion: means provided to insert /upload data
§ E.g., ORACLE SQL*Loader
§ Storage: format/structures used to persist data
§ E.g., hash, B-tree, heap file
§ Modelling: arrangement of data within the available structures
§ E.g., normalization, partitioning
§ Processing: means provided to manipulate data
§ E.g., PL/SQL
§ Querying/fetching: means provided to allow users to retrieve data
§ E.g., SQL, Relational Algebra
In Big Data settings, they are the same concepts but assuming NOSQL underneath
1. Typically, a distributed system
2. Possibly with an alternative data model to the Relational one
3. Implementing ad-hoc architectural solutions
14
Data Management The majority of NoSQL
databases were inspired by
the necessity to be
Data management refers to the functionalities a DBMS must provide:
executed in clusters, which
§ Ingestion: means provided to insert /upload data lead to ”aggregated” data
§ E.g., ORACLE SQL*Loader models: data that are
§ Storage: format/structures used to persist data accessed together.
§ E.g., hash, B-tree, heap file
§ Modelling: arrangement of data within the available structures They differ on how they
§ E.g., normalization, partitioning structure the “aggregate”
§ Processing: means provided to manipulate data and how they allow it to be
§ E.g., PL/SQL accessed.
§ Querying/fetching: means provided to allow users to retrieve data
§ E.g., SQL, Relational Algebra
In Big Data settings, they are the same concepts but assuming NOSQL underneath
1. Typically, a distributed system
2. Possibly with an alternative data model to the Relational one
3. Implementing ad-hoc architectural solutions
source: https://k21academy.com/dba-to-cloud-dba/nosql-database-service-in-oracle-cloud 15
1st gen.: Data Warehouses
• Aid the business via analytical insights
• Extract data from operational databases
• Transform and load them into centralized DWs
• Schema-on-write
• The data model is optimized for BI operations
• Some challenges
• Compute and storage in on-premise appliances
• Requires to provision and pay for peak workloads / large datasets
• Unstructured data
• Video, audio, text documents
• DWs cannot deal with this formats
16
Model-First (Load-Later)
Product
• Popularity
• Top feature • Avg (sentiment)
• Bottom feature • Keen: Avg(landing time)/#visits
Interested In
Is part of
• User
Twitter API • User • Product • Product Web Logs
(JSON) • Tweet • Product features • Landing time (Logs)
• Date • Visits ts
• Location USER WEB
USER FEEDBACK PRODUCT INFO
In-house DB BEHAVIOUR
(PostgreSQL)
17
Drawbacks of Model-First (Load-Later)
- Costly: on-premise appliance, provisioned and payed for the peak of user load
- Unstructured data: more and more datasets are completely unstructured, DWs cannot cope well
Product
• Popularity
• Top feature • Avg (sentiment)
Fixed Target • Bottom feature • Keen: Avg(landing time)/#visits
Schema
Interested In
Is part of
18
2nd gen.: Data Lakes
• Idea: Load-First, Model-Later
• Modelling at load time restricts the
potential analysis that can be done
later (Big Analytics)
• Characteristics:
a) Store raw data
b) Low cost storage
c) Create on-demand
selection/processing to handle
precise analysis needs
d) Initiated by the Apache Hadoop
movement
19
Load-First (Model-Later)
Data repository
Product Interested In User
• Popularity
USER WEB • Top feature
• Avg rating
USER FEEDBACK • Avg (sentiment) • List of preferences
BEHAVIOUR • Bottom feature • Keen:
Twitter API Web Logs Avg(landing
time)/#visits Analyst 1
(JSON) (Logs)
User
PRODUCT INFO • Avg rating
• List of preferences
Product Is part of
In-house DB • Popularity
Feature
• Top feature Assesses
(Relational) • Bottom feature - Avg (sentiment)
Analyst 2
Data Views
USER WEB
USER FEEDBACK PRODUCT INFO
BEHAVIOUR
Twitter API Web Logs
(JSON) In-house DB
(Logs)
(Relational)
20
Drawbacks of Load-First (Model-Later)
Data Swamp
Data repository
Product Interested In User
• Popularity
USER WEB • Top feature
• Avg rating
USER FEEDBACK • Avg (sentiment) • List of preferences
BEHAVIOUR • Bottom feature • Keen:
Twitter API Web Logs Avg(landing
time)/#visits Analyst 1
(JSON) (Logs) Complex
Transformations User
PRODUCT INFO • Avg rating
• List of preferences
Product Is part of
In-house DB • Popularity
Feature
• Top feature Assesses
(PostgreSQL) • Bottom feature - Avg (sentiment)
Analyst 2
Data Views
USER WEB
USER FEEDBACK PRODUCT INFO
BEHAVIOUR
Twitter API Web Logs
(JSON) In-house DB
(Logs)
(PostgreSQL)
Stonebraker (2014)
21
Towards semantic-awareness
Semantic-aware
Data repository data repository
Product Interested In User
• Popularity
• Avg rating
• Top feature • Avg (sentiment) • List of preferences
• Bottom feature • Keen:
Avg(landing
time)/#visits Analyst 1
User
Metadata catalog • Avg rating
File 3 • List of preferences
File 1 File 2
Product Is part of
• User • Popularity
• User
• Product • Product • Top feature Feature Assesses
• Tweet • Landing • Bottom feature - Avg (sentiment)
• Product
• Date time
Analyst 2
features
• Location • Visits ts
Data Views
22
From IT-Centered to User-Centered
Semantic-aware data repository
Product Interested In User
• Popularity
• Avg rating
• Top feature • Avg (sentiment) • List of preferences
• Bottom feature • Keen:
Avg(landing
time)/#visits Analyst 1
User
Metadata catalog • Avg rating
File 3 • List of preferences
File 1 File 2
Product Is part of
• User • Popularity
• User
• Product • Product • Top feature Feature Assesses
• Tweet • Landing • Bottom feature - Avg (sentiment)
• Product
• Date time
Analyst 2
features
• Location • Visits ts
Data Views
23
A possible implementation
Sergi Nadal, Petar Jovanovic, Besim
Bilalli, and Oscar Romero.
Operationalizing and automating Data
Goverance. Journal of Big Data 2022.
A possible instantiation
Sergi Nadal, Petar Jovanovic, Besim
Bilalli, and Oscar Romero.
Operationalizing and automating Data
Goverance. Journal of Big Data 2022.
Hetereogeneity of data formats
• General-Purpose Formats
• CSV (comma separated values), JSON ( javascript object notation), XML, Protobuf
• CLI/API access to DBs, KV-stores, Doc-stores, Time series DBs, etc
• Sparse Matrix Formats
• Matrix market: text IJV (row, col, value)
• Libsvm: text compressed sparse rows
• Scientific formats: NetCDF, HDF5
• Large-Scale Data Formats
• Parquet (columnar file format)
• Arrow (cross-platform columnar in-memory data)
• Domain-Specific Formats
• Health care: DICOM images, HL7 messages (health-level seven XML)
• Automotive: MDF (measurements), CDF (calibrations), ADF (auto-lead XML)
27
Difference between Parquet and CSV
• Columnar (hybrid) storage that brings efficiency compared to row-based files like CSV
• When querying you can skip over the non-relevant data quickly, aggregation queries are less time
consuming -> hardware savings and minimizes latency
• Build from ground up and is able to support advanced data structures
• The layout is optimized for queries that process large volumes of data
• Supports flexible compression options and efficient encoding schemes
source: https://www.databricks.com/glossary/what-is-parquet 28
Parquet
• Row groups (RG) - horizontal partitions
• Data vertically partitioned within RGs
• Statistics per row group (aid filtering)
• E.g., min-max
29
3rd gen.: Cloud Data Lakes
• Cloud Data Lakes
• Superior availability
• Geo-replication
• Low cost (pay as you go and elasticity)
• AWS S3, AWS Glacier, Azure Data Lake Storage (ADLS), Google Cloud Storage
(GCS)
• Two-tier data architectures
• Same as 2nd gen, but the DW also resides on the cloud
• AWS Redshift, Snowflake
33
Drawbacks of 3rd gen. Data Analytics Platforms
• Reliability
• Keeping the DL and the DW consistent is difficult and costly
• Data staleness
• Data in DW is stale compared to that of the DL
• This is a step backwards w.r.t. 1st gen where operational data was quickly available
• Limited support for advanced analytics
• None of the leading ML systems (TensorFlow, PyTorch or XGBoost) work well on
top of DWs
• These systems need to process large datasets using complex non-SQL code
• Accessing the DL, all nice features from the DW are lost (transactions, data versioning,
indexing,…)
• Total cost of ownership
• You pay double for the stored data (in the DL and in the DW)
34
A 4th gen.: Lakehouses
• Question: “is it possible to turn data lakes based on
standard open data formats, such as Parquet and ORC,
into high - performance systems that can provide both
the performance and management features of data
warehouses and fast, direct I/O from advanced analytics
workloads?”
• The Lakehouse
• Reliable data management on data lakes
• Not just a ”bunch of files”
• Support for ML and data science
• Declarative DataFrame APIs
• SQL performance
• Optimize open data layouts to compete with DWs
35
Data Analysis Backbone
The need for Model Management
*Some of the following slides are borrowed from the ‘Architectures of ML Systems’ course of Matthias Boehm, TU Berlin
Traditional Software vs AI/Data Science1
Software AI (Software + Data)
Goal Functional correctness Optimization of a metric, e.g., minimize loss
Quality Depends on code Depends on data, code, model architecture,
hyperparameters, random seeds, …
Outcome Works deterministically Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Data Engineers,
Research Scientists, ML engineers
Tooling Usually standardized Often heterogeneous even within teams
within a dev team
Established/hardened Few established standards and in constant change due
over decades to open source innovation
1Clemens Mewald: Announcing Databricks Machine Learning, Feature Store, AutoML, Keynote Data+ AI Summit 2021. 37
The Data Analysis Backbone
1. Data management
Acquire data 2. Model management Analysis
Execute
Reformat and scripts
Data/SW clean data Edit analysis
Engineer scripts
Preparation
Inspect
outputs
Explore Debug
alternatives
Data Scientist
Dissemination
1
Adapted from: Philip J. Guo. Software Tools to Facilitate Research Programming. Ph.D. dissertation, 2012. 38
Thriving Procurement and
ecosystem of DevOps
innovation! nightmare!
39
ML Lifecycle Management
[Clemens Mewald: Announcing
Databricks Machine Learning,
Feature Store, AutoML, Keynote
Data+ AI Summit 2021]
40
AutoML Overview
[Chris Thornton, Frank Hutter, Holger
H. Hoos, Kevin Leyton-Brown: Auto-
WEKA: combined selection and
hyperparameter optimization of
classification algorithms. KDD 2013]
• #1 Model Selection
• Given a dataset and ML task (e.g., classification or regression)
• Select the model (type) that performs best
(e.g.: LogReg, Naïve Bayes, SVM, Decision Tree, Random Forest, DNN)
• #2 Hyper Parameter Tuning
• Given a model and dataset, find best hyper parameter values
(e.g., learning rate, regularization, kernels, kernel params, tree params)
• Validation: Generalization Error
• Goodness of fit to held-out data (e.g., 80-20 train/test)
• Cross validation (e.g., 10 fold cross validation, leave-one out)
• AutoML Systems/Services
• Often providing both model selection and hyper parameter search
• Integrated ML system, often in distributed/cloud environments
41
Basic Grid Search
• Basic approach
• Given n hyper parameters λ1, …, λn with domains Λ1, …, Λn
• Enumerate and evaluate parameter space Λ ⊆ Λ1 × … × Λ𝑛
(often strict subset due to dependency structure of params)
• Continuous hyper parameters -> discretization
• Equi-width
• Exponential
(e.g., regularization 0.1, 0.01, 0.001, etc.)
• Problem: Only applicable with small domains
• Heuristic: Monte-Carlo
(random search, anytime)
42
Basic Grid Search, cont.
• Example Adult Dataset (train 32,561 x 14)
• Binary classification (>50K),
• #1 MLogReg defaults with one-hot categoricals Accuracy(%): 82.35
• #2 MLogReg defaults with one-hot + binning Accuracy(%): 84.73
• #3 GridSearch MLogReg: Accuracy(%): 90.07
43
More sophisticated methods Example 1D Problem
Gaussian process
4 iterations
• Simulated Annealing
• Recursive Random Search
• Bayesian Optimization
• Sequential Model-Based Optimization
• Fit a probabilistic model based on the first n-1
evaluated hyperparameters
• Use model to select next candidate
• Gaussian process (GP) models, or tree-based
Bayesian Optimization
[Eric Brochu, Vlad M. Cora, Nando de
Freitas: A Tutorial on Bayesian
Optimization of Expensive Cost
Functions, with Application to Active
User Modeling and Hierarchical
Reinforcement Learning. CoRR 2010]
44
Selected AutoML Systems
• Auto Weka
• Bayesian optimization with 28 learners, 11
ensemble/meta methods
• Auto Sklearn
• Bayesian optimization with 15 classifiers, 14
feature prep, 4 data prep
• TuPAQ
• Multi-armed bandit and large-scale
• TPOT
• Genetic programming
• Other Services
• Azure ML, Amazon ML
• Google AutoML, H20 AutoML
45
Model Management and
Provenance
Overview Model Management
• Motivation
• Exploratory data science process -> trial and error
(preparation, feature engineering, model selection)
• Different personas (data engineer, ML expert, devops)
• Problems
• No record of experiments, insights lost along the way
• Difficult to reproduce results
• Cannot search or query models
• Difficult to collaborate
• Overview
• Experiment tracking and visualization
• Coarse-grained ML pipeline provenance and versioning
• Fine-grained data provenance (data-/ops-oriented)
47
Model Management Systems (MLOps) [Hui Miao, Ang Li, Larry S. Davis,
Amol Deshpande: ModelHub:
Deep Learning Lifecycle
Management. ICDE 2017]
ModelHub
• Versioning system for DBB models, including provenance tracking
• DSL for model exploration and enumeration queries (model selection +
hyeperparameters) [Manasi Vartak, Samuel Madden:
48
Model Management Systems (MLOps), cont.
• MLflow [https://mlflow.org]
• An open source platform for the machine learning lifecycle
• Use of existing ML systems and various language bindings
• MLflow Tracking: logging and querying experiments
• MLflow Projects: packaging/reproduction of ML pipeline results
• MLflow Models: deployment of models in various services/tools
• MLflow Model Registry: cataloging models and managing in deployment
[Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski,
Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, Corey Zumar:
Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 41(4) 2018]
49
Mlflow
Example
50
MLflow UI
• Run mlflow ui in the command line on top of the folder mlruns
http://localhost:5000
52