LECTURE 01
Welcome to Data
Engineering!
April 10, 2025
1
Why Learn Data Engineering?
Data engineering is an essential ingredient
of real-world data science projects.
A set of activities that include The backbone,
collecting, collating, extracting, plumbing, or
moving, transforming, cleaning,
infrastructure that
integrating, organizing,
representing, storing, and supports data science.
processing data.
Data engineering is as essential as plumbing!
● When it works well, you donʼt realize it exists.
● When it doesnʼt, youʼll really know. 2
3
4
5
Data Science Data Engineering
Data Science: The Conventional View Nowadays, Data Science also involves
Data Engineering:
A data scientist operating alone, on A set of activities that include
one static dataset at a time, with a collecting, collating, extracting, moving,
clean “rectangularˮ shape and fitting transforming, cleaning, integrating,
in main-memory, employing various organizing, representing, storing, and
statistical and ML algorithms on processing data.
predefined objectives.
● From Data 100 ● Messy (often non-rectangular),
● Also the view reinforced by dynamic, and large datasets
“popularˮ Machine Learning, ● One team generates the data,
e.g., leaderboards, Kaggle, … another team consumes it
● Unclear and ill-defined objectives
A lot of data engineering ● Necessary precursor to real-world
must happen to support data science & ML
the conventional view! ● etc. 6
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
● Most of the time spent in real-world
data science projects involve data
engineering:
○ cleaning, moving, restructuring,
processing, …
● Often underappreciated compared
to other activities, e.g., ML.)
7
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
“… 70% more open roles at
companies in data engineering as
compared to data science.ˮ
Mihail Eric, Jan 2021.[blog]
A new specialized job category:
● Data scientist: Use various techniques
in statistics & ML to process & analyze
data.
● Data engineer: Develops a robust and
(we wonʼt be dogmatic about
scalable set of data processing these industry-level role
tools/platforms. distinctions…)
8
[3/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
“ML codeˮ is not only a
small fraction of the
system; it is also often
simple—calls to
standard libraries
(sklearn, pytorch, etc.)
Sculley et al., SE4ML 2014 [google research]. 9
Data Engineering is Essential in ML/AI
“Under the strong influence
of the current AI hype,
people try to plug in data that’s
dirty & full of gaps, that
spans years while changing
in format and meaning,
that’s not understood yet,
thatʼs structured in ways
that donʼt make sense, and
expect those tools to
magically handle it.ˮ
Monica Rogati, 2017
10
[blog].
NEW Machine Learning Engineer
“ML Engineerˮ: a specialization of
data engineer focused on
operationalizing ML.
“A need for a person that would
reunite two warring parties…
One being fluent just enough in
both fields Data Science and
Software Engineering] to get the
product up and running…
...taking data scientistsʼ code and
making it more effective and
scalable. …ˮ
Tomasz Dudek,, Scalability will be an important
2018 [blog].
focus for us! 11
[4/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
4. Balance your data techniques with a systems perspective.
As a Data Science major,
you are likely familiar with …but you are likely less
techniques: statistics/ML familiar with systems.
concepts & algorithms…
● You will learn systems and the infrastructure that enables these techniques.
● Youʼll start thinking about efficiency and scalability, esp. on large datasets.
● Various “plumbing analogiesˮ: data pipelines, data flows, …
12
All these Data Systems!!!
13
2024 MAD ML/AI/Data) Landscape
So…what is Data 101 about?
Data systems is a difficult subject! There are
many, many data systems – too many for us to
cover.
● In this class, we will try to cover the key
categories and underlying principles.
● This way, you can make informed decisions
about when to use what type of system.
2023 MAD (ML/AI/Data) Landscape: blog, interactive 14
Demystifying Industry Jargon
Data systems are tools
that support data
engineering.
15
(the same VC who made the MAD
Landscape diagram)
What you mostly learn at Berkeley (eg. DATA 100
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions
Sensors
Log Files
Experiments Data
Preparation
“Experts are close to the data
Data preparation example: and should be the ones
Research experiments extracting / analyzing”
Alternative picture, but more traditional enterprise
Use-Case-Specific
Data Fit for purpose
Raw Data Preparation Self-Service
Transactions
Sensors
Log Files
Experiments Source of Truth
Governed
Secure
Data Integration example: Audited
UC Berkeley data (contracts, Managed
Data
student info, grants, etc…) – Integration
must be centrally managed “Compute is expensive, data is
precious”
How this actually happens? E, T, and L
Extract: Scrape raw data from all the source systems,
e.g., transactions, sensors, log files, experiments, tables,
bytestreams, …
Transform: Apply a series of rules or functions, wrangle
data into schema(s)/format(s)
Load: Load data into a data storage solution
18
Traditional Single Source of Truth: Data Warehouses - through ETL
Entire organizations
Transform centered around this
ETL process!
Raw Data
Transactions
Sensors
Log Files Extract Data
Experiments Source of Truth
Integration
Governed
Extract or scraping from API or
log file, transform into common
Load Secure
Audited
schema/format, load in parallel Managed
to “data warehouse”
Data Warehouse
ELT for Data Warehouses: A Newer Picture (e.g. Original Snowflake)
Load without doing a lot of
transformation, with transformations
done in SQL
Faster to get going, and more
scalable, but requires more data
Raw Data warehousing knowledge (& may be
more expensive).
Transactions
Sensors
Extract
Log Files Data
Experiments Integration Source of Truth
Transform Governed
Secure
Audited
Managed
Load
Data Warehouse
From Warehouses → … Lakes??? 💦 Got until here
Data Warehouses are expensive
● Warehouses expect some degree of structure
● Transformation is costly, not necessarily just computing
but engineering time
What about skipping the “data warehouse” entirely?
No Loading! Just “dumpˮ the data in
Letʼs be sloppy!
Letʼs be …agile…
Enter the data lake
Editorial note: Understand, but to not try to make too
much sense of why these terms came to be. Often just
marketing…]
ET? For Data Lakes?
Data Lake
Use-Case-Specific
Raw Data
Transform Data Fit for purpose
Preparation Self-Service
Transactions
Sensors
Log Files Extract
Experiments
No need to “load/manage” data 22
Data is dumped in cheaply and
massaged as needed for various (joke)
use-cases
Usually code-centric (Spark)
Why go through all this trouble?
Once data is “lostˮ (i.e. not saved, deleted, etc) it
cannot be recovered. So record everything.
Recreating history is exceptionally hard. Canʼt predict
when a particular measurement will be crucial to
understanding some situation.
How do you know something improved if you cannot
measure change?
23
The Two Extremes
Data Warehouse, 1990s Data Lake, 2010s
● “Single source of truthˮ: A central, ● Emerged during Hadoop/Spark
organized repository of data used revolution
for analytics throughout an ● “Landing zoneˮ: unconstrained
enterprise. storage for any and all data
● Design the uber-schema up-front ● Data is then analyzed on demand
of all of the rectangular tables ● Extract into files/storage
youʼd ever want. ● Load into storage (easy!)
● Extract from trusted sources ● Transform on demand for any use.
● Transform to warehouse schema ○ Create new files in the lake,
using custom tools catalog files as they go for
● Load data warehouse reuse
● Old school ETL solution: ○ Often code-centric
Informatica
Modern solution is likely Many-to-Many, ETLT
Some datasets may directly be
Data Lake loaded into a data warehouse
Transform Use-Case-Specific
Fit for purpose
Raw Data
Transactions
Transform Data
Self-Service
Sensors
Log Files Extract Preparation /
Integration
Experiments Source of Truth
Governed
Sometimes start with a data lake 25
Empower data scientists to work
Load Secure
Audited
on ad-hoc use cases Transform Managed
Allow for datasets that “graduate” This class will focus a lot on
to a carefully managed warehouse T: Transform
Data Warehouse
26
A Modern Buzzword for the Modern Solution: A Data Lakehouse 2020
27
…but that was just the beginning….
As we move away from a “managed” data
warehouse, there are other considerations we
need to worry about … 28
Really, really important considerations
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Governed
29
Secure
Audited
Managed
Data Warehouse
Important considerations
Data Discovery, Data Assessment
● Ad-Hoc: End-users land data, explore it, label it
● Systematic: Crawl/index the data lake for files
○ E.g., for CSV/JSON
● Very content-centric: really a form of
analytics/prediction
○ Try to figure out what type of data you have.
● AI People!
Data Quality & Integrity
● Boolean Integrity checks
● Often specified by people, also “minedˮ by AI
● Data changes ALL the time, especially from clients.
● Enforced: can “rejectˮ or “sequesterˮ data that violates
○ e.g no two products that have the same product ID!
Donʼt forget about Metadata!
Data alone is not enough, also need context. Also need to capture metadata!
Application Metadata:
● Data entities (e.g. students, courses, employees for a university)
● Relationships between data
● Constraints
Behavioral Metadata:
● Data Lineage – where did it come from?
● Audit Trails of Usage – who ran this job, and what did it do?
Change Metadata
● Version info for all the above
● Timestamps
Modern solutions
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store 32
Secure
Audited
Managed
Data Warehouse
Making things Dynamic: Operationalization and Feedback
Operationalization: Everything is an ongoing feed!
● When do jobs kick off, and what do they do?
● How are tests registered, exceptions handled,
people alerted?
● How do experiments “graduateˮ into processes?
Feedback: Every data “productˮ is of interest!
● Some are datasets in their own right. If you produce a table, thatʼs also
data!
● Many are new processes that generating new data feeds!
○ ML models: Constantly yielding predictions.
■ Compare old predictions to new predictions?
Real life is messy
Data Lake
Data
Discovery &
Assessment
Use-Case-Specific
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Metadata
Governed
Store
Secure
Audited
Managed
Data Warehouse