0% found this document useful (0 votes)

8 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views34 pages

Intro To Data Engineering!

Uploaded by

Jay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

LECTURE 01

Welcome to Data
Engineering!
April 10, 2025

1
Why Learn Data Engineering?

Data engineering is an essential ingredient

of real-world data science projects.
A set of activities that include The backbone,
collecting, collating, extracting, plumbing, or
moving, transforming, cleaning,
infrastructure that
integrating, organizing,
representing, storing, and supports data science.
processing data.

Data engineering is as essential as plumbing!

● When it works well, you donʼt realize it exists.
● When it doesnʼt, youʼll really know. 2
3
4
5
Data Science Data Engineering
Data Science: The Conventional View Nowadays, Data Science also involves
Data Engineering:
A data scientist operating alone, on A set of activities that include
one static dataset at a time, with a collecting, collating, extracting, moving,
clean “rectangularˮ shape and fitting transforming, cleaning, integrating,
in main-memory, employing various organizing, representing, storing, and
statistical and ML algorithms on processing data.
predeﬁned objectives.
● From Data 100 ● Messy (often non-rectangular),
● Also the view reinforced by dynamic, and large datasets
“popularˮ Machine Learning, ● One team generates the data,
e.g., leaderboards, Kaggle, … another team consumes it
● Unclear and ill-defined objectives
A lot of data engineering ● Necessary precursor to real-world
must happen to support data science & ML
the conventional view! ● etc. 6
[1/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world

data science projects involve data
engineering:
○ cleaning, moving, restructuring,
processing, …
● Often underappreciated compared
to other activities, e.g., ML.)

7
[2/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.

“… 70% more open roles at

companies in data engineering as
compared to data science.ˮ
Mihail Eric, Jan 2021.[blog]

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.

“ML codeˮ is not only a

small fraction of the
system; it is also often
simple—calls to
standard libraries
(sklearn, pytorch, etc.)

Sculley et al., SE4ML 2014 [google research]. 9

Data Engineering is Essential in ML/AI

“Under the strong influence

of the current AI hype,
people try to plug in data that’s
dirty & full of gaps, that
spans years while changing
in format and meaning,
that’s not understood yet,
thatʼs structured in ways
that donʼt make sense, and
expect those tools to
magically handle it.ˮ

Monica Rogati, 2017

10
[blog].
NEW Machine Learning Engineer

“ML Engineerˮ: a specialization of

data engineer focused on
operationalizing ML.
“A need for a person that would
reunite two warring parties…
One being fluent just enough in
both fields Data Science and
Software Engineering] to get the
product up and running…
...taking data scientistsʼ code and
making it more effective and
scalable. …ˮ

Tomasz Dudek,, Scalability will be an important

2018 [blog].
focus for us! 11
[4/4] Why Learn Data Engineering?

1. Data science projects largely focus on data engineering.

2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
4. Balance your data techniques with a systems perspective.

As a Data Science major,

you are likely familiar with …but you are likely less
techniques: statistics/ML familiar with systems.
concepts & algorithms…

● You will learn systems and the infrastructure that enables these techniques.
● Youʼll start thinking about efficiency and scalability, esp. on large datasets.
● Various “plumbing analogiesˮ: data pipelines, data flows, …

12
All these Data Systems!!!

13
2024 MAD ML/AI/Data) Landscape
So…what is Data 101 about?

Data systems is a difficult subject! There are

many, many data systems – too many for us to
cover.
● In this class, we will try to cover the key
categories and underlying principles.
● This way, you can make informed decisions
about when to use what type of system.

2023 MAD (ML/AI/Data) Landscape: blog, interactive 14

Demystifying Industry Jargon

Data systems are tools

that support data
engineering.

(the same VC who made the MAD

Landscape diagram)
What you mostly learn at Berkeley (eg. DATA 100

Use-Case-Speciﬁc
Fit for purpose
Raw Data Self-Service
Transactions
Sensors
Log Files
Experiments Data
Preparation

“Experts are close to the data

Data preparation example: and should be the ones
Research experiments extracting / analyzing”
Alternative picture, but more traditional enterprise

Use-Case-Speciﬁc

Data Fit for purpose

Raw Data Preparation Self-Service
Transactions
Sensors
Log Files
Experiments Source of Truth
Governed
Secure
Data Integration example: Audited
UC Berkeley data (contracts, Managed
Data
student info, grants, etc…) – Integration
must be centrally managed “Compute is expensive, data is
precious”
How this actually happens? E, T, and L
Extract: Scrape raw data from all the source systems,
e.g., transactions, sensors, log files, experiments, tables,
bytestreams, …
Transform: Apply a series of rules or functions, wrangle
data into schema(s)/format(s)
Load: Load data into a data storage solution

18
Traditional Single Source of Truth: Data Warehouses - through ETL

Entire organizations
Transform centered around this
ETL process!

Raw Data
Transactions
Sensors
Log Files Extract Data
Experiments Source of Truth
Integration
Governed
Extract or scraping from API or
log ﬁle, transform into common
Load Secure
Audited
schema/format, load in parallel Managed
to “data warehouse”

Data Warehouse
ELT for Data Warehouses: A Newer Picture (e.g. Original Snowflake)

Load without doing a lot of

transformation, with transformations
done in SQL
Faster to get going, and more
scalable, but requires more data
Raw Data warehousing knowledge (& may be
more expensive).
Transactions
Sensors
Extract
Log Files Data
Experiments Integration Source of Truth
Transform Governed
Secure
Audited
Managed

Load
Data Warehouse
From Warehouses → … Lakes??? 💦 Got until here

Data Warehouses are expensive

● Warehouses expect some degree of structure
● Transformation is costly, not necessarily just computing
but engineering time

What about skipping the “data warehouse” entirely?

No Loading! Just “dumpˮ the data in
Letʼs be sloppy!
Letʼs be …agile…
Enter the data lake
Editorial note: Understand, but to not try to make too
much sense of why these terms came to be. Often just
marketing…]
ET? For Data Lakes?

Data Lake

Use-Case-Speciﬁc

Raw Data
Transform Data Fit for purpose
Preparation Self-Service
Transactions
Sensors
Log Files Extract
Experiments

No need to “load/manage” data 22

Data is dumped in cheaply and

massaged as needed for various (joke)
use-cases
Usually code-centric (Spark)
Why go through all this trouble?

Once data is “lostˮ (i.e. not saved, deleted, etc) it

cannot be recovered. So record everything.

Recreating history is exceptionally hard. Canʼt predict

when a particular measurement will be crucial to
understanding some situation.

How do you know something improved if you cannot

measure change?
23
The Two Extremes

Data Warehouse, 1990s Data Lake, 2010s

● “Single source of truthˮ: A central, ● Emerged during Hadoop/Spark
organized repository of data used revolution
for analytics throughout an ● “Landing zoneˮ: unconstrained
enterprise. storage for any and all data
● Design the uber-schema up-front ● Data is then analyzed on demand
of all of the rectangular tables ● Extract into files/storage
youʼd ever want. ● Load into storage (easy!)
● Extract from trusted sources ● Transform on demand for any use.
● Transform to warehouse schema ○ Create new files in the lake,
using custom tools catalog files as they go for
● Load data warehouse reuse
● Old school ETL solution: ○ Often code-centric
Informatica
Modern solution is likely Many-to-Many, ETLT

Some datasets may directly be

Data Lake loaded into a data warehouse

Transform Use-Case-Speciﬁc
Fit for purpose
Raw Data
Transactions
Transform Data
Self-Service

Sensors
Log Files Extract Preparation /
Integration
Experiments Source of Truth
Governed
Sometimes start with a data lake 25

Empower data scientists to work

Load Secure
Audited
on ad-hoc use cases Transform Managed

Allow for datasets that “graduate” This class will focus a lot on
to a carefully managed warehouse T: Transform
Data Warehouse
26
A Modern Buzzword for the Modern Solution: A Data Lakehouse 2020

27
…but that was just the beginning….

As we move away from a “managed” data

warehouse, there are other considerations we
need to worry about … 28
Really, really important considerations

Data Lake
Data
Discovery &
Assessment
Use-Case-Speciﬁc
Fit for purpose
Raw Data Self-Service
Transactions Data
Sensors Data Quality Preparation /
Log Files & Integrity Integration
Experiments Source of Truth
Governed
29
Secure
Audited
Managed

Data Warehouse
Important considerations
Data Discovery, Data Assessment
● Ad-Hoc: End-users land data, explore it, label it
● Systematic: Crawl/index the data lake for files
○ E.g., for CSV/JSON
● Very content-centric: really a form of
analytics/prediction
○ Try to figure out what type of data you have.
● AI  People!
Data Quality & Integrity
● Boolean Integrity checks
● Often specified by people, also “minedˮ by AI
● Data changes ALL the time, especially from clients.
● Enforced: can “rejectˮ or “sequesterˮ data that violates
○ e.g no two products that have the same product ID!
Donʼt forget about Metadata!

Data alone is not enough, also need context. Also need to capture metadata!

Application Metadata:
● Data entities (e.g. students, courses, employees for a university)
● Relationships between data
● Constraints
Behavioral Metadata:
● Data Lineage – where did it come from?
● Audit Trails of Usage – who ran this job, and what did it do?
Change Metadata
● Version info for all the above
● Timestamps
Modern solutions

Data Warehouse
Making things Dynamic: Operationalization and Feedback
Operationalization: Everything is an ongoing feed!
● When do jobs kick off, and what do they do?
● How are tests registered, exceptions handled,
people alerted?
● How do experiments “graduateˮ into processes?
Feedback: Every data “productˮ is of interest!
● Some are datasets in their own right. If you produce a table, thatʼs also
data!
● Many are new processes that generating new data feeds!
○ ML models: Constantly yielding predictions.
■ Compare old predictions to new predictions?
Real life is messy

Data Warehouse

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Traffic Impact Assessment
No ratings yet
Traffic Impact Assessment
28 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
Chapter 3: Organization, Utilization, and Communication of Test Results
No ratings yet
Chapter 3: Organization, Utilization, and Communication of Test Results
25 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
Case Study Hertz Corporation 130430125238 Phpapp01
100% (1)
Case Study Hertz Corporation 130430125238 Phpapp01
23 pages
Prepositions of Place Map Practice Grammar Drills Information Gap Activities - 103340
No ratings yet
Prepositions of Place Map Practice Grammar Drills Information Gap Activities - 103340
2 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
No ratings yet
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
31 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Unit 1 Introduction to Data Engineering
No ratings yet
Unit 1 Introduction to Data Engineering
32 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
5 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Data Engineering Ppt[1]
No ratings yet
Data Engineering Ppt[1]
144 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
5 pages
DE Week-1, Lecture
No ratings yet
DE Week-1, Lecture
3 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
De Notes
No ratings yet
De Notes
3 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
M1.2 Building A Data Lake
No ratings yet
M1.2 Building A Data Lake
60 pages
e17fba4f-3376-462f-beef-e2fc3fcd25cf
No ratings yet
e17fba4f-3376-462f-beef-e2fc3fcd25cf
2 pages
Data Engineering Training Technology Agnostic Foundations
No ratings yet
Data Engineering Training Technology Agnostic Foundations
50 pages
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
43 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
What Is Data Engineering?: Think
No ratings yet
What Is Data Engineering?: Think
13 pages
Introduction To Data Engineering
100% (1)
Introduction To Data Engineering
6 pages
1 Intro
No ratings yet
1 Intro
33 pages
De Unit - I
No ratings yet
De Unit - I
43 pages
4 Data Engineering
No ratings yet
4 Data Engineering
34 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
Inbound 2613578228155417375
No ratings yet
Inbound 2613578228155417375
2 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
185 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
UNIT 1 Merged
No ratings yet
UNIT 1 Merged
11 pages
DM Lecture 5
No ratings yet
DM Lecture 5
31 pages
Coursera - IBM - Introduction To Data Analytics
No ratings yet
Coursera - IBM - Introduction To Data Analytics
13 pages
The Evolving Role of The Data Engineer
No ratings yet
The Evolving Role of The Data Engineer
61 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
The Background and Skill of Data Engineer
No ratings yet
The Background and Skill of Data Engineer
9 pages
Evolution of Data Engineer.
No ratings yet
Evolution of Data Engineer.
2 pages
100 Data Engineering QUESTIONS ANSWERS
No ratings yet
100 Data Engineering QUESTIONS ANSWERS
59 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
Data Engineering Vs Data Science
No ratings yet
Data Engineering Vs Data Science
26 pages
1-Pre Requisite For Data Scientist-03!01!2025
No ratings yet
1-Pre Requisite For Data Scientist-03!01!2025
26 pages
Data Engineer Preparation
No ratings yet
Data Engineer Preparation
5 pages
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
The Beginner's to Professional Guide
From Everand
The Beginner's to Professional Guide
mohamed adel
No ratings yet
Aishwarya Rai K
No ratings yet
Aishwarya Rai K
4 pages
06 Fan
No ratings yet
06 Fan
20 pages
tm350sg PDF
No ratings yet
tm350sg PDF
128 pages
Unit 2 5CHR Imelda Echevarria Splinter June2020
No ratings yet
Unit 2 5CHR Imelda Echevarria Splinter June2020
20 pages
B+W Asi 3.0 Profi
No ratings yet
B+W Asi 3.0 Profi
2 pages
Evaluation of Training and Assessment
No ratings yet
Evaluation of Training and Assessment
111 pages
Toyo Engineering & Construction Sdn. BHD.: Field Inspection Notice
No ratings yet
Toyo Engineering & Construction Sdn. BHD.: Field Inspection Notice
2 pages
The Vision: Beautiful Euston: Homes Property
No ratings yet
The Vision: Beautiful Euston: Homes Property
24 pages
Science: Assessment of The Reliability of The Sysmex XE-5000 Analyzer To Detect Platelet Clumps
No ratings yet
Science: Assessment of The Reliability of The Sysmex XE-5000 Analyzer To Detect Platelet Clumps
6 pages
Service Manual: 337 - S/N 234611001 & Above 341 - S/N 234711001 & Above (G Series)
0% (1)
Service Manual: 337 - S/N 234611001 & Above 341 - S/N 234711001 & Above (G Series)
1,037 pages
Vermiculture in Egypt
100% (1)
Vermiculture in Egypt
99 pages
Sales and Distribution
No ratings yet
Sales and Distribution
14 pages
McDesign 18 PDF
No ratings yet
McDesign 18 PDF
39 pages
NIA V Gamit
No ratings yet
NIA V Gamit
2 pages
Respiratory Questions
No ratings yet
Respiratory Questions
11 pages
Pipelining and Pipelining Hazards
No ratings yet
Pipelining and Pipelining Hazards
43 pages
Chapter 8 Global City
No ratings yet
Chapter 8 Global City
31 pages
Digital Signal Processing: (Course code-ECE 303
100% (1)
Digital Signal Processing: (Course code-ECE 303
39 pages
NCLEX Pharmacology Worksheet
No ratings yet
NCLEX Pharmacology Worksheet
5 pages
Assignment 4
No ratings yet
Assignment 4
2 pages
Thể Subjunctive
100% (1)
Thể Subjunctive
6 pages
Digital Tilt Sensor Inclinometer
No ratings yet
Digital Tilt Sensor Inclinometer
4 pages
Cox1992 PDF
No ratings yet
Cox1992 PDF
4 pages
REPORT FROM THE COMMISSION TO THE COUNCIL AND THE EUROPEAN PARLIAMENT On The Use of Substances Other Than Vitamins and Minerals in Food Supplements
No ratings yet
REPORT FROM THE COMMISSION TO THE COUNCIL AND THE EUROPEAN PARLIAMENT On The Use of Substances Other Than Vitamins and Minerals in Food Supplements
13 pages
Rxyq 72 Pydn
No ratings yet
Rxyq 72 Pydn
25 pages
Merry Ann Gelacio Perero: Personal Summary
No ratings yet
Merry Ann Gelacio Perero: Personal Summary
4 pages

Intro To Data Engineering!

Uploaded by

Intro To Data Engineering!

Uploaded by

LECTURE 01

Data engineering is an essential ingredient

Data engineering is as essential as plumbing!

1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world

1. Data science projects largely focus on data engineering.

“… 70% more open roles at

A new specialized job category:

1. Data science projects largely focus on data engineering.

“ML codeˮ is not only a

Sculley et al., SE4ML 2014 [google research]. 9

“Under the strong influence

Monica Rogati, 2017

“ML Engineerˮ: a specialization of

Tomasz Dudek,, Scalability will be an important

1. Data science projects largely focus on data engineering.

As a Data Science major,

Data systems is a difficult subject! There are

2023 MAD (ML/AI/Data) Landscape: blog, interactive 14

Data systems are tools

(the same VC who made the MAD

“Experts are close to the data

Data Fit for purpose

Load without doing a lot of

Data Warehouses are expensive

What about skipping the “data warehouse” entirely?

No need to “load/manage” data 22

Data is dumped in cheaply and

Once data is “lostˮ (i.e. not saved, deleted, etc) it

Recreating history is exceptionally hard. Canʼt predict

How do you know something improved if you cannot

Data Warehouse, 1990s Data Lake, 2010s

Some datasets may directly be

Empower data scientists to work

As we move away from a “managed” data

You might also like