Data Pipelines From Zero To Solid
Data Pipelines From Zero To Solid
Data Pipelines From Zero To Solid
1
Who’s talking?
Swedish Institute of Computer Science (test tools)
Sun Microsystems (very large machines)
Google (Hangouts, productivity)
Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems)
Spotify (data processing & modelling)
Schibsted (data processing & modelling)
Independent data engineering consultant
2
Presentation goals
● Overview of data pipelines for analytics / data products
● Target audience: Big data starters
○ Seen wordcount, need the stuff around
● Overview of necessary components & wiring
● Base recipe
○ In vicinity of state-of-practice
○ Baseline for comparing design proposals
● Subjective best practices - not single truth
● Technology suggestions, (alternatives)
3
Presentation non-goals
● Stream processing
○ High complexity in practice
○ Batch processing yields > 90% of value
● Technology enumeration or (fair) comparison
● Writing data processing code
○ Already covered en masse
4
Data product anatomy
Pipeline Export
Unified log
Job Dataset
DB Service
Service
Cluster storage
DB
DB Data Business
lake intelligence
Ingress ETL Egress
5
Computer program anatomy
Input data Execution path Output
Function Variable data
Lookup
File
structure
HID
RAM
File
File
Window
Input Process Output
6
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, lint tools
● Avoid anti-patterns: Global state, hard-coding location,
duplication, ...
In data engineering, slipping is the norm... :-(
Solved by mixing strong software engineers with data
engineers/scientists. Mutual respect is crucial.
7
Unified log
Immutable events, append-only,
Event collection source of truth
Unreliable
(Secor,
Reliable, simple, Camus)
write available
Service
Service
(unimportant)
Service
(important)
Replicated bus
with history
Unified log
Cluster storage
HDFS
Bus-to-bus WAN mirror (NFS, S3, Google CS, C*)
expect delays
clicks/2016/02/08/14
clicks/2016/02/08/15
Cluster storage
Service
Cluster storage
Service DB ? HDFS
(NFS, S3, Google CS, C*)
DB
12
Anti-pattern: Send the oliphants!
● Sqoop (dump with MapReduce) production DB
● MapReduce from production API
Hadoop / Spark == internal DDoS service
Service
Cluster storage
Service HDFS
DB
(NFS, S3, Google CS, C*)
DB
Our preciousss 13
Deterministic slaves
Restore
Service
backup
DB snapshot DB
15
Event sourcing
DB DB’
API
and DB transactions
API
Postgres, MySQL -> Kafka
API
17
DB snapshot lessons learnt
● Put fences between online and offline components
○ The latter can kill the former
● Team that owns a database/service must own exporting
data to offline
○ Protect online stability
○ Affects choice of DB technology
18
The data lake
Unified log + snapshots
● Immutable datasets
● Raw, unprocessed
● Source of truth from batch
processing perspective Cluster storage
● Kept as long as permitted Data lake
● Technically homogeneous
19
Datasets
● Pipeline equivalent of objects
● Dataset class == homogeneous records, open-ended
○ Compatible schema
○ E.g. MobileAdImpressions
● Dataset instance = dataset class + parameters
○ Immutable
○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)
20
Representation - data lake & pipes
● Directory with multiple files
○ Parallel processing
○ Sealed with _SUCCESS (Hadoop convention)
○ Bundled schema format
■ JSON lines, Avro, Parquet
○ Avoid old, inadequate formats
■ CSV, XML
○ RPC formats lack bundled schema
■ Protobuf, Thrift
21
Directory datasets
Privacy Dataset Schema Instance parameters, Seal Partitions
level class version Hive convention
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS
part-00000.json
part-00001.json
DB Service
Service
Business
DB Change agility important here intelligence
DB
25
Batch processing Artifact of business value
E.g. service index
27
Language choice
● People and community thing, not a technical thing
● Need for simple & quick experiments
○ Java - too much ceremony and boilerplate
● Stable and static enough for production
○ Python/R - too dynamic
● Scala connects both worlds
○ Current home of data innovation
● Beware of complexity - keep it sane and simple
○ Avoid spaceships: <|*|> |@| <**>
28
Batch job
Job == function([input datasets]): [output datasets]
● No orthogonal concerns
○ Invocation
○ Scheduling
○ Input / output location q
● Testable
● No other input factors
● No side-effects
● Ideally: atomic, deterministic, idempotent
29
Batch job class & instance
● Pipeline equivalent of Command pattern
● Parameterised
○ Higher order, c.f. dataset class & instance
○ Job instance == job class + parameters
○ Inputs & outputs are dataset classes
● Instances are ideally executed when input appears
○ Not on cron schedule
30
Pipelines
Data lake Intermediate
● Things will break
○ Input will be missing
○ Jobs will fail
○ Jobs will have bugs
● Datasets must be rebuilt
● Determinism,
idempotency Cluster storage
f() p()
file://test_input/ file://test_output/
f() p()
+ Tests workflow logic
+ More authentic
- Workflow mgr setup
B: Customised workflow manager setup for testability
- Difficult to debug
- Dataset handling
with Python
● Both can be extended with Kafka, egress DBs 37
Deployment
Hg/git
my-pipe-7.tar.gz
repo Luigi DSL, jars, config HDFS
All that a pipeline needs, installed atomically > pip install my-pipe-7.tar.gz
* 10 * * * bin/my_pipe_daily \
Redundant cron schedule, higher --backfill 14
frequency + backfill (Luigi range tools)
Worker Luigi
Worker daemon
Worker
Worker
Worker
Worker
Worker
Worker
38
Continuous deployment
Hg/git
my-pipe-7.tar.gz
repo Luigi DSL, jars, config HDFS
○ virtualenv package/version
> virtualenv my_pipe/7
■ No need to sync > pip install my-pipe-7.tar.gz
○ Cron package/latest/bin/*
Worker
■ Old versions run pipelines to
completion, then exit
39
Start lean: assess needs
Your data & your jobs:
A. Fit in one machine, and will continue to do so
B. Fit in one machine, but grow faster than Moore’s law
C. Do not fit in one machine
43
Data retention
● Remove old, promote derived datasets to lake
Data lake Derived Data lake Derived
44
PII removal Split out PII,
wash on user
deletion
Key on PII => difficult to wash 34ac,bobwhite
56bd,null
bobwhite,http://site_a/,2015-01-03T,Bath,
bobwhite,http://site_a/,2015-01-03T,Bath,uk uk
bobwhite,http://site_b/,2015-01-03T,Bath,uk bobwhite,http://site_b/,2015-01-03T,Bath,
joeblack,http://site_c/,2015-01-03T,Bristol,uk uk
null,http://site_c/,2015-01-03T,Bristol,uk
48
Cloud or not?
+ Operations
+ Security
+ Responsive scaling
- Development workflows
- Privacy
- Vendor lock-in
Security?
● Afterthought add-on for big data components
○ E.g. Kerberos support
○ Always trailing - difficult to choose global paradigm
● Container security simpler
○ Easy with cloud
○ Immature with on-premise solutions?
50
Data pipelines example
Raw Derived
Views with Conversion
Users
demographics analytics
Sales
Sales
reports
51
Data pipelines team organisation
Manager Devops
- Explain what and why - Always increase automation
- Facilitate process to determine how - Enable, don’t control
- Enable, enable, enable
Protect production servers
Service
commit backup
DB DB DB DB
log snapshot
56
PII privacy control
● Simplify with coarse classification (red/yellow/green)
○ Datasets, potentially fields
○ Separate production areas
● Log batch jobs
○ Code checksum -> commit id -> source code
○ Tag job class with classification
■ Aids PII consideration in code review
■ Enables ad-hoc verification
57
Audit
● Audit manual access
● Wrap all functionality in gateway tool
○ Log datasets, output, code used
○ Disallow download to laptop
○ Wrapper tool happens to be great for enabling data
scientists, too - shields them from operations.
58