Data Pipelines From Zero to Solid

Data pipelines from zero to solid
Who’s talking?
Swedish Institute of Computer Science (test tools) Sun Microsystems (very large machines) Google
(Hangouts, productivity) Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data
processing & modelling) Independent data engineering consultant
Presentation Goals
• Overview of data pipelines for analytics / data products

• Target audience: Big data starters
o Seen wordcount, need the stuff around
• Overview of necessary components & wiring
• Base recipe
o In vicinity of state-of-practice
o Baseline for comparing design proposals
• Subjective best practices - not single truth
• Technology suggestions, (alternatives)
Presentation Non – Goals
Stream processing
High complexity in practice
Batch processing yields > 90% of value
Technology enumeration or (fair) comparison
Writing data processing code
Already covered en masse
Data product anatomy

Computer program anatomy
Data pipeline = yet another program

Don’t veer from best practices
Regression testing
Design: Separation of concerns, modularity, etc
Process: CI/CD, code review, lint tools
Avoid anti-patterns: Global state, hard-coding location, duplication, ...

In data engineering, slipping is the norm... :-(
Solved by mixing strong software engineers with data engineers/scientists.
Mutual respect is crucial.
Immediate handoff to append-only replicated log. Once in the log, events eventually arrive in storage.
Event Registration
Asynchronous fire-and-forget handoff for unimportant data. Synchronous, replicated, with ack for
important data.
Event Transportation
Log has long history (months+) => robustness end to end. Avoid risk of processing & decoration. Except
timestamps.
Event Arrival
Bundle incoming events into datasets

● Sealed quickly, thereafter immutable
● Bucket on arrival / wall-clock time
● Predictable bucketing, e.g. hour
Database State Collection

Source of truth sometimes in database. Snapshot to cluster storage. Easy on surface.
Anti-pattern: Send the oliphants!

 Hadoop / Spark == internal DDoS service
Deterministic slaves
Restore backup to offline slave
 + Standard procedure
 - Serial or resource consuming
Using snapshots
 join(event, snapshot) => always time mismatch

 Usually acceptable
 Some behaviour difficult to catch with snapshots
o E.g. user creates, then deletes account
Event sourcing
 Every change to unified log == source of truth

 snapshot(t + 1) = sum(snapshot(t), events(t, t+1))
 Allows view & join at any point in time
Application services still need DB for current state lookup
Event sourcing, synced database
Service interface generates events

and DB transactions
Generate stream from commit log

Postgres, MySQL -> Kafka
Build DB with stream processing
DB snapshot lessons learnt
 Put fences between online and offline components

o The latter can kill the former
 Team that owns a database/service must own exporting
o Data to offline
o Protect online stability
o Affects choice of DB technology
The data lake
Unified log + snapshots

 Immutable datasets
 Raw, unprocessed
 Source of truth from batch
processing perspective
 Kept as long as permitted
 Technically homogeneous
Representation - data lake & pipes
 Directory with multiple files

o Parallel processing
o Sealed with _SUCCESS (Hadoop convention)
o Bundled schema forma
 JSON lines, Avro, Parquet
o Avoid old, inadequate formats
o RPC formats lack bundled schema
Directory datasets
Some tools, e.g. Spark, understand Hive name conventions
Ingress / egress representation
Larger variation:
 Single file
 Relational database table
 Cassandra column family, other NoSQL
 BI tool storage
 BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never change it.
Schemas
 There is always a schema

o Plan your evolution
 New field, same semantic == compatible change
 Incompatible schema change => new dataset class
 Schema on read - assumptions in code
o Dynamic typing
o Quick schema changes possible
 Schema on write - enumerated fields
o Static typing & code generation possible
o Changes must propagate down pipeline code
Schema on read or write?
Batch processing
Gradual refinement
1. Wash
o time shuffle, dedup
2. Decorate
o geo, demographic
3. Domain model
o similarity, clusters
4. Application model
o Recommendations
Batch job code
 Components should scale up

o Spark, (Scalding, Crunch)
 And scale down
o More important!
o Component should support local mode
 Integration tests
 Small jobs - less risk, easier debugging
Language choice
 People and community thing, not a technical thing

 Need for simple & quick experiments
o Java - too much ceremony and boilerplate
 Stable and static enough for production
o Python/R - too dynamic
 Scala connects both worlds
o Current home of data innovation
 Beware of complexity - keep it sane and simple
o Avoid spaceships: <|*|> |@| <**>
Batch job
Job == function([input datasets]): [output datasets]

 No orthogonal concerns
 Testable
 No other input factors
 No side-effects
 Ideally: atomic, deterministic,
idempotent
Batch job class & instance
 Pipeline equivalent of Command pattern

 Parameterized
o Higher order, c.f. dataset class & instance
o Job instance == job class + parameters
o Inputs & outputs are dataset classes
 Instances are ideally executed when input appears
o Not on cron schedule
Pipelines
 Things will break

o Input will be missing
o Jobs will fail
o Jobs will have bugs
 Datasets must be rebuilt
 Determinism, idempotency
 Backfill missing / failed
 Eventual correctness
Workflow manager
 Dataset “build tool”

 Run job instance when
o input is available
o output missing
o resources are available
 Backfill for previous failures
 DSL describes DAG
 Includes ingress & egress
Luigi, (Airflow, Pinball)
DSL DAG example (Luigi)
Expressive, embedded DSL - a must for ingress, egress
 Avoid weak DSL tools

Egress datasets
 Serving
o Precomputed user query answers
o Denormalised
o Cassandra, (many)
 Export & Analytics
o SQL (single node / Hive, Presto)
o Workbenches (Zeppelin)
o (Elasticsearch, proprietary OLAP)
 BI / analytics tool needs change frequently
o Prepare to redirect pipelines
Test strategy considerations
 Developer productivity is the primary value of test automation

Test at stable interface
o Minimal maintenance
o No barrier to refactorings
 Focus: single job + end to end
o Jobs & pipelines are pure functions - easy to test
 Component, unit - only if necessary
o Avoid dependency injection ceremony
Testing single job
Testing pipelines - two options
Both can be extended with Kafka, egress DBs.

Deployment
Continuous deployment
 Poll and pull latest on worker nodes

o virtualenv package/version
 No need to sync
 Environment & versions
o Cron package/latest/bin/*
 Old versions run pipelines to
 Completion, then exit
Start lean: assess needs
Your data & your jobs:
A. Fit in one machine, and will continue to do so

B. Fit in one machine, but grow faster than Moore’s law
C. Do not fit in one machine
 Most datasets / jobs: A
o Even at large companies with millions of users
 cost(C) >> cost(A)
 Running A jobs on C infrastructure is expensive
Lean MVP
 Start simple, lean, end-to-end

o No parallel cluster computations necessary?
o Custom jobs or local Spark/Scalding/Crunch
 Shrink data
o Downsample
o Approximate algorithms (e.g. Count-min sketch)
 Get workflows running
o Serial jobs on one/few machines
o Simple job control (Luigi only / simple work queue)
Scale carefully
 Get end-to-end workflows in production for evaluation

o Improvements driven by business value, not tech
 Keep focus small
○ Business value
○ Privacy needs attention early
 Keep iterations swift
o Integration test end-to-end
o Efficient code/test/deploy cycle
 Parallelise jobs only when forced
Protecting privacy in practice
 Removing old personal identifiable information (PII)

 Right to be forgotten
 Access control to PII data
 Audit of access and processing
 PII content definition is application-specific
 PII handling subject to business priorities
o But you should have a plan from day one
Data retention
 Remove old, promote derived datasets to lake


PII removal
 Must rebuild downstream datasets regularly

o In order for PII to be washed in x days
Simple PII audit
 Classify PII level

o Name, address, messages,
...
o IP, city, ...
o Total # page views, …
 Tag datasets and jobs in code
 Manual access through gateway
tool
o Verify permission, log
o Dedicated machines only
 Log batch jobs
 Deploy with CD only, log hg/git
commit hash
Parting words + sales plug
Keep things simple; batch, homogeneity & little state
Focus on developer code, test, debug cycle - end to end
Harmony with technical ecosystems
Little technology overlap with yesterday - follow leaders
Plan early: Privacy, retention, audit, schema evolution
Please give feedback -- mapflat.com/feedback
I help companies plan and build these things

Data Pipelines From Zero to Solid

Uploaded by

Copyright:

Available Formats

Data Pipelines From Zero to Solid

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pipelines From Zero to Solid

Uploaded by

Copyright:

Available Formats

Data pipelines from zero to solid

• Overview of data pipelines for analytics / data products

Presentation Non – Goals

Data product anatomy

Data pipeline = yet another program

Design: Separation of concerns, modularity, etc

Process: CI/CD, code review, lint tools

Avoid anti-patterns: Global state, hard-coding location, duplication, ...

Bundle incoming events into datasets

Database State Collection

Anti-pattern: Send the oliphants!

Restore backup to offline slave

 join(event, snapshot) => always time mismatch

 Every change to unified log == source of truth

Service interface generates events

Generate stream from commit log

Build DB with stream processing

DB snapshot lessons learnt

 Put fences between online and offline components

The data lake

Unified log + snapshots

 Directory with multiple files

Some tools, e.g. Spark, understand Hive name conventions

Ingress / egress representation

 There is always a schema

Schema on read or write?

Batch job code

 Components should scale up

 People and community thing, not a technical thing

Job == function([input datasets]): [output datasets]

Batch job class & instance

 Pipeline equivalent of Command pattern

 Things will break

 Dataset “build tool”

DSL DAG example (Luigi)

Expressive, embedded DSL - a must for ingress, egress

 Avoid weak DSL tools

Test strategy considerations

 Developer productivity is the primary value of test automation

Testing pipelines - two options

Both can be extended with Kafka, egress DBs.

 Poll and pull latest on worker nodes

Start lean: assess needs

Your data & your jobs:

A. Fit in one machine, and will continue to do so

 Start simple, lean, end-to-end

 Get end-to-end workflows in production for evaluation

Protecting privacy in practice

 Removing old personal identifiable information (PII)

 Remove old, promote derived datasets to lake

 Must rebuild downstream datasets regularly

 Classify PII level

Parting words + sales plug

Keep things simple; batch, homogeneity & little state

Focus on developer code, test, debug cycle - end to end

Harmony with technical ecosystems

Little technology overlap with yesterday - follow leaders

Plan early: Privacy, retention, audit, schema evolution

Please give feedback -- mapflat.com/feedback

I help companies plan and build these things