Data Pipelines From Zero to Solid
Data Pipelines From Zero to Solid
Data Pipelines From Zero to Solid
Who’s talking?
Swedish Institute of Computer Science (test tools) Sun Microsystems (very large machines) Google
(Hangouts, productivity) Recorded Future (NLP startup)
Cinnober Financial Tech. (trading systems) Spotify (data processing & modelling) Schibsted (data
processing & modelling) Independent data engineering consultant
Presentation Goals
Stream processing
High complexity in practice
Batch processing yields > 90% of value
Technology enumeration or (fair) comparison
Writing data processing code
Already covered en masse
Immediate handoff to append-only replicated log. Once in the log, events eventually arrive in storage.
Event Registration
Asynchronous fire-and-forget handoff for unimportant data. Synchronous, replicated, with ack for
important data.
Event Transportation
Log has long history (months+) => robustness end to end. Avoid risk of processing & decoration. Except
timestamps.
Event Arrival
+ Standard procedure
- Serial or resource consuming
Using snapshots
Event sourcing
Directory datasets
Larger variation:
Single file
Relational database table
Cassandra column family, other NoSQL
BI tool storage
BigQuery, Redshift, ...
Egress datasets are also atomic and immutable.
E.g. write full DB table / CF, switch service to use it, never change it.
Schemas
Batch processing
Gradual refinement
1. Wash
o time shuffle, dedup
2. Decorate
o geo, demographic
3. Domain model
o similarity, clusters
4. Application model
o Recommendations
Language choice
Batch job
Workflow manager
Serving
o Precomputed user query answers
o Denormalised
o Cassandra, (many)
Export & Analytics
o SQL (single node / Hive, Presto)
o Workbenches (Zeppelin)
o (Elasticsearch, proprietary OLAP)
BI / analytics tool needs change frequently
o Prepare to redirect pipelines
Continuous deployment
Lean MVP
Scale carefully
PII removal