Large Scale Data Pipelines
Large Scale Data Pipelines
Large Scale Data Pipelines
Ryan Knight
James Ward
TODO
@
_JamesWard
@
• Distributed Systems guru
• Scala, Akka, Cassandra Expert & Trainer
• Skis with his 5 boys in Park City, UT
• First time to jFokus
Ryan Knight
Architect at Starbucks
• Back-end Developer
• Creator of WebJars
• Blog: www.jamesward.com
• Not a JavaScript Fan
Developer at Salesforce
Agenda
• Play Framework
• Flink
• Cassandra
• Spark Streaming
Modern Data Pipelines
Real-Time, Distributed, Decoupled
Why Streaming Pipelines
Real Time Value
• Allow business to react to data in real-time instead of batch
Real Time Intelligence
• Provide real-time information so that the apps can use the information
and adapt their user interactions
Distributed data processing that is both scalable and resilient
Clickstream analysis
Real-time anomaly detection
Instant (< 10 s) feedback - ex. real time concurrent video viewers / page
views
Data Pipeline Requirements
http://ferd.ca/queues-don-t-fix-overload.html
Backpressure
http://ferd.ca/queues-don-t-fix-overload.html
Data Hub / Stream Processing
Pipeline Architecture
Spark
Notebook
Web Client
Flink
Spark
core, streaming,
graphx, mllib, ...
Play App
Spark
Kafka Cassandra
Streaming
Cold Data
Koober
github.com/jamesward/koober
Kafka
Distributed Commit Logs
What is Kafka?
Scalability of a filesystem
• Hundreds of MB/sec/server throughput
• Many TB per server
Durable - Guarantees of a database
• Messages strictly ordered
• All data persistent
Distributed by default
• Replication
• Partitioning model
Kafka is about logs
The Event Log
Append-Only Logging
Database of Facts
Disks are Cheap
Why Delete Data any more?
Replay Events
Append Only Logging
Logs: pub/sub done right
Kafka Overview
• Producers write data to brokers.
• Consumers read data from brokers.
• Brokers - Each server running Kafka is called a
broker.
• All this is distributed.
• Data
– Data is stored in topics.
– Topics are split into partitions, which are
replicated.
Declarative Routing:
GET /foo controllers.Foo.do
app/views/blah.scala.html Action {
Ok(views.html.blah("bar"))
}
@(foo: String)
<html> <html>
<body> <body>
@foo bar
</body> </body>
</html> </html>
Demo & Code
Flink
Real-time Data Analytics
Flink
Real-time Data Analytics
• Flexible Windowing
Apache Flink
Continuous Processing for Unbounded Datasets
count() 5
Windowing
Bounding with Time, Count, Session, or Data
1s 1s count() 3
2
Batch Processing
Stream Processing on Finite Streams
count() 4
Data Processing
What can we do?
✓ Working in a cluster C*
✓ Nothing is shared
Confidential
Cassandra Cluster
Confidential
Multi-Data Center Design
Why Cassandra?
It has a flexible data model
Tables, wide rows, partitioned and distributed
✓ Data
✓ Blobs (documents, files, images)
✓ Collections (Sets, Lists, Maps)
✓ UDTs
Access it with CQL ← familiar syntax to SQL
Confidential
Two knobs control Cassandra fault tolerance
Replication Factor (server side)
How many copies of the data should exist?
RF=3
Write A
A B
CD AD
Client
D C
BC AB
Two knobs control Cassandra fault tolerance
Consistency Level (client side)
CL=ONE CL=QUORUM
A B A B
Write A Write A
CD AD CD AD
Client Client
D C D C
BC AB BC AB
Consistency Levels
Applies to both Reads and Writes (i.e. is set on each query)
12 µs ack
D 12 µs ack
C
BC AB
Consistency Level and Availability
Consistency Level choice affects availability
A=2
D C
BC AB
Reads in the cluster
Same as writes in the cluster, reads are coordinated
Any node can be the Coordinator Node
A B
CD AD
Client
Read A
(CL=QUORUM) D C
BC AB
Coordinator Node
Spark Cassandra Connector
Spark Cassandra Connector
Worker
Worker
Worker
Worker
Worker
• Checkpointing and replication mandatory since streams don’t have source data files to
reconstruct lost RDD partitions (except for the directory ingest case).
• Write Ahead Logging to prevent Data Loss
Direct Kafka Streaming w/ Kafka Direct API