Getting Started with Apache Kafka
WHY APACHE KAFKA
Ryan Plant
COURSE AUTHOR
@ryan_plant blog.ryanplant.com
What Is Apache Kafka?
Microsoft ElasticSearch
SQL Server
MongoDB
Oracle
MySQL
Apache Kafka
Hadoop
“A high-throughput distributed messaging system.”
What a Typical Enterprise Looks Like
RDBMS LOGS NOSQL QUEUES BLOBS
DW HADOOP SEARCH ANALYSIS
Database replication
Log shipping
Extract, Transform, and Load (ETL)
Messaging
Custom middleware magic
Database Replication and Log Shipping
RDBMS to RDBMS only
Database-specific
Tight coupling (schema)
Performance challenges (log shipping)
Cumbersome (subscriptions)
Extract, Transform, and Load (ETL)
Typically proprietary and costly
Lots of custom development
Scalability challenged
Performance challenged
Often times requires multiple instances
Messaging
Limited scalability
Smaller messages
Requires rapid consumption
Not fault-tolerant (application)
Perils of Messaging Under High Volume
High volume?
Publishers Message size?
No throttle?
Single host?
Local storage?
Message buildup?
BROKER
Consumers No consumption?
Slow consumption?
Perils of Messaging With Application Faults
Publishers
BROKER
Message
Consumers processing
bug
Middleware Magic
Increasingly complex
Deceiving
Consistency concerns
Potentially expensive
Middleware Challenges
Multi-write pattern Message broker pattern
Atomic
transaction
Coordination 1 2 1 2
Competing
logic consumers
Non-consuming
consumer
Isn’t There a Better Way?
To move data around:
- Cleanly
- Reliably
- Quickly
- Autonomously
That’s What LinkedIn
Asked in 2010…
High Volume:
- Over 1.4 trillion messages per day
- 175 terabytes per day
- 650 terabytes of messages consumed
per day
- Over 433 million users
High Velocity:
- Peak 13 million messages per second
- 2.75 gigabytes per second
High Variety:
- Multiple RDBMS (Oracle, MySQL, etc.)
- Multiple NoSQL (Espresso, Voldemort)
- Hadoop, Spark, etc.
Pre-2010 LinkedIn Data Architecture
skills recommendations
comments jobs
network updates ads mail search
groups people you may know profile news stats
…
kaf ka esque /’káf, kə, ɛsk/ | adjective
Basically it describes a nightmarish situation which most
people can somehow relate to, although strongly surreal.
synonyms: surreal, lucid, spoilsbury toast boy
Usage: “Whoa! This flick is way kafkaesque…”
Franz Kafka
Source: Urban Dictionary
Next-generation Messaging Goals
High throughput
Horizontally scalable
Reliable and durable
Loosely coupled Producers and Consumers
Flexible publish-subscribe semantics
Post-2010 LinkedIn Data Architecture
LOB
APPS
APPS
DBs LOGS
consume consume
consume consume consume
consume
publish publish publish
topic topic topic …
consume consume consume consume publish
publish publish
DW MARTS HDP SEARCH OPS …
Timeline of Events
2010 Today
2003 Initial Kafka 1.1 Trillion
LinkedIn Launch Deployment @ messages per
LinkedIn day @ LinkedIn
2011
2009
Kafka Open Sourced
Kafka Inception
Apache Software
Development begins
Foundation
Apache Kafka Adoption
7X since 2015
Yahoo Uber Square Airbnb
Etsy Oracle Coursera Spotify
Microsoft Goldman Sachs IBM Ancestry
Bing Netflix Pintrest LinkedIn
Mailchimp PayPal Twitter Hotels.com
Kafka is a distributed messaging system
Designed to move data at high volumes
Addresses shortcomings of traditional
Summary data movement tools and approaches
Invented by LinkedIn to address data
growth issues common to many
enterprises
Open-sourced under Apache Software
Foundation in 2012
First-choice adoption for data movement
for hundreds of enterprise and internet-
scale companies