Kafka Notes1
Kafka Notes1
Kafka Notes1
These are the most relevant contents that I collected through some books and
courses. I organized and make some edits to help deliver the concepts about
Apache Kafka in the most comprehensible way. Anyone who wants to learn Apache
Kafka can reference these notes without going through too many resources on the
Internet. Below might be the only material you need to grasp quite a basic
understanding of Kafka as well as how to configure your applications to use Kafka
properly in production.
If you want to get a deeper understanding of Kafka or how to use Kafka Stream for
big data processing. I highly recommend you check out the material including some
books and courses that I linked in the reference section.
Table of contents:
● Kafka notes
○ Kafka introduction
■ The data problem
■ Why use Kafka?
■ Why is Kafka fast?
■ Compared to other message queue systems
■ Use Cases
○ Kafka architecture
■ Log
■ Topics
■ Partitions
■ Difference between Partition and Log?
■ Partitions group data by key
■ Important characteristics of Kafka
○ Producers and Consumers
■ Producer
■ Consumer
■ Consumer group
■ Flow of sending a message
○ Broker and Clusters
■ Broker
■ Cluster membership management with Zookeeper
■ Cluster controller
■ Replica
■ Leader replica
■ Follower replica
○ Confugrations
■ Hardware selection
■ Disk Throughput
■ Disk capacity
■ Memory
■ Partitions count, replication factor
■ Patitions
■ Replication: should be at least 2, usually 3, maximum 4
■ Configure topic
■ Retention and clean up policies(Compaction)
■ Paritions and Segments
■ Configure producer
■ Kafka broker discovery
■ Options for producer configuration
■ Message compression for high-throughpput producer
■ Producer batching
■ Idempotent producer
■ Configure consumer
■ How kafka handle consumers exit/enter groups?
■ Controlling consumer liveness?
■ Consumer offset
■ Consumer offset reset behaviours
■ Delivery semantics for consumers
■ Offset management
■ Consumer offset commits strategies
■ Schema registry
○ Case study
■ Video analytics - MovieFlix
■ GetTaxi
■ Campaign compare
■ Mysocial media
■ Finance application - MyBank
■ Big data ingestion
○ Kafka internal
■ Request processing
■ Physical storage
■ Partition Allocation
■ File Management
○ References
Kafka introduction
Use Cases
● Activity tracking: The original use case for Kafka, designed at LinkedIn, is that
of user activity tracking.
● Messaging: when applications need to send notifications to users. Those can
produce messages without needing to be concerned about formatting. Then
an other applicatoin can read all the messages and handle them consistently.
● Metrics and logging
● Commit log: Database changes can be published to Kafka and applications
can easily monitor this stream to receive live updates as they happen.
● Stream processing: Kafka is extremely good for streaming and processing
huge datasets.
Kafka architecture
Log
Is a file that Kafka appends incoming records to. A log is an append-only, totally
ordered sequence of records ordered by time
Configuration setting log.dir, specifies where Kafka stores log data on disk.
Topics
Topics are logs that are seperated by topic name. Thinks topics as labeled logs. The
closest analogies for a topic are a database table or a folder in a filesystem.
● orders
● customers
● paymments
To help manage the load of messages coming into a topic. Kafka use partitions
Partitions are the way that Kafka provides redundancy and scalability. Each
partition can be hosted on a different server, which means that a single topic can be
scaled horizontally across multiple servers.
Partitions
● Help increasing throughput
● Allows topic messages to be spread across several machines so that the
capacity of a given topic isn't limited to the availble disk space one one server
When a message is sent to kafka, you can specify a key option for that message.
If the key(key will be explained in the next section) isn't null. Kafka uses the following
formula to calculate which partition the message will be sent to.
Records with the same key will always be sent to the same partition and in order.
Producer
Producers create new messages. In other publish/subscribe systems, these may be
called publishers or writers. A message will be produced to a specific topic.
● The producer does not care what partition a specific message is written to and
will balance messages over all partitions of a topic evenly
● In some cases, the producer will direct messages to specific partitions using
message key. Messages with a specified message key will be ensured to come
in the right order in a partition.
Consumer
Consumers read messages. In other publish/subscribe systems, these may be called
subscribers or readers.
● The consumer subscribes to one or more topics and reads the messages in
the order in which they were produced.
● The consumer keeps track of which message it has already consumed by
keeping track of the offset of messages.
The offset is a simple integer number that is used by Kafka to maintain the current
position of a consumer.
Consumers work as part of a consumer group, which is one or more consumers that
work together to consume a topic. Group assures that each partition is only
consumed by one member. If a single consumer fails, the remaning members of
group will rebalance the partitions being consumed to take over the missing member.
Consumer group
● Create a ProducerRecord, which must include the topic we want to send the
record to and a value. Optionally, we can also specify a key and/or a partition.
● Then Serialized the key and value objects to ByteArrays so they can be sent
over the network
● Data is sent to a partitioner. The partition check if ProducerRecord has a
specifed partition option. If yes, it doesn't do anything an reply the
partition we specify. If not, the partitioner will choose a partition for us.
● Once a partition is selected, the producer then add the record to a batch of
records that will also be sent to the same topic and partition.
● When broker receives the messages, it sends back a response.
○ If the messages were successfully writtent to Kafka, return a
RecordMetaData object contains <topic, partition, offset>
○ If failed, the broker will return an error. The producer may retry sending
the message a few more times before giving up and returning an error.
Broker and Clusters
Broker
A single Kafka server is called a broker. The broker receives messages from
producers, assigns offsets to them and commits the messages to storage on disk.
Cluster controller
In a cluster, one broker will also function as the cluster controller
A cluster controller is one of the kafka brokers that in addition to the usual broker
functionality:
Each broker holds a number of partitions and each of these partitions can be either a
leader or a replica for a topic
Follower replica
● All replicas for a partition that are not leaders are called followers
● Followers don't serve client requests
● Only replicate messages from the leader and stay up-to-date with the most
recent message the leader has
● When a leader crashes, one of follower replica will be promoted to become
the leader
● A Follower replica that catch up with the most recent messages of the leader
are callled In-Sync replica
● Only in-sync replicas are eligible to be elected as partition leader in case the
existing leader fail
When a controller notices that a broker left the cluster. All the partitions that had a
leader on that broker will need a new leader, so the controller will choose a new
leader for all of these partitions