# Kafka ![rw-book-cover](https://m.media-amazon.com/images/I/813T2iWhVVL._SY160.jpg) ## Metadata - Author: [[Gwen Shapira, Todd Palino, Rajini Sivaram, and Krit Petty]] - Full Title: Kafka - Category: #big-data #data-engineering ## Highlights - Here are the big three differences: first, it works as a modern distributed system that runs as a cluster and can scale to handle all the applications in even the most massive of companies. Rather than running dozens of individual messaging brokers, hand wired to different apps, this lets you have a central platform that can scale elastically to handle all the streams of data in a company. Second, Kafka is a true storage system built to store data for as long as you might like. This has huge advantages in using it as a connecting layer as it provides real delivery guarantees—its data is replicated, persistent, and can be kept around as long as you like. Finally, the world of stream processing raises the level of abstraction quite significantly. Messaging systems mostly just hand out messages. The stream processing capabilities in Kafka let you compute derived streams and datasets dynamically off of your streams with far less code. These differences make Kafka enough of its own thing that it doesn’t really make sense to think of it as “yet another queue.” ([Location 103](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=103)) - Any time scientists disagree, it’s because we have insufficient data. Then we can agree on what kind of data to get; we get the data; and the data solves the problem. Either I’m right, or you’re right, or we’re both wrong. And we move on. Neil deGrasse Tyson ([Location 232](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=232)) - Publish/subscribe (pub/sub) messaging is a pattern that is characterized by the sender (publisher) of a piece of data (message) not specifically directing it to a receiver. Instead, the publisher classifies the message somehow, and that receiver (subscriber) subscribes to receive certain classes of messages. Pub/sub systems often have a broker, a central point where messages are published, to facilitate this pattern. ([Location 238](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=238)) - You also know that there will be more use cases for messaging coming soon. What you would like to have is a single centralized system that allows for publishing generic types of data, which will grow as your business grows. ([Location 273](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=273)) - Apache Kafka was developed as a publish/subscribe messaging system designed to solve this problem. It is often described as a “distributed commit log” or more recently as a “distributing streaming platform.” A filesystem or database commit log is designed to provide a durable record of all transactions so that they can be replayed to consistently build the state of a system. Similarly, data within Kafka is stored durably, in order, and can be read deterministically. ([Location 276](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=276)) - The unit of data within Kafka is called a message. If you are approaching Kafka from a database background, you can think of this as similar to a row or a record. A message is simply an array of bytes as far as Kafka is concerned, so the data contained within it does not have a specific format or meaning to Kafka. ([Location 282](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=282)) - Keys are used when messages are to be written to partitions in a more controlled manner. ([Location 287](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=287)) - messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition. ([Location 290](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=290)) - Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power. ([Location 294](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=294)) - There are many options available for message schema, depending on your application’s individual needs. Simplistic systems, such as JavaScript Object Notation (JSON) and Extensible Markup Language (XML), are easy to use and human readable. However, they lack features such as robust type handling and compatibility between schema versions. Many Kafka developers favor the use of Apache Avro, ([Location 299](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=299)) - Avro provides a compact serialization format, schemas that are separate from the message payloads and that do not require code to be generated when they change, and strong data typing and schema evolution, with both backward and forward compatibility. ([Location 303](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=303)) - Messages in Kafka are categorized into topics. The closest analogies for a topic are a database table or a folder in a filesystem. Topics are additionally broken down into a number of partitions. Going back to the “commit log” description, a partition is a single log. Messages are written to it in an append-only fashion and are read in order from beginning to end. Note that as a topic typically has multiple partitions, there is no guarantee of message ordering across the entire topic, just within a single partition. ([Location 311](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=311)) - The term stream is often used when discussing data within systems like Kafka. Most often, a stream is considered to be a single topic of data, regardless of the number of partitions. ([Location 323](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=323)) - Kafka clients are users of the system, and there are two basic types: producers and consumers. ([Location 331](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=331)) - Producers create new messages. In other publish/subscribe systems, these may be called publishers or writers. ([Location 335](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=335)) - Consumers read messages. In other publish/subscribe systems, these clients may be called subscribers or readers. ([Location 344](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=344)) - The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset—an integer value that continually increases—is another piece of metadata that Kafka adds to each message as it is produced. ([Location 348](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=348)) - Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. The group ensures that each partition is only consumed by one member. ([Location 352](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=352)) - The mapping of a consumer to a partition is often called ownership of the partition by the consumer. ([Location 356](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=356)) - A single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and writes the messages to storage on disk. ([Location 364](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=364)) - Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster). The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker in the cluster, and that broker is called the leader of the partition. A replicated partition (as seen in Figure 1-7) is assigned to additional brokers, called followers of the partition. ([Location 369](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=369)) - All producers must connect to the leader in order to publish messages, but consumers may fetch from either the leader or one of the followers. ([Location 379](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=379)) - A key feature of Apache Kafka is that of retention, which is the durable storage of messages for some period of time. ([Location 383](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=383)) - Topics can also be configured as log compacted, which means that Kafka will retain only the last message produced with a specific key. This can be useful for changelog-type data, where only the last update is interesting. ([Location 390](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=390)) - The Kafka project includes a tool called MirrorMaker, used for replicating data to other clusters. ([Location 403](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=403)) - Kafka is able to seamlessly handle multiple producers, whether those clients are using many topics or the same topic. This makes the system ideal for aggregating data from many frontend systems and making it consistent. ([Location 415](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=415)) - Kafka is designed for multiple consumers to read any single stream of messages without interfering with each other client. This is in contrast to many queuing systems where once a message is consumed by one client, it is not available to any other. ([Location 422](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=422)) - Not only can Kafka handle multiple consumers, but durable message retention means that consumers do not always need to work in real time. ([Location 427](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=427)) - Durable retention is useful here for providing a buffer for the changelog, meaning it can be replayed in the event of a failure of the consuming applications. ([Location 502](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=502)) - ZooKeeper is designed to work as a cluster, called an ensemble, to ensure high availability. Due to the balancing algorithm used, it is recommended that ensembles contain an odd number of servers (e.g., 3, 5, and so on) as a majority of ensemble members (a quorum) must be working in order for ZooKeeper to respond to requests. ([Location 658](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=658)) - Clients only need to be able to connect to the ensemble over the clientPort, but the members of the ensemble must be able to communicate with one another over all three ports. ([Location 689](https://readwise.io/to_kindle?action=open&asin=B09L6KLWDG&location=689))