Designing Data-Intensive Applications

# Designing Data-Intensive Applications ![rw-book-cover](https://images-na.ssl-images-amazon.com/images/I/514xvNk9rTL._SL200_.jpg) ## Metadata - Author: Martin Kleppmann - Full Title: Designing Data-Intensive Applications - Category: #books ## Highlights - We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck. ([Location 58](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=58)) - Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing. ([Location 206](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=206)) - If that sounds painfully obvious, that’s just because these data systems are such a successful abstraction: we use them all the time without thinking too much. ([Location 216](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=216)) - reliable, scalable, and maintainable data systems. ([Location 225](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=225)) - When you combine several tools in order to provide a service, the service’s interface or application programming interface (API) usually hides those implementation details from clients. ([Location 251](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=251)) - The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error). See “Reliability” ([Location 263](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=263)) - As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth. See “Scalability” ([Location 266](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=266)) - Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively. See “Maintainability”. ([Location 268](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=268)) - reliability as meaning, roughly, “continuing to work correctly, even when things go wrong.” ([Location 284](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=284)) - The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient. ([Location 285](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=285)) - So it only makes sense to talk about tolerating certain types of faults. ([Location 290](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=290)) - Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures. ([Location 292](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=292)) - Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately—for example, by randomly killing individual processes without warning. ([Location 299](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=299)) - Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. ([Location 301](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=301)) - one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages ([Location 379](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=379)) - Design systems in a way that minimizes opportunities for error. ([Location 383](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=383)) - Decouple the places where people make the most mistakes from the places where they can cause failures. ([Location 385](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=385)) - Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [3]. ([Location 388](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=388)) - Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure. ([Location 391](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=391)) - Set up detailed and clear monitoring, such as performance metrics and error rates. In other engineering disciplines this is referred to as telemetry. ([Location 396](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=396)) - Implement good management practices and training—a complex and important aspect, and beyond the scope of this book. ([Location 400](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=400)) - One common reason for degradation is increased load: ([Location 418](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=418)) - Scalability is the term we use to describe a system’s ability to cope with increased load. ([Location 420](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=420)) - Load can be described with a few numbers which we call load parameters. ([Location 427](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=427)) - Twitter’s scaling challenge is not primarily due to tweet volume, but due to fan-outii—each user follows many people, and each user is followed by many people. ([Location 439](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=439)) - In a batch processing system such as Hadoop, we usually care about throughput—the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size. ([Location 488](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=488)) - In online systems, what’s usually more important is the service’s response time—that is, the time between a client sending a request and receiving a response. ([Location 491](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=491)) - Latency and response time are often used synonymously, but they are not the same. The response time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays. Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service [17 ([Location 495](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=495)) - We therefore need to think of response time not as a single number, but as a distribution of values that you can measure. ([Location 501](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=501)) - However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay. Usually it is better to use percentiles. ([Location 519](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=519)) - If you take your list of response times and sort it from fastest to slowest, then the median is the halfway point: ([Location 522](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=522)) - The median is also known as the 50th percentile, and sometimes abbreviated as p50. ([Location 526](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=526)) - High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service. ([Location 535](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=535)) - percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availability of a service. ([Location 551](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=551)) - it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking. ([Location 558](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=558)) - a higher proportion of end-user requests end up being slow (an effect known as tail latency amplification [24]). ([Location 573](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=573)) - scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture. ([Location 604](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=604)) - elastic, meaning that they can automatically add computing resources when they detect a load increase, ([Location 610](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=610)) - there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as magic scaling sauce). ([Location 621](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=621)) - An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare—the load parameters. ([Location 626](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=626)) - It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features. ([Location 635](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=635)) - legacy systems—perhaps it involves fixing other people’s mistakes, or working with platforms that are now outdated, or systems that were forced to do things they were never intended for. ([Location 639](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=639)) - Operability Make it easy for operations teams to keep the system running smoothly. ([Location 644](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=644)) - Simplicity Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. ([Location 645](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=645)) - Evolvability Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity. ([Location 647](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=647)) - A software project mired in complexity is sometimes described as a big ball of mud ([Location 683](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=683)) - One of the best tools we have for removing accidental complexity is abstraction. ([Location 699](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=699)) - functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), ([Location 733](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=733)) - nonfunctional requirements (general properties like security, reliability, compliance, scalability, compatibility, and maintainability). ([Location 734](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=734)) - Reliability means making systems work correctly, even when faults occur. ([Location 737](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=737)) - Scalability means having strategies for keeping performance good, even when load increases. ([Location 740](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=740)) - Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. ([Location 744](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=744)) - data is organized into relations (called tables in SQL), where each relation is an unordered collection of tuples (rows in SQL). ([Location 898](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=898)) - The roots of relational databases lie in business data processing, ([Location 905](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=905)) - transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting). ([Location 907](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=907)) - network model and the hierarchical model were the main alternatives, but the relational model came to dominate them. ([Location 911](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=911)) - A number of interesting database systems are now associated with the #NoSQL hashtag, and it has been retroactively reinterpreted as Not Only SQL [4 ([Location 925](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=925)) - There are several driving forces behind the adoption of NoSQL databases, including: A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput A widespread preference for free and open source software over commercial database products Specialized query operations that are not well supported by the relational model Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model [5] ([Location 928](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=928)) - relational databases will continue to be used alongside a broad variety of nonrelational datastores—an idea that is sometimes called polyglot persistence ([Location 936](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=936)) - The disconnect between the models is sometimes called an impedance mismatch. ([Location 943](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=943)) - Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they can’t completely hide the differences between the two models. ([Location 948](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=948)) - Some developers feel that the JSON model reduces the impedance mismatch between the application code and the storage layer. ([Location 1018](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1018)) - The JSON representation has better locality than the multi-table schema ([Location 1023](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1023)) - Removing such duplication is the key idea behind normalization in databases. ([Location 1053](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1053)) - many-to-one relationships (many people live in one particular region, many people work in one particular industry), ([Location 1064](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1064)) - The design of IMS used a fairly simple data model called the hierarchical model, which has some remarkable similarities to the JSON model used by document databases ([Location 1101](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1101)) - relational model (which became SQL, and took over the world) and the network model (which initially had a large following but eventually faded into obscurity). ([Location 1109](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1109)) - The CODASYL model was a generalization of the hierarchical model. ([Location 1118](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1118)) - accessing a record was to follow a path from a root record along these chains of links. This was called an access path. ([Location 1123](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1123)) - Even CODASYL committee members admitted that this was like navigating around an n-dimensional data space ([Location 1129](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1129)) - relational model did, by contrast, was to lay out all the data in the open: a relation (table) is simply a collection of tuples (rows), ([Location 1137](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1137)) - In a relational database, the query optimizer automatically decides which parts of the query to execute in which order, and which indexes to use. ([Location 1143](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1143)) - Query optimizers for relational databases are complicated beasts, and they have consumed many years of research and development effort ([Location 1149](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1149)) - the related item is referenced by a unique identifier, which is called a foreign key in the relational model and a document reference in the document model [9 ([Location 1160](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1160)) - The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the data structures used by the application. The relational model counters by providing better support for joins, and many-to-one and many-to-many relationships. ([Location 1169](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1169)) - The relational technique of shredding—splitting a document-like structure into multiple tables (like positions, education, and contact_info in Figure 2-1)—can lead to cumbersome schemas and unnecessarily complicated application code. ([Location 1175](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1175)) - However, if your application does use many-to-many relationships, the document model becomes less appealing. ([Location 1187](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1187)) - For highly interconnected data, the document model is awkward, the relational model is acceptable, and graph models (see “Graph-Like Data Models”) are the most natural. ([Location 1193](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1193)) - Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database ([Location 1202](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1202)) - schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it) ([Location 1205](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1205)) - Schema changes have a bad reputation of being slow and requiring downtime. ([Location 1243](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1243)) - The schema-on-read approach is advantageous if the items in the collection don’t all have the same structure for some reason ([Location 1253](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1253)) - needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this storage locality. ([Location 1266](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1266)) - The locality advantage only applies if you need large parts of the document at the same time. ([Location 1269](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1269)) - schema to declare that a table’s rows should be interleaved (nested) within a parent table [27]. Oracle allows the same, using a feature called multi-table index cluster tables [28]. The column-family concept in the Bigtable data model (used in Cassandra and HBase) has a similar purpose of managing locality ([Location 1286](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1286)) - Most relational database systems (other than MySQL) have supported XML since the mid-2000s. ([Location 1297](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1297)) - SQL is a declarative query language, whereas IMS and CODASYL queried the database using imperative code. ([Location 1327](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1327)) - An imperative language tells the computer to perform certain operations in a certain order. ([Location 1355](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1355)) - In a declarative query language, like SQL or relational algebra, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal. ([Location 1357](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1357)) - A declarative query language is attractive because it is typically more concise and easier to work with than an imperative API. But more importantly, it also hides implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries. ([Location 1360](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1360)) - MapReduce is a programming model for processing large amounts of data in bulk across many machines, popularized by Google ([Location 1473](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1473)) - MapReduce is neither a declarative query language nor a fully imperative query API, but somewhere in between: the logic of the query is expressed with snippets of code, which are called repeatedly by the processing framework. It is based on the map (also known as collect) and reduce (also known as fold or inject) functions that exist in many functional programming languages. ([Location 1478](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1478)) - They must be pure functions, which means they only use the data that is passed to them as input, they cannot perform additional database queries, and they must not have any side effects. ([Location 1571](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1571)) - MapReduce is a fairly low-level programming model for distributed execution on a cluster of machines. Higher-level query languages like SQL can be implemented as a pipeline of MapReduce operations (see Chapter 10), but there are also many distributed implementations of SQL that don’t use MapReduce. ([Location 1575](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1575)) - The relational model can handle simple cases of many-to-many relationships, but as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph. ([Location 1616](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1616)) - A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). Many kinds of data can be modeled as a graph. ([Location 1622](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1622)) - However, graphs are not limited to such homogeneous data: an equally powerful use of graphs is to provide a consistent way of storing completely different types of objects in a single datastore. ([Location 1633](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1633)) - property graph model (implemented by Neo4j, Titan, and InfiniteGraph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). ([Location 1648](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1648)) - In the property graph model, each vertex consists of: A unique identifier A set of outgoing edges A set of incoming edges A collection of properties (key-value pairs) Each edge consists of: A unique identifier The vertex at which the edge starts (the tail vertex) The vertex at which the edge ends (the head vertex) A label to describe the kind of relationship between the two vertices A collection of properties (key-value pairs) ([Location 1657](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1657)) - Cypher is a declarative query language for property graphs, created for the Neo4j graph database [37]. ([Location 1713](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1713)) - In a relational database, you usually know in advance which joins you need in your query. In a graph query, you may need to traverse a variable number of edges before you find the vertex you’re looking for—that is, the number of joins is not fixed in advance. ([Location 1772](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=1772)) - On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you. ([Location 2343](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=2343)) - there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. ([Location 2351](https://readwise.io/to_kindle?action=open&asin=B06XPJML5D&location=2351))