# Database Internals ![rw-book-cover](https://images-na.ssl-images-amazon.com/images/I/51Id4KLgbGL._SL200_.jpg) ## Metadata - Author: Alex Petrov - Full Title: Database Internals - Category: #books ## Highlights - Databases are modular systems and consist of multiple parts: a transport layer accepting requests, a query processor determining the most efficient way to run queries, an execution engine carrying out the operations, and a storage engine (see “DBMS Architecture”). ([Location 156](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=156)) - The storage engine (or database engine) is a software component of a database management system responsible for storing, retrieving, and managing data in memory and on disk, designed to capture a persistent, long-term memory of each node [REED78] ([Location 159](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=159)) - database management systems are applications built on top of storage engines, offering a schema, a query language, indexing, transactions, and many other useful features. ([Location 163](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=163)) - it’s usually much better to use a database that slowly saves the data than one that quickly loses it. ([Location 200](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=200)) - The main performance indicator is throughput: the number of transactions the database system is able to process per minute. ([Location 226](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=226)) - Online transaction processing (OLTP) databases These handle a large number of user-facing requests and transactions. Queries are often predefined and short-lived. ([Location 281](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=281)) - Online analytical processing (OLAP) databases These handle complex aggregations. OLAP databases are often used for analytics and data warehousing, and are capable of handling complex, long-running ad hoc queries. ([Location 284](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=284)) - Hybrid transactional and analytical processing (HTAP) These databases combine properties of both OLTP and OLAP stores. ([Location 286](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=286)) - query processor, which parses, interprets, and validates it. ([Location 310](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=310)) - query optimizer, which first eliminates impossible and redundant parts of the query, and then attempts to find the most efficient way to execute it based on internal statistics (index cardinality, approximate intersection size, etc.) and data placement (which nodes in the cluster hold the data and the costs associated with its transfer). ([Location 312](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=312)) - The query is usually presented in the form of an execution plan (or query plan): a sequence of operations that have to be carried out for its results to be considered complete. ([Location 316](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=316)) - The execution plan is carried out by the execution engine, which aggregates the results of local and remote operations. ([Location 319](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=319)) - Local queries (coming directly from clients or from other nodes) are executed by the storage engine. ([Location 321](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=321)) - Transaction manager This manager schedules transactions and ensures they cannot leave the database in a logically inconsistent state. ([Location 323](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=323)) - Lock manager This manager locks on the database objects for the running transactions, ensuring that concurrent operations do not violate physical data integrity. ([Location 326](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=326)) - Access methods (storage structures) These manage access and organizing data on disk. Access methods include heap files and storage structures such as B-Trees (see “Ubiquitous B-Trees”) or LSM Trees ([Location 328](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=328)) - Buffer manager This manager caches data pages in memory (see “Buffer Management”). ([Location 331](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=331)) - Recovery manager This manager maintains the operation log and restoring the system state in case of a failure ([Location 333](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=333)) - In-memory database management systems (sometimes called main memory DBMS) store data primarily in memory and use the disk for recovery and logging. ([Location 343](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=343)) - Disk-based DBMS hold most of the data on disk and use memory for caching disk contents or as a temporary storage. ([Location 344](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=344)) - Databases using memory as a primary data store do this mainly because of performance, comparatively low access costs, and access granularity. ([Location 350](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=350)) - Programming for main memory is also significantly simpler than doing so for the disk. Operating systems abstract memory management and allow us to think in terms of allocating and freeing arbitrarily sized memory chunks. On disk, we have to manage data references, serialization formats, freed memory, and fragmentation manually. ([Location 351](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=351)) - Non-Volatile Memory (NVM) [ARULRAJ17] technologies grow. NVM storage reduces or completely eliminates (depending on the exact technology) asymmetry between read and write latencies, further improves read and write performance, and allows byte-addressable access. ([Location 358](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=358)) - Log records are usually applied to backup in batches. After the batch of log records is processed, backup holds a database snapshot for a specific point in time, and log contents up to this point can be discarded. This process is called checkpointing. ([Location 369](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=369)) - Most database systems store a set of data records, consisting of columns and rows in tables. ([Location 386](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=386)) - Field is an intersection of a column and a row: a single value of some type. ([Location 390](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=390)) - One of the ways to classify databases is by how the data is stored on disk: row- or column-wise. ([Location 393](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=393)) - Row-oriented database management systems store data in records or rows. Their layout is quite close to the tabular data representation, where every row has the same set of fields. ([Location 403](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=403)) - Column-oriented database management systems partition data vertically (i.e., by column) instead of storing it in rows. ([Location 416](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=416)) - Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc. ([Location 420](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=420)) - To reconstruct data tuples, which might be useful for joins, filtering, and multirow aggregates, we need to preserve some metadata on the column level to identify which data points from other columns it is associated with. ([Location 429](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=429)) - Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values ([Location 431](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=431)) - Storing values that have the same data type together (e.g., numbers with other numbers, strings with other strings) offers a better compression ratio. We can use different compression algorithms depending on the data type and pick the most effective compression method for each case. ([Location 444](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=444)) - wide column stores, such as BigTable or HBase, where data is represented as a multidimensional map, columns are grouped into column families (usually storing data of the same type), and inside each column family, data is stored row-wise. ([Location 453](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=453)) - This layout is best for storing data retrieved by a key or a sequence of keys. ([Location 456](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=456)) - Related columns are grouped together in column families — contents and anchor in this example — which are stored on disk separately. ([Location 464](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=464)) - The main reasons to use specialized file organization over flat files are: Storage efficiency Files are organized in a way that minimizes storage overhead per stored data record. Access efficiency Records can be located in the smallest possible number of steps. Update efficiency Record updates are performed in a way that minimizes the number of changes on disk. ([Location 479](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=479)) - A database system usually separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files. ([Location 488](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=488)) - Files are partitioned into pages, which typically have the size of a single or multiple disk blocks. Pages can be organized as sequences of records or as a slotted pages ([Location 490](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=490)) - Most modern storage systems do not delete data from pages explicitly. Instead, they use deletion markers (also called tombstones), which contain deletion metadata, such as a key and a timestamp. ([Location 493](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=493)) - Data files (sometimes called primary files) can be implemented as index-organized tables (IOT), heap-organized tables (heap files), or hash-organized tables (hashed files). ([Location 499](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=499)) - Records in heap files are not required to follow any particular order, and most of the time they are placed in a write order. ([Location 504](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=504)) - In hashed files, records are stored in buckets, and the hash value of the key determines which bucket a record belongs ([Location 506](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=506)) - Index-organized tables (IOTs) store data records in the index itself. ([Location 508](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=508)) - Storing data records in the index allows us to reduce the number of disk seeks by at least one, since after traversing the index and locating the searched key, we do not have to address a separate file to find the associated data record. ([Location 509](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=509)) - An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations. ([Location 517](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=517)) - An index on a primary (data) file is called the primary index. ([Location 520](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=520)) - Secondary indexes can point directly to the data record, or simply store its primary key. ([Location 523](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=523)) - referencing data directly, we can reduce the number of disk seeks, but have to pay a cost of updating the pointers whenever the record is updated or relocated during a maintenance process. Using indirection in the form of a primary index allows us to reduce the cost of pointer updates, but has a higher cost on a read path. ([Location 552](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=552)) - To reduce the costs of pointer updates, instead of payload offsets, some implementations use primary keys for indirection. ([Location 555](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=555)) - Storage structures have three common variables: they use buffering (or avoid using it), use immutable (or mutable) files, and store values in order (or out of order). ([Location 573](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=573)) - Buffering This defines whether or not the storage structure chooses to collect a certain amount of data in memory before putting it on disk. ([Location 575](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=575)) - the smallest unit of data transfer to and from the disk is a block, and it is desirable to write full blocks. ([Location 577](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=577)) - Mutability (or immutability) This defines whether or not the storage structure reads parts of the file, updates them, and writes the updated results at the same location in the file. ([Location 583](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=583)) - Immutable structures are append-only: once written, file contents are not modified. ([Location 585](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=585)) - copy-on-write (see “Copy-on-Write”), where the modified page, holding the updated version of the record, is written to the new location in the file, instead of its original location. ([Location 587](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=587)) - Ordering This is defined as whether or not the data records are stored in the key order in the pages on disk. ([Location 591](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=591)) - Ordering often defines whether or not we can efficiently scan the range of records, not only locate the individual data records. ([Location 593](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=593)) - Most of the mutable storage structures use an in-place update mechanism. ([Location 625](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=625)) - One of the most popular storage structures is a B-Tree. Many open source database systems are B-Tree based, and over the years they’ve proven to cover the majority of use cases. ([Location 630](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=630)) - A binary search tree (BST) is a sorted in-memory data structure, used for efficient key-value lookups. BSTs consist of multiple nodes. Each tree node is represented by a key, a value associated with this key, and two child pointers (hence the name binary). ([Location 636](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=636)) - The balanced tree is defined as one that has a height of log2 N, where N is the total number of items in the tree, and the difference in height between the two subtrees is not greater than one1 ([Location 659](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=659)) - during rotation the middle node (3), known as a rotation pivot, is promoted one level higher, and its parent becomes its right child. ([Location 670](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=670)) - At the same time, due to low fanout (fanout is the maximum allowed number of children per node), we have to perform balancing, relocate nodes, and update pointers rather frequently. Increased maintenance costs make BSTs impractical as on-disk data structures ([Location 676](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=676)) - A naive on-disk BST implementation would require as many disk seeks as comparisons, since there’s no built-in concept of locality. ([Location 686](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=686)) - version of the tree that would be better suited for disk implementation has to exhibit the following properties: High fanout to improve locality of the neighboring keys. Low height to reduce the number of seeks during traversal. ([Location 688](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=688)) - Only a fraction of the data can be cached in memory at any time, and the rest has to be stored on disk in a manner that allows efficiently accessing it. ([Location 699](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=699)) - On spinning disks, seeks increase costs of random reads because they require disk rotation and mechanical head movements to position the read/write head to the desired location. However, once the expensive part is done, reading or writing contiguous bytes (i.e., sequential operations) is relatively cheap. ([Location 707](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=707)) - The smallest transfer unit of a spinning drive is a sector, so when some operation is performed, at least an entire sector can be read or written. Sector sizes typically range from 512 bytes to 4 Kb. ([Location 710](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=710)) - Head positioning is the most expensive part of an operation on the HDD. This is one of the reasons we often hear about the positive effects of sequential I/O: reading and writing contiguous memory segments from disk. ([Location 712](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=712)) - A typical SSD is built of memory cells, connected into strings (typically 32 to 64 cells per string), strings are combined into arrays, arrays are combined into pages, and pages are combined into blocks ([Location 719](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=719)) - Pages vary in size between devices, but typically their sizes range from 2 to 16 Kb. Blocks typically contain 64 to 512 pages. Blocks are organized into planes and, finally, planes are placed on a die. SSDs can have one or more dies. ([Location 721](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=721)) - The smallest unit that can be written (programmed) or read is a page. However, we can only make changes to the empty memory cells (i.e., to ones that have been erased before the write). The smallest erase entity is not a page, but a block that holds multiple pages, which is why it is often called an erase block. Pages in an empty block have to be written sequentially. ([Location 725](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=725)) - Flash Translation Layer (FTL) (see “Flash Translation Layer” for more about FTL). It is also responsible for garbage collection, during which FTL finds blocks it can safely erase. ([Location 730](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=730)) - SSDs, we don’t have a strong emphasis on random versus sequential I/O, as in HDDs, because the difference in latencies between random and sequential reads is not as large. ([Location 738](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=738)) - Writing only full blocks, and combining subsequent writes to the same block, can help to reduce the number of required I/O operations. We discuss buffering and immutability as ways to achieve that in later chapters. ([Location 742](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=742)) - Besides the cost of disk access itself, the main limitation and design condition for building efficient on-disk structures is the fact that the smallest unit of disk operation is a block. ([Location 744](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=744)) - What sets B-Trees apart is that, rather than being built from top to bottom (as binary search trees), they’re constructed the other way around — from bottom to top. ([Location 837](https://readwise.io/to_kindle?action=open&asin=B07XW76VHZ&location=837))