Delta Lake-2

# Delta Lake ![rw-book-cover](https://m.media-amazon.com/images/I/813ZHZyE1eL._SY160.jpg) ## Metadata - Author: [[Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu]] - Full Title: Delta Lake - Category: #apache-spark #python #big-data #data-engineering ## Highlights - It provides ACID (atomicity, consistency, isolation, and durability) transactions and scalable metadata handling and unifies various data analytics tasks, such as batch and streaming workloads, machine learning, and SQL, on a single platform. ([Location 249](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=249)) - Unlike traditional databases, data lakes are designed to handle an internet-scale volume, velocity, and variety of data (e.g., structured, semistructured, and unstructured data). ([Location 278](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=278)) - Instead of providing ACID protections, these systems follow the BASE model—basically available, soft-state, and eventually consistent. The lack of ACID guarantees means the storage system processing failures leave your storage in an inconsistent state with orphaned files. ([Location 289](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=289)) - Delta Lake is an open source storage layer that supports ACID transactions, scalable metadata handling, and unification of streaming and batch data processing. It was initially designed to work with Apache Spark and large-scale data lake workloads. ([Location 353](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=353)) - Delta Lake ensures that data modifications are performed atomically, consistently, in isolation, and durably, i.e., with ACID transaction protections. ([Location 395](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=395)) - The metadata of a Delta Lake table is the transaction log, which provides transactional consistency per the aforementioned ACID transactions. ([Location 399](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=399)) - The Delta Lake time travel feature allows you to query previous versions of a table to access historical data. ([Location 404](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=404)) - Delta Lake was designed hand in hand with Apache Spark Structured Streaming to simplify the logic around streaming. Instead of having different APIs for batch and streaming, Structured Streaming uses the same in-memory Datasets/DataFrame API for both scenarios. ([Location 408](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=408)) - Delta Lake’s schema evolution and schema enforcement ensure data consistency and quality by enforcing a schema on write operations and allowing users to modify the schema without breaking existing queries. ([Location 415](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=415)) - This feature provides detailed logs of all changes made to the data, including information about who made each change, what the change was, and when it was made. ([Location 419](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=419)) - Delta Lake was one of the first lakehouse formats to provide data manipulation language (DML) operations. ([Location 422](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=422)) - The roots of Delta Lake were built within the foundation of Databricks, which has extensive experience in open source (the founders of Databricks were the original creators of Apache Spark). ([Location 427](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=427)) - While Delta Lake is a lakehouse storage format, it is optimally designed to improve the speed of your queries and processing for both ingestion and querying using the default configuration. While you can continually tweak the performance of Delta Lake, most of the time the defaults will work for your scenarios. ([Location 432](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=432)) - Delta Lake was built with simplicity in mind right from the beginning. ([Location 435](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=435)) - Delta Lake tables store data in Parquet file format. ([Location 449](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=449)) - The transaction log, also known as the Delta log, is a critical component of Delta Lake. It is an ordered record of every transaction performed on a Delta Lake table. The transaction log ensures ACID properties by recording all changes to the table in a series of JSON files. ([Location 454](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=454)) - Metadata in Delta Lake includes information about the table’s schema, partitioning, and configuration settings. ([Location 461](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=461)) - A Delta Lake table’s schema defines the data’s structure, including its columns, data types, and so on. The schema is enforced on write, ensuring that all data written to the table adheres to the defined structure. ([Location 465](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=465)) - Checkpoints are periodic snapshots of the transaction log that help speed up the recovery process. Delta Lake consolidates the state of the transaction log by default every 10 transactions. This allows client readers to quickly catch up from the most recent checkpoint rather than replaying the entire transaction log from the beginning. ([Location 468](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=468)) - The Delta transaction log protocol is the specification defining how clients interact with the table in a consistent manner. ([Location 480](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=480)) - Serializable ACID writes Multiple writers can modify a Delta table concurrently while maintaining ACID semantics. ([Location 491](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=491)) - Snapshot isolation for reads Readers can read a consistent snapshot of a Delta table, even in the face of concurrent writes. ([Location 492](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=492)) - Scalability to billions of partitions or files Queries against a Delta table can be planned on a single machine or in parallel. ([Location 494](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=494)) - Self-describing All metadata for a Delta table is stored alongside the data. This design eliminates the need to maintain a separate metastore to read the data and allows static tables to be copied or moved using standard filesystem tools. ([Location 495](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=495)) - Support for incremental processing Readers can tail the Delta log to determine what data has been added in a given period of time, allowing for efficient streaming. ([Location 497](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=497)) - To show users correct views of the data at all times, the Delta log is the single source of truth. ([Location 520](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=520)) - As the Delta transaction log is the single source of truth, any client who wants to read or write to your Delta table must first query the transaction log. ([Location 522](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=522)) - For deletes on object stores, it is faster to create a new file or files comprising the unaffected rows rather than modifying the existing Parquet file(s). This approach also provides the advantage of multiversion concurrency control (MVCC). MVCC is a database optimization technique that creates copies of the data, thus allowing data to be safely read and updated concurrently. This technique also allows Delta Lake to provide time travel. Therefore, Delta Lake creates multiple files for these actions, providing atomicity, MVCC, and speed. ([Location 537](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=537)) - If a user were to read the Parquet files without reading the Delta transaction log, they would read duplicates because of the replicated rows in all the files ([Location 549](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=549)) - Note that the remove operation is a soft delete or tombstone where the physical removal of the files (1.parquet, 2.parquet) has yet to happen. The physical removal of files will happen when executing the VACUUM command. ([Location 553](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=553)) - Table Features replaces table protocol versions to represent features a table uses so connectors can know which features are required to read or write a table ([Location 615](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=615)) - As of this writing, every time new features are added to Delta Lake, the connector must be rewritten entirely, because there is a tight coupling between the metadata and data processing. Delta Kernel simplifies the development of connectors by abstracting out all the protocol details so the connectors do not need to understand them. Kernel itself implements the Delta transaction log specification (per the previous section). ([Location 640](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=640)) - Creating Delta Kernel allows for more easily maintained parity between Delta Lake Rust and Scala/JVM, enabling both to be first-class citizens. All metadata (i.e., transaction log) logic is coordinated and executed through the Kernel library. This way, the connectors need only to focus on how to perform their respective frameworks/services/languages. ([Location 646](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=646)) - Delta Kernel decouples the logic for the metadata (i.e., transaction log) from the data. This allows Delta Lake to be modular, extensible, and highly portable (for example, you can copy the entire table with its transaction log to a new location for your AI workloads). This also extends (pun intended) to Delta Lake’s extensibility, as a connector is now, for example, provided the list of files to read instead of needing to query the transaction log directly. ([Location 654](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=654)) - Delta Universal Format, or UniForm, is designed to simplify the interoperability among Delta Lake, Apache Iceberg, and Apache Hudi. Fundamentally, lakehouse formats are composed of metadata and data (typically in Parquet file format). ([Location 689](https://readwise.io/to_kindle?action=open&asin=B0DFJ2W1MZ&location=689))