Delta Lake

# Delta Lake ![rw-book-cover](https://m.media-amazon.com/images/I/81IvRO+OANL._SY160.jpg) ## Metadata - Author: [[Bennie Haelen and Dan Davis]] - Full Title: Delta Lake - Category: #apache-spark #big-data #python #data-engineering ## Highlights - Delta Lake brings capabilities such as transactional reliability and support for UPSERTs and MERGEs to data lakes while maintaining the dynamic horizontal scalability and separation of storage and compute of data lakes. Delta Lake is one solution for building data lakehouses, an open data architecture combining the best of data warehouses and data lakes. ([Location 139](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=139)) - A database transaction is a sequence of operations on a database that satisfies four properties: atomicity, consistency, isolation, and durability, commonly referred to by their acronym ACID. ([Location 155](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=155)) - Atomicity ensures that all changes made to the database are executed as a single operation. ([Location 156](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=156)) - The consistency property guarantees that the database transitions from one consistent state at the beginning of the transaction to another consistent state at the end of the transaction. ([Location 162](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=162)) - Isolation ensures that concurrent operations happening within the database are not affecting each other. This property ensures that when multiple transactions are executed concurrently, their operations do not interfere with each other. ([Location 166](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=166)) - Durability refers to the persistence of committed transactions. ([Location 168](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=168)) - Metadata Contextual information about the data. ([Location 231](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=231)) - Raw data Maintained in its original format without any processing. ([Location 234](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=234)) - Summary data Automatically created by the underlying data management system. ([Location 238](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=238)) - Data warehouses introduced the need for a comprehensive data model that spans the different subject areas in a corporate enterprise. The technique used to create these models became known as dimensional modeling. ([Location 268](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=268)) - A fact table, which is the primary, or central table for the schema. The fact table captures the primary measurements, metrics, or “facts” of the business process. Staying with our sales business process example, a sales fact table would include units sold and sales amount. ([Location 277](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=277)) - Multiple dimension tables that are related to the fact table. A dimension provides the context surrounding the selected business process. ([Location 287](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=287)) - Volume The volume of data created, captured, copied, and consumed globally is increasing rapidly. ([Location 318](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=318)) - Velocity In today’s modern business climate, timely decisions are critical. ([Location 321](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=321)) - Variety Variety refers to the number of different “types” of data that are now available. ([Location 324](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=324)) - Veracity Veracity defines the trustworthiness of the data. ([Location 329](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=329)) - Traditional data warehouse architectures do not use in-memory and parallel processing techniques, preventing them from scaling the data warehouse vertically. ([Location 338](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=338)) - While data warehouses are very good at storing structured data, they are not well suited to store and query the variety of semi-structured or unstructured data. ([Location 341](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=341)) - The limitations of the traditional data warehouse architecture gave rise to a more modern architecture, based upon the concept of a data lake. ([Location 348](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=348)) - A data lake is a cost-effective central repository to store structured, semi-structured, or unstructured data at any scale, in the form of files and blobs. ([Location 350](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=350)) - Instead of using one large computer to store and process the data, Hadoop leveraged the clustering of multiple commodity compute nodes to analyze large volumes of datasets in parallel more quickly. ([Location 360](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=360)) - Traditional data lakes have poor latency query performance, so they cannot be used for interactive queries. ([Location 415](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=415)) - Data lakes typically use a “schema on read” strategy, where data can be ingested in any format without schema enforcement. ([Location 419](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=419)) - Data lakes do not offer any kind of transactional guarantees. Data files can only be appended to, leading to expensive rewrites of previously written data to make a simple update. ([Location 423](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=423)) - lakehouse as “a data management system based upon low-cost and directly accessible storage that also provides analytics DBMS management and performance features such as ACID transactions, data versioning, auditing, indexing, caching and query optimization.” ([Location 432](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=432)) - The goal of a lakehouse is to use existing high-performance data formats, such as Parquet, while also enabling ACID transactions (and other features). To add these capabilities, lakehouses use an open-table format, which adds features like ACID transactions, record-level operations, indexing, and key metadata to those existing data formats. ([Location 438](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=438)) - Lakehouses are an especially good match for most, if not all, cloud environments with separate compute and storage resources. ([Location 443](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=443)) - The final tool that can speed up analytic workloads is the development of a standard DataFrame API. ([Location 478](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=478)) - DataFrames were first introduced by R and pandas and provide a simple table abstraction of the data with a multitude of transformation operations, most of which originate from relational algebra. ([Location 480](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=480)) - In Spark, the DataFrame API is declarative, and lazy evaluation is used to build an execution DAG (directed acyclic graph). ([Location 481](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=481)) - Delta Lake is an open-table format that combines metadata, caching, and indexing with a data lake storage format. Together these provide an abstraction level to serve ACID transactions and other management features. ([Location 491](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=491)) - Transactional ACID guarantees Delta Lake will make sure that all data lake transactions using Spark, or any other processing engine, are committed for durability and exposed to other readers in an atomic fashion. This is made possible through the Delta transaction log. ([Location 499](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=499)) - Full DML support Traditional data lakes do not support transactional, atomic updates to the data. Delta Lake fully supports all DML operations, including deletes and updates, and complex data merge or upsert scenarios. ([Location 503](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=503)) - Audit history The Delta Lake transaction log records every change made to the data, in the order that these changes were made. ([Location 508](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=508)) - Unification of batch and streaming into one processing model Delta Lake can work with batch and streaming sinks or sources. ([Location 513](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=513)) - Schema enforcement and evolution Delta Lake enforces a schema when writing or reading data from the lake. ([Location 519](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=519)) - Rich metadata support and scaling Having the ability to support large volumes of data is great, but if the metadata cannot scale accordingly, the solution will fall short. ([Location 524](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=524)) - This architectural pattern with Bronze, Silver, and Gold layers is often referred to as the medallion architecture. While it is only one of many lakehouse architecture patterns, it is a great fit for modern data warehouses, data marts, and other analytical solutions. ([Location 544](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=544)) - Data in the Bronze zone is typically landed “as is,” but can be augmented with additional metadata, such as the loading date and time, processing identifiers, etc. ([Location 556](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=556)) - In the Silver layer, the data from the Bronze layer is cleansed, normalized, merged, and conformed. This is where the enterprise view of the data across the different subject areas is gradually coming together. ([Location 558](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=558)) - The data in the Gold layer is “consumption-ready” data. This data can be in the format of a classic star schema, containing dimensions and fact tables, or it could be in any data model that is befitting to the consuming use case. ([Location 559](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=559)) - All components of Delta Lake 2.0 have been open sourced by Databricks. ([Location 579](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=579)) - The Delta Lake storage format is an open source storage layer that runs on top of cloud-based data lakes. ([Location 582](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=582)) - Delta Sharing is an open source protocol for securely sharing large datasets of Delta Lake data. It allows users to securely share data stored in Amazon S3, ADLS, or GCS. With Delta Sharing, users can directly connect to the shared data, using their favorite toolsets like Spark, Rust, Power BI, etc., without having to deploy any additional components. ([Location 592](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=592)) - The main goal of Delta Connectors3,4 is to bring Delta Lake to other big data engines outside of Apache Spark. Delta Connectors are open source connectors that directly connect to Delta Lake. ([Location 599](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=599)) - When we save a file using this format, we are just writing a standard Parquet file with additional metadata. This additional metadata is the foundation for enabling the core features of Delta Lake, and even just performing DML operations typically seen in traditional RDBMSs such as INSERT, UPDATE, and DELETE, among a vast array of other operations. ([Location 766](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=766)) - The metadata in a Parquet file not only contains information about these row groups, but also information about columns (e.g., min/max values, number of values) and the data schema, which makes Parquet a self-describing file with additional metadata to enable better data skipping. ([Location 797](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=797)) - High performance Because Parquet files are a column-oriented format, they enable better compression and encoding since these algorithms can take advantage of the similar values and data types stored in each column. For I/O-intensive operations, this compressed data can improve performance significantly. ([Location 808](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=808)) - Cost-effective Since Parquet files are able to leverage better compression and encoding, this inherently makes the data more cost-effective to store. ([Location 817](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=817)) - Interoperability Since Parquet files have been very popular for the past 20 years, especially for legacy big data processing frameworks and tools (e.g., Hadoop), they are very widely supported across different tools and engines, and offer great interoperability. ([Location 819](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=819)) - The .crc files you see in the output are cyclic redundancy check files. Spark uses them to ensure that data hasn’t been corrupted. These files are always very small, so their overhead is very minimal compared to the utility that they provide. While there is a way to turn off the generation of these files, we would not recommend doing so since their benefit far outweighs their overhead. ([Location 853](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=853)) - The _delta_log file contains a transaction log with every single operation performed on your data. ([Location 882](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=882)) - With UniForm enabled, Delta tables can be read as if they were other open-table formats, such as Iceberg. This enables you to use a broader range of tools without worrying about table format compatibility. ([Location 883](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=883)) - UniForm automatically generates Apache Iceberg metadata alongside Delta metadata, atop one copy of the underlying Parquet data. ([Location 887](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=887)) - The Delta Lake transaction log (also known as DeltaLog) is a sequential record of every transaction performed on a Delta Lake table since its creation. ([Location 893](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=893)) - The main goal of the transaction log is to enable multiple readers and writers to operate on a given version of a dataset file simultaneously and to provide additional information, like data skipping indexes to the execution engine for more performant operations. ([Location 897](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=897)) - Without atomicity, any hardware failure or software bug can cause a data file to be written partially, resulting in corrupted or, at a minimum, invalid data. ([Location 909](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=909)) - The transaction is only considered complete when the transaction log entry is written successfully. ([Location 982](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=982)) - When the system reads a Delta table, it will iterate through the transaction log to “compile” the current state of the table. The sequence of events when reading a file is as follows: The transaction log files are read first. The data files are read based on the log files. ([Location 984](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=984)) - Once the Delta Lake writer has made the commits to the transaction log, it will save a checkpoint file in Parquet format in the _delta_log folder. The Delta Lake writer will continue to generate a new checkpoint every 10 commits. ([Location 1052](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1052)) - A checkpoint file saves the entire state of the table at a given point in time. Note that “state” refers to the different actions, not the file’s actual content. So, it will contain the add file, remove file, update metadata, commit info, etc., actions, with all the context information. It will save this list in native Parquet format. This will allow Spark to read the checkpoint quickly. This gives the Spark reader a “shortcut” to fully reproduce a table’s state and avoid reprocessing thousands of small JSON files, which could be inefficient. ([Location 1054](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1054)) - Delta tables can be read by standard ANSI SQL or using the popular PySpark DataFrameReader API. You can write to a Delta table by using the classic SQL INSERT statement, or you can append a DataFrame to the table. Finally, leveraging the SQL COPY INTO option is a great way to append large amounts of data quickly. ([Location 1132](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1132)) - Partitioning a Delta table based upon your frequently used query pattern can dramatically improve your query and DML performance. ([Location 1134](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1134)) - SQL Data Definition Language (DDL) commands ([Location 1141](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1141)) - PySpark DataFrameWriter API ([Location 1145](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1145)) - DeltaTableBuilder API ([Location 1150](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1150)) - One of those actions is the metadata action, which describes the schema of the table, the partitioning columns (if applicable), and other information. This metadata action is always written to the first transaction log entry created for our new table. ([Location 1192](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1192)) - The SQL DESCRIBE command can be used to return the basic metadata for a Parquet file or Delta table. The metadata returned for a table includes one line for each column with the following information: The column name The column data type Any comments that were applied to the column ([Location 1216](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1216)) - Spark DataFrames resemble relational database tables or Excel spreadsheets with headers. The data resides in rows and columns of different data types. The collection of functions that lets us read, write, and manipulate DataFrames is collectively known as the Spark DataFrameWriter API. ([Location 1264](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1264)) - A Delta table that is created with a location is known as an unmanaged table. For these tables, Spark only manages the metadata, and requires the user to specify the exact location where you wish to save the underlying data for the table, or alternatively, the source directory from which data will be pulled to create the table (if you are using the DataFrameWriter API). ([Location 1272](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1272)) - A Delta table that is created without a location is referred to as a managed table. Spark manages both the metadata and the actual data for managed tables. The data is stored under the /spark-warehouse subfolder (in the Apache Spark scenario) or the /user/hive/warehouse folder (when running on Databricks), which is the default for managed tables. ([Location 1275](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1275)) - One of the benefits of the DataFrameWriter API is that you can simultaneously create a table and insert data into it from a Spark DataFrame, ([Location 1279](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1279)) - The last way to create a Delta table is by using the DeltaTableBuilder API. Since it is designed to work with Delta tables, it offers a higher degree of fine-grained control versus the traditional DataFrameWriter API. ([Location 1351](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1351)) - The Builder design pattern is popular in software languages. The Builder pattern aims to “separate the construction of a complex object from its representation so that the same construction process can create different representations.” It is used to construct a complex object step-by-step, where the final step will return the object. ([Location 1356](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1356)) - Delta Lake supports generated columns, which are a special type of column, the values of which are automatically generated based on a user-specified function over other columns in the Delta table. When you write to a Delta table with generated columns and don’t explicitly provide values for them, Delta Lake automatically computes the values. ([Location 1388](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1388)) - The expression you use in GENERATED ALWAYS AS can be any Spark SQL function that always returns the same result when given the same argument values, with a few exceptions we will touch on soon. ([Location 1414](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1414)) - GENERATED ALWAYS AS columns using the functions listed are valid, and can be very useful in several scenarios, like calculating a standard deviation of a given sample of records. ([Location 1425](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1425)) - Often, a developer new to PySpark might assume that groupBy() returns a DataFrame, but it returns a GroupedData instance, so you have to use GroupedData methods such as agg() and filter() instead of DataFrame functions such as avg() and where(). ([Location 1550](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1550)) - We can use the SQL COPY INTO command to append data to our table. This command is especially useful when we need to quickly append very large amounts of data. ([Location 1657](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1657)) - These use cases lend themselves well to partitioning. Partitioning your data to align with your query patterns can dramatically speed up query performance, especially when combined with other performance optimizations, such as Z-ordering.7 A Delta table partition is composed of a folder with a subset of data rows that share the same value for one or more column(s). ([Location 1700](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1700)) - Note that this type of on-disk partitioning should not be confused with the partitioning that Spark applies when processing a DataFrame. Spark applies in-memory partitioning to enable tasks to run in parallel and independently on a large number of nodes in a Spark cluster. ([Location 1704](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1704)) - Once the table is partitioned, all queries with predicates that include the partition columns will run much faster, since Spark can immediately select the folder with the correct partition. You can partition data when you create a Delta table by specifying a PARTITIONED BY clause. ([Location 1711](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1711)) - Since Delta Lake adds a transactional layer to classic data lakes, we can perform classic DML operations, such as updates, deletes, and merges. ([Location 1966](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=1966)) - This process is called “tombstoning.” However, it is important to note that this old data file is not deleted, because you might still need it to time travel back to an earlier version of the table. You can use the VACUUM command to delete files older than a certain time period. ([Location 2087](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2087)) - The most important message to take away from this is that the delete transaction occurs at the data file level. Delta Lake will create new partitions and insert new add file and remove file actions in the transaction log, as needed. Chapter 6 on performance tuning will cover the VACUUM command and other ways to clean up tombstoned data files that are no longer required. ([Location 2097](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2097)) - The main way to improve the performance of a DELETE operation on Delta Lake is to add more predicates to narrow the search spectrum. ([Location 2104](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2104)) - Delta Lake also provides a number of other optimization conditions, such as data skipping and z-order optimization. Z-ordering reorganizes the layout of each data file so that similar column values are strategically colocated near one another for maximum efficiency. ([Location 2109](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2109)) - You can use the UPDATE operation to selectively update any rows matching a filtering condition, also known as a predicate. ([Location 2116](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2116)) - Once Delta Lake has executed the UPDATE successfully, it adds an add file action for the new data file. ([Location 2174](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2174)) - Here is how an upsert actually works: When a record from the source table matches a preexisting record in the target table, Delta Lake updates the record. When there is no such match, Delta Lake inserts the new record. ([Location 2196](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2196)) - An important addition to the Delta Lake MERGE operation is the recently released WHEN NOT MATCHED BY SOURCE clause. This clause can be used to UPDATE or DELETE records in the target table that do not have corresponding records in the source table. ([Location 2321](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2321)) - To delete records that exist in the source tables and not in the target table (i.e., hard delete), use the WHEN NOT MATCHED BY SOURCE clause, ([Location 2330](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2330)) - It is best practice to add an optional MERGE condition when you add the WHEN NOT MATCHED BY SOURCE clause to UPDATE or DELETE target rows. This is because when there is no specified MERGE condition, this can lead to a large number of target rows being modified. ([Location 2348](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2348)) - Internally, Delta Lake completes a MERGE operation like this in two steps: It first performs an inner join between the target table and the source table to select all data files containing matches. This prevents the operation from unnecessarily shuffling data that can be safely ignored. Next, it performs an outer join between the selected files in the target and source tables, and applies the appropriate INSERT, DELETE, or UPDATE clause as specified by the user. ([Location 2380](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2380)) - Any time you are storing and retrieving data, whether with a traditional RDBMS or with Delta tables, how you organize the data in the underlying storage format can significantly affect the time it takes to perform table operations and queries. In general, performance tuning refers to the process of optimizing the performance of a system, and in the context of Delta tables this involves optimizing how the data is stored and retrieved. ([Location 2409](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2409)) - Delta Lake provides a number of different techniques that can be combined to accelerate data retrieval by efficiently reducing the amount of files and data that needs to be read during operations. ([Location 2413](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2413)) - Skipping nonrelevant data is ultimately the foundation for most performance tuning features, as it aims to reduce the amount of data that needs to be read. ([Location 2422](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2422)) - Delta Lake automatically maintains the minimum and maximum value for up to 32 fields for files, and stores those values as part of the metadata. Delta Lake uses these minimum and maximum ranges to skip the files that are out of the range of the querying field values. This is a key aspect that enables data skipping through what is called data skipping statistics. ([Location 2426](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2426)) - Typically, you want to collect data skipping statistics on columns that are commonly used in filters, WHERE clauses, joins, and columns that you tend to perform aggregations on. Conversely, avoid collecting data skipping statistics on long strings as they are far less efficient for data skipping purposes. ([Location 2442](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2442)) - Delta Lake partitioning allows you to organize a Delta table by dividing the data into smaller chunks called partitions. ([Location 2483](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2483)) - At the time of writing, partitions are the recommended approach to enable data skipping in regard to data layout. A new feature in Delta Lake called liquid clustering, which you will learn about in the last section of this chapter, is currently in preview and is not compatible with partitions. This will replace partitions as the recommended approach to optimize query performance in regard to data layout. ([Location 2503](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2503)) - Select your partition column(s) carefully. If the cardinality of a column is very high, do not use that column for partitioning. ([Location 2646](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2646)) - You can partition by a column if you expect data in that partition to be at least 1 GB. Tables with fewer, larger partitions tend to outperform tables with many smaller partitions, otherwise you run into the small file problem. ([Location 2651](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2651)) - To avoid this issue, you should rewrite a large number of small files into a small number of larger files greater than 16 MB. Delta Lake supports the ability to optimize this layout of data in storage with various ways to coalesce small files into larger ones. ([Location 2663](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2663)) - The consolidation of files is called compaction, or bin-packing. ([Location 2666](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2666)) - To perform compaction using your own specifications, for example, specifying the number of files to compact the Delta table into, you can use a DataFrame writer with dataChange = false. This indicates that the operation does not change the data; it simply rearranges the data layout. ([Location 2669](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2669)) - When data is compacted, Delta Lake sets dataChange = true by default. ([Location 2673](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2673)) - Compaction allows you to specify how to consolidate small files into larger ones. In Delta Lake, a more optimal way to trigger this compaction and let Delta Lake determine the optimal number of large files you want is with the OPTIMIZE command. ([Location 2689](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2689)) - The OPTIMIZE command aims to remove unnecessary files from the transaction log while also producing evenly balanced data files in terms of file size. ([Location 2694](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2694)) - Unlike compaction achieved through the repartition method, there is no need to specify the dataChange option. OPTIMIZE uses snapshot isolation when performing the command so concurrent operations and downstream streaming consumers remain uninterrupted. ([Location 2706](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2706)) - The NYC Taxi dataset we are using is mainly integers, so you see little compression from organizing the data. If your data contains many string values, you will see much better compression after running OPTIMIZE. ([Location 2726](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2726)) - It is important to note that the 1,000 files that were removed were not physically removed from the underlying storage; rather, they were only logically removed from the transaction ([Location 2728](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2728)) - Optimization using OPTIMIZE is also idempotent, meaning that if it is run twice on the same table or subset of data, the second run has no effect. ([Location 2731](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2731)) - The OPTIMIZE command is effective for tables, or table partitions, that you write data continuously to and thus contain large amounts of small files. ([Location 2756](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2756)) - The OPTIMIZE command is not effective for tables with static data or tables where data is rarely updated because there are few small files to coalesce into larger files. ([Location 2758](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2758)) - The OPTIMIZE command can be a resource-intensive operation that takes time to execute. You can incur costs from your cloud provider while running your compute engine to perform the operation. Balancing these resource-intensive operations with the ideal query performance for your tables is important. ([Location 2759](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2759)) - While OPTIMIZE aims to consolidate files, Z-ordering allows us to read the data in those files more efficiently by optimizing the data layout. ([Location 2768](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2768)) - ZORDER BY is a parameter of the command and refers to the way that data is arranged in files based on their values. Specifically, this technique clusters and colocates related information in the same set of files to allow for faster data retrieval. ([Location 2773](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2773)) - Z-order indexes can improve the performance of queries that filter on the specified Z-order columns. Performance is improved because it allows queries to more efficiently locate the relevant rows, and it also allows joins to more efficiently locate rows with matching values. ([Location 2776](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2776)) - Similar to partitions, Z-order indexes will soon be replaced by the new Delta Lake feature, liquid clustering, as the preferred technique to simplify data layout and optimize query performance. ([Location 2781](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2781)) - This is generally most effective when used on a high-cardinality column and a column used frequently in query predicates, which means that the column that we apply Z-ordering to impacts how well the data is retrieved: ([Location 2805](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2805)) - Now that we added Z-ordering, we can see the detailed zOrderStats highlighted in the output, which includes the strategy name, input cube files, and other statistics about the ZORDER BY operation. ([Location 2808](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2808)) - You can specify multiple columns for ZORDER BY as a comma-separated list in the command. However, the effectiveness of the locality drops with each additional column: ([Location 2839](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2839)) - You cannot use ZORDER BY on fields used for partitioning. ([Location 2845](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2845)) - If you expect a column to be commonly used in query predicates, and if that column has high cardinality (that is, a large number of distinct values), then use ZORDER BY. ([Location 2848](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2848)) - Unlike OPTIMIZE, Z-ordering is not idempotent but aims to be an incremental operation. ([Location 2850](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2850)) - While Delta Lake supports partitioning, there are challenges with partition evolution, as partitioning is considered a fixed data layout. ([Location 2874](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2874)) - Anytime data is inserted, updated, or deleted on a table, OPTIMIZE ZORDER BY must be run again for optimization. And when ZORDER BY is applied again, the user must remember the columns used in the expression. This is because the columns used in ZORDER BY are not persisted and can cause errors or challenges when attempting to apply it again. Since OPTIMIZE ZORDER BY is not idempotent, this will result in reclustering data when it is run. ([Location 2876](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2876)) - The following scenarios for Delta tables benefit greatly from liquid clustering Tables often filtered by high cardinality columns Tables with substantial skew in data distribution Tables that require large amounts of tuning and maintenance Tables with concurrent write requirements Tables with partition patterns that change over time ([Location 2881](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2881)) - To enable liquid clustering on a table, you can specify the CLUSTER BY command when creating a table. You must specify liquid clustering using the CLUSTER BY command when you create the table; you cannot add clustering to an existing table (e.g., using ALTER TABLE) that does not have liquid clustering enabled. ([Location 2888](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2888)) - Since only these specific operations support clustering data on write, you should trigger clustering on a regular basis by running OPTIMIZE. Running this command frequently will ensure that data is properly clustered. ([Location 2935](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2935)) - While you must specify how a table is clustered when it is initially created, you can still change the columns used for clustering on the table using ALTER TABLE and CLUSTER BY. ([Location 2949](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=2949)) - You must enable Delta Lake liquid clustering when first creating a table. You cannot alter an existing table to add clustering without clustering being enabled when the table is first created. ([Location 3005](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3005)) - Run OPTIMIZE frequently to ensure new data is clustered. ([Location 3009](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3009)) - The ability to easily traverse through different versions of data at specific points in time is a key feature in Delta Lake called Delta Lake time travel. ([Location 3061](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3061)) - Delta Lake time travel allows you to access and revert to previous versions of data stored in Delta Lake, easily providing a powerful mechanism for version control, auditing, and data management. You can then track changes over time and roll back to previous versions if needed. ([Location 3065](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3065)) - Version history can be kept on a Delta table because the transaction log keeps track of which files should or should not be read when performing operations on a table. ([Location 3168](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3168)) - The transaction log commits checkpoint files, and the fact that data files are only logically removed as opposed to being physically removed is the foundation for how Delta Lake easily enables time travel on your Delta table. ([Location 3180](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3180)) - It is important to note that RESTORE is a data-changing operation, meaning dataChange = true. This means it can potentially affect downstream jobs, such as Structured Streaming jobs, ([Location 3213](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3213)) - Remember, the data itself is not physically deleted from the underlying storage, it is just logically removed. Logical removal rather than physical deletion means that time travel not only allows you to restore a table to a specific point in time, but you can easily query previous versions of a table directly, without restoring. ([Location 3253](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3253)) - Querying by timestamp makes it easy to perform time-series analysis because we can compare the data of the same table to itself at two different points in time. ([Location 3275](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3275)) - The data files backing a Delta table are never deleted automatically, but log files are automatically cleaned up after checkpoints are written. Ultimately, what enables time travel to a specific version of a table is the retention of both the data and log files for that version of a table. By default, Delta tables retain the commit history, or log files, for 30 days. So you can access time travel on Delta tables for up to 30 days unless you have modified the data or log files. ([Location 3294](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3294)) - The retention threshold refers to the interval (e.g., days) a file must be kept before it is a candidate to be physically removed from storage. ([Location 3301](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3301)) - The default retention is seven days for files that are candidates to be removed by VACUUM, a command used for physically deleting data files. In brief, VACUUM removes data files no longer referenced by the Delta table and older than the retention period. ([Location 3308](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3308)) - There is minimal downside to retaining log files, as log files do not affect performance on read/writes on the table; they only affect performance on operations that leverage table history. ([Location 3363](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3363)) - This is a reminder: these data files are never automatically physically deleted from storage unless the VACUUM command is run. VACUUM is designed to allow users to physically delete old versions of data files and directories that are no longer needed, while also taking into account the retention threshold of the table. ([Location 3410](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3410)) - When vacuuming a table, you can specify VACUUM without any parameters to vacuum files that are not required by version older than the default retention period. You can also use the RETAIN num HOURS parameter to vacuum files that are not required by versions greater than the number of hours specified in the parameter. ([Location 3432](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3432)) - VACUUM also commits to the Delta transaction log, which means that you can also view previous VACUUM commits and operationMetrics using DESCRIBE HISTORY: ([Location 3452](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3452)) - This maintenance job, which can also include other file cleanup operations such as OPTIMIZE, should be run as a separate workflow outside of your main ETL workflow for several reasons: ([Location 3480](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3480)) - Resource utilization File cleanup operations can be resource intensive and can compete for resources with your main workflow, leading to a decline in overall performance. ([Location 3482](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3482)) - Isolation It is best to isolate processes that perform file cleanup and consolidation so that they have exclusive access to the Delta table to avoid any potential conflicts. ([Location 3486](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3486)) - Monitoring By isolating these processes, it is much easier to monitor performance so that you can track progress and resource consumption for tuning. ([Location 3487](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3487)) - By scheduling separate workflows for maintenance tasks such as VACUUM, you can have greater resource management, isolation, monitoring, and overall control of your jobs and workflows. ([Location 3490](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3490)) - It is not recommended that you set a retention interval shorter than seven days. ([Location 3504](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3504)) - Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. ([Location 3508](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3508)) - When you run VACUUM on a Delta table, it removes the following files from the underlying filesystem: Any data files that are not maintained by Delta Lake, ignoring directories beginning with an underscore, like _delta_log. ([Location 3520](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3520)) - Stale data files (files that are no longer referenced by a Delta table) that are older than the retention period. ([Location 3525](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3525)) - Since vacuuming removes files, it is important to note that the process can take some time, depending on the size of the table and the number of files to be removed. ([Location 3526](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3526)) - Run OPTIMIZE regularly to eliminate small files and reduce the number of files that need to be removed. ([Location 3527](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3527)) - But time travel does not track row-level changes, or rather, how row-level data is inserted, updated, or deleted across different versions. ([Location 3549](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3549)) - This efficient tracking of row-level changes across versions is called the Change Data Feed (CDF). ([Location 3551](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3551)) - When enabled on a Delta table, the Delta Lake records “change events” for all the data written into the table. This includes the row data and metadata indicating whether the specified row was inserted, deleted, or updated. ([Location 3552](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3552)) - With the CDF, you can capture changes to the data without having to process every single record in your Delta table file or query an entire version of a table. So, if just one record changed, you no longer have to read all records in the file or a table. The CDF is stored in a separate directory called _change_data that sits alongside _delta_log and maintains the changes to the Delta table file. ([Location 3559](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3559)) - If you have an event-driven application, an event-streaming platform such as Kafka could read the change feed and trigger actions for a downstream application or platform. ([Location 3577](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3577)) - The CDF is an efficient, powerful feature that can capture changes to data over time. This can be used with other ETL operations to easily build type 2 slowly changing dimensions, or you can process only row-level changes following MERGE, UPDATE, or DELETE operations to accelerate ETL operations and incrementally load data downstream. ([Location 3706](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3706)) - Schema enforcement This is the process of ensuring that all data being added to a table conforms to that specific schema, where the schema defines a table structure by a list of column names, their data types, and any optional constraints. ([Location 3761](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3761)) - Schema evolution This allows the data stored in the data lake to be flexible and adaptable to address the changing business requirements and data landscape. ([Location 3766](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3766)) - All source data is written when the transaction succeeds, and no source data is written when the validation fails. When schema validation fails, Delta Lake will raise an exception to let the user know about the mismatch. ([Location 3821](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3821)) - schema evolution allows us to add, remove, or modify columns in an existing Delta table without losing any data or breaking any downstream jobs that depend on the table. ([Location 3902](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3902)) - If a column exists in the Delta table but not in the DataFrame being written, the column is not changed and retains its existing values. The new records will have a null value for the missing columns in the source DataFrame. ([Location 3974](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=3974)) - If a column with the same name but a different data type exists in the Delta table, Delta Lake attempts to convert the data to the new data type. If the conversion fails, an error is thrown. ([Location 4027](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4027)) - The reason for this error is that NullType columns in Delta Lake do not have a defined schema, so Spark cannot infer the data type of the column. Therefore, when we try to run a SELECT * query, Spark is unable to map the NullType column to a specific data type, and the query fails. If you want to query the table, we can list the columns you need without the NullType column: %sql SELECT RateCodeId, RateCodeDesc FROM delta.`/mnt/datalake/book/chapter07/TaxiRateCode` This will succeed without any issues. ([Location 4100](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4100)) - Note that we used the AFTER keyword, so the column will be added after the RateCodeId field, and not at the end of the column list, as is the standard practice without the AFTER keyword. Similarly, we can use the FIRST keyword to add the new column at the first position in the column list. ([Location 4125](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4125)) - Column mapping allows Delta Lake tables and the underlying Parquet file columns to use different names. This enables Delta Lake schema evolution such as RENAME COLUMN and DROP COLUMN on a Delta Lake table without the need to rewrite the underlying Parquet files. ([Location 4226](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4226)) - Delta Lake supports column mapping for Delta Lake tables, which enables metadata-only changes to mark columns as deleted or renamed without rewriting data files. It also allows users to name Delta table columns using characters that are not allowed by Parquet, such as spaces, so that users can directly ingest CSV or JSON data into Delta Lake without the need to rename columns due to previous character constraints. ([Location 4237](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4237)) - The REPLACE COLUMNS operation sets all the column values to null because the new schema might have different data types or a different order of columns than the old schema. ([Location 4377](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4377)) - Delta Lake now supports dropping a column as a metadata-only operation without rewriting any data files. Note that column mapping must be enabled for this operation. ([Location 4389](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4389)) - The REORG TABLE command reorganizes a Delta Lake table by rewriting files to purge soft-deleted data, which we created in the previous section, where we dropped a column with the ALTER TABLE DROP COLUMN command. ([Location 4457](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4457)) - We can change a column’s data type or name or drop a column by manually rewriting the table. To do this, we can use the overwriteSchema option. ([Location 4492](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4492)) - Delta Lake is integrated with Spark Structured Streaming through its two major operators: readStream and writeStream. ([Location 4573](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4573)) - Delta tables can be used as both streaming sources and streaming sinks. ([Location 4575](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4575)) - Delta Lake overcomes many limitations typically associated with streaming systems, including: Coalescing small files produced by low-latency ingestion Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs) Leveraging the Delta transaction log for efficient discovery of which files are new when using files for a source stream ([Location 4575](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4575)) - Incremental processing of data has become a popular ETL model. The AvailableNow stream triggering mode enables developers to build incremental pipelines without needing to maintain their own state variables, resulting in simpler and more robust pipelines. ([Location 4585](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4585)) - You can enable a Change Data Feed (CDF) on a Delta table. Clients can consume this CDF feed with SQL queries, or they can stream these changes into their application, enabling use cases such as creating audit trials, streaming analytics, compliance analysis, etc. ([Location 4589](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4589)) - With Delta Lake, you can now use Delta tables as both streaming sources and sinks, enabling a continuous processing model that processes your data through the Raw, Bronze, Silver, and Gold data lake layers in a streaming fashion, eliminating the need for batch jobs, resulting in a simplified solution architecture. ([Location 4612](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4612)) - This checkpoint file will maintain the metadata and state of the streaming query. The checkpoint file is necessary to ensure fault tolerance and enable the query’s recovery in case of failure. Among many other pieces of information, it will maintain which transaction log entries of the streaming source were already processed, so it can identify the new entries that have not yet been processed. ([Location 4702](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4702)) - When we start the streaming query we see the stream initializing, and a query progress log (QPL) is displayed. The QPL is a JSON log generated by every single micro-batch, and provides execution details on the micro-batch. ([Location 4714](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4714)) - The query log also contains the batchId, which is the micro-batch ID. For every stream, this ID will start with zero and increment by one for every processed micro-batch. ([Location 4732](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4732)) - The sources startOffset and endOffset indicate where each batch started and ended. These include the following subfields: The reservoirVersion is the version of the Delta table on which the current micro-batch is operating. The index is used to keep track of which part file to start processing from. The isStartingVersion boolean field is set to true if the reservoirVersion is set to the version of the Delta table at which the current stream was started. The sink field contains the location of the streaming sink. ([Location 4736](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4736)) - However, when we started the .writeStream, the streaming query started running and immediately processed the first 10 rows as part of batch 0. We can also see that our batchId is currently 1, and since batchIds start with 0, the first batch was already processed. ([Location 4756](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4756)) - We have one file (metadata), and two directories (offsets and commits). Let’s take a look at each one. The metadata file simply contains the stream identifier in JSON format: ([Location 4804](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4804)) - The AvailableNow trigger option consumes all available records as an incremental batch with the ability to configure batch sizes with options such as maxBytesPerTrigger. ([Location 4846](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4846)) - This AvailableNow model means that we could now run the streaming query as shown in the “02 - AvailableNow Streaming” notebook just once a day, or once an hour, or in whatever time interval the use case demands. Delta Lake will always pick up all changes that happened to the source table since the last run, thanks to the state saved in the checkpoint file. ([Location 4891](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4891)) - In addition to the AvailableNow trigger, there is also a RunOnce trigger, which behaves very similarly. Both triggers will process all available data. However, the RunOnce trigger will consume all records in a single batch, while the AvailableNow trigger will process the data in multiple batches when appropriate, typically resulting in better scalability. ([Location 4897](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=4897)) - for all the data written into the table via the CDF. These changes can be transmitted to downstream consumers. These downstream consumers can read the change events captured and transmitted in the CDF using streaming queries with .readStream(). ([Location 5009](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5009)) - Setting readChangeFeed to true in conjunction with .readStream() will allow us to efficiently stream changes from a source table to a downstream target table. We can also use startingVersion or startingTimestamp to specify the starting point of the Delta table streaming source without processing the entire table: ([Location 5014](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5014)) - When reading change data, there are other options that we can specify, specifically around data changes and rate limits (how much data is processed in each micro-batch). ([Location 5047](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5047)) - This will read the latest version of the table (since no version or timestamp is specified) and give us better control over the size of micro-batches and processing resources to reduce potential interruptions to the streaming query. ([Location 5072](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5072)) - One of Delta Lake’s key features is the unification of batch and streaming data into a single table. ([Location 5077](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5077)) - Delta Sharing is an open source protocol designed with the following objectives: Open cross-platform data sharing Delta Sharing provides an open source, cross-platform solution that avoids vendor lock-in. It allows data sharing in Delta Lake and Apache Parquet formats with any platform, whether on premises or another cloud. Share live data without data movement Data recipients can directly connect to Delta Sharing without replicating the data. This feature enables the easy and real-time sharing of existing data without unnecessary data duplication or movement. Support a wide range of clients Delta Sharing supports a diverse range of clients, including popular tools like Power BI, Tableau, Apache Spark, pandas, and Java. It offers flexibility for consuming data using the tools of choice for various use cases, such as business intelligence, machine learning, and AI. Implementing a Delta Sharing connector is quick and straightforward. Centralized governance Delta Sharing provides robust security, auditing, and governance capabilities. Data providers have granular control over data access, allowing them to share an entire table or specific versions or partitions of a table. Access to shared data is managed and audited from a single enforcement point, ensuring centralized control and compliance. Scalability for massive datasets Delta Sharing is designed to handle massive structured datasets, and supports sharing unstructured data and future data derivatives such as machine learning models, dashboards, notebooks, and tabular data. Delta Sharing enables the economical and reliable sharing of large-scale datasets by leveraging the cost-effectiveness and scalability of cloud storage systems. ([Location 5178](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5178)) - Delta Sharing is an open protocol that defines REST API endpoints that enable secure access to specific portions of a cloud dataset. It leverages the capabilities of modern cloud storage systems like Amazon S3, ADLS, or GCS to ensure the reliable transfer of large datasets. The process involves two key parties: data providers and recipients, as depicted ([Location 5201](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5201)) - The data provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. Open source Delta Lake includes a reference sharing server, and Databricks provides one for its platform; other vendors are expected to soon follow. ([Location 5212](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5212)) - The recipient’s client authenticates to the sharing server (via a bearer token or other method) and asks to query a specific table. ([Location 5218](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5218)) - The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. ([Location 5220](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5220)) - To transfer the data, the server generates short-lived pre-signed URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel with massive bandwidth, without streaming through the sharing server. ([Location 5222](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5222)) - Among the key benefits of using Delta Sharing for data providers and data recipients, the following are the most important: Scalability is critical for data teams working with ever-growing datasets and high-demand use cases. Interoperability is another significant benefit. Delta Sharing, as an open source technology, is designed to work in harmony with other components of the data ecosystem, facilitating seamless integration. In addition, transparency and security are improved compared to the proprietary solutions, as the Delta Sharing source code is available for review, which allows for stronger security measures and the ability to respond to and proactively address identified vulnerabilities. ([Location 5321](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5321)) - Throughout this book you have learned about the five key capabilities that help enable the lakehouse architecture: the storage layer, data management, SQL analytics, data science and machine learning, and the medallion architecture. ([Location 5336](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5336)) - In the late 2010s, the concept of the lakehouse emerged. This introduced a modernized version of a data warehouse that provides all of the benefits and features without compromising the flexibility of a data lake. The lakehouse leverages a low-cost, flexible cloud storage layer, a data lake, combined with data reliability and consistency guarantees through technologies that feature open-table formats with support for ACID transactions. ([Location 5351](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5351)) - The first step, or layer, in any well-designed architecture is deciding where to store your data. ([Location 5360](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5360)) - One of the key elements of a data lake is that a cloud object store provides limitless scalability to store any type of data. ([Location 5373](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5373)) - In structured data all the data has a predefined structure, or schema. This is most commonly relational data coming from a database in the form of tables with rows and columns. ([Location 5377](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5377)) - Semi-structured data does not conform to a typical relational format like structured data. Rather it is loosely structured with patterns or tags that separate elements of the data, such as key/value pairs. Examples of semi-structured data are Parquet, JSON, XML, CSV files, and even emails or social feeds. ([Location 5384](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5384)) - Unstructured data does not contain an organized structure; it is not arranged in any type of schema or pattern. It is often delivered as media files, such as photo (e.g., JPEG) or video files (e.g., MP4). ([Location 5391](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5391)) - One of the most important features of a lakehouse is unifying platforms, and a cloud data lake helps eliminate and consolidate data silos and different types of data into a single object store. ([Location 5409](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5409)) - A single storage layer reduces the need for multiple security policies that cover different systems and helps resolve difficulties with collaboration across systems. It also offers data consumers a single place to look for all sources of data. ([Location 5416](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5416)) - cloud data lakes allow for the ultimate flexibility to store data. ([Location 5419](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5419)) - Storage is generally inexpensive, whereas compute is not. The cloud data lake allows you to decouple this and independently scale your storage and store vast amounts of data at very little cost. ([Location 5429](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5429)) - Although a cloud data lake allows you to elastically store data at scale in its native format, among other benefits, the next piece of the lakehouse foundation is facilitating data management. ([Location 5462](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5462)) - Data management on a data lake begins with a structured transactional layer for reliability. This reliability comes from a transaction layer that supports ACID transactions, open-table formats, integration between batch and streaming data processing, and scalable metadata management. ([Location 5472](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5472)) - At the core of Delta Lake is the metadata layer. This layer provides extensive and scalable metadata tracking that enables most of the core features of Delta Lake. It provides a level of abstraction to implement ACID transactions and a variety of other management features. These metadata layers are also a natural place to begin enabling governance features, such as access control and audit logging. ([Location 5483](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5483)) - Delta Lake will ensure that data is kept intact in case concurrency transactions are active on a single table. Support for ACID transactions brings consistency and reliability to your data lake, which is made possible through the transaction log. ([Location 5488](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5488)) - Since Delta Lake stores information about which files are part of a table as a transaction log, it allows users to query old versions of data and perform rollbacks, also referred to as time traveling. ([Location 5493](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5493)) - Delta Lake is an open source framework that is cloud agnostic, and since it interacts seamlessly with Apache Spark by providing a set of APIs and extensions, there are a vast number of different integration capabilities across projects, other APIs, and platforms. ([Location 5510](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5510)) - The curation of data through ETL/ELT has traditionally been thought of and presented in the context of a data warehouse, but through the data management features offered by Delta Lake, you can bring those processes to a single place. ([Location 5524](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5524)) - machine learning operations (MLOps) are the set of practices and principles involved in the end-to-end machine learning lifecycle. ([Location 5620](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5620)) - Data silos often start to develop as the gap between data engineering activities and data science activities begins to grow. ([Location 5625](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5625)) - Machine learning models use historical data to train models in order to make accurate predictions on streaming data. ([Location 5631](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5631)) - And if data scientists wish to consume data via ODBC/JDBC interfaces from tables created through data engineering pipelines, these interfaces can create a very inefficient process. These inefficient processes are largely because these interfaces are designed to work with SQL and offer limited support for non-SQL code logic. This results in inefficient non-SQL queries that can be caused by data volume, data conversions, and complex data structures that non-SQL code logic can often include. ([Location 5638](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5638)) - Raw data from the data sources is ingested into the Bronze layer without any transformations or business rule enforcement. ([Location 5752](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5752)) - In the Silver layer we first cleanse and normalize the data. ([Location 5768](https://readwise.io/to_kindle?action=open&asin=B0CL5NCKQ7&location=5768))