Data Pipelines Pocket Reference

# Data Pipelines Pocket Reference ![rw-book-cover](https://m.media-amazon.com/images/I/81XmIUEHkIL._SY160.jpg) ## Metadata - Author: [[James Densmore]] - Full Title: Data Pipelines Pocket Reference - Category: #python #data-engineering ## Highlights - Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived. ([Location 90](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=90)) - In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination. ([Location 94](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=94)) - Data engineers specialize in building and maintaining the data pipelines that underpin the analytics ecosystem. ([Location 104](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=104)) - Data engineers work closely with data scientists and analysts to understand what will be done with the data and help bring their needs into a scalable production state. ([Location 106](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=106)) - Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity. ([Location 127](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=127)) - The term data ingestion refers to extracting data from one source and loading it into another. ([Location 183](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=183)) - Well-structured data is often easiest to work with, but it’s usually structured in the interest of an application or website. ([Location 204](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=204)) - Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. ([Location 206](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=206)) - Unstructured data is common for some analytics endeavors. For example, Natural Language Processing (NLP) models require vast amounts of free text data to train and validate. Computer Vision (CV) projects require images and video content. Even less daunting projects such as scraping data from web pages have a need for free text data from the web in addition to the semistructured HTML markup of a web page. ([Location 209](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=209)) - As the old saying goes, “garbage in, garbage out.” It’s important to understand the limitations and deficiencies of source data and address them in the appropriate sections of your pipelines. ([Location 226](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=226)) - Assume the worst, expect the best Pristine datasets only exist in academic literature. ([Location 233](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=233)) - Clean and validate data in the system best suited to do so There are times when it’s better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing ([Location 235](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=235)) - Validate often Even if you don’t clean up data early in a pipeline, don’t wait until the end of the pipeline to validate it. ([Location 240](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=240)) - The ease of building and deploying data pipelines, data lakes, warehouses, and analytics processing in the cloud. ([Location 257](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=257)) - A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries. ([Location 263](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=263)) - A data lake is where data is stored, but without the structure or query optimization of a data warehouse. ([Location 265](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=265)) - Data transformation Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic. ([Location 293](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=293)) - Data modeling Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse. ([Location 298](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=298)) - As the complexity and number of data pipelines in an organization grows, it’s important to introduce a workflow orchestration platform to your data infrastructure. ([Location 317](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=317)) - Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks. I use these terms interchangeably in this text. ([Location 323](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=323)) - directed, meaning they start with a general task or multiple tasks and end with a specific task or tasks. ([Location 332](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=332)) - acyclic, meaning that a task cannot point back to a previously completed task. In other words, it cannot cycle back. ([Location 335](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=335)) - There is perhaps no pattern more well known than ETL and its more modern sibling, ELT. ([Location 367](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=367)) - The extract step gathers data from various sources in preparation for loading and transforming. ([Location 377](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=377)) - The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. ([Location 379](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=379)) - The transform step is where the raw data from each source system is combined and formatted in a such a way that it’s useful to analysts, visualization tools, or whatever use case your pipeline is serving. ([Location 382](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=382)) - The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure. ([Location 386](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=386)) - It’s now better to focus on extracting data and loading it into a data warehouse where you can then perform the necessary transformations to complete the pipeline. ([Location 409](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=409)) - row-based storage is optimal since the data the application needs is stored in close proximity on disk, and the amount of data queried at one time is small. ([Location 418](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=418)) - columnar database, such as Snowflake or Amazon Redshift, stores data in disk blocks by column rather than row. ([Location 430](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=430)) - However, instead of transformation involving business logic or data modeling, this type of transformation is more limited in scope. I refer to this as lowercase t transformation, or EtLT. ([Location 447](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=447)) - ELT has become the most common and, in my opinion, most optimal pattern for pipelines built for data analysis. ([Location 457](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=457)) - As already discussed, columnar databases are well suited to handling high volumes of data. They are also designed to handle wide tables, meaning tables with many columns, thanks to the fact that only data in columns used in a given query are scanned on disk and loaded into memory. ([Location 460](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=460)) - With the emergence of ELT, data analysts have become more autonomous and empowered to deliver value from data without being “blocked” by data engineers. Data engineers can focus on data ingestion and supporting infrastructure that enables analysts to write and deploy their own transform code written as SQL. With that empowerment have come new job titles such as the analytics engineer. ([Location 473](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=473)) - If you’re building pipelines to support data scientists, you’ll find that the extract and load steps of the ELT pattern will remain pretty much the same as they will for supporting analytics. ([Location 485](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=485)) - Data is used for more than analysis, reporting, and predictive models. It’s also used for powering data products. Some common examples of data products include the following: A content recommendation engine that powers a video streaming home screen A personalized search engine on an e-commerce website An application that performs sentiment analysis on user-generated restaurant reviews ([Location 491](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=491)) - Data ingestion This step ([Location 507](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=507)) - Data preprocessing The data that’s ingested is unlikely to be ready to use in ML development. Preprocessing is where data is cleaned and otherwise prepared for models. For example, this is the step in a pipeline where text is tokenized, features are converted to numerical values, and input values are normalized. ([Location 513](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=513)) - Model training After new data is ingested and preprocessed, ML models need to be retrained. ([Location 517](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=517)) - Model deployment Deploying models to production can be the most challenging part of going from research-oriented machine learning to a true data product. Here, not only is versioning of datasets necessary, but versioning of trained models is also needed. ([Location 519](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=519)) - The first two steps in the ELT pattern, extract and load, are collectively referred to as data ingestion. ([Location 557](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=557)) - Because you’ll be storing credentials and connection information in the configuration file, make sure you don’t add it to your Git repo. ([Location 602](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=602)) - Binlog replication is also a path to creating a streaming data ingestion. ([Location 659](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=659)) - In a full extraction, every record in the table is extracted on each run of the extraction job. This is the least complex approach, but for high-volume tables it can take a long time to run. ([Location 689](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=689)) - In an incremental extraction, only records from the source table that have changed or been added since the last run of the job are extracted. ([Location 693](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=693)) - For tables containing immutable data (meaning records can be inserted, but not updated), you can make use of the timestamp for when the record was created instead of a LastUpdated column. ([Location 701](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=701)) - Beware of large extraction jobs — whether full or incremental — putting strain on the source MySQL database, and even blocking production queries from executing. ([Location 903](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=903)) - Binlog replication is a form of change data capture (CDC). Many source data stores have some form of CDC that you can use. ([Location 912](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=912)) - The STATEMENT format logs every SQL statement that inserts or modifies a row in the binlog. ([Location 946](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=946)) - With the ROW format, every change to a row in a table is represented on a line of the binlog not as a SQL statement but rather the data in the row itself. ([Location 951](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=951)) - The MIXED format logs both STATEMENT- and ROW-formatted records in the binlog. ([Location 952](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=952)) - ingesting data from a PostgreSQL (commonly known as Postgres) database can be done in one of two ways: either with full or incremental extractions using SQL or by leveraging features of the database meant to support replication to other nodes. In the case of Postgres, there are a few ways to do this, but this chapter will focus on one method: turning the Postgres write-ahead log (WAL) into a data stream. ([Location 1104](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=1104)) - the Postgres WAL can be used as a method of CDC for extraction. ([Location 1181](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=1181)) - Though you can take a similar, simplified approach to the one used as an example with the MySQL binlog, I suggest using an open source distributed platform called Debezium to stream the contents of the Postgres WAL to an S3 bucket or data warehouse. ([Location 1184](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=1184)) - Debezium is a distributed system made up of several open source services that capture row-level changes from common CDC systems and then streams them as events that are consumable by other systems. There are three primary components of a Debezium installation: Apache Zookeeper manages the distributed environment and handles configuration across each service. Apache Kafka is a distributed streaming platform that is commonly used to build highly scalable data pipelines. Apache Kafka Connect is a tool to connect Kafka with other systems so that the data can be easily streamed via Kafka. Connectors are built for systems like MySQL and Postgres and turn data from their CDC systems (binlogs and WAL) into Kakfa topics. ([Location 1439](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=1439)) - Kafka exchanges messages that are organized by topic. ([Location 1448](https://readwise.io/to_kindle?action=open&asin=B08WGSM9CJ&location=1448))