Fundamentals of Data Engineering

# Fundamentals of Data Engineering ![rw-book-cover](https://m.media-amazon.com/images/I/81+oMD7Lm7L._SY160.jpg) ## Metadata - Author: Joe Reis and Matt Housley - Full Title: Fundamentals of Data Engineering - Category: #books ## Highlights - Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. ([Location 264](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=264)) - Data engineering is all about the movement, manipulation, and management of data. ([Location 282](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=282)) - Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning. ([Location 291](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=291)) - The stages of the data engineering lifecycle are as follows: Generation Storage Ingestion Transformation Serving ([Location 304](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=304)) - The data engineering lifecycle also has a notion of undercurrents—critical ideas across the entire lifecycle. These include security, data management, DataOps, data architecture, orchestration, and software engineering. ([Location 306](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=306)) - Data warehousing ushered in the first age of scalable analytics, with new massively parallel processing (MPP) databases that use multiple processors to crunch large amounts of data coming on the market and supporting unprecedented volumes of data. ([Location 328](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=328)) - Coinciding with the explosion of data, commodity hardware—such as servers, RAM, disks, and flash drives—also became cheap and ubiquitous. Several innovations allowed distributed computation and storage on massive computing clusters at a vast scale. These innovations started decentralizing and breaking apart traditionally monolithic services. ([Location 343](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=343)) - Another famous and succinct description of big data is the three Vs of data: velocity, variety, and volume. ([Location 349](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=349)) - Traditional enterprise-oriented and GUI-based data tools suddenly felt outmoded, and code-first engineering was in vogue with the ascendance of MapReduce. ([Location 375](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=375)) - Despite the term’s popularity, big data has lost steam. What happened? One word: simplification. Despite the power and sophistication of open source big data tools, managing them was a lot of work and required constant attention. Often, companies employed entire teams of big data engineers, costing millions of dollars a year, to babysit these platforms. Big data engineers often spent excessive time maintaining complicated tooling and arguably not as much time delivering the business’s insights and value. ([Location 393](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=393)) - Whereas data engineers historically tended to the low-level details of monolithic frameworks such as Hadoop, Spark, or Informatica, the trend is moving toward decentralized, modularized, managed, and highly abstracted tools. ([Location 404](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=404)) - Popular trends in the early 2020s include the modern data stack, representing a collection of off-the-shelf open source and third-party products assembled to make analysts’ lives easier. ([Location 407](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=407)) - With greater abstraction and simplification, a data lifecycle engineer is no longer encumbered by the gory details of yesterday’s big data frameworks. While data engineers maintain skills in low-level data programming and use these as required, they increasingly find their role focused on things higher in the value chain: security, data management, DataOps, data architecture, orchestration, and general data lifecycle management. ([Location 417](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=417)) - Instead of focusing on who has the “biggest data,” open source projects and services are increasingly concerned with managing and governing data, making it easier to use and discover, and improving its quality. ([Location 422](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=422)) - Data engineers managing the data engineering lifecycle have better tools and techniques than ever before. ([Location 431](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=431)) - Data scientists aren’t typically trained to engineer production-grade data systems, and they end up doing this work haphazardly because they lack the support and resources of a data engineer. ([Location 451](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=451)) - The skill set of a data engineer encompasses the “undercurrents” of data engineering: security, data management, DataOps, data architecture, and software engineering. ([Location 461](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=461)) - Finally, a data engineer juggles a lot of complex moving parts and must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability ([Location 467](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=467)) - The data engineer is also expected to create agile data architectures that evolve as new trends emerge. ([Location 477](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=477)) - Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization, but data maturity does not simply depend on the age or revenue of a company. An early-stage startup can have greater data maturity than a 100-year-old company with annual revenues in the billions. What matters is the way data is leveraged as a competitive advantage. ([Location 484](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=484)) - A data engineer should focus on the following in organizations getting started with data: Get buy-in from key stakeholders, including executive management. Ideally, the data engineer should have a sponsor for critical initiatives to design and build a data architecture to support the company’s goals. Define the right data architecture (usually solo, since a data architect likely isn’t available). This means determining business goals and the competitive advantage you’re aiming to achieve with your data initiative. Work toward a data architecture that supports these goals. See Chapter 3 for our advice on “good” data architecture. Identify and audit data that will support key initiatives and operate within the data architecture you designed. Build a solid data foundation for future data analysts and data scientists to generate reports and models that provide competitive value. In the meantime, you may also have to generate these reports and models until this team is hired. ([Location 508](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=508)) - Just keep in mind that quick wins will likely create technical debt. Have a plan to reduce this debt, as it will otherwise add friction for future delivery. ([Location 519](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=519)) - Build custom solutions and code only where this creates a competitive advantage. ([Location 524](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=524)) - In organizations that are in stage 2 of data maturity, a data engineer’s goals are to do the following: Establish formal data practices Create scalable and robust data architectures Adopt DevOps and DataOps practices Build systems that support ML Continue to avoid undifferentiated heavy lifting and customize only when a competitive advantage results ([Location 528](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=528)) - Issues to watch out for include the following: As we grow more sophisticated with data, there’s a temptation to adopt bleeding-edge technologies based on social proof from Silicon Valley companies. This is rarely a good use of your time and energy. Any technology decisions should be driven by the value they’ll deliver to your customers. The main bottleneck for scaling is not cluster nodes, storage, or technology but the data engineering team. Focus on solutions that are simple to deploy and manage to expand your team’s throughput. You’ll be tempted to frame yourself as a technologist, a data genius who can deliver magical products. Shift your focus instead to pragmatic leadership and begin transitioning to the next maturity stage; communicate with other teams about the practical utility of data. Teach the organization how to consume and leverage data. ([Location 533](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=533)) - In organizations in stage 3 of data maturity, a data engineer will continue building on prior stages, plus they will do the following: Create automation for the seamless introduction and usage of new data Focus on building custom tools and systems that leverage data as a competitive advantage Focus on the “enterprisey” aspects of data, such as data management (including data governance and quality) and DataOps Deploy tools that expose and disseminate data throughout the organization, including data catalogs, data lineage tools, and metadata management systems Collaborate efficiently with software engineers, ML engineers, analysts, and others Create a community and environment where people can collaborate and speak openly, no matter their role or position ([Location 545](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=545)) - Data engineering is a fast-growing field, and a lot of questions remain about how to become a data engineer. Because data engineering is a relatively new discipline, little formal training is available to enter the field. ([Location 561](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=561)) - By definition, a data engineer must understand both data and technology. ([Location 572](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=572)) - Know how to communicate with nontechnical and technical people. Communication is key, and you need to be able to establish rapport and trust with people across the organization. We suggest paying close attention to organizational hierarchies, who reports to whom, how people interact, and which silos exist. These observations will be invaluable to your success. Understand how to scope and gather business and product requirements. You need to know what to build and ensure that your stakeholders agree with your assessment. In addition, develop a sense of how data and technology decisions impact the business. Understand the cultural foundations of Agile, DevOps, and DataOps. Many technologists mistakenly believe these practices are solved through technology. We feel this is dangerously wrong. Agile, DevOps, and DataOps are fundamentally cultural, requiring buy-in across the organization. Control costs. You’ll be successful when you can keep costs low while providing outsized value. Know how to optimize for time to value, the total cost of ownership, and opportunity cost. Learn to monitor costs to avoid surprises. Learn continuously. The data field feels like it’s changing at light speed. People who succeed in it are great at picking up new things while sharpening their fundamental knowledge. They’re also good at filtering, determining which new developments are most relevant to their work, which are still immature, and which are just fads. Stay abreast of the field and learn how to learn. ([Location 581](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=581)) - A successful data engineer always zooms out to understand the big picture and how to achieve outsized value for the business. ([Location 593](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=593)) - People often ask, should a data engineer know how to code? Short answer: yes. A data engineer should have production-grade software engineering chops. ([Location 610](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=610)) - Even in a more abstract world, software engineering best practices provide a competitive advantage, and data engineers who can dive into the deep architectural details of a codebase give their companies an edge when specific technical needs arise. ([Location 615](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=615)) - SQL The most common interface for databases and data lakes. After briefly being sidelined by the need to write custom MapReduce code for big data processing, SQL (in various forms) has reemerged as the lingua franca of data. Python The bridge language between data engineering and data science. A growing number of data engineering tools are written in Python or have Python APIs. It’s known as “the second-best language at everything.” Python underlies popular data tools such as pandas, NumPy, Airflow, sci-kit learn, TensorFlow, PyTorch, and PySpark. Python is the glue between underlying components and is frequently a first-class API language for interfacing with a framework. JVM languages such as Java and Scala Prevalent for Apache open source projects such as Spark, Hive, and Druid. The JVM is generally more performant than Python and may provide access to lower-level features than a Python API (for example, this is the case for Apache Spark and Beam). Understanding Java or Scala will be beneficial if you’re using a popular open source data framework. bash The command-line interface for Linux operating systems. Knowing bash commands and being comfortable using CLIs will significantly improve your productivity and workflow when you need to script or perform OS operations. Even today, data engineers frequently use command-line tools like awk or sed to process files in a data pipeline or call bash commands from orchestration frameworks. If you’re using Windows, feel free to substitute PowerShell for bash. ([Location 621](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=621)) - Data engineers may also need to develop proficiency in secondary programming languages, including R, JavaScript, Go, Rust, C/C++, C#, and Julia. ([Location 652](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=652)) - focus on the fundamentals to understand what’s not going to change; pay attention to ongoing developments to know where the field is going. ([Location 662](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=662)) - Data maturity is a helpful guide to understanding the types of data challenges a company will face as it grows its data capability. ([Location 668](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=668)) - Type A data engineers A stands for abstraction. In this case, the data engineer avoids undifferentiated heavy lifting, keeping data architecture as abstract and straightforward as possible and not reinventing the wheel. Type A data engineers manage the data engineering lifecycle mainly by using entirely off-the-shelf products, managed services, and tools. Type A data engineers work at companies across industries and at all levels of data maturity. Type B data engineers B stands for build. Type B data engineers build data tools and systems that scale and leverage a company’s core competency and competitive advantage. In the data maturity range, a type B data engineer is more commonly found at companies in stage 2 and 3 (scaling and leading with data), or when an initial data use case is so unique and mission-critical that custom data tools are required to get started. ([Location 677](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=677)) - An external-facing data engineer typically aligns with the users of external-facing applications, such as social media apps, Internet of Things (IoT) devices, and ecommerce platforms. ([Location 699](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=699)) - External-facing query engines often handle much larger concurrency loads than internal-facing systems. ([Location 706](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=706)) - An internal-facing data engineer typically focuses on activities crucial to the needs of the business and internal stakeholders ([Location 710](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=710)) - External-facing and internal-facing responsibilities are often blended. In practice, internal-facing data is usually a prerequisite to external-facing data. ([Location 715](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=715)) - The data engineer is a hub between data producers, such as software engineers, data architects, and DevOps or site-reliability engineers (SREs), and data consumers, such as data analysts, data scientists, and ML engineers. ([Location 724](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=724)) - To be successful as a data engineer, you need to understand the data architecture you’re using or designing and the source systems producing the data you’ll need. ([Location 729](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=729)) - Data architects function at a level of abstraction one step removed from data engineers. Data architects design the blueprint for organizational data management, mapping out processes and overall data architecture and systems. ([Location 734](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=734)) - Data architects implement policies for managing data across silos and business units, steer global strategies such as data management and data governance, and guide significant initiatives. ([Location 739](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=739)) - Software engineers build the software and systems that run a business; they are largely responsible for generating the internal data that data engineers will consume and process. ([Location 751](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=751)) - DevOps and SREs often produce data through operational monitoring. ([Location 764](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=764)) - Data engineering exists to serve downstream data consumers and use cases. ([Location 769](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=769)) - Data scientists build forward-looking models to make predictions and recommendations. ([Location 774](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=774)) - According to common industry folklore, data scientists spend 70% to 80% of their time collecting, cleaning, and preparing data.12 In our experience, these numbers often reflect immature data science and data engineering practices. ([Location 778](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=778)) - Data analysts (or business analysts) seek to understand business performance and trends. Whereas data scientists are forward-looking, a data analyst typically focuses on the past or present. ([Location 789](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=789)) - Machine learning engineers (ML engineers) overlap with data engineers and data scientists. ML engineers develop advanced ML techniques, train models, and design and maintain the infrastructure running ML processes in a scaled production environment. ML engineers often have advanced working knowledge of ML and deep learning techniques and frameworks such as PyTorch or TensorFlow. ([Location 798](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=798)) - the boundaries between ML engineering, data engineering, and data science are blurry. ([Location 805](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=805)) - The world of ML engineering is snowballing and parallels a lot of the same developments occurring in data engineering. ([Location 806](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=806)) - Because of increased technical abstraction, data engineers will increasingly become data lifecycle engineers, thinking and operating in terms of the principles of data lifecycle management. ([Location 982](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=982)) - The data engineering lifecycle is our framework describing “cradle to grave” data engineering. ([Location 986](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=986)) - We divide the data engineering lifecycle into five stages (Figure 2-1, top): Generation Storage Ingestion Transformation Serving data ([Location 992](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=992)) - We begin the data engineering lifecycle by getting data from source systems and storing it. Next, we transform the data and then proceed to our central goal, serving data to analysts, data scientists, ML engineers, and others. ([Location 998](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=998)) - In general, the middle stages—storage, ingestion, transformation—can get a bit jumbled. ([Location 1001](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1001)) - The data engineering lifecycle is a subset of the whole data lifecycle (Figure 2-2). Whereas the full data lifecycle encompasses data across its entire lifespan, the data engineering lifecycle focuses on the stages a data engineer controls. ([Location 1013](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1013)) - A source system is the origin of the data used in the data engineering lifecycle. ([Location 1018](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1018)) - A data engineer consumes data from a source system but doesn’t typically own or control the source system itself. The data engineer needs to have a working understanding of the way source systems work, the way they generate data, the frequency and velocity of the data, and the variety of data they generate. ([Location 1023](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1023)) - There are many things to consider when assessing source systems, including how the system handles ingestion, state, and data generation. ([Location 1047](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1047)) - What are the essential characteristics of the data source? Is it an application? A swarm of IoT devices? How is data persisted in the source system? Is data persisted long term, or is it temporary and quickly deleted? At what rate is data generated? How many events per second? How many gigabytes per hour? What level of consistency can data engineers expect from the output data? If you’re running data-quality checks against the output data, how often do data inconsistencies occur—nulls where they aren’t expected, lousy formatting, etc.? How often do errors occur? Will the data contain duplicates? Will some data values arrive late, possibly much later than other messages produced simultaneously? What is the schema of the ingested data? Will data engineers need to join across several tables or even several systems to get a complete picture of the data? If schema changes (say, a new column is added), how is this dealt with and communicated to downstream stakeholders? How frequently should data be pulled from the source system? For stateful systems (e.g., a database tracking customer account information), is data provided as periodic snapshots or update events from change data capture (CDC)? What’s the logic for how changes are performed, and how are these tracked in the source database? Who/what is the data provider that will transmit the data for downstream consumption? Will reading from a data source impact its performance? Does the source system have upstream data dependencies? What are the characteristics of these upstream systems? Are data-quality checks in place to check for late or missing data? ([Location 1050](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1050)) - A data engineer should know how the source generates data, including relevant quirks or nuances. ([Location 1069](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1069)) - The schema defines the hierarchical organization of data. ([Location 1072](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1072)) - Schemaless doesn’t mean the absence of schema. Rather, it means that the application defines the schema as data is written, whether to a message queue, a flat file, a blob, or a document database such as MongoDB. A more traditional model built on relational database storage uses a fixed schema enforced in the database, to which application writes must conform. ([Location 1078](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1078)) - First, data architectures in the cloud often leverage several storage solutions. Second, few data storage solutions function purely as storage, with many supporting complex transformation queries; even object storage solutions may support powerful query capabilities—e.g., Amazon S3 Select. Third, while storage is a stage of the data engineering lifecycle, it frequently touches on other stages, such as ingestion, transformation, and serving. ([Location 1093](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1093)) - Storage runs across the entire data engineering lifecycle, often occurring in multiple places in a data pipeline, with storage systems crossing over with source systems, ingestion, transformation, and serving. In many ways, the way data is stored impacts how it is used in all of the stages of the data engineering lifecycle. ([Location 1097](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1097)) - Here are a few key engineering questions to ask when choosing a storage system for a data warehouse, data lakehouse, database, or object storage: Is this storage solution compatible with the architecture’s required write and read speeds? Will storage create a bottleneck for downstream processes? Do you understand how this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with significant performance overhead.) Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc. Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)? Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes. Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)? Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)? How are you tracking master data, golden records data quality, and data lineage for data governance? (We have more to say on these in “Data Management”.) How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others? ([Location 1102](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1102)) - Data that is most frequently accessed is called hot data. Hot data is commonly retrieved many times per day, perhaps even several times per second—for example, in systems that serve user requests. This data should be stored for fast retrieval, where “fast” is relative to the use case. Lukewarm data might be accessed every so often—say, every week or month. Cold data is seldom queried and is appropriate for storing in an archival system. Cold data is often retained for compliance purposes or in case of a catastrophic failure in another system. ([Location 1124](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1124)) - This depends on your use cases, data volumes, frequency of ingestion, format, and size of the data being ingested—essentially, the key considerations listed in the preceding bulleted questions. There is no one-size-fits-all universal storage recommendation. ([Location 1137](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1137)) - source systems and ingestion represent the most significant bottlenecks of the data engineering lifecycle. ([Location 1148](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1148)) - When preparing to architect or build a system, here are some primary questions about the ingestion stage: What are the use cases for the data I’m ingesting? Can I reuse this data rather than create multiple versions of the same dataset? Are the systems generating and ingesting this data reliably, and is the data available when I need it? What is the data destination after ingestion? How frequently will I need to access the data? In what volume will the data typically arrive? What format is the data in? Can my downstream storage and transformation systems handle this format? Is the source data in good shape for immediate downstream use? If so, for how long, and what may cause it to be unusable? If the data is from a streaming source, does it need to be transformed before reaching its destination? Would an in-flight transformation be appropriate, where the data is transformed within the stream itself? ([Location 1154](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1154)) - Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source. Batch ingestion is simply a specialized and convenient way of processing this stream in large chunks—for example, handling a full day’s worth of data in a single batch. ([Location 1168](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1168)) - real-time (or near real-time) means that the data is available to a downstream system a short time after it is produced (e.g., less than one second later). ([Location 1176](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1176)) - Batch data is ingested either on a predetermined time interval or as data reaches a preset size threshold. Batch ingestion is a one-way door: once data is broken into batches, the latency for downstream consumers is inherently constrained. ([Location 1178](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1178)) - The following are some questions to ask yourself when determining whether streaming ingestion is an appropriate choice over batch ingestion: If I ingest the data in real time, can downstream storage systems handle the rate of data flow? Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data, say, every minute? What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in real time, what actions can I take on that data that would be an improvement upon batch? Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than simply doing batch? Are my streaming pipeline and system reliable and redundant if infrastructure fails? What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage it? What are the costs and trade-offs? If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training? Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system? ([Location 1186](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1186)) - Adopt true real-time streaming only after identifying a business use case that justifies the trade-offs against using batch. ([Location 1200](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1200)) - In the push model of data ingestion, a source system writes data out to a target, whether a database, object store, or filesystem. In the pull model, data is retrieved from the source system. ([Location 1202](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1202)) - After you’ve ingested and stored data, you need to do something with it. The next stage of the data engineering lifecycle is transformation, meaning data needs to be changed from its original form into something useful for downstream use cases. Without proper transformations, data will sit inert, and not be in a useful form for reports, analysis, or ML. ([Location 1226](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1226)) - When considering data transformations within the data engineering lifecycle, it helps to consider the following: What’s the cost and return on investment (ROI) of the transformation? What is the associated business value? Is the transformation as simple and self-isolated as possible? What business rules do the transformations support? ([Location 1235](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1235)) - Logically, we treat transformation as a standalone area of the data engineering lifecycle, but the realities of the lifecycle can be much more complicated in practice. Transformation is often entangled in other phases of the lifecycle. ([Location 1244](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1244)) - Business logic is a major driver of data transformation, often in data modeling. ([Location 1249](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1249)) - Data has value when it’s used for practical purposes. Data that is not consumed or queried is simply inert. Data vanity projects are a major risk for companies. Many companies pursued vanity projects in the big data era, gathering massive datasets in data lakes that were never consumed in any useful way. ([Location 1268](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1268)) - Analytics is the core of most data endeavors. Once your data is stored and transformed, you’re ready to generate reports or dashboards and do ad hoc analysis on the data. ([Location 1276](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1276)) - BI marshals collected data to describe a business’s past and current state. BI requires using business logic to process raw data. Note that data serving for analytics is yet another area where the stages of the data engineering lifecycle can get tangled. As we mentioned earlier, business logic is often applied to data in the transformation stage of the data engineering lifecycle, but a logic-on-read approach has become increasingly popular. Data is stored in a clean but fairly raw form, with minimal postprocessing business logic. A BI system maintains a repository of business logic and definitions. ([Location 1284](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1284)) - Although self-service analytics is simple in theory, it’s tough to pull off in practice. The main reason is that poor data quality, organizational silos, and a lack of adequate data skills often get in the way of allowing widespread use of analytics. ([Location 1294](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1294)) - Operational analytics focuses on the fine-grained details of operations, promoting actions that a user of the reports can act upon immediately. ([Location 1296](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1296)) - With embedded analytics, the request rate for reports, and the corresponding burden on analytics systems, goes up dramatically; access control is significantly more complicated and critical. Businesses may be serving separate analytics and data to thousands or more customers. Each customer must see their data and only their data. ([Location 1308](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1308)) - The responsibilities of data engineers overlap significantly in analytics and ML, and the boundaries between data engineering, ML engineering, and analytics engineering can be fuzzy. For example, a data engineer may need to support Spark clusters that facilitate analytics pipelines and ML model training. They may also need to provide a system that orchestrates tasks across teams and support metadata and cataloging systems that track data history and lineage. Setting these domains of responsibility and the relevant reporting structures is a critical organizational decision. ([Location 1325](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1325)) - The feature store is a recently developed tool that combines data engineering and ML engineering. Feature stores are designed to reduce the operational burden for ML engineers by maintaining feature history and versions, supporting feature sharing among teams, and providing basic operational and orchestration capabilities, such as backfilling. ([Location 1330](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1330)) - The following are some considerations for the serving data phase specific to ML: Is the data of sufficient quality to perform reliable feature engineering? Quality requirements and assessments are developed in close collaboration with teams consuming the data. Is the data discoverable? Can data scientists and ML engineers easily find valuable data? Where are the technical and organizational boundaries between data engineering and ML engineering? This organizational question has significant architectural implications. Does the dataset properly represent ground truth? Is it unfairly biased? ([Location 1340](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1340)) - Reverse ETL takes processed data from the output side of the data engineering lifecycle and feeds it back into source systems, as shown in Figure 2-6. In reality, this flow is beneficial and often necessary; reverse ETL allows us to take analytics, scored models, etc., and feed these back into production systems or SaaS platforms. ([Location 1352](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1352)) - Data engineering now encompasses far more than tools and technology. The field is now moving up the value chain, incorporating traditional enterprise practices such as data management and cost optimization and newer practices like DataOps. ([Location 1372](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1372)) - Data engineers must understand both data and access security, exercising the principle of least privilege. ([Location 1385](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1385)) - People and organizational structure are always the biggest security vulnerabilities in any company. ([Location 1392](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1392)) - Data security is also about timing—providing data access to exactly the people and systems that need to access it and only for the duration necessary to perform their work. ([Location 1396](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1396)) - Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle. ([Location 1417](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1417)) - Data management has quite a few facets, including the following: Data governance, including discoverability and accountability Data modeling and design Data lineage Storage and operations Data integration and interoperability Data lifecycle management Data systems for advanced analytics and ML Ethics and privacy ([Location 1425](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1425)) - “Data governance is, first and foremost, a data management function to ensure the quality, integrity, security, and usability of the data collected by an organization.” ([Location 1436](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1436)) - In a data-driven company, data must be available and discoverable. ([Location 1453](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1453)) - Metadata is “data about data,” and it underpins every section of the data engineering lifecycle. Metadata is exactly the data needed to make data discoverable and governable. ([Location 1459](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1459)) - DMBOK identifies four main categories of metadata that are useful to data engineers: Business metadata Technical metadata Operational metadata Reference metadata ([Location 1481](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1481)) - Business metadata relates to the way data is used in the business, including business and data definitions, data rules and logic, how and where data is used, and the data owner(s). ([Location 1484](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1484)) - A data engineer uses business metadata to answer nontechnical questions about who, what, where, and how. For example, a data engineer may be tasked with creating a data pipeline for customer sales analysis. ([Location 1487](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1487)) - Technical metadata describes the data created and used by systems across the data engineering lifecycle. ([Location 1491](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1491)) - Here are some common types of technical metadata that a data engineer will use: Pipeline metadata (often produced in orchestration systems) Data lineage Schema ([Location 1494](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1494)) - Pipeline metadata captured in orchestration systems provides details of the workflow schedule, system and data dependencies, configurations, connection details, and much more. ([Location 1497](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1497)) - Data-lineage metadata tracks the origin and changes to data, and its dependencies, over time. ([Location 1500](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1500)) - Schema metadata describes the structure of data stored in a system such as a database, a data warehouse, a data lake, or a filesystem; it is one of the key differentiators across different storage systems. ([Location 1503](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1503)) - Operational metadata describes the operational results of various systems and includes statistics about processes, job IDs, application runtime logs, data used in a process, and error logs. A data engineer uses operational metadata to determine whether a process succeeded or failed and the data involved in the process. ([Location 1508](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1508)) - Reference metadata is data used to classify other data. This is also referred to as lookup data. Standard examples of reference data are internal codes, geographic codes, units of measurement, and internal calendar standards. ([Location 1514](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1514)) - Data accountability means assigning an individual to govern a portion of data. The responsible person then coordinates the governance activities of other stakeholders. Managing data quality is tough if no one is accountable for the data in question. ([Location 1522](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1522)) - Data quality is the optimization of data toward the desired state and orbits the question, “What do you get compared with what you expect?” Data should conform to the expectations in the business metadata. Does the data match the definition agreed upon by the business? ([Location 1535](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1535)) - According to Data Governance: The Definitive Guide, data quality is defined by three main characteristics:4 Accuracy Is the collected data factually correct? Are there duplicate values? Are the numeric values accurate? Completeness Are the records complete? Do all required fields contain valid values? Timeliness Are records available in a timely fashion? ([Location 1542](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1542)) - Master data is data about business entities such as employees, customers, products, and locations. ([Location 1560](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1560)) - Master data management (MDM) is the practice of building consistent entity definitions known as golden records. Golden records harmonize entity data across an organization and with its partners. ([Location 1565](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1565)) - To derive business insights from data, through business analytics and data science, the data must be in a usable form. The process for converting data into a usable form is known as data modeling and design. ([Location 1576](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1576)) - With the wide variety of data that engineers must cope with, there is a temptation to throw up our hands and give up on data modeling. This is a terrible idea with harrowing consequences, made evident when people murmur of the write once, read never (WORN) access pattern or refer to a data swamp. ([Location 1589](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1589)) - Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on. ([Location 1597](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1597)) - Data integration and interoperability is the process of integrating data across tools and processes. ([Location 1608](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1608)) - While the complexity of interacting with data systems has decreased, the number of systems and the complexity of pipelines has dramatically increased. Engineers starting from scratch quickly outgrow the capabilities of bespoke scripting and stumble into the need for orchestration. ([Location 1618](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1618)) - Two changes have encouraged engineers to pay more attention to what happens at the end of the data engineering lifecycle. First, data is increasingly stored in the cloud. ([Location 1626](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1626)) - Second, privacy and data retention laws such as the GDPR and the CCPA require data engineers to actively manage data destruction to respect users’ “right to be forgotten.” ([Location 1632](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1632)) - The last several years of data breaches, misinformation, and mishandling of data make one thing clear: data impacts people. ([Location 1639](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1639)) - Data engineers need to ensure that datasets mask personally identifiable information (PII) and other sensitive information; bias can be identified and tracked in datasets as they are transformed. Regulatory requirements and compliance penalties are only growing. Ensure that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA. ([Location 1648](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1648)) - DataOps maps the best practices of Agile methodology, DevOps, and statistical process control (SPC) to data. Whereas DevOps aims to improve the release and quality of software products, DataOps does the same thing for data products. ([Location 1653](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1653)) - a data product is built around sound business logic and metrics, whose users make decisions or build models that perform automated actions. ([Location 1658](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1658)) - Like DevOps, DataOps borrows much from lean manufacturing and supply chain management, mixing people, processes, and technology to reduce time to value. As Data Kitchen (experts in DataOps) describes it:7 DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: Rapid innovation and experimentation delivering new insights to customers with increasing velocity Extremely high data quality and very low error rates Collaboration across complex arrays of people, technology, and environments Clear measurement, monitoring, and transparency of results ([Location 1660](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1660)) - DataOps is a set of cultural habits; the data engineering team needs to adopt a cycle of communicating and collaborating with the business, breaking down silos, continuously learning from successes and mistakes, and rapid iteration. ([Location 1672](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1672)) - We suggest first starting with observability and monitoring to get a window into the performance of a system, then adding in automation and incident response. ([Location 1678](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1678)) - DataOps has three core technical elements: automation, monitoring and observability, and incident response ([Location 1680](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1680)) - Automation enables reliability and consistency in the DataOps process and allows data engineers to quickly deploy new product features and improvements to existing workflows. ([Location 1686](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1686)) - DevOps, DataOps practices monitor and maintain the reliability of technology and systems (data pipelines, orchestration, etc.), with the added dimension of checking for data quality, data/model drift, metadata integrity, and more. ([Location 1690](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1690)) - One of the tenets of the DataOps Manifesto is “Embrace change.” This does not mean change for the sake of change but rather goal-oriented change. ([Location 1706](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1706)) - Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle. We recommend you incorporate SPC to understand whether events being monitored are out of line and which incidents are worth responding to. Petrella’s DODD method mentioned previously in this chapter provides an excellent framework for thinking about data observability. DODD is much like test-driven development (TDD) in software engineering: ([Location 1723](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1723)) - The purpose of DODD is to give everyone involved in the data chain visibility into the data and data applications so that everyone involved in the data value chain has the ability to identify changes to the data or data applications at every step—from ingestion to transformation to analysis—to help troubleshoot or prevent data issues. DODD focuses on making data observability a first-class consideration in the data engineering lifecycle. ([Location 1731](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1731)) - A high-functioning data team using DataOps will be able to ship new data products quickly. But mistakes will inevitably happen. ([Location 1737](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1737)) - Incident response is about using the automation and observability capabilities mentioned previously to rapidly identify root causes of an incident and resolve it as reliably and quickly as possible. ([Location 1741](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1741)) - Data engineers would do well to make DataOps practices a high priority in all of their work. ([Location 1753](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1753)) - A data architecture reflects the current and future state of data systems that support an organization’s long-term data needs and strategy. Because an organization’s data requirements will likely change rapidly, and new tools and practices seem to arrive on a near-daily basis, data engineers must understand good data architecture. ([Location 1761](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1761)) - A data engineer should first understand the needs of the business and gather requirements for new use cases. Next, a data engineer needs to translate those requirements to design new ways to capture and serve data, balanced for cost and operational simplicity. ([Location 1768](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1768)) - Orchestration is the process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence. For instance, people often refer to orchestration tools like Apache Airflow as schedulers. ([Location 1782](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1782)) - Software engineering has always been a central skill for data engineers. ([Location 1812](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1812)) - It’s also imperative that a data engineer understand proper code-testing methodologies, such as unit, regression, integration, end-to-end, and smoke. ([Location 1825](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1825)) - Many data engineers are heavily involved in developing open source frameworks. They adopt these frameworks to solve specific problems in the data engineering lifecycle, and then continue developing the framework code to improve the tools for their use cases and contribute back to the community. ([Location 1828](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1828)) - Keep an eye on the total cost of ownership (TCO) and opportunity cost associated with implementing a tool. ([Location 1838](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1838)) - Streaming data processing is inherently more complicated than batch, and the tools and paradigms are arguably less mature. ([Location 1841](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1841)) - Engineers must also write code to apply a variety of windowing methods. Windowing allows real-time systems to calculate valuable metrics such as trailing statistics. Engineers ([Location 1846](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1846)) - Infrastructure as code (IaC) applies software engineering practices to the configuration and management of infrastructure. ([Location 1851](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1851)) - Pipelines as code is the core concept of present-day orchestration systems, which touch every stage of the data engineering lifecycle. Data engineers use code (typically Python) to declare data tasks and dependencies among them. The orchestration engine interprets these instructions to run steps using available resources. ([Location 1863](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1863)) - Good data architecture provides seamless capabilities across every step of the data lifecycle and undercurrent. ([Location 1931](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1931)) - Successful data engineering is built upon rock-solid data architecture. ([Location 1937](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1937)) - researching data architecture yields many inconsistent and often outdated definitions. ([Location 1942](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1942)) - Enterprise architecture has many subsets, including business, technical, application, and data ([Location 1947](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1947)) - Enterprise architecture is the design of systems to support change in the enterprise, achieved by flexible and reversible decisions reached through careful evaluation of trade-offs. ([Location 1991](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1991)) - Flexible and reversible decisions are essential for two reasons. First, the world is constantly changing, and predicting the future is impossible. Reversible decisions allow you to adjust course as the world changes and you gather new information. Second, there is a natural tendency toward enterprise ossification as organizations grow. Adopting a culture of reversible decisions helps overcome this tendency by reducing the risk attached to a decision. ([Location 1996](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1996)) - Jeff Bezos is credited with the idea of one-way and two-way doors.4 A one-way door is a decision that is almost impossible to reverse. For example, Amazon could have decided to sell AWS or shut it down. It would be nearly impossible for Amazon to rebuild a public cloud with the same market position after such an action. On the other hand, a two-way door is an easily reversible decision: you walk through and proceed if you like what you see in the room or step back through the door if you don’t. ([Location 1999](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=1999)) - Change management is closely related to reversible decisions and is a central theme of enterprise architecture frameworks. Even with an emphasis on reversible decisions, enterprises often need to undertake large initiatives. These are ideally broken into smaller changes, each one a reversible decision in itself. ([Location 2008](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2008)) - Architects identify problems in the current state (poor data quality, scalability limits, money-losing lines of business), define desired future states (agile data-quality improvement, scalable cloud data solutions, improved business processes), and realize initiatives through execution of small, concrete steps. It bears repeating: Technical solutions exist not for their own sake but in support of business goals. ([Location 2014](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2014)) - definition: enterprise architecture balances flexibility and trade-offs. ([Location 2026](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2026)) - Data architecture is the design of systems to support the evolving data needs of an enterprise, achieved by flexible and reversible decisions reached through a careful evaluation of trade-offs. ([Location 2050](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2050)) - Just as the data engineering lifecycle is a subset of the data lifecycle, data engineering architecture is a subset of general data architecture. Data engineering architecture is the systems and frameworks that make up the key sections of the data engineering lifecycle. We’ll use data architecture interchangeably with data engineering architecture throughout this book. ([Location 2052](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2052)) - Operational architecture encompasses the functional requirements of what needs to happen related to people, processes, and technology. ([Location 2058](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2058)) - Technical architecture outlines how data is ingested, stored, transformed, and served along the data engineering lifecycle. ([Location 2061](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2061)) - Good data architecture serves business requirements with a common, widely reusable set of building blocks while maintaining flexibility and making appropriate trade-offs. Bad architecture is authoritarian and tries to cram a bunch of one-size-fits-all decisions into a big ball of mud. ([Location 2076](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2076)) - Agility is the foundation for good data architecture; it acknowledges that the world is fluid. Good data architecture is flexible and easily maintainable. It evolves in response to changes within the business and new technologies and practices that may unlock even more value in the future. ([Location 2078](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2078)) - Bad data architecture is tightly coupled, rigid, overly centralized, or uses the wrong tools for the job, hampering development and change management. ([Location 2083](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2083)) - The AWS Well-Architected Framework consists of six pillars: Operational excellence Security Reliability Performance efficiency Cost optimization Sustainability ([Location 2094](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2094)) - Google Cloud’s Five Principles for Cloud-Native Architecture are as follows: Design for automation. Be smart with state. Favor managed services. Practice defense in depth. Always be architecting. ([Location 2100](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2100)) - We’d like to expand or elaborate on these pillars with these principles of data engineering architecture: Choose common components wisely. Plan for failure. Architect for scalability. Architecture is leadership. Always be architecting. Build loosely coupled systems. Make reversible decisions. Prioritize security. Embrace FinOps. ([Location 2108](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2108)) - When architects choose well and lead effectively, common components become a fabric facilitating team collaboration and breaking down silos. Common components enable agility within and across teams in conjunction with shared knowledge and skills. ([Location 2118](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2118)) - Common components can be anything that has broad applicability within an organization. Common components include object storage, version-control systems, observability, monitoring and orchestration systems, and processing engines. ([Location 2120](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2120)) - Cloud platforms are an ideal place to adopt common components. ([Location 2129](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2129)) - Here are a few key terms for evaluating failure scenarios; we describe these in greater detail in this chapter and throughout the book: Availability The percentage of time an IT service or component is in an operable state. Reliability The system’s probability of meeting defined standards in performing its intended function during a specified interval. Recovery time objective The maximum acceptable time for a service or system outage. The recovery time objective (RTO) is generally set by determining the business impact of an outage. An RTO of one day might be fine for an internal reporting system. A website outage of just five minutes could have a significant adverse business impact on an online retailer. Recovery point objective The acceptable state after recovery. In data systems, data is often lost during an outage. In this setting, the recovery point objective (RPO) refers to the maximum acceptable data loss. ([Location 2145](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2145)) - Scalability in data systems encompasses two main capabilities. First, scalable systems can scale up to handle significant quantities of data. ([Location 2161](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2161)) - Second, scalable systems can scale down. Once the load spike ebbs, we should automatically remove capacity to cut costs. (This is related to principle 9.) An elastic system can scale dynamically in response to load, ideally in an automated fashion. ([Location 2166](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2166)) - Data architects are responsible for technology decisions and architecture descriptions and disseminating these choices through effective leadership and training. Data architects should be highly technically competent but delegate most individual contributor work to others. ([Location 2176](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2176)) - In many ways, the most important activity of Architectus Oryzus is to mentor the development team, to raise their level so they can take on more complex issues. ([Location 2187](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2187)) - An ideal data architect manifests similar characteristics. They possess the technical skills of a data engineer but no longer practice data engineering day to day; they mentor current data engineers, make careful technology choices in consultation with their organization, and disseminate expertise through training and leadership. They train engineers in best practices and bring the company’s engineering resources together to pursue common goals in both technology and business. ([Location 2190](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2190)) - an architect’s job is to develop deep knowledge of the baseline architecture (current state), develop a target architecture, and map out a sequencing plan to determine priorities and the order of architecture changes. ([Location 2202](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2202)) - When the architecture of the system is designed to enable teams to test, deploy, and change systems without dependencies on other teams, teams require little communication to get work done. In other words, both the architecture and the teams are loosely coupled. ([Location 2215](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2215)) - For software architecture, a loosely coupled system has the following properties: Systems are broken into many small components. These systems interface with other services through abstraction layers, such as a messaging bus or an API. These abstraction layers hide and protect internal details of the service, such as a database backend or internal classes and method calls. As a consequence of property 2, internal changes to a system component don’t require changes in other parts. Details of code updates are hidden behind stable APIs. Each piece can evolve and improve separately. As a consequence of property 3, there is no waterfall, global release cycle for the whole system. Instead, each component is updated separately as changes and improvements are made. ([Location 2233](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2233)) - Given the pace of change—and the decoupling/modularization of technologies across your data architecture—always strive to pick the best-of-breed solutions that work for today. Also, be prepared to upgrade or adopt better practices as the landscape evolves. ([Location 2268](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2268)) - Every data engineer must assume responsibility for the security of the systems they build and maintain. We focus now on two main ideas: zero-trust security and the shared responsibility security model. These align closely to a cloud-native architecture. ([Location 2273](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2273)) - FinOps is an evolving cloud financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology, and business teams to collaborate on data-driven spending decisions. ([Location 2329](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2329)) - In the past, data engineers thought in terms of performance engineering—maximizing the performance for data processes on a fixed set of resources and buying adequate resources for future needs. With FinOps, engineers need to learn to think about the cost structures of cloud systems. For example, what is the appropriate mix of AWS spot instances when running a distributed cluster? What is the most appropriate approach for running a sizable daily job in terms of cost-effectiveness and performance? When should the company switch from a pay-per-query model to reserved capacity? ([Location 2349](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2349)) - we must not lose sight of the main goal of all of these architectures: to take data and transform it into something useful for downstream consumption. ([Location 2373](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2373)) - A domain is the real-world subject area for which you’re architecting. A service is a set of functionality whose goal is to accomplish a task. ([Location 2381](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2381)) - A domain can contain multiple services. For example, you might have a sales domain with three services: orders, invoicing, and products. Each service has particular tasks that support the sales domain. Other domains may also share services (Figure 3-3). ([Location 2386](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2386)) - When thinking about what constitutes a domain, focus on what the domain represents in the real world and work backward. ([Location 2394](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2394)) - As data engineers, we’re interested in four closely related characteristics of data systems (availability and reliability were mentioned previously, but we reiterate them here for completeness): Scalability Allows us to increase the capacity of a system to improve performance and handle the demand. For example, we might want to scale a system to handle a high rate of queries or process a huge data set. Elasticity The ability of a scalable system to scale dynamically; a highly elastic system can automatically scale up and down based on the current workload. Scaling up is critical as demand increases, while scaling down saves money in a cloud environment. Modern systems sometimes scale to zero, meaning they can automatically shut down when idle. Availability The percentage of time an IT service or component is in an operable state. Reliability The system’s probability of meeting defined standards in performing its intended function during a specified interval. ([Location 2405](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2405)) - Distributed systems are widespread in the various data technologies you’ll use across your architecture. Almost every cloud data warehouse object storage system you use has some notion of distribution under the hood. ([Location 2432](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2432)) - On one end of the spectrum, you can choose to have extremely centralized dependencies and workflows. Every part of a domain and service is vitally dependent upon every other domain and service. This pattern is known as tightly coupled. ([Location 2445](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2445)) - On the other end of the spectrum, you have decentralized domains and services that do not have strict dependence on each other, in a pattern known as loose coupling. In a loosely coupled scenario, it’s easy for decentralized teams to build systems whose data may not be usable by their peers. ([Location 2447](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2447)) - As you develop your architecture, it helps to be aware of architecture tiers. Your architecture has layers—data, application, business logic, presentation, and so forth—and you need to know how to decouple these layers. Because tight coupling of modalities presents obvious vulnerabilities, keep in mind how you structure the layers of your architecture to achieve maximum reliability and flexibility. ([Location 2455](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2455)) - In a single-tier architecture, your database and application are tightly coupled, residing on a single server ([Location 2460](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2460)) - A multitier (also known as n-tier) architecture is composed of separate layers: data, application, business logic, presentation, etc. These layers are bottom-up and hierarchical, meaning the lower layer isn’t necessarily dependent on the upper layers; the upper layers depend on the lower layers. The notion is to separate data from the application, and application from the presentation. ([Location 2476](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2476)) - A common multitier architecture is a three-tier architecture, a widely used client-server design. A three-tier architecture consists of data, application logic, and presentation tiers ([Location 2479](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2479)) - Coupling within monoliths can be viewed in two ways: technical coupling and domain coupling. Technical coupling refers to architectural tiers, while domain coupling refers to the way domains are coupled together. A monolith has varying degrees of coupling among technologies and domains. You could have an application with various layers decoupled in a multitier architecture but still share multiple domains. ([Location 2502](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2502)) - Microservices architecture comprises separate, decentralized, and loosely coupled services. Each service has a specific function and is decoupled from other services operating within its domain. If one service temporarily goes down, it won’t affect the ability of other services to continue functioning. ([Location 2517](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2517)) - Rather than dogmatically preach microservices over monoliths (among other arguments), we suggest you pragmatically use loose coupling as an ideal, while recognizing the state and limitations of the data technologies you’re using within your data architecture. Incorporate reversible technology choices that allow for modularity and loose coupling whenever possible. ([Location 2539](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2539)) - One approach to this problem is centralization: a single team is responsible for gathering data from all domains and reconciling it for consumption across the organization. (This is a common approach in traditional data warehousing.) Another approach is the data mesh. With the data mesh, each software team is responsible for preparing its data for consumption across the rest of the organization. We’ll ([Location 2546](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2546)) - We have two factors to consider in multitenancy: performance and security. ([Location 2562](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2562)) - With multiple large tenants within a cloud system, will the system support consistent performance for all tenants, or will there be a noisy neighbor problem? (That is, will high usage from one tenant degrade performance for other tenants?) Regarding security, data from different tenants must be properly isolated. ([Location 2564](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2564)) - Your business is rarely static. Things often happen in your business, such as getting a new customer, a new order from a customer, or an order for a product or service. These are all examples of events that are broadly defined as something that happened, typically a change in the state of something. ([Location 2571](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2571)) - An event-driven workflow (Figure 3-8) encompasses the ability to create, update, and asynchronously move events across various parts of the data engineering lifecycle. This workflow boils down to three main areas: event production, routing, and consumption. ([Location 2576](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2576)) - An event-driven architecture (Figure 3-9) embraces the event-driven workflow and uses this to communicate across various services. The advantage of an event-driven architecture is that it distributes the state of an event across multiple services. ([Location 2582](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2582)) - Before you design your data architecture project, you need to know whether you’re starting with a clean slate or redesigning an existing architecture. ([Location 2591](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2591)) - Brownfield projects often involve refactoring and reorganizing an existing architecture and are constrained by the choices of the present and past. Because a key part of architecture is change management, you must figure out a way around these limitations and design a path forward to achieve your new business and technical objectives. Brownfield projects require a thorough understanding of the legacy architecture and the interplay of various old and new technologies. ([Location 2596](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2596)) - On the opposite end of the spectrum, a greenfield project allows you to pioneer a fresh start, unconstrained by the history or legacy of a prior architecture. Greenfield projects tend to be easier than brownfield projects, and many data architects and engineers find them more fun! ([Location 2616](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2616)) - Because data architecture is an abstract discipline, it helps to reason by example. In this section, we outline prominent examples and types of data architecture that are popular today. ([Location 2631](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2631)) - A data warehouse is a central data hub used for reporting and analysis. Data in a data warehouse is typically highly formatted and structured for analytics use cases. It’s among the oldest and most well-established data architectures. ([Location 2636](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2636)) - It’s worth noting two types of data warehouse architecture: organizational and technical. The organizational data warehouse architecture organizes data associated with certain business team structures and processes. The technical data warehouse architecture reflects the technical nature of the data warehouse, such as MPP. ([Location 2647](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2647)) - The organizational data warehouse architecture has two main characteristics: Separates online analytical processing (OLAP) from production databases (online transaction processing) This separation is critical as businesses grow. Moving data into a separate physical system directs load away from production systems and improves analytics performance. Centralizes and organizes data Traditionally, a data warehouse pulls data from application systems by using ETL. The extract phase pulls data from source systems. ([Location 2653](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2653)) - One variation on ETL is ELT. With the ELT data warehouse architecture, data gets moved more or less directly from production systems into a staging area in the data warehouse. Staging in this setting indicates that the data is in a raw form. Rather than using an external system, transformations are handled directly in the data warehouse. The intention is to take advantage of the massive computational power of cloud data warehouses and data processing tools. ([Location 2673](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2673)) - Cloud data warehouses represent a significant evolution of the on-premises data warehouse architecture and have thus led to significant changes to the organizational architecture. Amazon Redshift kicked off the cloud data warehouse revolution. ([Location 2687](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2687)) - A data mart is a more refined subset of a warehouse designed to serve analytics and reporting, focused on a single suborganization, department, or line of business; every department has its own data mart, specific to its needs. ([Location 2701](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2701)) - Data marts exist for two reasons. First, a data mart makes data more easily accessible to analysts and report developers. Second, data marts provide an additional stage of transformation beyond that provided by the initial ETL or ELT pipelines. ([Location 2705](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2705)) - Among the most popular architectures that appeared during the big data era is the data lake. Instead of imposing tight structural limitations on data, why not simply dump all of your data—structured and unstructured—into a central location? The data lake promised to be a democratizing force, liberating the business to drink from a fountain of limitless data. ([Location 2714](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2714)) - The data lake became a dumping ground; terms such as data swamp, dark data, and WORN were coined as once-promising data projects failed. Data grew to unmanageable sizes, with little in the way of schema management, data cataloging, and discovery tools. ([Location 2724](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2724)) - Many organizations found significant value in data lakes—especially huge, heavily data-focused Silicon Valley tech companies like Netflix and Facebook. ([Location 2742](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2742)) - In response to the limitations of first-generation data lakes, various players have sought to enhance the concept to fully realize its promise. For example, Databricks introduced the notion of a data lakehouse. The lakehouse incorporates the controls, data management, and data structures found in a data warehouse while still housing data in object storage and supporting a variety of query and transformation engines. In particular, the data lakehouse supports atomicity, consistency, isolation, and durability (ACID) transactions, a big departure from the original data lake, where you simply pour in data and never update or delete it. The term data lakehouse suggests a convergence between data lakes and data warehouses. ([Location 2746](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2746)) - The modern data stack (Figure 3-13) is currently a trendy analytics architecture that highlights the type of abstraction we expect to see more widely used over the next several years. Whereas past data stacks relied on expensive, monolithic toolsets, the main objective of the modern data stack is to use cloud-based, plug-and-play, easy-to-use, off-the-shelf components to create a modular and cost-effective data architecture. These components include data pipelines, storage, transformation, data management/governance, monitoring, visualization, and exploration. ([Location 2767](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2767)) - In a Lambda architecture (Figure 3-14), you have systems operating independently of each other—batch, streaming, and serving. The source system is ideally immutable and append-only, sending data to two destinations for processing: stream and batch. In-stream processing intends to serve the data with the lowest possible latency in a “speed” layer, usually a NoSQL database. In the batch layer, data is processed and transformed in a system such as a data warehouse, creating precomputed and aggregated views of the data. The serving layer provides a combined view by aggregating query results from the two layers. ([Location 2795](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2795)) - As a response to the shortcomings of Lambda architecture, Jay Kreps proposed an alternative called Kappa architecture (Figure 3-15).23 The central thesis is this: why not just use a stream-processing platform as the backbone for all data handling—ingestion, storage, and serving? This facilitates a true event-based architecture. Real-time and batch processing can be applied seamlessly to the same data by reading the live event stream directly and replaying large chunks of data for batch processing. ([Location 2807](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2807)) - Though the original Kappa architecture article came out in 2014, we haven’t seen it widely adopted. There may be a couple of reasons for this. First, streaming itself is still a bit of a mystery for many companies; it’s easy to talk about, but harder than expected to execute. Second, Kappa architecture turns out to be complicated and expensive in practice. ([Location 2816](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2816)) - The core idea in the Dataflow model is to view all data as events, as the aggregation is performed over various types of windows. Ongoing real-time event streams are unbounded data. Data batches are simply bounded event streams, and the boundaries provide a natural window. Engineers can choose from various windows for real-time aggregation, such as sliding or tumbling. Real-time and batch processing happens in the same system using nearly identical code. ([Location 2831](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2831)) - The Internet of Things (IoT) is the distributed collection of devices, aka things—computers, sensors, mobile devices, smart home devices, and anything else with an internet connection. Rather than generating data from direct human input (think data entry from a keyboard), IoT data is generated from devices that collect data periodically or continuously from the surrounding environment and transmit it to a destination. ([Location 2842](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2842)) - Devices (also known as things) are the physical hardware connected to the internet, sensing the environment around them and collecting and transmitting data to a downstream destination. ([Location 2854](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2854)) - An IoT gateway is a hub for connecting devices and securely routing devices to the appropriate destinations on the internet. While you can connect a device directly to the internet without an IoT gateway, the gateway allows devices to connect using extremely little power. ([Location 2867](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2867)) - Ingestion begins with an IoT gateway, as discussed previously. From there, events and measurements can flow into an event ingestion architecture. ([Location 2876](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2876)) - Storage requirements will depend a great deal on the latency requirement for the IoT devices in the system. ([Location 2883](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2883)) - Serving patterns are incredibly diverse. In a batch scientific application, data might be analyzed using a cloud data warehouse and then served in a report. Data will be presented and served in numerous ways in a home-monitoring application. ([Location 2889](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2889)) - The data mesh is a recent response to sprawling monolithic data platforms, such as centralized data lakes and data warehouses, and “the great divide of data,” wherein the landscape is divided between operational data and analytical data.24 The data mesh attempts to invert the challenges of centralized data architecture, taking the concepts of domain-driven design (commonly used in software architectures) and applying them to data architecture. Because the data mesh has captured much recent attention, you should be aware of it. ([Location 2906](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2906)) - Dehghani later identified four key components of the data mesh:26 Domain-oriented decentralized data ownership and architecture Data as a product Self-serve data infrastructure as a platform Federated computational governance ([Location 2917](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2917)) - Data architectures have countless other variations, such as data fabric, data hub, scaled architecture, metadata-first architecture, event-driven architecture, live data stack (Chapter 11), and many more. ([Location 2928](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2928)) - Bigger companies may still employ data architects, but those architects will need to be heavily in tune and current with the state of technology and data. Gone are the days of ivory tower data architecture. In the past, architecture was largely orthogonal to engineering. We expect this distinction will disappear as data engineering, and engineering in general, quickly evolves, becoming more agile, with less separation between engineering and architecture. ([Location 2943](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2943)) - When designing architecture, you’ll work alongside business stakeholders to evaluate trade-offs. What are the trade-offs inherent in adopting a cloud data warehouse versus a data lake? What are the trade-offs of various cloud platforms? When might a unified batch/streaming framework (Beam, Flink) be an appropriate choice? Studying these choices in the abstract will prepare you to make concrete, valuable decisions. ([Location 2949](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=2949)) - Data engineering nowadays suffers from an embarrassment of riches. We have no shortage of technologies to solve various types of data problems. ([Location 3112](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3112)) - However, it’s easy to get caught up in chasing bleeding-edge technology while losing sight of the core purpose of data engineering: designing robust and reliable systems to carry data through the full lifecycle and serve it according to the needs of end users. ([Location 3115](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3115)) - We feel the criteria to choose a good data technology is simple: does it add value to a data product and the broader business? ([Location 3121](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3121)) - Architecture is strategic; tools are tactical. ([Location 3126](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3126)) - Architecture is the high-level design, roadmap, and blueprint of data systems that satisfy the strategic aims for the business. ([Location 3127](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3127)) - Architecture is the what, why, and when. Tools are used to make the architecture a reality; tools are the how. ([Location 3128](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3128)) - Architecture first, technology second. ([Location 3136](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3136)) - The following are some considerations for choosing data technologies across the data engineering lifecycle: Team size and capabilities Speed to market Interoperability Cost optimization and business value Today versus the future: immutable versus transitory technologies Location (cloud, on prem, hybrid cloud, multicloud) Build versus buy Monolith versus modular Serverless versus servers Optimization, performance, and the benchmark wars The undercurrents of the data engineering lifecycle ([Location 3138](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3138)) - The first thing you need to assess is your team’s size and its capabilities with technology. ([Location 3146](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3146)) - We sometimes see small data teams read blog posts about a new cutting-edge technology at a giant tech company and then try to emulate these same extremely complex technologies and practices. We call this cargo-cult engineering, and it’s generally a big mistake that consumes a lot of valuable time and money, often with little to nothing to show in return. ([Location 3152](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3152)) - we suggest sticking with technologies and workflows with which the team is familiar. ([Location 3160](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3160)) - Learning new technologies, languages, and tools is a considerable time investment, so make these investments wisely. ([Location 3161](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3161)) - In technology, speed to market wins. ([Location 3163](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3163)) - Perfect is the enemy of good. Some data teams will deliberate on technology choices for months or years without reaching any decisions. ([Location 3167](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3167)) - Deliver value early and often. As we’ve mentioned, use what works. Your team members will likely get better leverage with tools they already know. ([Location 3170](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3170)) - When choosing a technology or system, you’ll need to ensure that it interacts and operates with other technologies. ([Location 3176](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3176)) - Interoperability describes how various technologies or systems connect, exchange information, and interact. ([Location 3176](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3176)) - Always be aware of how simple it will be to connect your various technologies across the data engineering lifecycle. ([Location 3187](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3187)) - We look at costs through three main lenses: total cost of ownership, opportunity cost, and FinOps. ([Location 3197](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3197)) - Total cost of ownership (TCO) is the total estimated cost of an initiative, including the direct and indirect costs of products and services utilized. ([Location 3199](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3199)) - Direct costs can be directly attributed to an initiative. ([Location 3204](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3204)) - Indirect costs, also known as overhead, are independent of the initiative and must be paid regardless of where they’re attributed. ([Location 3207](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3207)) - Apart from direct and indirect costs, how something is purchased impacts the way costs are accounted for. Expenses fall into two big groups: capital expenses and operational expenses. ([Location 3211](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3211)) - Capital expenses, also known as capex, require an up-front investment. Payment is required today. ([Location 3212](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3212)) - Operational expenses, also known as opex, are the opposite of capex in certain respects. Opex is gradual and spread out over time. Whereas capex is long-term focused, opex is short-term. Opex can be pay-as-you-go or similar and allows a lot of flexibility. ([Location 3219](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3219)) - Total opportunity cost of ownership (TOCO) is the cost of lost opportunities that we incur in choosing a technology, an architecture, or a process.1 ([Location 3234](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3234)) - Data engineers often fail to evaluate TOCO when undertaking a new project; in our opinion, this is a massive blind spot. ([Location 3238](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3238)) - The first step to minimizing opportunity cost is evaluating it with eyes wide open. We’ve seen countless data teams get stuck with technologies that seemed good at the time and are either not flexible for future growth or simply obsolete. Inflexible data technologies are a lot like bear traps. They’re easy to get into and extremely painful to escape. ([Location 3245](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3245)) - The goal of FinOps is to fully operationalize financial accountability and business value by applying the DevOps-like practices of monitoring and dynamically adjusting systems. ([Location 3252](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3252)) - The intention to build a better future is noble but often leads to overarchitecting and overengineering. Tooling chosen for the future may be stale and out-of-date when this future arrives; the future frequently looks little like what we envisioned years before. ([Location 3267](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3267)) - You should choose the best technology for the moment and near future, but in a way that supports future unknowns and evolution. ([Location 3270](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3270)) - We have two classes of tools to consider: immutable and transitory. ([Location 3273](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3273)) - Immutable technologies might be components that underpin the cloud or languages and paradigms that have stood the test of time. In the cloud, examples of immutable technologies are object storage, networking, servers, and security. ([Location 3274](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3274)) - Transitory technologies are those that come and go. The typical trajectory begins with a lot of hype, followed by meteoric growth in popularity, then a slow descent into obscurity. ([Location 3282](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3282)) - Even relatively successful technologies often fade into obscurity quickly, after a few years of rapid adoption, a victim of their success. ([Location 3294](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3294)) - Given the rapid pace of tooling and best-practice changes, we suggest evaluating tools every two years ([Location 3301](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3301)) - Given the reasonable probability of failure for many data technologies, you need to consider how easy it is to transition from a chosen technology. ([Location 3305](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3305)) - Companies now have numerous options when deciding where to run their technology stacks. A slow shift toward the cloud culminates in a veritable stampede of companies spinning up workloads on AWS, Azure, and Google Cloud Platform (GCP). ([Location 3310](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3310)) - While new startups are increasingly born in the cloud, on-premises systems are still the default for established companies. ([Location 3317](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3317)) - Essentially, these companies own their hardware, which may live in data centers they own or in leased colocation space. ([Location 3319](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3319)) - The cloud flips the on-premises model on its head. Instead of purchasing hardware, you simply rent hardware and managed services from a cloud provider (such as AWS, Azure, or Google Cloud). ([Location 3335](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3335)) - PaaS includes IaaS products but adds more sophisticated managed services to support applications. ([Location 3348](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3348)) - Many people quibble with the term serverless; after all, the code must run somewhere. In practice, serverless usually means many invisible servers. ([Location 3358](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3358)) - Enterprises that migrate to the cloud often make major deployment errors by not appropriately adapting their practices to the cloud pricing model. ([Location 3366](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3366)) - Any of these limits (IOPs, storage capacity, bandwidth) is a potential bottleneck for a cloud provider. ([Location 3387](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3387)) - We often think of this optimization as leading to lower costs, but we should also strive to increase business value by exploiting the dynamic nature of the cloud. ([Location 3412](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3412)) - Data gravity is real: once data lands in a cloud, the cost to extract it and migrate processes can be very high. ([Location 3421](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3421)) - The hybrid cloud model assumes that an organization will indefinitely maintain some workloads outside the cloud. ([Location 3428](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3428)) - This pattern of putting analytics in the cloud is beautiful because data flows primarily in one direction, minimizing data egress costs ([Location 3433](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3433)) - That is, on-premises applications generate event data that can be pushed to the cloud essentially for free. ([Location 3434](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3434)) - Multicloud simply refers to deploying workloads to multiple public clouds. ([Location 3442](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3442)) - Another common motivation for employing a multicloud approach is to take advantage of the best services across several clouds. ([Location 3449](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3449)) - A multicloud methodology has several disadvantages. As we just mentioned, data egress costs and networking bottlenecks are critical. ([Location 3454](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3454)) - multicloud networking can be diabolically complicated. ([Location 3456](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3456)) - A new generation of “cloud of clouds” services aims to facilitate multicloud with reduced complexity by offering services across clouds and seamlessly replicating data between clouds or managing workloads on several clouds through a single pane of glass. ([Location 3457](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3457)) - The “cloud of clouds” space is evolving quickly; within a few years of this book’s publication, many more of these services will be available. Data engineers and architects would do well to maintain awareness of this quickly changing cloud landscape. ([Location 3462](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3462)) - Though not widely used now, it’s worth briefly mentioning a new trend that might become popular over the next decade: decentralized computing. Whereas today’s applications mainly run on premises and in the cloud, the rise of blockchain, Web 3.0, and edge computing may invert this paradigm. ([Location 3465](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3465)) - We believe that it is critical to avoid this endless trap of analysis. Instead, plan for the present. Choose the best technologies for your current needs and concrete plans for the near future. Choose your deployment platform based on real business needs while focusing on simplicity and flexibility. ([Location 3487](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3487)) - On the other hand, have an escape plan. As we’ve emphasized before, every technology—even open source software—comes with some degree of lock-in. ([Location 3493](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3493)) - Consider continuing to run workloads on premises or repatriating cloud workloads if you run a truly cloud-scale service. ([Location 3545](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3545)) - The argument supporting buying comes down to resource constraints and expertise; do you have the expertise to build a better solution than something already available? Either decision comes down to TCO, TOCO, and whether the solution provides a competitive advantage to your organization. ([Location 3559](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3559)) - it’s that we suggest investing in building and customizing when doing so will provide a competitive advantage for your business. ([Location 3562](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3562)) - Whenever possible, lean toward type A behavior; avoid undifferentiated heavy lifting and embrace abstraction. Use open source frameworks, or if this is too much trouble, look at buying a suitable managed or proprietary solution. ([Location 3570](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3570)) - Whereas in the past, IT used to make most of the software purchase and adoption decisions in a top-down manner, these days, the trend is for bottom-up software adoption in a company, driven by developers, data engineers, data scientists, and other technical roles. Technology adoption within companies is becoming an organic, continuous process. ([Location 3574](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3574)) - Open source software (OSS) is a software distribution model in which software, and the underlying codebase, is made available for general use, typically under specific licensing terms. ([Location 3578](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3578)) - OSS has two main flavors: community managed and commercial OSS. ([Location 3586](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3586)) - OSS projects succeed with a strong community and vibrant user base. Community-managed OSS is a prevalent path for OSS projects. The community opens up high rates of innovations and contributions from developers worldwide with popular OSS projects. ([Location 3588](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3588)) - Mindshare Avoid adopting OSS projects that don’t have traction and popularity. ([Location 3591](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3591)) - Maturity How long has the project been around, how active is it today, and how usable are people finding it in production? ([Location 3595](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3595)) - Troubleshooting How will you have to handle problems if they arise? ([Location 3597](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3597)) - Project management Look at Git issues and the way they’re addressed. ([Location 3598](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3598)) - Team Is a company sponsoring the OSS project? ([Location 3600](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3600)) - Developer relations and community management What is the project doing to encourage uptake and adoption? ([Location 3601](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3601)) - Contributing Does the project encourage and accept pull requests? ([Location 3602](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3602)) - Roadmap Is there a project roadmap? ([Location 3604](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3604)) - Self-hosting and maintenance Do you have the resources to host and maintain the OSS solution? ([Location 3604](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3604)) - Giving back to the community If you like the project and are actively using it, consider investing in it. ([Location 3606](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3606)) - Commercial vendors try to solve this management headache by hosting and managing the OSS solution for you, typically as a cloud SaaS offering. ([Location 3614](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3614)) - This model is called commercial OSS (COSS). Typically, a vendor will offer the “core” of the OSS for free while charging for enhancements, curated code distributions, or fully managed services. ([Location 3616](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3616)) - This is a widespread trend: an OSS project becomes popular, an affiliated company raises truckloads of venture capital (VC) money to commercialize the OSS project, and the company scales as a fast-moving rocket ship. ([Location 3620](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3620)) - Value Is the vendor offering a better value than if you managed the OSS technology yourself? ([Location 3625](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3625)) - Delivery model How do you access the service? ([Location 3627](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3627)) - Support Support cannot be understated, and it’s often opaque to the buyer. ([Location 3628](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3628)) - Releases and bug fixes Is the vendor transparent about the release schedule, improvements, and bug fixes? ([Location 3632](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3632)) - Sales cycle and pricing Often a vendor will offer on-demand pricing, especially for a SaaS product, and offer you a discount if you commit to an extended agreement. ([Location 3633](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3633)) - Company finances Is the company viable? If the company has raised VC funds, you can check their funding on sites like Crunchbase. ([Location 3635](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3635)) - Logos versus revenue Is the company focused on growing the number of customers (logos), or is it trying to grow revenue? ([Location 3637](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3637)) - Community support Is the company truly supporting the community version of the OSS project? ([Location 3640](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3640)) - Often a company selling a data tool will not release it as OSS, instead offering a proprietary solution. Although you won’t have the transparency of a pure OSS solution, a proprietary independent solution can work quite well, especially as a fully managed service in the cloud. ([Location 3660](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3660)) - Interoperability Make sure that the tool interoperates with other tools you’ve chosen (OSS, other independents, cloud offerings, etc.). ([Location 3662](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3662)) - Mindshare and market share Is the solution popular? ([Location 3664](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3664)) - Documentation and support Problems and questions will inevitably arise. ([Location 3665](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3665)) - Pricing Is the pricing understandable? ([Location 3667](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3667)) - Longevity Will the company survive long enough for you to get value from its product? ([Location 3669](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3669)) - Cloud vendors will often bundle their products to work well together. Each cloud can create stickiness with its user base by creating a strong integrated ecosystem. ([Location 3679](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3679)) - Performance versus price comparisons Is the cloud offering substantially better than an independent or OSS version? ([Location 3681](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3681)) - Purchase considerations On-demand pricing can be expensive. ([Location 3682](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3682)) - In general, we favor OSS and COSS by default, which frees you to focus on improving those areas where these options are insufficient. ([Location 3689](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3689)) - Don’t treat internal operational overhead as a sunk cost. ([Location 3690](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3690)) - Monoliths versus modular systems is another longtime debate in the software architecture space. ([Location 3700](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3700)) - Monolithic systems are self-contained, often performing multiple functions under a single system. The monolith camp favors the simplicity of having everything in one place. ([Location 3703](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3703)) - The modular camp leans toward decoupled, best-of-breed technologies performing tasks at which they are uniquely great. ([Location 3705](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3705)) - The monolith (Figure 4-4) has been a technology mainstay for decades. The old days of waterfall meant that software releases were huge, tightly coupled, and moved at a slow cadence. ([Location 3709](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3709)) - monoliths are brittle. ([Location 3717](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3717)) - Another con of monoliths is that switching to a new system will be painful if the vendor or open source project dies. ([Location 3725](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3725)) - Modularity (Figure 4-5) is an old concept in software engineering, but modular distributed systems truly came into vogue with the rise of microservices. ([Location 3729](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3729)) - Microservices can communicate via APIs, allowing developers to focus on their domains while making their applications accessible to other microservices. ([Location 3732](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3732)) - The famous Bezos API mandate decreases coupling between applications, allowing refactoring and decomposition. Bezos also imposed the two-pizza rule (no team should be so large that two pizzas can’t feed the whole group). Effectively, this means that a team will have at most five members. This cap also limits the complexity of a team’s domain of responsibility—in particular, the codebase that it can manage. ([Location 3736](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3736)) - In a modular microservice environment, components are swappable, and it’s possible to create a polyglot (multiprogramming language) application; a Java service can replace a service written in Python. Service customers need worry only about the technical specifications of the service API, not behind-the-scenes details of implementation. ([Location 3741](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3741)) - Data-processing technologies have shifted toward a modular model by providing strong support for interoperability. Data is stored in object storage in a standard format such as Parquet in data lakes and lakehouses. ([Location 3744](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3744)) - We view data modularity as a more powerful paradigm than monolithic data engineering. Modularity allows engineers to choose the best technology for each job or step along the pipeline. ([Location 3749](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3749)) - The cons of modularity are that there’s more to reason about. Instead of handling a single system of concern, now you potentially have countless systems to understand and operate. Interoperability is a potential headache; hopefully, these systems all play nicely together. ([Location 3751](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3751)) - Orchestration becomes the glue that binds data stack modules together. ([Location 3756](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3756)) - The distributed monolith pattern is a distributed architecture that still suffers from many of the limitations of monolithic architecture. The basic idea is that one runs a distributed system with different services to perform different tasks. Still, services and nodes share a common set of dependencies or a common codebase. ([Location 3757](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3757)) - Some modern Python-based orchestration technologies—e.g., Apache Airflow—also suffer from this problem. While they utilize a highly decoupled and asynchronous architecture, every service runs the same codebase with the same dependencies. Any executor can execute any task, so a client library for a single task run in one DAG must be installed on the whole cluster. Orchestrating many tools entails installing client libraries for a host of APIs. Dependency conflicts are a constant problem. ([Location 3766](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3766)) - One solution to the problems of the distributed monolith is ephemeral infrastructure in a cloud setting. Each job gets its own temporary server or cluster installed with dependencies. Each cluster remains highly monolithic, but separating jobs dramatically reduces conflicts. ([Location 3769](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3769)) - Here are some things to consider when evaluating monoliths versus modular options: Interoperability Architect for sharing and interoperability. Avoiding the “bear trap” Something that is easy to get into might be painful or impossible to escape. Flexibility Things are moving so fast in the data space right now. Committing to a monolith reduces flexibility and reversible decisions. ([Location 3779](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3779)) - A big trend for cloud providers is serverless, allowing developers and data engineers to run applications without managing servers behind the scenes. ([Location 3785](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3785)) - With the promise of executing small chunks of code on an as-needed basis without having to manage a server, serverless exploded in popularity. ([Location 3792](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3792)) - Looking specifically at the case of AWS Lambda, various engineers have found hacks to run batch workloads at meager costs. ([Location 3801](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3801)) - On the other hand, serverless functions suffer from an inherent overhead inefficiency. Handling one event per function call at a high event rate can be catastrophically expensive, especially when simpler approaches like multithreading or multiprocessing are great alternatives. ([Location 3802](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3802)) - Monitor to determine cost per event in a real-world environment and maximum length of serverless execution, and model using this cost per event to determine overall costs as event rates grow. ([Location 3805](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3805)) - containers are one of the most powerful trending operational technologies as of this writing. ([Location 3809](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3809)) - Containers are often referred to as lightweight virtual machines. Whereas a traditional VM wraps up an entire operating system, a container packages an isolated user space (such as a filesystem and a few processes); many such containers can coexist on a single host operating system. ([Location 3811](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3811)) - Container escape—broadly, a class of exploits whereby code in a container gains privileges outside the container at the OS level—is common enough to be considered a risk for multitenancy. ([Location 3821](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3821)) - Containerized function platforms run containers as ephemeral units triggered by events rather than persistent services. ([Location 3826](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3826)) - Serverless makes less sense when the usage and cost exceed the ongoing cost of running and maintaining a server (Figure 4-6). ([Location 3838](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3838)) - Customization, power, and control are other major reasons to favor servers over serverless. ([Location 3842](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3842)) - Expect servers to fail. Server failure will happen. ([Location 3844](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3844)) - Use clusters and autoscaling. Take advantage of the cloud’s ability to grow and shrink compute resources on demand. ([Location 3848](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3848)) - Treat your infrastructure as code. Automation doesn’t apply to just servers and should extend to your infrastructure whenever possible. ([Location 3850](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3850)) - Use containers. For more sophisticated or heavy-duty workloads with complex installed dependencies, consider using containers on either a single server or Kubernetes. ([Location 3852](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3852)) - Workload size and complexity Serverless works best for simple, discrete tasks and workloads. It’s not as suitable if you have many moving parts or require a lot of compute or memory horsepower. ([Location 3857](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3857)) - Execution frequency and duration How many requests per second will your serverless application process? How long will each request take to process? Cloud serverless platforms have limits on execution frequency, concurrency, and duration. If your application can’t function neatly within these limits, it is time to consider a container-oriented approach. ([Location 3859](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3859)) - Requests and networking Serverless platforms often utilize some form of simplified networking and don’t support all cloud virtual networking features, such as VPCs and firewalls. ([Location 3862](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3862)) - Language What language do you typically use? If it’s not one of the officially supported languages supported by the serverless platform, you should consider containers instead. ([Location 3863](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3863)) - Runtime limitations Serverless platforms don’t give you complete operating system abstractions. ([Location 3865](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3865)) - Cost Serverless functions are incredibly convenient but potentially expensive. ([Location 3866](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3866)) - In the end, abstraction tends to win. We suggest looking at using serverless first and then servers—with containers and orchestration if possible—once you’ve outgrown serverless options. ([Location 3870](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3870)) - Benchmarks either compare databases that are optimized for completely different use cases, or use test scenarios that bear no resemblance to real-world needs. ([Location 3888](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3888)) - We applaud benchmarks and are glad to see many database vendors finally dropping DeWitt clauses from their customer contracts.13 Even so, let the buyer beware: the data space is full of nonsensical benchmarks.14 ([Location 3890](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3890)) - To benchmark for real-world use cases, you must simulate anticipated real-world data and query size. Evaluate query performance and resource costs based on a detailed evaluation of your needs. ([Location 3898](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3898)) - Nonsensical cost comparisons are a standard trick when analyzing a price/performance or TCO. For instance, many MPP systems can’t be readily created and deleted even when they reside in a cloud environment; these systems run for years on end once they’ve been configured. ([Location 3901](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3901)) - The deceit of asymmetric optimization appears in many guises, but here’s one example. Often a vendor will compare a row-based MPP system against a columnar database by using a benchmark that runs complex join queries on highly normalized data. ([Location 3906](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3906)) - As with all things in data technology, let the buyer beware. Do your homework before blindly relying on vendor benchmarks to evaluate and choose technology. ([Location 3912](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3912)) - Whatever technology you choose, be sure to understand how it supports the undercurrents of the data engineering lifecycle. ([Location 3918](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3918)) - Data management is a broad area, and concerning technologies, it isn’t always apparent whether a technology adopts data management as a principal concern. ([Location 3920](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3920)) - When evaluating a new technology, how much control do you have over deploying new code, how will you be alerted if there’s a problem, and how will you respond when there’s a problem? ([Location 3934](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3934)) - good data architecture means assessing trade-offs and choosing the best tools for the job while keeping your decisions reversible. With the data landscape morphing at warp speed, the best tool for the job is a moving target. The main goals are to avoid unnecessary lock-in, ensure interoperability across the data stack, and produce high ROI. ([Location 3942](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3942)) - Airflow relies on a few core nonscalable components (the scheduler and backend database) that can become bottlenecks for performance, scale, and reliability; the scalable parts of Airflow still follow a distributed monolith pattern. ([Location 3957](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3957)) - Finally, Airflow lacks support for many data-native constructs, such as schema management, lineage, and cataloging; and it is challenging to develop and test Airflow workflows. ([Location 3959](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3959)) - Prefect and Dagster aim to solve some of the problems discussed previously by rethinking components of the Airflow architecture. ([Location 3962](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3962)) - As a data engineer, you should strive for simplification and abstraction across the data stack. Buy or use prebuilt open source solutions whenever possible. ([Location 3966](https://readwise.io/to_kindle?action=open&asin=B0B4VH4T37&location=3966))