Data Engineering With DBT

# Data Engineering With DBT ![rw-book-cover](https://m.media-amazon.com/images/I/81KhKl9-qtL._SY160.jpg) ## Metadata - Author: [[Roberto Zagni]] - Full Title: Data Engineering With DBT - Category: #python #sql #data-engineering ## Highlights - When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating. ([Location 548](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=548)) - SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world. ([Location 580](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=580)) - Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems. ([Location 585](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=585)) - With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications. ([Location 597](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=597)) - A database contains tables, organized in schemata (plural of schema) ([Location 616](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=616)) - Tables store data in rows that have one value for each column defined in the table ([Location 618](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=618)) - Columns have a name and can contain only data of the declared data type ([Location 620](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=620)) - To regulate access to the data, privileges are assigned to users and roles ([Location 621](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=621)) - The table is the most central concept in SQL, as it is the object that contains data. ([Location 629](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=629)) - Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. ([Location 651](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=651)) - Table content: It is organized in rows, each row containing one value for each column defined for the table or null if no value is provided. ([Location 653](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=653)) - Data value: All the values in a column must be compatible with the type declared for the column. null is a special value that corresponds to the absence of data and is compatible with all data types. ([Location 655](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=655)) - If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables. ([Location 673](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=673)) - we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view. ([Location 680](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=680)) - A DB can be further divided and organized using schema objects. ([Location 695](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=695)) - The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema. ([Location 708](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=708)) - A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges. ([Location 718](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=718)) - Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database ([Location 737](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=737)) - Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example, INSERT, DELETE, and UPDATE ([Location 739](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=739)) - Data Query Language (DQL): DQL contains the SELECT command and is the central part of SQL that allows querying and transforming the data stored in a database ([Location 742](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=742)) - Data Control Language (DCL): DCL contains the GRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objects ([Location 744](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=744)) - Transaction Control Language (TCL): TCL contains the commands to manage transactions ([Location 747](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=747)) - DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored. ([Location 751](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=751)) - When working with dbt, we use the DDL and DML commands only in macros. ([Location 767](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=767)) - DML provides the commands to manipulate data in a database and carry out bulk data loading. ([Location 770](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=770)) - DQL is the reason why SQL exists: to query and transform data. The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL. ([Location 793](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=793)) - DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects. Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security: ([Location 808](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=808)) - A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable. ([Location 818](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=818)) - A user can have the privilege directly, that is, assigned to the user, or indirectly, through a role that the user can impersonate. ([Location 883](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=883)) - If you are familiar with SQL from other database systems, you will wonder what the QUALIFY clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. ([Location 1026](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1026)) - The term query should really only refer to SELECT statements, as SELECT statements are used to query data from the database, but query is often used with any statement that has to do with data. ([Location 1035](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1035)) - The WITH clause is optional and can only precede the SELECT command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT statement, which can be used later in the main SELECT statement as any other table-like object. ([Location 1043](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1043)) - Defining a CTE is useful for the following: Clarity: You can provide an informative name to a piece of SQL Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance ([Location 1048](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1048)) - The SELECT command can appear in two possible forms: as the central clause of a SELECT statement or used as a clause in other statements. ([Location 1101](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1101)) - Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the FROM clause. ([Location 1197](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1197)) - Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones. ([Location 1200](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1200)) - A VALUES clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input. ([Location 1202](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1202)) - JOIN is a subclause of the FROM clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN cannot exist outside of a FROM clause. ([Location 1251](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1251)) - The condition in each join is a Boolean expression, often using columns from the base to1 table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants. ([Location 1258](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1258)) - The WHERE clause specifies the conditions that the rows involved in the command need to match. ([Location 1272](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1272)) - A GROUP BY clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression. ([Location 1311](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1311)) - The HAVING clause filters the rows resulting from a GROUP BY clause according to a predicate: ([Location 1344](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1344)) - The QUALIFY clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by. ([Location 1368](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1368)) - WHERE applies to the data being read from the sources specified in the FROM / JOIN clauses HAVING applies to the data that has been grouped by a GROUP BY clause QUALIFY applies to the data that has been calculated by a window function ([Location 1375](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1375)) - The ORDER BY clause specifies in which order the query or window function lays out the rows. ([Location 1426](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1426)) - The LIMIT clause is used to restrict the number of rows returned by the query. ([Location 1442](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1442)) - Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause: FROM and its JOIN subclause, which are used to identify the source data for the query. The WHERE clause, which is used to filter out the source data that we do not want. This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE whenever possible to just bring in the data you need. The GROUP BY clause, which groups the source data left after applying the WHERE clause and calculates the aggregate functions on the grouped data. The HAVING clause, which filters on the results of GROUP BY. Partitioning of the windows and calculation of the window functions. The QUALIFY clause, which filters on the results of the window functions. The DISTINCT keyword, if applied to the SELECT clause, which removes duplicated rows. The ORDER BY clause, which puts the resulting rows in the desired order. The LIMIT clause, which caps the rows returned by the query to the desired amount. ([Location 1462](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1462)) - Arithmetic operators: These are the traditional +, -, *, /, and % (modulo). ([Location 1496](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1496)) - Comparison operators: These are used to test two values for equality or other comparisons, and are = (equal), != (not equal), <> (also not equal), < (less than), <= (less than or equal), > (greater than), and >= (greater than or equal). ([Location 1505](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1505)) - Logical operators: These are the traditional AND, OR, and NOT Boolean operators. ([Location 1522](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1522)) - Set operators: These operators are used to put together the results from multiple queries, and are INTERSECT, MINUS or EXCEPT, and UNION [ALL]. ([Location 1534](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1534)) - INTERSECT returns the rows that appear in both queries, checking all columns to have the same value MINUS or EXCEPT returns the rows from the first query that do not appear in the second UNION [ALL] returns the rows from both queries, with ALL keeping duplicates ([Location 1544](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1544)) - Subquery operators: These operators allow us to use subqueries in WHERE clauses. ([Location 1561](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1561)) - The JOIN clause, and the equivalent forms of the FROM clause with multiple tables, is used to combine the rows from two tables to create a row with the columns (that you select) from both tables. ([Location 1592](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1592)) - In the most common cases, the join type is one of [INNER] or { LEFT | RIGHT | FULL } [ OUTER ]. ([Location 1689](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1689)) - A window function is a SQL function that operates on a set of rows, called a window of rows. ([Location 1845](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1845)) - A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query. ([Location 1848](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1848)) - A window is defined using the OVER clause, which specifies how to partition and order the rows of the incoming data to apply the window function. ([Location 1882](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1882)) - The PARTITION BY clause defines how the rows are partitioned and <part_expr> is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example. ([Location 1890](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1890)) - The ORDER BY clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr> work pretty much as they do in the SELECT statement to order the results of the query. ([Location 1893](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1893)) - Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these. ([Location 1908](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1908)) - A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. ([Location 1919](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1919)) - A sliding window frame is a subset of rows that extends from a number of N rows before the current row to a number of M rows after the current row. One or both sides can be fixed to the start or end of the partition. ([Location 1927](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1927)) - ROWS: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition. ([Location 1941](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1941)) - RANGE: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for the ORDER BY value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list. ([Location 1944](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=1944)) - VC also allows easily identifying the changes between one version and the next. ([Location 2005](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2005)) - A repository, often called a repo for short, is the container of the files for a project. ([Location 2025](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2025)) - A commit action is the action of saving the current version of some files and assigning an identifier to that set. A commit action moves the branch pointer one step forward, from the previous commit to the one being created. ([Location 2029](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2029)) - A commit represents the state of all files at the specific moment in time when the commit was done, and it is identified by a unique ID. ([Location 2031](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2031)) - A branch is used to create an alternative development line that can evolve independently from other branches. ([Location 2035](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2035)) - A merge is an operation that combines the changes that happened to the files in one branch into another branch. Just after the merge, the contents of the two branches are equal, but then they can keep evolving independently. ([Location 2041](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2041)) - A merge conflict is a situation when a file has been changed in both branches involved in a merge operation and the system cannot determine automatically how to mix the changes. A simple, manual intervention is needed to edit and save the desired final version of the conflicted files. ([Location 2043](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2043)) - Let’s list what the dbt default project contains: README.md: This is a text file with some instructions and pointers to the dbt documentation dbt_project.yml: The main configuration file .gitignore: A git-specific file that lists resources to exclude from VC, such as the dbt_packages, target, and logs folders Inside the models/example folder, we have two models and a config file: my_first_dbt_model.sql: As the name suggests, this is the first model, which is just made up of plain SQL that will generate two rows of hardcoded data my_second_dbt_model.sql: This is the second model, and it reads from the first model, introducing the use of the ref function to reference the other model schema.yml: Another configuration file that provides metadata for the models and also declares tests that we want to be run on these models ([Location 2490](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2490)) - At the base of a relational database, there is the Entity-Relationship (E-R) model. Therefore, you will learn how you can use E-R models to represent data models that describe the data you have or want to collect. ([Location 2697](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2697)) - Data modeling encompasses the activities to provide a clear context for your data, at multiple levels. ([Location 2734](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2734)) - The typical components of a visual data model are as follows: Entity: An entity is a concept that we want to analyze, a thing of interest that we want to collect data about, such as a car, an order, or a page visit Attribute: An attribute is a property of an entity for which we want to store values, such as a car plate, a customer, or the total amount of a transaction Relationship: A relationship is a connection between two entities that are related to each other and captures how they are related, such as the ownership relation between a car and its owner, or the purchasing relation between a customer and a product ([Location 2742](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2742)) - Data modeling is generally used in two main contexts: To describe what data should be generated or collected by an application or in a project. In this case, we build the data model before having the data. To describe some data that already exists. ([Location 2764](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2764)) - In data warehousing, we always start with the second case, as we get data “as is” from source systems, and if we do not have a data model of the source system, we reverse-engineer one that describes the data to the best of our knowledge. ([Location 2772](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2772)) - A good practice is to explicitly design this new data model before coding the transformations, drawing at least one model that represents the business concepts that are the goal of the integration and will form the refined targets of our transformations. This model can be at a high level of abstraction, describing the business and only the main entities, or at a greater level of detail, already representing the technical design. ([Location 2779](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2779)) - Conceptual data model: This is the most abstract model, defining what will be in the domain of the project, providing the general scope Logical data model: This model provides much greater detail, defining what the data will look like Physical data model: This is the most detailed model, describing exactly how the data will be stored in the database ([Location 2817](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2817)) - The logical data model is a technical model that defines how the data will be organized, without yet committing to a particular technical implementation. This model starts to identify the key attributes of the entities, such as Natural Keys (NKs) and Business Keys (BKs), which will be used to identify the entity instances or to connect to other entities or reference tables. It also identifies key measures and categorizations that are important for the use of the data. ([Location 2848](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2848)) - The following logical data model is drawn using the crow’s foot notation, which is the simplest and most common for E-R diagrams. ([Location 2859](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2859)) - The physical data model describes the actual implementation of the data model in a specific database. The name physical indicates it is extremely detailed, actually one-to-one with the DB definitions and can often be used to generate the commands to instantiate the database or be created by reverse-engineering of the database. This model includes the Primary Keys (PKs), indexes, and everything needed to instantiate the database, including the conversion of many-to-many relations into associative tables. ([Location 2885](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2885)) - The cardinality of a relationship indicates how many instances of one entity on one side of the relationship participate in the relationship with how many instances of the entity on the other side. ([Location 2942](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2942)) - One-to-one cardinality: This is quite a rare case where only one instance of each entity participates on each side. ([Location 2945](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2945)) - One-to-many cardinality: This is the most common type of relationship, where one entity on one side participates with many entities on the other side. ([Location 2955](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2955)) - Many-to-many cardinality: This is a relatively uncommon case where many entities on one side participate in the relationship with many entities on the other side. ([Location 2963](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2963)) - What is the reasoning process to follow to understand cardinalities or find them for your model? The answer is simple: you think of a generic entity on one side of the relation and with how many entities on the other side of the relationship it will be related in your domain. ([Location 2975](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2975)) - Primary Key (PK): This is the field or set of fields that uniquely identifies an instance of an entity. Primary indicates that it is the key that is used to ensure a unicity constraint on the entity. It is the main key used to determine whether two rows relate to the same instance or not when storing data. Natural Key (NK): This is a key that has a business significance and uniquely identifies the instances of the entity in the given context. An NK also exists outside of the database world. It is something printed on forms or that a person might use to identify an instance of the entity in the real world. An example could be the plate for a car inside a state or an email for a user inside a system. Surrogate Key (SK): This is a key that has no business significance and is used as a key inside the system that generated it. An old way to generate SKs was to use sequences. A more modern way is to use hash functions. SKs are often used to simplify identification, especially when no good NK exists or when NKs or PKs are made of multiple fields. Foreign Key (FK): This is a field or set of fields in one entity that matches the key of another entity. The field names do not have to match, as long as the content does. ([Location 2991](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=2991)) - A weak entity is an entity that does not have any meaning without the header entity, such as an order line without an order header. ([Location 3160](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3160)) - Bill of Materials (BOM): This is a common occurrence in production organizations. ([Location 3236](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3236)) - Work Breakdown Structure (WBS): A WBS represents all the tasks needed to complete a project. It is usually represented by progressively smaller subprojects, which are tasks themselves, building a multilevel hierarchy of tasks. ([Location 3241](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3241)) - John Gilles, in his book The Elephant in the Fridge, introduces a series of 12 general data models that cover most needs of a company. ([Location 3371](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3371)) - The fan trap is a very common problem that can happen every time you have a join in a one-to-many relationship. It is not a problem of the relationship, but of how you might use it. The fan trap problem causes the calculations done on measures joined from the one side of the one-to-many relationship to be wrong. ([Location 3386](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3386)) - In essence, a chasm trap is an unmanaged and often unintentional many-to-many relationship. ([Location 3458](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3458)) - The short answer is that two tables that have an FK in common, such as a customer key or a product key, are probably related in some way, but the common key alone is not enough to express the possibly multiple and complex relations between the two tables. Trying to extract information without enough data in general does not provide meaningful results. ([Location 3505](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3505)) - Probably the most ubiquitous design style in the data warehousing world is the dimensional model from Ralph Kimball, which is the de facto standard in the BI area. The signature feature of the dimensional modeling design is to produce a data model structure called a star schema, which is the access interface to the data for BI tools. ([Location 3542](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3542)) - In a star schema, a fact table containing the events or records containing the measures of interest is surrounded by several dimension tables. A dimension table is a denormalized table that contains the descriptive data at the lowest level of detail that you want to access and should not contain measures (in the original Kimball definition) to avoid the fan trap issue. A measure is a number that you want to use for calculations. ([Location 3546](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3546)) - The main advantage of a star schema is its simplicity, which makes it very simple to query and easy to understand even for non-technical people. ([Location 3554](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3554)) - We have seen the main concepts of a Kimball warehouse, so let’s also look at some of its downsides: According to the original definition, there should not be measures in dimensions, because if you have them and you use them for calculations involving the fact, you always have a fan trap. Not having or not using important metrics like the price in a product dimension is very limiting, so in practice, we have measures in the dimensions, then you should be aware when a Fan Trap is triggered. Another limitation according to the Kimball rules is that facts are only connected to dimensions, not other facts. But some of the most interesting analyses are built by using data from multiple facts, such as orders and deliveries or returns. Kimball star schemas do not help here, so you can look at other methodologies, such as the USS, or write your own SQL to produce a new joined fact to be analyzed. The Kimball method is oriented to report business processes, not build a data view of the enterprise, so the facts and conformed dimensions might not handle other reporting needs besides the ones that the star schemas were designed for. The process of loading and storing the data from the data sources is deeply intertwined with the implementation of the business rules because the conformed dimensions and the facts are both the goal of the business transformations and the primary storage for all the data. ([Location 3584](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3584)) - The USS is an extension of the Kimball star schema that addresses some of its limitations, such as the fan trap and the impossibility to query more than one fact at a time without the risk of a chasm trap. ([Location 3611](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3611)) - The USS produces a single star schema with the Puppini Bridge at the center like a super fact and all the dimensions around it. This makes it extremely easy to use for every kind of user as the data model is fixed and trivially simple, while powerful enough to answer more BI questions than a full set of data marts. ([Location 3613](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3613)) - The following figure shows the Oriented Data Model (ODM) from the USS methodology that always represents one-to-many relationships as an arrow going in the lookup direction, that is from left (the many side) to right (the one side) with the arrow pointing to the one side of the relation: ([Location 3623](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3623)) - The signature concept of the Inmon design style is the goal to create a corporate-wide, integrated, and interoperable data repository. While Kimball was working to satisfy the need for reporting, Inmon tried to build a repository of data that describes the organization and its business processes, providing a company-wide single source of truth. ([Location 3662](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3662)) - The goal is therefore to identify the business concepts, that is, the entities, that describe the company and how it works and take the input source data and turn it into homogeneous data under the correct concept. To avoid redundancy and maximize the expressive power, the data is normalized in Third Normal Form (3NF), and all interesting relations can be captured between the desired concepts. ([Location 3665](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3665)) - DV is a data platform-building methodology invented by Dan Linstedt. ([Location 3687](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3687)) - The most central idea in DV is that all the information is aggregated around the identified business concepts and data is integrated by the BK. Concept definition and identity management are two important cornerstones of DV2. The signature DV model is based on these three core elements: Hub: Contains all the unique instances of BK for the concept of the hub Link: Contains all the unique instances of a relationship between hubs, as a tuple of BKs, providing a very flexible many-to-many relation Satellite: Contains the changes over time of the descriptive information connected with a hub or link ([Location 3692](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3692)) - The last step is taking the required data from the DV and BV to produce the data marts, where the data is usually organized according to the Kimball star schema, the USS, or whatever model your BI tool and users in your organization need to receive the information, that is, data understandable by business users. ([Location 3732](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3732)) - The following is a list of logical and technical reasons why I like DV: The greatest advancement of DV is the separation of the storage of the historical data from the application of the business rules, which gives to each part of the architecture only one reason to change, drastically reducing the need for reworking to add or change a data source or to modify or add a business rule. The use of just three model elements, hub, link, and satellite, which are highly standardized, allows fully automating and automatically generating most of the loading logic. The use of natural key (NK), instead of surrogate key (SK) like sequences, greatly simplifies the ETL as there is no need for SK lookups; this allows for huge parallelism in the processing. Note that NKs can be used “as is”, including composite keys, or through hashing or other transformations to build a single key field derived from the NK. The use of Hashed Difference (HashDIFF) fields to check changes between instances of an entity greatly simplifies and automates the loading patterns. The adoption of loading patterns that use insert only queries to store and update a complete and auditable history is a great advancement and enables the manipulation of billions of rows exploiting the scalability of the new cloud-based platforms. The ability to handle bitemporality, as in, when something happened versus when we got that information, is very important in some contexts, such as compliance departments. The following is a list of reasons why I do not like DVs: The creation of a DV platform requires the presence of skilled data professionals with training and experience in DV2, which is a hard requirement to meet in many situations. DV models require a huge number of joins to collate all the bits and pieces of data back in the correct form. This is generally beyond the comfort zone of most developers and certainly out of reach for analysts and business users. The creation of a proper DV requires a thorough analysis of the business concepts of the domain under development. This requires sponsorship at a high enough level, some time, availability of experts on the business side, and someone who is able to put all that together. ([Location 3735](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3735)) - The data mesh approach was born in some of the most tech-heavy companies to obviate a very common problem that affects most of the data engineering industry: data ownership. ([Location 3773](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3773)) - Instead of trying to build a unified, centralized data platform, a data mesh is built by business unit scale data platforms that can interact through known and well-defined interfaces, with clear semantics, each maintained by a different team. ([Location 3789](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3789)) - In short, a data product can be one of the following: An application that allows doing something using the underlying data that you do not have but is in the system providing the product. ([Location 3794](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3794)) - Analytics engineering transforms raw data from disparate company data sources into information ready for use with tools that analysts and businesspeople use to derive insights and support data-driven decisions. ([Location 3883](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3883)) - Data engineering is the discipline of taking data that is born elsewhere, generally in many disparate places, and putting it together to make more sense to business users than the individual pieces of information in the systems they came from. ([Location 3897](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3897)) - The art of data engineering is to store data and make it available for analysis, eventually distilling it into information, without losing the original information and adding noise. ([Location 3903](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3903)) - Data creation: Data is born in a data source where something work-related happens. ([Location 3909](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3909)) - Data movement and storage: Data from multiple data sources is moved to some place to be analyzed. Let’s call it a data platform, to be generic and implementation agnostic. We call the data stored here with the desired time dept historical data or just history. ([Location 3913](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3913)) - Data transformation: The goal of data engineering is to take the data made available on the data platform and transform it to produce integrated, meaningful, and easy-to-consume information for business users. This is the key stage of a data platform, where value is added by producing the refined data used in reports and understood by business users. ([Location 3917](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3917)) - Access to reports: The most common interface to the refined data is the business reports built with reporting tools that provide simple access to pre-built reports for general business users and may allow varying grades of self-service reporting for power users. ([Location 3920](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3920)) - Data write back: Eventually, business users can interact with the data in the reporting tools to produce data that is fed back into the data platform. Business user input, as well as machine learning insights, can produce data that is even sent back to the source systems. ([Location 3923](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3923)) - Event-based observability: This is when the system itself publishes the changes that are considered useful for external applications to know. ([Location 3983](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3983)) - Change data capture: This is when we can connect to the DB log or some other event stream from the source system. ([Location 3988](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3988)) - Application Programming Interface (API): Another very common use case is when a system provides an API that we can use to read data from the system. ([Location 3992](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=3992)) - Database snapshot: This is another very common method, where the data from the source system is exported at a specific time (whether at regular intervals or on demand) and might contain all the data (full data export) or only the most recent data (created or changed since a known time). ([Location 4000](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4000)) - Queueing systems: Storing the data in queue systems, such as Kafka, is very common when the source system is designed to provide external observability, sending itself messages about its state or events. ([Location 4020](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4020)) - suggestion is to keep the data unaltered as much as possible and as close as possible to how it was in the source system. ([Location 4059](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4059)) - Our advice on this matter is to move and save the raw data as much as possible “as is”, performing only the very important tasks of properly identifying the concepts, their identity, by clearly naming and testing the primary and business keys, and their way of changing, as these are three key aspects to store history properly. ([Location 4121](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4121)) - Nowadays, we have two main use cases: classical business reporting, derived from source data applying business rules, and advanced analytics, where we include the reporting insights generated by machine learning or AI. ([Location 4179](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4179)) - We have seen that the analytics engineer is a bit of a data engineer, in collecting and understanding the raw data, and a bit of a data analyst, in being aware of how to create datasets that enable good analysis. But the core of this role is to transform the data into information using the best practices that allow software engineers to produce and maintain reliable software. ([Location 4317](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4317)) - The analytics engineer organizes the data transformations and their dependencies just as a software engineer designs an application, using reusable patterns, and applies the separation of concerns to minimize the need of changing a model for many different reasons. ([Location 4325](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4325)) - In short, the core ideas of DevOps are to provide the team with the tools it needs, as well as the authority and the responsibility for all of the development cycle: from software coding to Quality Assurance (QA), to releasing and then running the production operations. ([Location 4336](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4336)) - These ideas are exemplified by the four-eyes principle, requiring each piece of code to be seen and approved by at least two people (through pairing, pull requests, or code walkthroughs), and the boy scout rule, which prompts you to ensure a piece of code you touch is left in better shape than it was before you touched it, independently of the person that wrote it, therefore improving the work over time. ([Location 4341](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4341)) - The three cornerstones of DataOps, ownership, freedom, and responsibility, heavily rely on the fact that the team has a solid quality assurance process in place. ([Location 4369](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4369)) - The modern data stack allows you to build reusable models so that you can code the customer only once, maybe composing it by putting together lower level models that represent the online customers, the store customer and the wholesale customer. ([Location 4398](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4398)) - Continuous integration is the next step in increasing the speed and trust in the development process, making sure that all the individual features that have been developed are working well together. ([Location 4420](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4420)) - develop being mindful about the future, that is, taking into consideration evolution and maintenance. ([Location 4427](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4427)) - Avoid long and complex queries by breaking down your logical steps into individual Common Table Expressions (CTEs) so that you can provide proper naming for each step and create clean, readable code. ([Location 4436](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4436)) - Keep your models under 100 lines and the number of joins to two or three, unless a model has the single purpose of joining many tables. ([Location 4438](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4438)) - SRP: The single-responsibility principle says that a model should have one, and only one, reason to change. ([Location 4451](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4451)) - OCP: The open/closed principle says that a model should be open for extension and closed for (interface) modification. This means that the models that we use as connecting points between systems and models should be stable in the interfaces they expose and should also be easily extensible. ([Location 4454](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4454)) - ISP: The interface segregation principle says that a client should not depend upon an interface it does not use. ([Location 4458](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4458)) - LSP: The least surprise principle says that a model should do what seems obvious it would do from its name and position in a project. ([Location 4463](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4463)) - The signature feature of the architecture that we propose is the separation of the source data ingestion from the application of the business logic and the creation of the business concepts. ([Location 4556](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4556)) - Decoupling the two most important activities, ingesting and building the business concepts, simplifies the overall development, allowing the use of the best solution for each part of the problem. ([Location 4559](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4559)) - Sources: The general mechanism for using external data in dbt is to define the initial data you want to start your transformation as a source, defined by a source system definition that points to a database and schema, and one or more table definitions that select one table or view from that database and schema. ([Location 4777](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4777)) - Seeds: In dbt, you can put CSV files into designated folders and have dbt load them into a table with the dbt seed command. Note that this is not a general-purpose CSV data loading mechanism by design. ([Location 4782](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4782)) - Created data: As strange as it sounds, in a DW project, some data can be created out of nowhere. A common example is the calendar dimension, but this is true for anything that can be generated by some logic and little starting data. ([Location 4785](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4785)) - Direct hardcoded references: When you start writing code in dbt, it will be common for you to use direct references to actual database tables, as you do in normal SQL queries, instead of using the source and ref references. Please resist this at all costs. ([Location 4789](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4789)) - The data suitable to be loaded as a seed has the following properties: changes infrequently, is generally maintained by hand, and is suitable or even required to be versioned inside your Git repository to track changes. ([Location 4880](https://readwise.io/to_kindle?action=open&asin=B0C4LL19G7&location=4880))