# AWS for Solutions Architects ![rw-book-cover](https://m.media-amazon.com/images/I/91K-xhrBRXL._SY160.jpg) ## Metadata - Author: [[Alberto Artasanchez]] - Full Title: AWS for Solutions Architects - Category: #aws #software-architecture #cloud-computing ## Highlights - the cloud is just a bunch of servers and other computing resources managed by a third-party provider in a data center somewhere. ([Location 615](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=615)) - One important characteristic of the leading cloud providers is the ability to quickly and frictionlessly provision resources. ([Location 620](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=620)) - Virtualization is the process of running multiple virtual instances on top of a physical computer system using an abstract layer sitting on top of actual hardware. ([Location 652](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=652)) - A hypervisor is a computing layer that enables multiple operating systems to execute in the same physical compute resource. These operating systems running on top of these hypervisors are VMs – a component that can emulate a complete computing environment using only software but as if it was running on bare metal. ([Location 657](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=657)) - Hypervisors, also known as Virtual Machine Monitors (VMMs), manage these VMs while running side by side. A hypervisor creates a logical separation between VMs, and it provides each of them with a slice of the available compute, memory, and storage resources. ([Location 660](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=660)) - The cloud computing model is one that offers computing services such as compute, storage, databases, networking, software, machine learning, and analytics over the internet and on demand. ([Location 667](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=667)) - A private cloud just becomes a fancy name for a data center managed by a trusted third party, and all the elasticity benefits wither away. ([Location 701](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=701)) - Terraform by HashiCorp may be a better alternative since Terraform is cloud-agnostic. ([Location 772](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=772)) - There are multiple reasons why the cloud market is growing so fast. Some of them are listed here: Elasticity Security Availability Faster hardware cycles System administration staff Faster time to market ([Location 856](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=856)) - Elasticity may be one of the most important reasons for the cloud's popularity. ([Location 862](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=862)) - More formally defined, elasticity is the ability of a computing environment to adapt to changes in workload by automatically provisioning or shutting down computing resources to match the capacity needed by the current workload. ([Location 876](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=876)) - You probably have a better chance of getting into the Pentagon without a badge than getting into an Amazon data center. ([Location 906](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=906)) - Amazon employees with access to the building must authenticate themselves four times to step on the data center floor. ([Location 914](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=914)) - In a cloud environment, spinning up new resources can take just a few minutes. So, we can configure minimal environments knowing that additional resources are a click away. ([Location 956](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=956)) - To increase resilience, data centers have discrete Uninterruptable Power Supplies (UPSes) and onsite backup generators. ([Location 966](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=966)) - For example, whenever AWS offers new and more powerful processor types, using them is as simple as stopping an instance, changing the processor type, and starting the instance again. In many cases, AWS may keep the price the same even when better and faster processors and technology become available. ([Location 975](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=975)) - If there is one must-read white paper from AWS, it is the paper titled AWS Well-Architected Framework, which spells out the five pillars of a well-architected framework. The full paper can be found here: https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf ([Location 985](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=985)) - First pillar – security ([Location 990](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=990)) - To enable system security and to guard against nefarious actors and vulnerabilities, AWS recommends these architectural principles: Always enable traceability. Apply security at all levels. Implement the principle of least privilege. Secure the system at all levels: application, data, operating system, and hardware. Automate security best practices. ([Location 993](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=993)) - Second pillar – reliability ([Location 999](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=999)) - At any given time, there are at least six copies of any object stored in Amazon S3. ([Location 1003](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1003)) - The well-architected framework paper recommends these design principles to enhance reliability: Continuously test backup and recovery processes. Design systems so that they can automatically recover from a single component failure. Leverage horizontal scalability whenever possible to enhance overall system availability. Use automation to provision and shut down resources depending on traffic and usage to minimize resource bottlenecks. Manage change with automation. ([Location 1007](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1007)) - Whenever possible, changes to the infrastructure should occur in an automated fashion. ([Location 1013](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1013)) - Third pillar – performance efficiency ([Location 1014](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1014)) - When it comes to performance efficiency, the recommended design best practices are as follows: Democratize advanced technologies. Take advantage of AWS's global infrastructure to deploy your application globally with minimal cost and to provide low latency. Leverage serverless architectures wherever possible. Deploy multiple configurations to see which one delivers better performance. ([Location 1019](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1019)) - Fourth pillar – cost optimization ([Location 1025](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1025)) - To enhance cost optimization, these principles are suggested: Use a consumption model. Leverage economies of scale whenever possible. Reduce expenses by limiting the use of company-owned data centers. Constantly analyze and account for infrastructure expenses. ([Location 1032](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1032)) - Whenever possible, use AWS-managed services instead of services that you need to manage yourself. This should lower your administration expenses. ([Location 1036](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1036)) - The operational excellence of a workload should be measured across these dimensions: Agility Reliability Performance ([Location 1038](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1038)) - To achieve operational excellence, AWS recommends these principles: Provision infrastructure through code (for example, via CloudFormation). Align operations and applications with business requirements and objectives. Change your systems by making incremental and regular changes. Constantly test both normal and abnormal scenarios. Record lessons learned from operational events and failures. Write down and keep the standard operations procedures manual up to date. ([Location 1042](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1042)) - Before we get to the best way to get certified, let's look at the worst way. Amazon offers extremely comprehensive documentation. You can find this documentation here: https://docs.aws.amazon.com/ ([Location 1174](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1174)) - AWS Regions exist in separate geographic areas. Each AWS Region comprises several independent and isolated data centers that provide a full array of AWS services dubbed AZs. ([Location 1377](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1377)) - LZs can be thought of as mini-AZs that provide core services that are latency-sensitive. ([Location 1379](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=1379)) - The data lake pattern is an incredibly useful pattern in today's enterprises to overcome this challenge. ([Location 6269](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6269)) - A data lake is a centralized data repository that can contain structured, semi-structured, and unstructured data at any scale. ([Location 6296](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6296)) - Some of the benefits of having a data lake are as follows: Increasing operational efficiency: Finding your data and deriving insights from it becomes easier with a data lake. Making data more available across organizations and busting silos: Having a centralized location will enable everyone in the organization to have access to the same data if they are authorized to access it. Lowering transactional costs: Having the right data at the right time and with minimal effort will invariably result in lower costs. Removing load from operational systems such as mainframes and data warehouses: Having a dedicated data lake will enable you to optimize it for analytical processing and enable you to optimize your operational systems to focus on its main mission of supporting day-to-day transactions and operations. ([Location 6305](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6305)) - An Aberdeen survey saw that enterprises that deploy a data lake in their organization can outperform competitors by 9% in incremental revenue growth. ([Location 6315](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6315)) - Landing or transient data zone This is a buffer used to temporarily host data as you prepare to permanently move it to the landing data zone defined later. It contains temporary data, such as a streaming spool, an interim copy, and other non-permanent data, before being ingested. ([Location 6324](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6324)) - Raw data zone After quality checks and security transformations have been performed in the transient data zone, the data can be loaded into the raw data zone for permanent storage: In the raw zone, files are transferred in their native format, without changing them or associating them with any business logic. The only change that will be applied to the raw zone files is tagging to specify the source system. All data in the lake should land in the raw zone initially. Most users will not have access to the raw zone. Mostly, it will be processes that copy data into the trusted data zone and curate it. ([Location 6328](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6328)) - Organized or trusted data zone This is where the data is placed after it's been checked to comply with all government, industry, and corporate policies. It's also been checked for quality: Terminology is standardized in the trusted data zone. The trusted data zone serves as the single source of truth across the data lake for users and downstream systems. Data stewards associate business terms with the technical metadata and can apply governance to the data. No duplicate records should exist in the trusted data zone. Normally users only have read access in the trusted data zone. ([Location 6336](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6336)) - Curated or refined data zone In this zone, data goes through more transformation steps. Files may be converted to a common format to facilitate access, and data quality checks are performed. The purpose of this process is to prepare the data to be in a format that can be more easily consumed and analyzed: Transformed data is stored in this zone. Data in this zone can be bucketed into topics, categories, and ontologies. Refined data could be used by a broad audience but is not yet fully approved for public consumption across the organization. In other words, users beyond specific security groups may not be allowed to access refined data since it has not yet been validated by all the necessary approvers. ([Location 6344](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6344)) - Sandboxes are an integral part of the data lake because they allow data scientists, analysts, and other users to manipulate, twist, and turn data to suit their individual use cases. Sandboxes are a play area where analysts can make data changes without affecting other users: Authorized users can transfer data from other zones to their own personal sandbox zone. Data in a personal private zone can be transformed, morphed, and filtered for private use without affecting the original data source: ([Location 6352](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6352)) - these characteristics can be measured and help us gauge the success or failure of a data lake: Size: This is the "volume" in the often-mentioned three Vs of big data (volume, variety, velocity) – how big is the lake? Governability: How easy is it to verify and certify the data in your lake? Quality: What is the quality of the data contained in the lake? Are some records and files invalid? Are there duplicates? Can you determine the source and lineage of the data in the lake? Usage: How many visitors, sources, and downstream systems does the lake have? How easy is it to populate and access the data in the lake? Variety: Does the data that the lake holds have many types? Are there many types of data sources that feed the lake? Can the data in the lake be extracted in different ways and formats, such as files, Amazon S3, HDFS, traditional databases, NoSQL, and so on? Speed: How quickly can you populate and access the lake? Stakeholder and customer satisfaction: Users, downstream systems, and source systems are the data lake customers. We recommend periodically probing the data lake customers in a formal and measurable fashion – for example, with a survey – to get feedback on levels of satisfaction or dissatisfaction. Security: Is the lake properly secured? Can only users with the proper access obtain data in the lake? Is data encrypted? Is Personally Identifiable Information (PII) properly masked for people without access? ([Location 6373](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6373)) - Faceted search enables end users to find resources using categorical metadata that has been previously assigned to the resource. ([Location 6450](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6450)) - A facet is an attribute/value pair. ([Location 6452](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6452)) - Amazon Kendra is a service that enables more relevant search results by using artificial intelligence to analyze user behavior, watching user navigation patterns to understand what resources should be given more importance when it comes to a given keyword and a given user. ([Location 6506](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6506)) - A common way to use machine learning classification is to let the machine learning algorithm take a first pass at the data and classify items for which the algorithm has a high level of confidence (say, 80%). ([Location 6510](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6510)) - Recommendation engines can suggest other documents or resources that a user might be interested in based on their previous search behavior. ([Location 6516](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6516)) - Natural Language Understanding (NLU) can make search applications much more robust. By using NLU, search queries can become much smarter. ([Location 6520](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6520)) - Entity extraction is a machine learning technique. It is used to identify and classify key elements in a text and to bucket some of the elements of the text into pre-defined categories. ([Location 6526](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6526)) - The silo mentality Depending on your company culture, and regardless of how good your technology stack is, you might have a mindset roadblock among your ranks, where departments within the enterprise still have a tribe mentality and refuse to disseminate information outside of their domain. For this reason, when implementing your data lake, it is critical to ensure that this mentality does not persist in the new environment. Establishing a well-architected enterprise data lake can go a long way toward breaking down these silos. ([Location 6539](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6539)) - Raw data will not be valuable if it does not have structure and a connection to the business and is not cleansed and deduplicated. If there isn't data governance built for the lake, users would be hard-pressed to trust the data in the lake. ([Location 6548](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6548)) - Data governance is the process that organizations use to make sure that the data used throughout the organization is of high quality, can be sourced, and can therefore be trusted. ([Location 6561](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6561)) - Data governance enables the identification of data ownership, which aids in understanding who has the answers if you have questions about the data. ([Location 6566](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6566)) - Data governance facilitates the adoption of data definitions and standards that help to relate technical metadata to business terms. ([Location 6568](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6568)) - Data governance aids in the remediation processes that need to be done for data by providing workflows and escalation procedures to report inaccuracies in data. ([Location 6573](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6573)) - Data governance allows us to make assessments of the data's usability for a given business domain, which minimizes the likelihood of errors and inconsistencies when creating reports and deriving insights. ([Location 6575](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6575)) - Data governance enables the lockdown of sensitive data, and it helps you to implement controls on the authorized users of the data. ([Location 6579](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6579)) - ACL: Access list for the resource (allow or in rare cases deny). ([Location 6584](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6584)) - Owner: The responsible party for this resource. ([Location 6586](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6586)) - Date created: The date the resource was created. ([Location 6587](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6587)) - Data source and lineage: The origin and lineage path for the resource. ([Location 6589](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6589)) - Job name: The name of the job that ingested and/or transformed the file. ([Location 6594](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6594)) - Data quality: For some of the data in the lake, data quality metrics will be applied to the data after the data is loaded, and the data quality score will be recorded in the metadata. ([Location 6595](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6595)) - Format type: With some file formats, it is not immediately apparent what the format of the file is. ([Location 6599](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6599)) - File structure: In the case of JSON, XML, and similar semi-structured formats, a reference to a metadata definition can be useful. ([Location 6602](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6602)) - Approval and certification: Once a file has been validated by either automated or manual processes, the associated metadata indicating this approval and certification will be appended to the metadata. ([Location 6603](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6603)) - Business term mappings: Any technical metadata items, such as tables and columns, always have a corresponding business term associated with them. ([Location 6606](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6606)) - PII, General Data Protection Regulation (GDPR), confidential, restricted, and other flags and labels: ([Location 6612](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6612)) - Physical structure, redundancy checks, and job validation: ([Location 6614](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6614)) - Data business purpose and reason: ([Location 6616](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6616)) - Data domain and meaning: ([Location 6618](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6618)) - There are a variety of ways that data governance metadata can be tracked. The recommended approaches are as follows: S3 metadata S3 tags An enhanced data catalog or vendor to maintain this information ([Location 6620](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6620)) - Metrics to gauge the success of your data lake ([Location 6633](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6633)) - Size: You may want to track two measurements: total lake size and trusted zone size. ([Location 6636](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6636)) - Governability: This might be a difficult characteristic to measure but it's an important one. Not all data must be governed. The critical data needs to be identified and a governance layer should be added on top of it. ([Location 6643](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6643)) - Data that is deemed critical to track is dubbed a Critical Data Element (CDE). ([Location 6647](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6647)) - Quality: Data quality does not need to be perfect. ([Location 6651](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6651)) - Usage: Borrowing a term from the internet, you might want to track the number of page requests, the number of visits, and the number of visitors to your data lake in general. ([Location 6655](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6655)) - AWS provides a convenient way to track your usage metrics by using SQL queries directly against AWS CloudTrail using Amazon Athena. ([Location 6659](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6659)) - Variety: Measure the variety of a couple of components of the data lake. ([Location 6661](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6661)) - Speed: There are two useful measurements to use when it comes to speed. ([Location 6668](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6668)) - Customer satisfaction: Other than security, this might be one of the most important metrics to continuously track. ([Location 6676](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6676)) - Security: Compromising on your security metrics is normally not an option. It is paramount to ensure that the data lake is secure and users have access only to their data. ([Location 6685](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6685)) - To minimize this risk, AWS offers Amazon Macie, which can automatically scan your data lake to locate and flag errant PII in your repositories. ([Location 6690](https://readwise.io/to_kindle?action=open&asin=B08MQ28DMY&location=6690))