0% found this document useful (0 votes)
6 views

DocScanner 20 Oct 2024 2-19 PM

Naveen

Uploaded by

vutlanaveen0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
6 views

DocScanner 20 Oct 2024 2-19 PM

Naveen

Uploaded by

vutlanaveen0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
Department of Electronics and communication Engneering Chalapathi Insistute of technology Approved by Aicte and permanently affileted to Jntuk kakinadha mothadhika, guntur -522016,Ap,India certificate This to certify that the intership report entitled"Aws Data Engneering"that is being submitted by-SHAIK ABRAHAM EMMANUEL RAJ of 4 year ECE Roll no- 22HT5A0413 Signa ture of Signature of Examiner INTERSHIP REPORT ON AWS DATA EINGINEERING tips, DATA ENGINEERING fo ibn ao =, Data Engineering Submitted by- shaik Abraham Emmanuel Raj Roll no-22HT5A0413 Branch-Electronics And Communiction Engneering “What is the business problem? “What pain is il causing? «Why docs (his problem need to be resolved? *What will happen if you don't solve this problem? How will you measure succes Example: ML problem framing Tada {sclentict for AnyCompany worked with a domain expert (or ear lasurance Claims who identified a set of relational database tables with information about the prior year's claims. The domain expert also has the information necessary to Identify which claims turned out to be fraudulent + They have approximately 1,000 available claims records from the previous 12- month period «The availablity of Labeled data makes this a good candidate for supervised learning, + The target (1s this fraud?) isa binary classification problem (the answer is one of two cholces—yes ar no) The data scientist plans to use an open-source binary classification algorithm. Fig- 8.5: ML Framing Collecting Data: Key steps In collecting data to be used in ML + Protect data veracity. + Collect enough data to train and test the ML model and Ingest it into the pipeline, = Apply labels to training data with known targets. Fig-8.6: Collecting Data 12 WEEK- 8: PROCESSING BIG DATA & DATA FOR ML 8.1 Big Data processing Concepts: fequertiy accesed (old) data = Faequenty sccessed ot an rng Tementaiy neat coal me + apsbicot deep annie a + Examotes Amazon EMH, Apa fevtoop ‘Apache Hadoop: Fig- 8.2: Apache Hadoop Apache Spark: Apache Spark characteristics, “Is an open-source, distributed processing framework +Uses in-memory caching and optimized query processing *Supports code reuse across multiple workloads «Clusters consist of leader and worker nodes Amazon EMR Characteristics: *Managed cluster platform «Big data solution for petabyte-scale data processing, interactive analytics, and machine learning WEEK- STORING AND ORGANIZING DATA: 7.1 Storage in the modern data architecture: Fig- 7.1: Storage in modem Architecture Data in cloud object storage is handled as objects. Each object is assigned a key, which is a unique identifier. When the key is paired with metadata that is attached to the objects, other AWS services can use the information to unlock a multitude of capabilities. Thanks to economies of scale, cloud object storage comes at a lower cost than traditional storage. Data Warehouse Storage: *Provide a centralized repository +Store structured and semi-structured data *Store data in one of two ways: “Frequently accessed data in fast storage Infrequently accessed data in cheap storage *Might contain multiple databases that are organized into tables and columns *Separate analytics processing from transactional databases+ Example: Amazon Redshift Purpose-Built Data Bases: ETL pipelines transform data in buffered memory prior to loading data into a data lake or data warehouse for storage. “ELT pipelines extract and load data into a data lake or data warehouse for storage without transformation. Here are a few key points to summarize this section. Storage plays an integral partin ELT and ETL pipelines. Data often moves in and out of storage numerous times, based on pipeline type and workload type, ETL pipelines transform data in buffered memory prior to loading data into a data lake or data warehouse for storage. Levels of buffeted memory vary by service, ELT pipelines extract and load data into data lake or data warehouse storage without ‘transformation. The transformation of the data is part of the target system's workload, Securing Storage: Security for a data warehouse in Amazon Redshift *Amazon Redshift database security is distinct from the security of the service itself. +Amazon Redshift provides additional features to manage database security. ‘Duc to third-party auditing, Amazon Redshift can help to support applications that are required to meet international compliance standards. WEEK- 5: INGESTING & PREPARING DATA 5.1 ETL and ELT comparison: EtLand ELT Fig- 5.1: ETL& ELT comparison Data wrangling: Transforming large amounts of unstructured or structured raw data from multiple sources with different schemas into a meaningful set of data that has value for downstream processes 0 users Data Structuring: For the scenario that was described previously, the structuring step includes exporting a json file from the customer support ticket system, loading the json file into Excel, and letting Excel parse the file. For the mapping step for the supp2 data, the data engineer would modify the cust num field to match the customer id field in the data warehouse. For this example, you would perform additional data wrangling steps before compressing the file for upload to the $3 bucket. Data Cleaning: Itincludes; *Remove unwanted data. missing data values. te or modify data types. Fix outlie WEEK- 4: SECURING & SCALING DATA PIPELINE 4.1 Scaling: An overview: Types of scating Fig- 4.1: Types of Scaling 4.2 Creating a scalable infrastructure: Fig- 4.2: Template Structure AWS Cloudrermation Fig- 4.3: AWS Cloud Formation AWS CloudFormation is a fully managed service that provides a common language for you to describe and provision all of the infrastructure resources in your cloud environment. Cloud Formation creates, updates, and deletes the resources for your applications in environments called stacks. A stack is a collection of AWS resources that are managed as a single unit. CloudFormation is all about automated resource provisioning— it simplifies the task of repeatedly and predictably creating groups of related resources that power your applications. Resources are written in text files by using JSON or YAML format. conform to requirements that are established for the trusted zone. Finally, the processing layer prepares the data for the curated zone by modeling and augmenting it to be joined with other datasets (enrichment) and then stores the transformed, validated data in the curated layer. Datasets from the curated layer are ready to be ingested into the data warehouse to make them available for low-latency access or complex SQL querying, Streaming analytics pipeline: Producers ingest records onto the stream, Producers are integrations that collect data from a source and load it onto the stream. Gensumers process records. Consumers read data from the stream and perform their own processing on it. The stream itself provides a temporary but durable storage layer for the streaming solution. In the pipeline that is depicted in this slide, ‘Amazon CloudWatch Events is the producer that puts CloudWatch Events event data onto the stream. Kinesis Data Streams provides the storage. The data is then available to multiple consumers. WEEK- 3: THE ELEMENTS OF DATA, DESIGN PRINCIPLES & PATTERNS FOR DATA PIPELINES 3.1 The five Vs of data- volume, velocity, variety, veracity& value: Data characteristics that drive infrastructure decisions a. <7 Pivot SAT How big isthe How What types Howaccurate, What insights dataset? How — frequentlyis_—and formats? precise, and can be pulled much new. new data How many trusted isthe from the data Is generated and different data? data? generated? ingested? sources does the data come from? Fig- 3.1: Data Characteristics The evolution of data architectures: ‘So, which of these data stores or data architectures is the best one for your data pipeline? The reality is that a modem architecture might include all of these elements. The key to 3 modern data architecture is to apply the three-pronged strategy that you learned about eerlier. Modernize the technology that you are using. Unify your data sources to create a single source of truth that can be accessed and used across the organization. And innovate to get higher value analysis from the data that you have. Variety data types, Modern data architecture on AWS: ‘The architecture illustrates the follawing other AWS purpose-built services that integrate with Amazon S3 and map to each component that was described on the previous slide: ‘Amazon Redshift is a fully managed data warehouse service. “Amazon OpenSearch Service is a purpose-built data store and search engine that is optimized for real-time analytics, including log analytics. *Amazon EMR provides big data processing and simplifies some of the most complex elements of setting up big data processing, *Amazon Aurora provides a relational database engine that was built for the cloud. mazon DynamoDB isa fully managed nonrelational database that is designed to run high-performance applications. *Amazon Sage Maker is an AI/ML service that democratizes access to ML process 3.2 Modern data architecture pipeline: Ingestion and storage: Data being ingested into the Amazon 3 deta lake arrives at the landing zone, where itis first cleaned and stored into the raw zone for permanent storage. Because data that is destined for the data warehouse needs to be highly trusted and conformed to a schema, the data needs to be processed further additional transformations would include applying the schema and Partitioning (structuring) as well as other transformations that are required to make the data DATA DRIVEN ORGANIZATIONS 2.1 Data Driven Decisions: How do organizations decide... Which of these customer transactions should be flagged as fraud? Which webpage design leads to the most completed sales? Which patients are most likely to have a relapse? Which type of online activity represents a security issue? When is the optimum time to harvest this year's crop? 2.2 The data pipeline —infrastructure for data-driven decisions: Fig- 2.1: Data Pipeline Another key characteristic of deriving insights by using your data pipeline is that the process will almost always be iterative. You have a hypothesis about what you expect to find in the data, and you need to experiment and see where it takes you. You might develop your hypothesis by using BI tools to do initial discovery and analysis of data that has already been collected. You might iterate within a pipeline segment, or you might iterate across the entire pipeline. For example, in this illustration, the initial iteration (number 1) yielded a result that wasnt as defined as was desired. Therefore, the data scientist refined the model and reprocessed the data to get a better result (number 2). After reviewing those results, they determined that additional data could improve the detail available in their result, so an additional data source was tapped and ingested through the pipeline to praduce the desired result (number 3). A pipeline often has iterations of storage and processing. For example, after the external data is ingested into pipeline storage, iterative processing transforms the data into different levels of refinement for different needs. COURSE MODULES WEEK- 1: OVERVIEW OF AWS ACADEMY DATA ENGINEERING Course objective: This course prepares you to do the following: «Summarize the role and value of data science in a data-driven organization. «Recognize how the elements of data influence decisions about the infrastructure of a data pipeline. + Illustrate a data pipeline by using AWS services to meet a generalized use case. «Identify the risks and approaches to secure and govern data at each step and each transition of the data pipeline «Identify scaling considerations and best practices for building pipelines that «handle large-scale datasets. «Design and build a data collection process while considering constraints such as scalability, cost, fault tolerance, and latency. Code generation + Code suggestions = Code completion aeaaporssbotas! + Code generation from comments 2 Inport botes 2 create an $3 bucket named €w95323 + Alternate code suggestions + Option to accept or reject 2 create on 23 bucket noned ew9S323 + Reference tracking for 22 = voro3.rezource(‘s3' . code that resembles, 53. create bucket (Bucket="ew59323") spen'sourcetralning # Upload @ file to the bucket data <|1/3 >| Accept ( tab | Accept Word 36) Fig- 1.1: Code Generation Open Code Reference Log Code Whisperer learns from open-source projects and the code it suggests might occasionally resemble code samples from the training data. With the reference log, you can view references to code suggestions that are similar to the training data. When such occurrences happen, Code Whisperer notifies you and provides repository and licensing information. Use this information to make decisions about whether to use the code in your project and properly attribute the source code as desired, Benefits of Amazon CodeWhisperer Lf 8 Pareo es Value to organizations + Increase velocity. + Use at all experience levels. + Spend less time writing code + Support open-source attribution. + Receive help directly within your IDE. + Reduce the risk of security vulnerabilities. + Find security vulnerabilities in your code, + Increase code quality and developer productivity. Fig- 1.2: Benefits of Amazon CodeWhisperer Code Whisperer code generation offers many benefits for software development organizations. It accelerates application development for faster delivery of software solutions. By automating repetitive tasks, it optimizes the use of developer time, so developers can focus on more critical aspects of the project. Additionally, code generation helps mitigate security vulnerabilities, safeguarding the integrity of the codebase. Code Whisperer also protects open source intellectual property by providing the open source reference tracker. Code Whisperer enhances code quality and reliability, leading to robust and efficient applications. And it supports an efficient response to evolving software threats, keeping the codebase up to date with the latest security practices. Code Whisperer has the potential to increase development speed, security, and the quality of software, WEEK- 6: INGESTING BY BATCH OR BY STREAM 6.1 Comparing batch and stream ingestion: atch and lon data flow Joo Fig- 6.1: Batch& Streaming Ingestion To generalize the characteristics of batch processing, batch ingestion involves running batch jobs that query a source, move the resulting dataset or datasets to durable storage in the pipeline, and then perform whatever transformations are required for the use case. As noted in the Ingesting and Preparing Data module, this could be just cleaning and minimally formatting data to putit into the lake. Or, it could be more complex enrichment, augmentation, and processing to support complex querying or big data and machine learning (ML) applications. Batch processing might be started on demand, run on a schedule, or initiated by an event. Traditional extract, transform, and load (ETL) uses batch processing, but extract, load, and transform (LT) processing might also be done by batch. Batch Ingestion Processing: The process of transporting data from one or more sources to a target site for further processing and analysis. This data can originate from a range of sources, including data lakes, loT devices, on-premises databases, and SaaS apps, and end up in different target environments, such as cloud data warehouses or data marts. Purpose Built Ingestion Tools: Fig- 6.2: Built Ingestion Tools Use Amazon App Flow to ingest data from a software as a service (SaaS) application. You can do the following with Amazon App Flow: *Create a connector that reads from a SaaS source and includes filters. *Map fields in each source object to fields in the destination and perform transformations. Perform validation on records to be transferred. *Securely transfer to Amazon $3 or Amazon Redshift. You can trigger an ingestion on demand, on event, or on a schedule. An example use case for Amazon App Flowis to ingest customer support ticket data from the Zendesk SaaS product. *Processes data for analytics and BI workloads using big data frameworks sTransform and move large amounts of data into and out of AWS data stores ML Concepts: Algorithms are used to train ML modets — ion ‘Three general types of ML models Ts ‘The medel is given inputs The medal finds pateernt The Intheeainey aa uheat hele Frampie Greapvart | Framnie Develop eit= Seitheimisr sewing | aura tore Fig- 8.3: ML models ML Life Cycle: The ML lifecycle defines a set of iterative phases Fig- 8.4: ML life cycle Framing the ML problem to meet Business Goals: Working backwards from the business problem to be solved 1 WEEK. ANALYZING & VISUALIZING DATA Consideration factors that influence tool selection: ors to consider when selecting tools lvsand the Understand the type and Consider the data fpeecs tobe able Guslty of ay Codctermune which dae Row og ee a proceed help sevtep ight Fig- 9.1: Factors & needs Data characteristics: *How much data is there? *At what speed and volume does it arrive? sHow frequently is it updated? +How quickly is it processed? “What type of data is it? 9.2 Comparing AWS tools and Services: For accessibility: Data from multiple sources is put in Amazon $3, where Athena can beused for one-time queries. Amazon EMR aggregates the data and stores the aggregates in S3. Athena can be used to query the aggregated datasets. From S3, the data can be used in Amazon Redshift, where Quick Sight can access the data to create visualizations. End of accessibility description. Fig- 9.2: QuickSight Example 13 WEEK- 10: AUTOMATING THE PIPELINE Automating Infrastructure deployment: Fig- 10.1: Automating Infrastructure If you build infrastructure with code, you gain the benefits of repeatability and reusability while you build your environments. In the example shown, a single template is used to deploy Network Load Balancers and Auto Scaling groups that contain Amazon Elastic Compute Cloud (Amazon EC2) instances. Network Load Balancers distribute traffic evenly across targets. ci/co: CI/CD can be pictured as a pipeline, where new code is submitted on one end, tested over a series of stages (source, build, test, staging, and production), and then published as production-ready code. Automating with Step Function: Mow Step Functions works Fig- 10.2: Step Function * With Step Functions, you can use visual workflows to coordinate the components of distributed applications and microservices. * You define a workflow, which is also referred to as a state machine, as a series of steps and transitions between each step. * Step Functions is integrated with Athena to fat Athena queries and data processing operations. {te building workflows that include 14

You might also like