AWS Certified Data Engineer Associate - Exam Guide
AWS Certified Data Engineer Associate - Exam Guide
Introduction
The AWS Certified Data Engineer - Associate (DEA-C01) exam validates a candidate’s
ability to implement data pipelines and to monitor, troubleshoot, and optimize cost
and performance issues in accordance with best practices.
The exam also validates a candidate’s ability to complete the following tasks:
• Ingest and transform data, and orchestrate data pipelines while applying
programming concepts.
• Choose an optimal data store, design data models, catalog data schemas, and
manage data lifecycles.
• Operationalize, maintain, and monitor data pipelines. Analyze data and ensure
data quality.
• Implement appropriate authentication, authorization, data encryption, privacy,
and governance. Enable logging.
• Setup and maintenance of extract, transform, and load (ETL) pipelines from
ingestion to destination
• Application of high-level but language-agnostic programming concepts as
required by the pipeline
• How to use Git commands for source control
• How to use data lakes to store data
• General concepts for networking, storage, and compute
• How to use AWS services to accomplish the tasks listed in the Introduction
section of this exam guide
• An understanding of the AWS services for encryption, governance, protection,
and logging of all data that is part of data pipelines
• The ability to compare AWS services to understand the cost, performance, and
functional differences between services
• How to structure SQL queries and how to run SQL queries on AWS services
• An understanding of how to analyze data, verify data quality, and ensure data
consistency by using AWS services
Job tasks that are out of scope for the target candidate
The following list contains job tasks that the target candidate is not expected to be
able to perform. This list is non-exhaustive. These tasks are out of scope for the exam:
Refer to the Appendix for a list of in-scope AWS services and features and a list of
out-of-scope AWS services and features.
Exam content
Response types
• Multiple choice: Has one correct response and three incorrect responses
(distractors)
• Multiple response: Has two or more correct responses out of five or more
response options
Unanswered questions are scored as incorrect; there is no penalty for guessing. The
exam includes 50 questions that affect your score.
Unscored content
The exam includes 15 unscored questions that do not affect your score. AWS collects
information about performance on these unscored questions to evaluate these
questions for future use as scored questions. These unscored questions are not
identified on the exam.
Exam results
The AWS Certified Data Engineer - Associate (DEA-C01) exam has a pass or fail
designation. The exam is scored against a minimum standard established by AWS
professionals who follow certification industry best practices and guidelines.
Your results for the exam are reported as a scaled score of 100–1,000. The minimum
passing score is 720. Your score shows how you performed on the exam as a whole
and whether you passed. Scaled scoring models help equate scores across multiple
exam forms that might have slightly different difficulty levels.
Your score report could contain a table of classifications of your performance at each
section level. The exam uses a compensatory scoring model, which means that you do
not need to achieve a passing score in each section. You need to pass only the overall
exam.
Each section of the exam has a specific weighting, so some sections have more
questions than other sections have. The table of classifications contains general
information that highlights your strengths and weaknesses. Use caution when you
interpret section-level feedback.
This exam guide includes weightings, content domains, and task statements for the
exam. This guide does not provide a comprehensive list of the content on the exam.
However, additional context for each task statement is available to help you prepare
for the exam.
Knowledge of:
• Throughput and latency characteristics for AWS services that ingest data
• Data ingestion patterns (for example, frequency and data history)
• Streaming data ingestion
• Batch data ingestion (for example, scheduled ingestion, event-driven
ingestion)
• Replayability of data ingestion pipelines
• Stateful and stateless data transactions
Skills in:
• Reading data from steaming sources (for example, Amazon Kinesis, Amazon
Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB
Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon
Redshift)
• Reading data from batch sources (for example, Amazon S3, AWS Glue,
Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon
AppFlow)
• Implementing appropriate configuration options for batch ingestion
• Consuming data APIs
Knowledge of:
• Creation of ETL pipelines based on business requirements
• Volume, velocity, and variety of data (for example, structured data,
unstructured data)
• Cloud computing and distributed computing
• How to use Apache Spark to process data
• Intermediate data staging locations
Skills in:
• Optimizing container usage for performance needs (for example, Amazon
Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service
[Amazon ECS])
• Connecting to different data sources (for example, Java Database
Connectivity [JDBC], Open Database Connectivity [ODBC])
• Integrating data from multiple sources
• Optimizing costs while processing data
• Implementing data transformation services based on requirements (for
example, Amazon EMR, AWS Glue, Lambda, Amazon Redshift)
• Transforming data between formats (for example, from .csv to Apache
Parquet)
• Troubleshooting and debugging common transformation failures and
performance issues
• Creating data APIs to make data available to other systems by using AWS
services
Knowledge of:
• How to integrate various AWS services to create ETL pipelines
• Event-driven architecture
• How to configure AWS services for data pipelines based on schedules or
dependencies
• Serverless workflows
Skills in:
• Using orchestration services to build workflows for data ETL pipelines (for
example, Lambda, EventBridge, Amazon Managed Workflows for Apache
Airflow [Amazon MWAA], AWS Step Functions, AWS Glue workflows)
• Building data pipelines for performance, availability, scalability, resiliency,
and fault tolerance
• Implementing and maintaining serverless workflows
• Using notification services to send alerts (for example, Amazon Simple
Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon
SQS])
Knowledge of:
• Continuous integration and continuous delivery (CI/CD) (implementation,
testing, and deployment of data pipelines)
• SQL queries (for data source queries and data transformations)
• Infrastructure as code (IaC) for repeatable deployments (for example, AWS
Cloud Development Kit [AWS CDK], AWS CloudFormation)
• Distributed computing
• Data structures and algorithms (for example, graph data structures and tree
data structures)
• SQL query optimization
Knowledge of:
• Storage platforms and their characteristics
• Storage services and configurations for specific performance demands
• Data storage formats (for example, .csv, .txt, Parquet)
• How to align data storage with data migration requirements
• How to determine the appropriate storage solution for specific access
patterns
• How to manage locks to prevent access to data (for example, Amazon
Redshift, Amazon RDS)
Skills in:
• Implementing the appropriate storage services for specific cost and
performance requirements (for example, Amazon Redshift, Amazon EMR,
AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data
Streams, Amazon MSK)
• Configuring the appropriate storage services for specific access patterns and
requirements (for example, Amazon Redshift, Amazon EMR, Lake
Formation, Amazon RDS, DynamoDB)
Knowledge of:
• How to create a data catalog
• Data classification based on requirements
• Components of metadata and data catalogs
Skills in:
• Using data catalogs to consume data from the data’s source
• Building and referencing a data catalog (for example, AWS Glue Data
Catalog, Apache Hive metastore)
• Discovering schemas and using AWS Glue crawlers to populate data
catalogs
• Synchronizing partitions with a data catalog
• Creating new source or target connections for cataloging (for example, AWS
Glue)
Knowledge of:
• Appropriate storage solutions to address hot and cold data requirements
• How to optimize the cost of storage based on the data lifecycle
• How to delete data to meet business and legal requirements
• Data retention policies and archiving strategies
• How to protect data with appropriate resiliency and availability
Knowledge of:
• Data modeling concepts
• How to ensure accuracy and trustworthiness of data by using data lineage
• Best practices for indexing, partitioning strategies, compression, and other
data optimization techniques
• How to model structured, semi-structured, and unstructured data
• Schema evolution techniques
Skills in:
• Designing schemas for Amazon Redshift, DynamoDB, and Lake Formation
• Addressing changes to the characteristics of data
• Performing schema conversion (for example, by using the AWS Schema
Conversion Tool [AWS SCT] and AWS DMS Schema Conversion)
• Establishing data lineage by using AWS tools (for example, Amazon
SageMaker ML Lineage Tracking)
Knowledge of:
• How to maintain and troubleshoot data processing for repeatable business
outcomes
• API calls for data processing
• Which services accept scripting (for example, Amazon EMR, Amazon
Redshift, AWS Glue)
Knowledge of:
• Tradeoffs between provisioned services and serverless services
• SQL queries (for example, SELECT statements with multiple qualifiers or
JOIN clauses)
• How to visualize data for analysis
• When and how to apply cleansing techniques
• Data aggregation, rolling average, grouping, and pivoting
Skills in:
• Visualizing data by using AWS services and tools (for example, AWS Glue
DataBrew, Amazon QuickSight)
• Verifying and cleaning data (for example, Lambda, Athena, QuickSight,
Jupyter Notebooks, Amazon SageMaker Data Wrangler)
• Using Athena to query data or to create views
• Using Athena notebooks that use Apache Spark to explore data
Knowledge of:
• How to log application data
• Best practices for performance tuning
• How to log access to AWS services
• Amazon Macie, AWS CloudTrail, and Amazon CloudWatch
Knowledge of:
• Data sampling techniques
• How to implement data skew mechanisms
• Data validation (data completeness, consistency, accuracy, and integrity)
• Data profiling
Skills in:
• Running data quality checks while processing the data (for example,
checking for empty fields)
• Defining data quality rules (for example, AWS Glue DataBrew)
• Investigating data consistency (for example, AWS Glue DataBrew)
Knowledge of:
• VPC security networking concepts
• Differences between managed services and unmanaged services
• Authentication methods (password-based, certificate-based, and role-based)
• Differences between AWS managed policies and customer managed policies
Skills in:
• Creating custom IAM policies when a managed policy does not meet the
needs
• Storing application and database credentials (for example, Secrets Manager,
AWS Systems Manager Parameter Store)
• Providing database users, groups, and roles access and authority in a
database (for example, for Amazon Redshift)
• Managing permissions through Lake Formation (for Amazon Redshift,
Amazon EMR, Athena, and Amazon S3)
Knowledge of:
• Data encryption options available in AWS analytics services (for example,
Amazon Redshift, Amazon EMR, AWS Glue)
• Differences between client-side encryption and server-side encryption
• Protection of sensitive data
• Data anonymization, masking, and key salting
Knowledge of:
• How to log application data
• How to log access to AWS services
• Centralized AWS logs
Skills in:
• Using CloudTrail to track API calls
• Using CloudWatch Logs to store application logs
• Using AWS CloudTrail Lake for centralized logging queries
• Analyzing logs by using AWS services (for example, Athena, CloudWatch
Logs Insights, Amazon OpenSearch Service)
• Integrating various AWS services to perform logging (for example, Amazon
EMR in cases of large volumes of log data)
Knowledge of:
• How to protect personally identifiable information (PII)
• Data sovereignty
Skills in:
• Granting permissions for data sharing (for example, data sharing for
Amazon Redshift)
• Implementing PII identification (for example, Macie with Lake Formation)
• Implementing data privacy strategies to prevent backups or replications of
data to disallowed AWS Regions
• Managing configuration changes that have occurred in an account (for
example, AWS Config)
The following list contains AWS services and features that are in scope for the exam.
This list is non-exhaustive and is subject to change. AWS offerings appear in
categories that align with the offerings’ primary functions:
Analytics:
• Amazon Athena
• Amazon EMR
• AWS Glue
• AWS Glue DataBrew
• AWS Lake Formation
• Amazon Kinesis Data Analytics
• Amazon Kinesis Data Firehose
• Amazon Kinesis Data Streams
• Amazon Managed Streaming for Apache Kafka (Amazon MSK)
• Amazon OpenSearch Service
• Amazon QuickSight
Application Integration:
• Amazon AppFlow
• Amazon EventBridge
• Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
• Amazon Simple Notification Service (Amazon SNS)
• Amazon Simple Queue Service (Amazon SQS)
• AWS Step Functions
• AWS Budgets
• AWS Cost Explorer
• AWS Batch
• Amazon EC2
• AWS Lambda
• AWS Serverless Application Model (AWS SAM)
Containers:
Database:
Developer Tools:
• AWS CLI
• AWS Cloud9
• AWS Cloud Development Kit (AWS CDK)
• AWS CodeBuild
• AWS CodeCommit
• AWS CodeDeploy
• AWS CodePipeline
Machine Learning:
• Amazon SageMaker
• AWS CloudFormation
• AWS CloudTrail
• Amazon CloudWatch
• Amazon CloudWatch Logs
• AWS Config
• Amazon Managed Grafana
• AWS Systems Manager
• AWS Well-Architected Tool
• Amazon CloudFront
• AWS PrivateLink
• Amazon Route 53
• Amazon VPC
• AWS Backup
• Amazon Elastic Block Store (Amazon EBS)
• Amazon Elastic File System (Amazon EFS)
• Amazon S3
• Amazon S3 Glacier
The following list contains AWS services and features that are out of scope for the
exam. This list is non-exhaustive and is subject to change. AWS offerings that are
entirely unrelated to the target job roles for the exam are excluded from this list:
Analytics:
• Amazon FinSpace
Business Applications:
Compute:
Containers:
• Amazon Timestream
Developer Tools:
• AWS Amplify
• AWS AppSync
• AWS Device Farm
• Amazon Location Service
• Amazon Pinpoint
• Amazon Simple Email Service (Amazon SES)
• FreeRTOS
• AWS IoT 1-Click
• AWS IoT Device Defender
• AWS IoT Device Management
• AWS IoT Events
• AWS IoT FleetWise
• AWS IoT RoboRunner
• AWS IoT SiteWise
• AWS IoT TwinMaker
Machine Learning:
• Amazon CodeWhisperer
• Amazon DevOps Guru
• AWS Activate
• AWS Managed Services (AMS)
Storage:
Survey
How useful was this exam guide? Let us know by taking our survey.