Lab_ Performing ETL on a Dataset by Using AWS Glue
Lab_ Performing ETL on a Dataset by Using AWS Glue
[Version 1.0.21]
Duration
This lab will require approximately 90 minutes to complete.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 1/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Scenario
The data science team has asked you to create a series of proofs of concept (POCs) to use AWS
services to address many of the data engineering and data analysis needs of the university. Mary,
one of the data science team members, has seen what Athena can do to create tables that have
defined schemas and is impressed. She asks you if it's possible to infer the columns and data
types automatically. Defining the schema takes much of her time when she deals with large
amounts of varied data. You want to develop a POC to use AWS Glue, which is designed for use
cases that are similar to this one.
To develop a POC, Mary suggests that you use a publicly available dataset, the Global Historical
Climatology Network Daily [GHCN-D] dataset, contains daily weather summaries from ground-
based stations, going back to 1763. The dataset is publicly available in an S3 bucket.
Mary explains that the most common recorded parameters in the dataset are daily temperatures,
rainfall, and snowfall. These parameters are useful to assess risks for drought, flooding, and
extreme weather. The data definitions used in this lab are available on the NOAA Global Historical
Climatology Network Daily (GHCN-D) Dataset page.
Note: As of October 2022, the dataset has been split into sub-datasets, by_year and by_station.
Throughout this lab you will be using by_year and it can be found at s3://noaa-ghcn-
pds/csv/by_year/.
When you start the lab, the environment will contain the resources that are shown in the following
diagram. For this lab environment, the original data source is an S3 bucket that exists in another
AWS account.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 2/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
The lab environment is created with a CloudFormation template that is deployed when you launch
the lab. The resulting CloudFormation stack creates two S3 buckets entitled data-science-bucket
and glue-1950-bucket, an IAM policy entitled Policy-For-Data-Scientists, and an IAM role entitled
gluelab.
Tip: To review the CloudFormation template that built this environment, navigate to the
CloudFormation console. In the navigation pane, choose Stacks.
By the end of the lab, you will have created the additional architecture shown in the following
diagram. The table after the diagram provides a detailed explanation of the architecture and how it
relates to the tasks that you will complete in this lab.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 3/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Numbered
Detail
Task
You will create and use an AWS Glue crawler named weather with the gluelab IA
1 access the GHCN-D dataset, which is in an S3 bucket in another AWS account.
crawler will populate the weatherdata database in the AWS Glue Data Catalog.
2 You will also use the Glue console to transform the database by modifying its sc
When the Data Catalog is ready, you will use Athena to query the database and b
3 tables. The results of the queries that you run will be stored in the data-science-b
bucket.
You will create an AWS Glue database table using another query that only includ
4
since 1950 and then store the results of the query in the glue-1950-bucket S3 bu
5 You will create views and use these to calculate the the average temperature of e
You will use the AWS CLI within the AWS Cloud9 terminal to create a CloudForm
template. The template will create the crawler as cfn-crawler-weather. Other team
6
and other university departments will be able to use the template to create the cr
needed.
You will review the Policy-For-Data-Scientists IAM policy to understand the team
7
to the workflow.
You will test Mary's access to the cfn-crawler-weather crawler and run it by using
8
credentials.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 4/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Notice that this is where you could add tags or extra security configurations. Keep the
default settings.
Choose Next at the bottom of the page.
Choose Add a data source and configure the following:
Data source: Choose S3.
Location of S3 data: Choose In a different account.
S3 path: Enter the following S3 bucket location for the publicly available dataset:
s3://noaa-ghcn-pds/csv/by_year/
GlueLab:
Type: AWS::IAM::Role
Properties:
RoleName: "gluelab"
Path: "/"
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- glue.amazonaws.com
Action:
- sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
- arn:aws:iam::aws:policy/AmazonS3FullAccess
Choose Next.
In the Output configuration section, choose Add database.
A new browser tab opens.
For Name, enter weatherdata
Choose Create database.
Return to the browser tab that is open to the Set output and scheduling page in the
AWS Glue console.
For Target database, choose the weatherdata database that you just created.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 6/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Tip: To refresh the list of available databases, choose the refresh icon to the right of the
dropdown list.
In the Crawler schedule section, for Frequency, keep the default On demand.
Choose Next.
Confirm your crawler configuration is similar to the following.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 7/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
AWS Glue creates a table to store metadata about the GHCN-D dataset. Next, you will
inspect the data that AWS Glue captured about the data source.
Now you will edit the schema of the database, which is part of transforming data in the
ETL process.
date date
element type
data_value observation
m_flag mflag
q_flag qflag
s_flag sflag
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 8/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Task 1 summary
In this task, you used the console to create a crawler in AWS Glue. You directed the crawler to
data that is stored in an S3 bucket, and the crawler discovered the data. Then, the crawler stored
the associated metadata (the table definition and schema) in a Data Catalog. By using a crawler in
AWS Glue, you can inspect a data source and infer its schema.
The team can now use crawlers to inspect data sources quickly and reduce the manual steps to
create database schemas from data sources in Amazon S3. You share the results of this POC with
Mary, and she is happy with the new functionality. Next, she wants to be able to do more analysis
on the data in the AWS Glue Data Catalog.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 10/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Notice the run time and amount of data that was scanned for the query. As you develop
more complex applications, it is important to minimize resource consumption to optimize
costs. You will see examples of how to optimize cost for Athena queries later in this task.
In the next step, you will create an AWS Glue database table that only includes data since 1950.
To optimize your use of Athena, you will store data in the Apache Parquet format. Apache Parquet
is an open-source columnar data format that is optimized for performance and storage.
9. Create a table for data after 1950.
First, you need to retrieve the name of the bucket that was created for you to store this data.
In the search box next to Services, search for and choose S3.
In the Buckets list, copy the bucket name that contains glue-1950-bucket to a text
editor of your choice.
Return to the Athena query editor.
Copy and paste the following query into a query tab in the editor. Replace <glue-1950-
bucket> with the name of the bucket that you recorded:
Choose Run.
After the query runs, the run time and data scanned values are similar to the following:
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 11/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
To preview the results, in the Tables section, to the right of the late20th table, choose the
ellipsis icon, and then choose Preview Table.
The results are similar to the following screenshot.
Now that you have isolated the data that you are interested in, you can write queries for
further analysis.
10. Run a query on the new table.
First, create a view that only includes the maximum temperature reading, or TMAX, value.
Run the following query in a new query tab:
To preview the results, in the Views section, to the right of the tmax view, choose the
ellipsis icon, and then choose Preview View.
The results are similar to the following screenshot:
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 12/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
The purpose of this query is to calculate the average maximum temperature for each year in
the dataset.
After the query runs, the run time and data scanned values are similar to the following:
The results display the average maximum temperature for each year from 1950 to 2015. The
following screenshot displays an example:
Remember that when you create queries with Athena, the results of the query must be stored
back in Amazon S3. In a previous step, you specified the location in Amazon S3 where your
queries are stored.
When using AWS services, you generally pay for what you use. However, because you reduced
your query to only three columns of temperature data from 1950 through 2015, you reduced your
costs for storage. Also, because you arranged the query data in columnar format with Apache
Parquet, the time that it took to perform the queries in this task was reduced, resulting in less cost
as you used fewer computational resources in Athena.
You show Mary that she can speed up her process by using AWS Glue in combination with
Athena and still use views. She is delighted.
Task 2 summary
In this task, you learned how to use Athena to query tables in a database that an AWS Glue
crawler created. You built a table for all data after 1950 from the original dataset. You used the
Apache Parquet format to optimize your Athena queries, which reduced the time that it took to
complete each query, resulting in less cost. After isolating this data, you created a view that
calculated the average maximum temperature for each year.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 13/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
AWS Glue integrates with original datasets stored in Amazon S3. AWS Glue can create a crawler
to ingest the original dataset into a database and infer the appropriate schema. Then, you can
quickly shift to Athena to develop queries and better understand your data. This integration
reduces the time that it takes to derive insights from your data and apply these insights to make
better decisions.
AWSTemplateFormatVersion: '2010-09-09'
Parameters: # The name of the crawler to be created
CFNCrawlerName:
Type: String
Default: cfn-crawler-weather
CFNDatabaseName:
Type: String
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 14/26
1/2/25, 17:19 g Lab: Performing ETL on a Dataset by Using AWS Glue
Default: cfn-database-weather
CFNTablePrefixName:
Type: String
Default: cfn_sample_1-weather
# Resources section defines metadata for the Data Catalog
Resources:
# Create a database to contain tables created by the crawler
CFNDatabaseWeather:
Type: AWS::Glue::Database
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseInput:
Name: !Ref CFNDatabaseName
Description: "AWS Glue container to hold metadata tables for the weather crawler"
#Create a crawler to crawl the weather data on a public S3 bucket
CFNCrawlerWeather:
Type: AWS::Glue::Crawler
Properties:
Name: !Ref CFNCrawlerName
Role: <GLUELAB-ROLE-ARN>
#Classifiers: none, use the default classifier
Description: AWS Glue crawler to crawl weather data
#Schedule: none, use default run-on-demand
DatabaseName: !Ref CFNDatabaseName
Targets:
S3Targets:
# Public S3 bucket with the weather data
- Path: "s3://noaa-ghcn-pds/csv/by_year/"
TablePrefix: !Ref CFNTablePrefixName
SchemaChangePolicy:
UpdateBehavior: "UPDATE_IN_DATABASE"
DeleteBehavior: "LOG"
Configuration: "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":
{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
Note: This codeblock uses the Sample AWS CloudFormation Template for an AWS
Glue Crawler for Amazon S3 from AWS CloudFormation for AWS Glue in the AWS Glue
Developer Guide.
In the file, replace <GLUELAB-ROLE-ARN> with the ARN for the gluelab IAM role. Look
for the line that starts with Role, which is around line 27.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 15/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Note: If you receive an error that says YAML not well-formed, check the value for the name of
the gluelab role. Also check the tabs and spacing for each line. YAML documents require
exact spacing, and the parser will encounter errors if the spacing doesn't match.
If the template is validated, the following output displays:
{
"Parameters": [
{
"ParameterKey": "CFNCrawlerName",
"DefaultValue": "cfn-crawler-weather",
"NoEcho": false
},
{
"ParameterKey": "CFNTablePrefixName",
"DefaultValue": "cfn_sample_1-weather",
"NoEcho": false
},
{
"ParameterKey": "CFNDatabaseName",
"DefaultValue": "cfn-database-weather",
"NoEcho": false
}
]
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 16/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Now you will use the template to create a CloudFormation stack. A stack implements and
manages the group of resources that are outlined in a template. With a stack, you can manage the
state and dependencies of those resources together. Think of a CloudFormation template as a
blueprint. Then, the stack is the actual instance of the template that is registered in AWS and
actually creates the resources.
16. To create the CloudFormation stack, run the following command:
If the stack is validated, the CloudFormation ARN displays in the output, similar to the
following:
{
"StackId": "arn:aws:cloudformation:us-east-1:338778555682:stack/gluecrawler/2d8cec90-5c42-
11ec-8fbf-12034b0079a5"
}
The CloudFormation create-stack command creates the stack and deploys it. If validation
passes and nothing causes the stack creation to roll back, proceed to the next step.
Tip: To check the progress of stack creation, navigate to the CloudFormation console. In the
navigation pane, choose Stacks.
17. To verify that the AWS Glue database was created in the stack, run the following command:
{
"DatabaseList": [
{
"Name": "cfn-database-weather",
"Description": "AWS Glue container to hold metadata tables for the weather crawler",
"Parameters": {},
"CreateTime": 1649267047.0,
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 17/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
],
"CatalogId": "034140262343"
},
{
"Name": "weatherdata",
"CreateTime": 1649263434.0,
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "034140262343"
}
]
}
{
"CrawlerNames": [
"Weather",
"cfn-crawler-weather"
]
}
{
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 18/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "WeatherCrawler-001-CFNRoleWeather-17WB9OM5H5MFL",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649083535.0,
"LastUpdated": 1649083535.0,
"Version": 1,
"Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":
{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 19/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Review the response. Notice that the state of the crawler is READY. This means that the crawler is
deployed, but it hasn't run yet. You will use Mary's IAM user to run the crawler later in the lab.
Task 3 summary
In this task, you learned how to integrate an AWS Glue crawler into a CloudFormation template.
You also learned how to use the AWS CLI within the AWS Cloud9 terminal to validate and deploy
the template to create the crawler. With the template, you can reuse the crawler in other AWS
accounts. Then, you learned how to confirm that the resources were built with the crawler (the
AWS Glue database and its associated tables).
Many companies use multiple accounts with AWS to maintain separate development, testing, and
production environments. Isolating these environments helps to ensure that teams follow best
practices. Building crawlers in a development account and then testing them in a controlled
account with production data can help to ensure that the crawler is designed as intended and
extracts, transforms, and loads the data that is intended for the specific business task. After
validating the crawler, you can use DevOps best practices with CloudFormation to quickly move it
to production so that the appropriate business stakeholders can reuse the crawler without having
to build from the beginning.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 20/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Task 4 summary
In this task, you reviewed the IAM policy for the DataScienceGroup. The policy contains
permissions for limited access to Amazon S3, AWS Glue, and Athena. The policy could be used
as an example policy for users who intend to reuse crawlers built by the operations team. As with
all services in AWS, IAM users must have the appropriate permissions applied to be able to
perform actions.
AK=<ACCESS-KEY>
SAK=<SECRET-ACCESS-KEY>
To test whether the mary user can perform a specific command, you can pass the user's
credentials as bash variables (AK and SAK) with the command. The API will then try to perform
that command as the specified user.
21. Test Mary's access to the AWS Glue crawler.
To test whether the mary user can perform the list-crawlers command, run the following
command:
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 21/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
The output is similar to the following and looks like the output that was displayed after
you ran the command earlier:
{
"CrawlerNames": [
"Weather",
"cfn-crawler-weather"
]
}
To test whether the mary user can perform the get-crawler command, run the following
command:
The output is similar to the following and looks like the output that was displayed after
you ran the command earlier. Note that the state of the crawler is READY, but no status
information is displayed. This is because the crawler hasn't run yet.
{
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "gluelab",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649267047.0,
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 22/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
"LastUpdated": 1649267047.0,
"Version": 1,
"Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":
{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
22. Test that the mary user can run the crawler.
Run the following command.
If the crawler runs successfully, the terminal doesn't display any output.
To observe the crawler running and adding data to the table, navigate to the AWS Glue
console.
In the navigation pane, choose Crawlers.
Here you can see status information for the crawler, as shown in the following screenshot.
When the status changes to Ready, the crawler is finished running. It might take a few
minutes.
Return to the AWS Cloud9 terminal.
To confirm that the crawler is finished running, run the following command.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 23/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
{
"Crawler": {
"Name": "cfn-crawler-weather",
"Role": "gluelab",
"Targets": {
"S3Targets": [
{
"Path": "s3://noaa-ghcn-pds/csv/by_year/",
"Exclusions": []
}
],
"JdbcTargets": [],
"MongoDBTargets": [],
"DynamoDBTargets": [],
"CatalogTargets": [],
"DeltaTargets": []
},
"DatabaseName": "cfn-database-weather",
"Description": "AWS Glue crawler to crawl weather data",
"Classifiers": [],
"RecrawlPolicy": {
"RecrawlBehavior": "CRAWL_EVERYTHING"
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"LineageConfiguration": {
"CrawlerLineageSettings": "DISABLE"
},
"State": "READY",
"TablePrefix": "cfn_sample_1-weather",
"CrawlElapsedTime": 0,
"CreationTime": 1649267047.0,
"LastUpdated": 1649267047.0,
"LastCrawl": {
"Status": "SUCCEEDED",
"LogGroup": "/aws-glue/crawlers",
"LogStream": "cfn-crawler-weather",
"MessagePrefix": "5ef3cff5-ce6c-45d5-8359-e223a4227570",
"StartTime": 1649267649.0
},
"Version": 1,
"Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":
{\"AddOrUpdateBehavior\":\"InheritFromTable\"},\"Tables\":
{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}",
"LakeFormationConfiguration": {
"UseLakeFormationCredentials": false,
"AccountId": ""
}
}
}
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 24/26
1/2/25, 17:19 Lab: Performing ETL on a Dataset by Using AWS Glue
Notice that the LastCrawl section is included, and the status in that section is
SUCCEEDED. This means that Mary was able to run the crawler successfully.
Task 5 summary
This result confirms that Mary has access to the AWS Glue crawler that you created and deployed
with CloudFormation. This is because the permissions in the IAM policy allow her to list, get, and
retrieve the metadata for the crawler. Other permissions associated with the policy include the
following:
For AWS Glue: List, read, and tag resources. Run a crawler deployed with CloudFormation,
but not create or remove resources or manage.
For Athena: List, read, and tag resources, but not create or remove specific resources. (For
example, this policy does not provide permissions to create or remove a named query or Data
Catalog, which your user has permissions to do.)
For Amazon S3: Access buckets, list bucket contents, and read objects, but not create a
bucket and limit access to specific buckets, like the DataScienceBucket we created for you.
Congratulations! You have learned how to create an AWS Glue crawler manually and by using
CloudFormation so that you can deploy it to users with a secure IAM policy. Because the crawler
is in a CloudFormation template, you can reuse the template to create and deploy the crawler in
any AWS account and change the parameters as desired.
After a couple of minutes, the grades panel appears and shows you how many points you
earned for each task. If the results don't display after a couple of minutes, choose Grades at
the top of these instructions.
Important: Some of the checks made by the submission process in this lab will only give
you credit if it has been at least 5 minutes since you completed the action. If you do not
receive credit the first time you submit, you may need to wait a couple minutes and the
submit again to receive credit for these items.
Tip: You can submit your work multiple times. After you change your work, choose Submit
again. Your last submission is recorded for this lab.
25. To find detailed feedback about your work, choose Submission Report.
Lab complete
Congratulations! You have completed the lab.
26. At the top of this page, choose End Lab, and then choose Yes to confirm that you want to
end the lab.
A message panel indicates that the lab is terminating.
27. To close the panel, choose Close in the upper-right corner.
Additional resources
For more information about the services and concepts covered in this lab, see the following
resources:
Getting Started with AWS Glue
Scheduling an AWS Glue Crawler
Apache Parquet
How Crawlers Work
CreateStack Request in CloudFormation
© 2022, Amazon Web Services, Inc. and its affiliates. All rights reserved. This work may not be
reproduced or redistributed, in whole or in part, without prior written permission from Amazon
Web Services, Inc. Commercial copying, lending, or selling is prohibited.
https://awsacademy.instructure.com/courses/96839/modules/items/8946918 26/26