Exercise 3 - Processing Data in A Data Lake
Exercise 3 - Processing Data in A Data Lake
Exercise 3 - Processing Data in A Data Lake
0]
Note
The exercises in this course will have an associated charge in your AWS account. In
this exercise, you will create the following resources:
The final exercise task includes instructions to delete all the resources that you
create for this exercise.
Familiarize yourself with AWS Glue, Amazon S3 pricing, Amazon Athena, and the
AWS Free Tier.
Setting up
This exercise requires an IAM role and an Amazon S3 bucket. You will create these
resources by using the provided CloudFormation template.
Note: If you have an existing virtual private cloud (VPC) with the Classless Inter-
Domain Routing (CIDR) block 10.16.0.0/16, you must edit the template and change its
CIDR block.
2. Sign in to the AWS Management Console as a user that has the necessary
permissions to create an IAM role and CloudFormation stack. You have already
created the CloudFormation role in exercise 2 of this course. If you don’t have a user
or role with permissions to create a stack in AWS CloudFormation, you must create
the user or role before you proceed to the next step. If you are unsure how to create a
role, see the step-by-step instructions in the Setting up section in exercise 2.
3. After you create the user or role to have permissions to work with AWS
CloudFormation, open the CloudFormation console. Make sure that you are in the
US East (N. Virginia) Region.
7. Choose Next.
11. After the stack is created, choose the Outputs tab and copy the name of the S3
bucket.
You can add a table manually or by using a crawler. A crawler is a program that
connects to a data store and progresses through a prioritized list of classifiers to
determine the schema for your data. AWS Glue provides classifiers for common file
types, such as CSV, JavaScript Object Notation (JSON), Apache Avro, and others. You
can also write your own classifier by using a grok pattern.
9. For Data source configuration, under Is your data already mapped to Glue
tables?, keep Not yet selected.
s3://aws-tc-largeobjects/DEV-AWS-MO-Designing_DataLakes/week3/
This S3 bucket contains the data file, which includes data for all rides from the green
taxis in the month of January 2020.
13. Keep all other settings for this page at their default values. Then choose Add an S3
data source.
15. On the Configure Security Settings page, under Existing IAM role, choose
AWSGlueServiceRoleDefault, and then choose Next.
16. On the Set output and scheduling page, under Target database, choose nycitytaxi.
17. For Frequency, keep Run on demand selected and choose Next.
When the crawler finishes running, one table is added to the database. After the job
stops, you should see that the Tables added column now shows 1.
This screen describes the table, including its schema, properties, and other
information. If you want to look at the schema information, you can choose Edit
schema.
1. In the navigation pane of AWS Glue, in the ETL section, choose Jobs.
2. In the Create job section, keep all the default settings and choose Create.
Example:
s3://glue-934169e0/data/
6. At the top of the pane, choose the Job details tab and configure the following
settings:
Name: nytaxiparquet
IAM role: AWSGlueServiceRoleDefault
Note: This role grants access to resources that AWS Glue needs to automatically
generate the nytaxi-csv-parquet script.
7. To verify the script, choose the Script tab. Feel free to review the script.
Wait for the job to complete. You can view the status by choosing the Run details link
(in the system message at the top of the pane) or by choosing the Runs tab.
10. In the navigation pane, in the Data catalog section, choose Crawlers.
15. For S3 path, paste s3://<FMI>/data/ . Replace the FMI with the name of the bucket
where the Parquet file is located.
Example:
s3://glue-934169e0/data/
17. On the Configure security settings page, under Existing IAM role, choose
AWSGlueServiceRoleDefault. Then choose Next.
19. For Frequency, keep Run on demand selected and choose Next.
21. Select the nytaxiparquet crawler and choose Run crawler. Wait for the job to finish.
22. In the navigation pane, in the Data catalog section, choose Tables.
2. If you are a new user, choose Explore the query editor. For existing users, the Query
editor may open automatically.
6. For both Query name and Query description, paste taxidata and choose Save
query.
7. At the top of the query editor pane, choose the Settings tab and then choose
Manage.
8. For Location of query result, paste the following path for your bucket and replace the
FMI with your bucket name.
s3://<FMI>/sql/
Example:
s3://glue-934169e0/sql/
9. Choose Save.
You can now browse the results and see information such as the passenger_count,
trip_distance, and tip_amount.
After you query the data, you can optionally connect Amazon Athena with Amazon
QuickSight to visualize data through dashboards.
Cleaning up
Delete the AWS resources that you created for this exercise by completing the following
steps.
Congratulations! You successfully completed the final exercise this course. In this exercise,
you gained a deeper understanding of how to transform and process data in a data lake.
You defined a database, configured a crawler, and created a table in AWS Glue. You then
transformed the CSV file into Parquet to save data processing costs, and repeated the steps
to create a table for the Parquet data. Finally, you queried the data with Amazon Athena.
© 2022 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be
reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web
Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections, feedback, or
other questions? Contact us at https://support.aws.amazon.com/#/contacts/aws-training. All
trademarks are the property of their owners.