Understanding AWS Glue for ETL
In the big data world, the biggest problem for many companies might be getting insights from data before it’s outdated. If you need to process the different types of data with speed, and efficiency to get the best value from the data, then you need to leverage ETL tools and standards.
Extract Transform and Load is the process of extracting information from different sources such as databases, then converting files or tables to usable formats, and finally loading load that data into systems end-users can access to solve business problems.
ETL Use Case Example
360 degree view of customer data — Company X gathers customer information from different sources such as ticket support systems, the company website, sales systems, call centers, etc. This data is then transformed by filtering what is necessary, cleaning the data up to a point where the data is ready for use, stored in a different storage tool, and use cases such as analytics applied to that data.
Some common ETL tools include AWS Glue, Talend, Informatica PowerCenter, and Oracle Data Integrator among others.
In this article, we will focus on AWS Glue as an ETL Tool. When you log in to AWS and Select Glue Service, this is the view you get.
AWS has split the Glue service into
- Data Catalog
The Data Catalog contains
- Databases and tables
- Crawlers and classifiers
- AWS Glue Schema Registry
The Data Catalog consists of databases and tables. A table can be in only one database. Your database can contain tables from many different sources that AWS Glue supports.
- Databases and Tables— Before any analysis can happen for any use case, we need to get data from somewhere. In AWS context this could be Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, DynamoDB, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases among others.
You, therefore, need to tell AWS Glue, which data source we are focusing on. For this context, let’s narrow down to S3. But before that, you need to set up a database that serves as a container where all your data will reside in.
Create a new database. I have created a new database below known as demodatabase.
2. AWS Crawlers and Tables— We will need tables to be created from our S3 folder. What enables us to achieve this is AWS Glue Crawler. Simply put, the Crawler will “crawl” check our data source and convert this to tables.
We will go through all the steps of adding a crawler. You can see below I have selected S3 and specified the actual dataset. If you select the S3 bucket itself, all the files in the bucket will be crawled.
I then specify the output of my crawler result which is the database I had previously created.
Once done, the crawler will be in a ready state. This means I can now trigger it to create tables in my database from my S3 bucket.
Crawler is now ready to be run. The status will change from Ready to Running to Stopping and at this point, the tables will be created.
On checking the tables, the data from S3 is now present. Great!
The data schema will also be shown upon selecting that table.
Next -> ETL — Jobs
With the AWS Data Catalog, we were establishing the groundwork for our ETL process. What enables us to extract data from the data source, transform this data and load it to the data target area is AWS Glue Job.
A job is the business logic that performs the extract, transform, and load (ETL) work in AWS Glue. In this case, I could structure my glue job as a logic that just simply drops two columns and outputs this in an S3 bucket ready for analysis.
AWS Glue allows us to create jobs in different ways as shown below.
You can also see that you can specify your data source and where the result of your data transformation will be stored.
For this demo, I will select a visual source and target. The image below shows us a clear preview of the ETL process. I will therefore need to specify my Data Source, define what needs to happen in the transformation layer, and finally where my result will be stored.
My glue job logic will just entail dropping two fields[price and address]and outputting the rest of the fields. NB Your logic can be written as a script with much more complex transformations to suit your business context
Save your job and run it. You can run it on-demand or schedule it as per your need in the Schedules tab.
Once done, if the job has been executed successfully, the run status will be indicated as “Succeeded”. This is great. We can now check the S3 data-target we specified in the ETL process to confirm our result is there.
Voila! Our data is there in S3 ready for analysis.
Let’s open the file to confirm the transformation happened. Our initial dataset had price and address. The final output the two columns have been dropped.
This was a gentle introduction to AWS Glue, and from here you can now leverage the AWS documentation for more details.
NB: AWS Glue is not free so delete any feature created if it was just for practice :)