Use AWS Glue Workflows to Convert Semi-structured Data

AWS Glue is an orchestration platform for ETL jobs. It is used in DevOps workflows for data warehouses, machine learning, and loading data into accounting or inventory management systems.
Glue is based on open source software, namely Apache Spark. It interacts with other open source products that AWS operates, as well as proprietary products, including Amazon S3 object storage and the Amazon DynamoDB database.
Glue is not a database; it’s a schema – also called metadata. It contains data tables that describe other data tables. Glue provides triggers, schedulers, and manual ways to use these patterns to grab data from one platform and push it to another.
Glue performs transformations with its web configuration and with the Python and Scala APIs. To demonstrate how your IT team can use Glue for their Extract, Transform, and Load (ETL) tasks, let’s review some of the basic components of the workflow. Next, we’ll explain how to use the service to organize semi-structured data for analysis and to support and train a machine learning model.
Glue work flow
With Glue, the workflow typically follows these steps:
- Load external data into Amazon S3, DynamoDB, or any row and column database that supports Java database connectivity, which includes most SQL databases. It supports JSON, XML, Apache Parquet, CSV, or Avro file formats.
- Glue uses Apache Spark to create data tables in virtual machines that run Glue in Apache Hive format, on top of a Hadoop file system.
- Load more data from another source.
- Perform a transformation – such as joins, deletes, aggregation, mapping – on the combined datasets from steps 1 and 2.
- Load data into a data warehouse such as Snowflake or Amazon Redshift, or use an API or bulk loader to move it into SAP.
Since this is a workflow, Glue can perform tasks, thus avoiding the need for a DevOps tool like SaltStack. For example, administrators can write code in Spark or Python to do this, and then use Salt to orchestrate tasks, but Salt doesn’t have a database. This would create disjointed operations. With Glue, everything works under the same code and the same context.
Glue-DevOps Examples
Now that we’ve looked at how Glue works, let’s take a look at two use cases – data transformation and machine language workflow – to better understand its practical application.
Author’s Note: To save money, download Glue locally and run it on your own computer to do some of the work.
ETL programming
Glue has an Apache Spark shell which can be run with the command gluepyspark. It works like PySpark, the Python shell for Spark.
Classes gluepyspark to start a Spark instance. Then write Python code in this shell or use it in batch mode and submit ETL jobs with colleparksubmit.
Spark uses DataFrames, which facilitates SQL operations on data created from JSON, CSV, and other file formats. Spark DataFrames are different from NumPy DataFrames, which are required for most machine learning SDKs. This limits Glue to a machine learning operation, described below.
One big downside to Spark DataFrames is that they first require a schematic. Glue introduces the dynamic framework, which infers a schema when possible.
For example, here’s how to open a JSON file in Glue to create a dynamic frame:
from pyspark import SparkContext
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
inputDF =
glueContext.create_dynamic_frame_from_options(connection_type = "s3",
connection_options = {"paths": ["s3://walkerbank/transactions.json"]},
format = "json")
Then run SQL and Python operations on it to aggregate or transform the data to push it to the next system, like the data warehouse.
Glue FindMatches Machine Learning
As the name suggests, AWS Glue’s FindMatches ML Transform feature finds matching datasets. IT teams can use it to group related items or to deduplicate repeated registrations, two common operations in ETL jobs.
To use Glue for machine learning, train the model with function tag data, like any other supervised learning tool. Then run the test data and Glue assigns a tag ID to the records. It searches for the records that you have defined as equivalent in the training data.
Like other Glue tasks, administrators can configure many or most of these tasks in the Glue Console.
Start by adding the first transformation step in the console, as shown in Figure 1.
Then perform other tasks. For example, tune the model, make other changes, or export the results.

The screen in Figure 3 shows the tuning parameters defined by AWS. There is a trade-off between recall and precision. Choose which one you prefer and how well you want to optimize that metric.

Recall and precision have mathematical definitions. Here’s a simple way to know the difference:
- True positives / (true positives + false positives)
- True positives / (true positives + false negatives)
Create labels to detect and tag data by uploading a workout set. AWS calls this a labeling file. This is basically a set of function labels – see figure 4 – where labeling_set_id is the label, and the label is the basis of all classification problems.
In Figure 4, we train the model to classify the first two rows as being in the same LBL123 group and the last one in the LBL345 group. These records can be anything, such as transactions, customers, or scientific data.

Note that there are no obvious similarities between the features, but that is the point. We use machine learning to identify relationships that might not be obvious at first glance. And those similarities will depend on the use case and how your organization defines them.