Xopus

Main Menu

  • Schemas
  • CSS
  • Chrome
  • Firefox
  • Fund

Xopus

Header Banner

Xopus

  • Schemas
  • CSS
  • Chrome
  • Firefox
  • Fund
Schemas
Home›Schemas›Use AWS Glue Workflows to Convert Semi-structured Data

Use AWS Glue Workflows to Convert Semi-structured Data

By Warren B. Obrien
May 21, 2021
0
0



AWS Glue is an orchestration platform for ETL jobs. It is used in DevOps workflows for data warehouses, machine learning, and loading data into accounting or inventory management systems.

Glue is based on open source software, namely Apache Spark. It interacts with other open source products that AWS operates, as well as proprietary products, including Amazon S3 object storage and the Amazon DynamoDB database.

Glue is not a database; it’s a schema – also called metadata. It contains data tables that describe other data tables. Glue provides triggers, schedulers, and manual ways to use these patterns to grab data from one platform and push it to another.

Glue performs transformations with its web configuration and with the Python and Scala APIs. To demonstrate how your IT team can use Glue for their Extract, Transform, and Load (ETL) tasks, let’s review some of the basic components of the workflow. Next, we’ll explain how to use the service to organize semi-structured data for analysis and to support and train a machine learning model.

Glue work flow

With Glue, the workflow typically follows these steps:

  1. Load external data into Amazon S3, DynamoDB, or any row and column database that supports Java database connectivity, which includes most SQL databases. It supports JSON, XML, Apache Parquet, CSV, or Avro file formats.
  2. Glue uses Apache Spark to create data tables in virtual machines that run Glue in Apache Hive format, on top of a Hadoop file system.
  3. Load more data from another source.
  4. Perform a transformation – such as joins, deletes, aggregation, mapping – on the combined datasets from steps 1 and 2.
  5. Load data into a data warehouse such as Snowflake or Amazon Redshift, or use an API or bulk loader to move it into SAP.

Since this is a workflow, Glue can perform tasks, thus avoiding the need for a DevOps tool like SaltStack. For example, administrators can write code in Spark or Python to do this, and then use Salt to orchestrate tasks, but Salt doesn’t have a database. This would create disjointed operations. With Glue, everything works under the same code and the same context.

Glue-DevOps Examples

Now that we’ve looked at how Glue works, let’s take a look at two use cases – data transformation and machine language workflow – to better understand its practical application.

Author’s Note: To save money, download Glue locally and run it on your own computer to do some of the work.

ETL programming

Glue has an Apache Spark shell which can be run with the command gluepyspark. It works like PySpark, the Python shell for Spark.

Classes gluepyspark to start a Spark instance. Then write Python code in this shell or use it in batch mode and submit ETL jobs with colleparksubmit.

Spark uses DataFrames, which facilitates SQL operations on data created from JSON, CSV, and other file formats. Spark DataFrames are different from NumPy DataFrames, which are required for most machine learning SDKs. This limits Glue to a machine learning operation, described below.

One big downside to Spark DataFrames is that they first require a schematic. Glue introduces the dynamic framework, which infers a schema when possible.

For example, here’s how to open a JSON file in Glue to create a dynamic frame:

from pyspark import SparkContext 
from awsglue.context import GlueContext 

glueContext = GlueContext(SparkContext.getOrCreate())
inputDF =
glueContext.create_dynamic_frame_from_options(connection_type = "s3",
connection_options = {"paths": ["s3://walkerbank/transactions.json"]},
format = "json")

Then run SQL and Python operations on it to aggregate or transform the data to push it to the next system, like the data warehouse.

Glue FindMatches Machine Learning

As the name suggests, AWS Glue’s FindMatches ML Transform feature finds matching datasets. IT teams can use it to group related items or to deduplicate repeated registrations, two common operations in ETL jobs.

To use Glue for machine learning, train the model with function tag data, like any other supervised learning tool. Then run the test data and Glue assigns a tag ID to the records. It searches for the records that you have defined as equivalent in the training data.

Like other Glue tasks, administrators can configure many or most of these tasks in the Glue Console.

Start by adding the first transformation step in the console, as shown in Figure 1.

Figure 1. Configure transform properties in Glue

Then perform other tasks. For example, tune the model, make other changes, or export the results.

List of paste actions: edit properties, learn how to transform, adjust, export all labels and delete
Figure 2. List of actions

The screen in Figure 3 shows the tuning parameters defined by AWS. There is a trade-off between recall and precision. Choose which one you prefer and how well you want to optimize that metric.

Glue Tuning Parameters Defined by AWS
Figure 3. Glue adjustment parameters

Recall and precision have mathematical definitions. Here’s a simple way to know the difference:

  • True positives / (true positives + false positives)
  • True positives / (true positives + false negatives)

Create labels to detect and tag data by uploading a workout set. AWS calls this a labeling file. This is basically a set of function labels – see figure 4 – where labeling_set_id is the label, and the label is the basis of all classification problems.

In Figure 4, we train the model to classify the first two rows as being in the same LBL123 group and the last one in the LBL345 group. These records can be anything, such as transactions, customers, or scientific data.

Glue machine learning model ranks lines
Figure 4. Example of machine learning model classification

Note that there are no obvious similarities between the features, but that is the point. We use machine learning to identify relationships that might not be obvious at first glance. And those similarities will depend on the use case and how your organization defines them.



Related posts:

  1. Biden administration signals sweeping shift in focus to deal with cyber concerns in government procurement Baker Donelson
  2. My five # 436 | Inbound Marketing Agency
  3. Spring Boot Tutorial Brian Matthews
  4. ChaosSearch Data Platform Now Available in the AWS Marketplace

Recent Posts

  • 4 CSS progress bars you can use on your website
  • Google TV is rolling out a new Highlights tab for the news
  • Mozilla explains how Firefox extensions will follow Chrome
  • Immediately formulate action plans for all CSS: Dulloo to officers
  • 2023 Toyota Tacoma SR5 Gets Extended SX Package, New Chrome Pack

Archives

  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021

Categories

  • Chrome
  • CSS
  • Firefox
  • Fund
  • Schemas
  • Terms and Conditions
  • Privacy Policy