Sunday, May 26, 2024

Finish-to-end growth lifecycle for information engineers to construct an information integration pipeline utilizing AWS Glue

Information is a key enabler for what you are promoting. Many AWS prospects have built-in their information throughout a number of information sources utilizing AWS Glue, a serverless information integration service, so as to make data-driven enterprise selections. To develop the ability of information at scale for the long run, it’s extremely really helpful to design an end-to-end growth lifecycle in your information integration pipelines. The next are widespread asks from our prospects:

  • Is it doable to develop and take a look at AWS Glue information integration jobs on my native laptop computer?
  • Are there really helpful approaches to provisioning parts for information integration?
  • How can we construct a steady integration and steady supply (CI/CD) pipeline for our information integration pipeline?
  • What’s the finest observe to maneuver from a pre-production setting to manufacturing?

To deal with these asks, this put up defines the event lifecycle for information integration and demonstrates how software program engineers and information engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, utilizing a pattern baseline template.

Finish-to-end growth lifecycle for an information integration pipeline

Right this moment, it’s widespread to outline not solely information integration jobs but additionally all the information parts in code. This implies you could depend on commonplace software program finest practices to construct your information integration pipeline. The software program growth lifecycle on AWS defines the next six phases: Plan, Design, Implement, Check, Deploy, and Preserve.

On this part, we talk about every part within the context of information integration pipeline.


Within the planning part, builders accumulate necessities from stakeholders comparable to end-users to outline an information requirement. This may very well be what the use instances are (for instance, advert hoc queries, dashboard, or troubleshooting), how a lot information to course of (for instance, 1 TB per day), what sorts of information, what number of totally different information sources to tug from, how a lot information latency to just accept to make it queryable (for instance, quarter-hour), and so forth.


Within the design part, you analyze necessities and determine the perfect resolution to construct the information integration pipeline. In AWS, you might want to select the precise companies to attain the purpose and give you the structure by integrating these companies and defining dependencies between parts. For instance, it’s possible you’ll select AWS Glue jobs as a core element for loading information from totally different sources, together with Amazon Easy Storage Service (Amazon S3), then integrating them and preprocessing and enriching information. Then it’s possible you’ll wish to chain a number of AWS Glue jobs and orchestrate them. Lastly, it’s possible you’ll wish to use Amazon Athena and Amazon QuickSight to current the enriched information to end-users.


Within the implementation part, information engineers code the information integration pipeline. They analyze the necessities to determine coding duties to attain the ultimate consequence. The code consists of the next:

  • AWS useful resource definition
  • Information integration logic

When utilizing AWS Glue, you possibly can outline the information integration logic in a job script, which could be written in Python or Scala. You need to use your most popular IDE to implement AWS useful resource definition utilizing the AWS Cloud Growth Package (AWS CDK) or AWS CloudFormation, and likewise the enterprise logic of AWS Glue job scripts for information integration. To be taught extra about the best way to implement your AWS Glue job scripts domestically, check with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container.


Within the testing part, you test the implementation for bugs. High quality evaluation consists of testing the code for errors and checking if it meets the necessities. As a result of many groups instantly take a look at the code you write, the testing part typically runs parallel to the event part. There are several types of testing:

  • Unit testing
  • Integration testing
  • Efficiency testing

For unit testing, even for information integration, you possibly can depend on a typical testing framework comparable to pytest and ScalaTest. To be taught extra about the best way to obtain unit testing domestically, check with Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container.


When information engineers develop an information integration pipeline, you code and take a look at on a unique copy of the product than the one which the end-users have entry to. The setting that end-users use is named manufacturing, whereas different copies are stated to be within the growth or the pre-production setting.

Having separate construct and manufacturing environments ensures you could proceed to make use of the information integration pipeline even whereas it’s being modified or upgraded. The deployment part consists of a number of duties to maneuver the most recent construct copy to the manufacturing setting, comparable to packaging, setting configuration, and set up.

The next parts are deployed by way of the AWS CDK or AWS CloudFormation:

  • AWS assets
  • Information integration job scripts for AWS Glue

AWS CodePipeline lets you construct a mechanism to automate deployments amongst totally different environments, together with growth, pre-production, and manufacturing. Once you commit your code to AWS CodeCommit, CodePipeline routinely provisions AWS assets based mostly on the CloudFormation templates included within the commit and uploads script recordsdata included within the decide to Amazon S3.


Even after you deploy your resolution to a manufacturing setting, it’s not the top of your challenge. It is advisable to monitor the information integration pipeline constantly and preserve sustaining and bettering it. Extra particularly, you additionally want to repair bugs, resolve buyer points, and handle software program adjustments. As well as, you might want to monitor the general system efficiency, safety, and person expertise to determine new methods to enhance the present information integration pipeline.

Answer overview

Sometimes, you’ve gotten a number of accounts to handle and provision assets in your information pipeline. On this put up, we assume the next three accounts:

  • Pipeline account – This hosts the end-to-end pipeline
  • Dev account – This hosts the mixing pipeline within the growth setting
  • Prod account – This hosts the information integration pipeline within the manufacturing setting

If you need, you should use the identical account and the identical Area for all three.

To start out making use of this end-to-end growth lifecycle mannequin to your information platform simply and shortly, we ready the baseline template aws-glue-cdk-baseline utilizing the AWS CDK. The template is constructed on prime of AWS CDK v2 and CDK Pipelines. It provisions two sorts of stacks:

  • AWS Glue app stack – This provisions the information integration pipeline: one within the dev account and one within the prod account
  • Pipeline stack – This provisions the Git repository and CI/CD pipeline within the pipeline account

The AWS Glue app stack provisions the information integration pipeline, together with the next assets:

  • AWS Glue jobs
  • AWS Glue job scripts

The next diagram illustrates this structure.

On the time of publishing of this put up, the AWS CDK has two variations of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The pattern AWS Glue app stack is outlined utilizing aws-glue-alpha, the L2 assemble for AWS Glue, as a result of it’s simple to outline and handle AWS Glue assets. If you wish to use the L1 assemble, check with Construct, Check and Deploy ETL options utilizing AWS Glue and AWS CDK based mostly CI/CD pipelines.

The pipeline stack provisions the whole CI/CD pipeline, together with the next assets:

The next diagram illustrates the pipeline workflow.

Each time the enterprise requirement adjustments (comparable to including information sources or altering information transformation logic), you make adjustments on the AWS Glue app stack and re-provision the stack to mirror your adjustments. That is completed by committing your adjustments within the AWS CDK template to the CodeCommit repository, then CodePipeline displays the adjustments on AWS assets utilizing CloudFormation change units.

Within the following sections, we current the steps to arrange the required setting and display the end-to-end growth lifecycle.


You want the next assets:

Initialize the challenge

To initialize the challenge, full the next steps:

  1. Clone the baseline template to your office:
    $ git clone
    $ cd aws-glue-cdk-baseline.git

  2. Create a Python digital setting particular to the challenge on the consumer machine:

We use a digital setting so as to isolate the Python setting for this challenge and never set up software program globally.

  1. Activate the digital setting in line with your OS:
    • On MacOS and Linux, use the next command:
      $ supply .venv/bin/activate

    • On a Home windows platform, use the next command:
      % .venvScriptsactivate.bat

After this step, the following steps run throughout the bounds of the digital setting on the consumer machine and work together with the AWS account as wanted.

  1. Set up the required dependencies described in necessities.txt to the digital setting:
    $ pip set up -r necessities.txt
    $ pip set up -r requirements-dev.txt

  2. Edit the configuration file default-config.yaml based mostly in your environments (change every account ID with your personal):
    awsAccountId: 123456789101
    awsRegion: us-east-1
    awsAccountId: 123456789102
    awsRegion: us-east-1
    awsAccountId: 123456789103
    awsRegion: us-east-1

  3. Run pytest to initialize the snapshot take a look at recordsdata by working the next command:
    $ python3 -m pytest --snapshot-update

Bootstrap your AWS environments

Run the next instructions to bootstrap your AWS environments:

  1. Within the pipeline account, change PIPELINE-ACCOUNT-NUMBER, REGION, and PIPELINE-PROFILE with your personal values:
    $ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess

  2. Within the dev account, change PIPELINE-ACCOUNT-NUMBER, DEV-ACCOUNT-NUMBER, REGION, and DEV-PROFILE with your personal values:
    $ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 

  3. Within the prod account, change PIPELINE-ACCOUNT-NUMBER, PROD-ACCOUNT-NUMBER, REGION, and PROD-PROFILE with your personal values:
    $ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> 
    --cloudformation-execution-policies arn:aws:iam::aws:coverage/AdministratorAccess 

Once you use just one account for all environments, you possibly can simply run the cdk bootstrap command one time.

Deploy your AWS assets

Run the command utilizing the pipeline account to deploy the assets outlined within the AWS CDK baseline template:

$ cdk deploy --profile <PIPELINE-PROFILE>

This creates the pipeline stack within the pipeline account and the AWS Glue app stack within the growth account.

When the cdk deploy command is accomplished, let’s confirm the pipeline utilizing the pipeline account.

On the CodePipeline console, navigate to GluePipeline. Then confirm that GluePipeline has the next phases: Supply, Construct, UpdatePipeline, Belongings, DeployDev, and DeployProd. Additionally confirm that the phases Supply, Construct, UpdatePipeline, Belongings, DeployDev have succeeded, and DeployProd is pending. It will possibly take about quarter-hour.

Now that the pipeline has been created efficiently, you too can confirm the AWS Glue app stack useful resource on the AWS CloudFormation console within the dev account.

At this step, the AWS Glue app stack is deployed solely within the dev account. You possibly can attempt to run the AWS Glue job ProcessLegislators to see the way it works.

Configure your Git repository with CodeCommit

In an earlier step, you cloned the Git repository from GitHub. Though it’s doable to configure the AWS CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, for this put up, we use CodeCommit. In case you want these third-party Git suppliers, configure the connections and edit to outline the variable supply to make use of the goal Git supplier utilizing CodePipelineSource.

Since you already ran the cdk deploy command, the CodeCommit repository has already been created with all of the required code and associated recordsdata. Step one is to arrange entry to CodeCommit. The following step is to clone the repository from the CodeCommit repository to your native. Run the next instructions:

$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://

Within the subsequent step, we make adjustments on this native copy of the CodeCommit repository.

Finish-to-end growth lifecycle

Now that the setting has been efficiently created, you’re prepared to start out creating an information integration pipeline utilizing this baseline template. Let’s stroll by way of end-to-end growth lifecycle.

Once you wish to outline your personal information integration pipeline, you might want to add extra AWS Glue jobs and implement job scripts. For this put up, let’s assume the use case so as to add a brand new AWS Glue job with a brand new job script to learn a number of S3 places and be a part of them.

Implement and take a look at in your native setting

First, implement and take a look at the AWS Glue job and its job script in your native setting utilizing Visible Studio Code.

Arrange your growth setting by following the steps in Develop and take a look at AWS Glue model 3.0 and 4.0 jobs domestically utilizing a Docker container. The next steps are required within the context of this put up:

  1. Begin Docker.
  2. Pull the Docker picture that has the native growth setting utilizing the AWS Glue ETL library:
    $ docker pull

  3. Run the next command to outline the AWS named profile title:

  4. Run the next command to make it obtainable with the baseline template:
    $ cd aws-glue-cdk-baseline/

  5. Run the Docker container:
    $ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true 
    --rm -p 4040:4040 -p 18080:18080 
    --name glue_pyspark pyspark

  6. Begin Visible Studio Code.
  7. Select Distant Explorer within the navigation pane, then select the arrow icon of the workspace folder within the container

If the workspace folder is just not proven, select Open folder and choose /dwelling/glue_user/workspace.

Then you will notice a view much like the next screenshot.

Optionally, you possibly can set up AWS Device Package for Visible Studio Code, and begin Amazon CodeWhisperer to allow code suggestions powered by machine studying mannequin. For instance, in aws_glue_cdk_baseline/job_scripts/, you possibly can put feedback like “# Write a DataFrame in Parquet format to S3”, press Enter key, then CodeWhisperer will advocate a code snippet much like the next:

CodeWhisperer on Visual Studio Code

Now you put in the required dependencies described in necessities.txt to the container setting.

  1. Run the next instructions in the terminal in Visible Studio Code:
    $ pip set up -r necessities.txt
    $ pip set up -r requirements-dev.txt

  2. Implement the code.

Now let’s make the required adjustments for a brand new AWS Glue job right here.

  1. Edit the file aws_glue_cdk_baseline/ Let’s add the next new code block after the present job definition of ProcessLegislators so as to add the brand new AWS Glue job JoinLegislators:
            self.new_glue_job = glue.Job(self, "JoinLegislators",
               a part of(path.dirname(__file__), "job_scripts/")
                description="a brand new instance PySpark job",
                    "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                    "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                    "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
                    "setting": self.setting,
                    "artifact_id": self.artifact_id,
                    "stack_id": self.stack_id,
                    "stack_name": self.stack_name

Right here, you added three job parameters for various S3 places utilizing the variable config. It’s the dictionary generated from default-config.yaml. On this baseline template, we use this central config file for managing parameters for all of the Glue jobs within the construction <stage title>/jobs/<job title>/<parameter title>. Within the continuing steps, you present these places by way of the AWS Glue job parameters.

  1. Create a brand new job script referred to as aws_glue_cdk_baseline/job_scripts/
    import sys
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.transforms import Be part of
    from awsglue.utils import getResolvedOptions
    class JoinLegislators:
        def __init__(self):
            params = []
            if '--JOB_NAME' in sys.argv:
            args = getResolvedOptions(sys.argv, params)
            self.context = GlueContext(SparkContext.getOrCreate())
            self.job = Job(self.context)
            if 'JOB_NAME' in args:
                jobname = args['JOB_NAME']
                self.input_path_orgs = args['input_path_orgs']
                self.input_path_persons = args['input_path_persons']
                self.input_path_memberships = args['input_path_memberships']
                jobname = "take a look at"
                self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
                self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
                self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
            self.job.init(jobname, args)
        def run(self):
            dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
            df = dyf.toDF()
    def read_dynamic_frame_from_json(glue_context, path):
        return glue_context.create_dynamic_frame.from_options(
                'paths': [path],
                'recurse': True
    def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
        orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
        individuals = read_dynamic_frame_from_json(glue_context, path_persons)
        memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
        orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('title', 'org_name')
        dynamicframe_joined = Be part of.apply(orgs, Be part of.apply(individuals, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
        return dynamicframe_joined
    if __name__ == '__main__':

  2. Create a brand new unit take a look at script for the brand new AWS Glue job referred to as aws_glue_cdk_baseline/job_scripts/exams/
    import pytest
    import sys
    import join_legislators
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.utils import getResolvedOptions
    @pytest.fixture(scope="module", autouse=True)
    def glue_context():
        args = getResolvedOptions(sys.argv, ['JOB_NAME'])
        context = GlueContext(SparkContext.getOrCreate())
        job = Job(context)
        job.init(args['JOB_NAME'], args)
    def test_counts(glue_context):
        dyf = join_legislators.join_legislators(glue_context, 
        assert dyf.toDF().depend() == 10439

  3. In default-config.yaml, add the next below prod and dev:
          inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
          inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/individuals.json"
          inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"

  4. Add the next below "jobs" within the variable config in exams/unit/, exams/unit/, and exams/snapshot/ (no want to switch S3 places):
                "JoinLegislators": {
                    "inputLocationOrgs": "s3://path_to_data_orgs",
                    "inputLocationPersons": "s3://path_to_data_persons",
                    "inputLocationMemberships": "s3://path_to_data_memberships"

  5. Select Run on the prime proper to run the person job scripts.

If the Run button is just not proven, set up Python into the container by way of Extensions within the navigation pane.

  1. For native unit testing, run the next command in the terminal in Visible Studio Code:
    $ cd aws_glue_cdk_baseline/job_scripts/
    $ python3 -m pytest

Then you possibly can confirm that the newly added unit take a look at handed efficiently.

  1. Run pytest to initialize the snapshot take a look at recordsdata by working following command:
    $ cd ../../
    $ python3 -m pytest --snapshot-update

Deploy to the event setting

Full following steps to deploy the AWS Glue app stack to the event setting and run integration exams there:

  1. Arrange entry to CodeCommit.
  2. Commit and push your adjustments to the CodeCommit repo:
    $ git add .
    $ git commit -m "Add the second Glue job"
    $ git push

You possibly can see that the pipeline is efficiently triggered.

Integration take a look at

There may be nothing required for working the mixing take a look at for the newly added AWS Glue job. The mixing take a look at script runs all the roles together with a particular tag, then verifies the state and its period. If you wish to change the situation or the edge, you possibly can edit assertions at the top of the integ_test_glue_job methodology.

Deploy to the manufacturing setting

Full the next steps to deploy the AWS Glue app stack to the manufacturing setting:

  1. On the CodePipeline console, navigate to GluePipeline.
  2. Select Overview below the DeployProd stage.
  3. Select Approve.

Await the DeployProd stage to finish, then you possibly can confirm the AWS Glue app stack useful resource within the dev account.

Clear up

To scrub up your assets, full following steps:

  1. Run the next command utilizing the pipeline account:
    $ cdk destroy --profile <PIPELINE-PROFILE>

  2. Delete the AWS Glue app stack within the dev account and prod account.


On this put up, you realized the best way to outline the event lifecycle for information integration and the way software program engineers and information engineers can design an end-to-end growth lifecycle utilizing AWS Glue, together with growth, testing, and CI/CD, by way of a pattern AWS CDK template. You will get began constructing your personal end-to-end growth lifecycle in your workload utilizing AWS Glue.

In regards to the creator

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue crew. He works based mostly in Tokyo, Japan. He’s answerable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking together with his street bike.

Related Articles


Please enter your comment!
Please enter your name here

Stay Connected

- Advertisement -spot_img

Latest Articles