AWS Glue Development Environment

Updated: Oct 8


We have built a complete ETL pipeline and data warehouse using AWS Glue and AWS S3 services for EdCast. In this blog, we will share the key learnings from that experience.


Glue Development Environment

Glue ETL scripts can be developed and tested in multiple ways. More prominent options are

  1. Using development end point and notebook (AWS hosted)

  2. Using development end point and Zepplin notebook server in local environment

  3. Using local development using ETL library 


Using development end point and notebooks (remote)

AWS development end point is a managed(paid) Glue environment for developing and testing ETL scripts. This environment includes Apache Spark and Glue libraries along with network configuration that allows to securely access the environment from Jupyter notebook.

AWS supports launching a EC2 machine with Jupyter Notebook server. Jupyter Notebook can be used to interactively author and test the ETL scripts, which will be used in Glue jobs.


For more information on development end point

https://docs.aws.amazon.com/glue/latest/dg/console-development-endpoint.html


Pros

  • Easy to launch and use

  • Since the development endpoint is similar to actual AWS Glue environment, it's easy to develop and test in the actual production like environment.

Cons

  • Expensive as it requires $1500 for dev endpoint(as of 03/22/2020) per month and ec2 machine cost to host notebook server

  • Need internet connection to notebook server for development

  • No easy way to write unit test cases

Using development end point and Jupiter notebooks in local environment

This option is same as the above except this allows to run the Zepplin notebook server in local environment and connects to development endpoint via SSH tunnel. In order to use this option, development end point must be updated with user specific SSH public keys (RSA). This can be done by accessing "Rotate SSH Keys" option in the Dev End point home page.


Docker version of  Zepplin 0.8.2 doesn’t work with Glue development endpoint due to some bugs. So we installed the binary locally using binary and used it.


Pros

  • No need to have a separate notebook server

  • Easy to use in local machine - We observed flaky UI due to poor network connection 

Cons

  • Still expensive as it requires $1500 for dev endpoint per month

  • Slow development as dev endpoint runs remotely

  • No easy way to write unit test cases

Using local development using ETL library

AWS has recently released the AWS glue libraries which can be used to setup the local development environment. This helps to integrate Glue ETL jobs with maven build system for building and testing.


ETL development can be done using Zepplin server or even using PyCharm (Professional 2019.3) or MS Visual Code. We use PySpark as language for our ETL scripts. So we use PyCharn for developing the scripts. PyCharm allows to run and debug the job scripts locally. It also allows us to remotely debug the issues.


Pros

  • Easy to use and faster development and testing

  • Cheaper

  • Unit testing can be done

For more information on the steps to setup and run the glue jobs locally :

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

In the next blog, we will explain the steps required to setup PyCharm with Glue ETL library for local debugging.

Discuss "AWS Glue Development Environment" on Medium

logo white.png

sales@squareshift.co

+65 9239 2194

© SquareShift Technologies Pte. Ltd.

  • SquareShift on LinkedIn