Skip to main content

[SOLVED] How to Use AWS Glue with NumPy and Pandas Python Packages? - amazon-web-services

[SOLVED] Mastering AWS Glue with NumPy and Pandas: A Comprehensive Guide

In this chapter, we will look at how to use AWS Glue with the Python libraries, NumPy and Pandas. AWS Glue is a service that helps us with ETL. This means it helps us extract, transform, and load data for analysis. By using NumPy and Pandas, we can do complex data work in AWS Glue easily. This guide will show us the main steps to use these tools together. We want to help you manage your data tasks in the cloud better.

Solutions We Will Discuss:

  • Part 1 - Setting Up Your AWS Glue Environment
  • Part 2 - Installing NumPy and Pandas in AWS Glue
  • Part 3 - Creating a Glue Job with NumPy and Pandas
  • Part 4 - Reading Data with Pandas in AWS Glue
  • Part 5 - Performing Data Transformations Using NumPy
  • Part 6 - Writing Data Back to S3 with Pandas
  • Frequently Asked Questions

By the end of this chapter, we will understand how to use AWS Glue with NumPy and Pandas. This will make our data processing skills better. If you want more information, you may check our guides on how to set up AWS Lambda and how to connect to Amazon EC2. These can help us learn more about AWS services.

Part 1 - Setting Up Your AWS Glue Environment

We need to set up our AWS Glue environment to use AWS Glue with NumPy and Pandas. Let’s follow these steps to get going:

  1. Create an AWS Account: If we do not have an AWS account, we can create one at the AWS Free Tier.

  2. Access AWS Glue:

    • We sign in to the AWS Management Console.
    • We find the AWS Glue service.
  3. Create a Database:

    • In the AWS Glue console, we go to “Databases”.
    • We click on “Add database”.
    • We give a name for our database and click “Create”.
  4. Create an IAM Role:

    • We go to the IAM console.
    • We select “Roles” and click “Create role”.
    • We choose “AWS Service” and select “Glue”.
    • We attach policies for access to S3 (like AmazonS3FullAccess) and any other services we need.
    • We name the role and create it.
  5. Configure AWS Glue:

    • In the AWS Glue console, we go to “Settings”.
    • We set up the Glue version (we choose Spark 2.4 or later for working with Pandas and NumPy).
    • We specify a default database and the role we made before.
  6. Create a Glue Crawler (Optional):

    • If we want to organize our data, we can create a crawler.
    • We set it up to access our S3 bucket and check for data formats.
    • We run the crawler to fill the Glue Data Catalog.
  7. Set Up Glue Development Endpoints:

    • We go to “Dev endpoints” in the AWS Glue console.
    • We click “Add endpoint”.
    • We select the IAM role and choose the security group.
    • We pick the Glue version and say how many worker nodes we want.

By following these steps, we will have set up our AWS Glue environment. Now we can use NumPy and Pandas for our data tasks. For more details, we can look at this AWS Glue setup guide.

Part 2 - Installing NumPy and Pandas in AWS Glue

To use NumPy and Pandas in AWS Glue, we need to install these packages. AWS Glue can use custom Python libraries with Python wheels. Let’s follow these steps to install NumPy and Pandas in our AWS Glue environment.

  1. Create a Requirements File: First, we create a requirements.txt file. This file includes the libraries we want to install. For example:

    numpy
    pandas
  2. Package Libraries: Next, we use an environment with pip to install our requirements into a folder. We can run this command:

    pip install -r requirements.txt -t ./python

    This command puts NumPy and Pandas into the python folder.

  3. Create a Zip File: Now, we need to zip the python folder:

    zip -r numpy_pandas.zip python
  4. Upload to S3: Then, we upload the numpy_pandas.zip file to an S3 bucket. We can do this using the AWS Management Console or the AWS CLI:

    aws s3 cp numpy_pandas.zip s3://your-bucket-name/
  5. Configure AWS Glue Job: When we create or change our Glue job, we need to tell it where to find the numpy_pandas.zip file. We do this in the “Python library path” section. This will let AWS Glue use the libraries when it runs the job.

  6. Set Glue Job Parameters: Finally, we make sure our Glue job has these parameters in the job settings:

    • Job Type: Spark
    • Python Version: Python 3
    • Glue Version: Choose the right version (like Glue 2.0 or above).

After doing these steps, our AWS Glue job can use NumPy and Pandas for data processing tasks. For more details, we can check the AWS Glue documentation on using libraries.

If we want more info on AWS Glue, we can look at this guide on running AWS Glue jobs.

Part 3 - Creating a Glue Job with NumPy and Pandas

To make an AWS Glue job that uses NumPy and Pandas, we need to follow these steps:

  1. Navigate to AWS Glue Console:

  2. Create a New Job:

    • Select Jobs on the left side.
    • Click on Add job.
  3. Configure Job Properties:

    • Name: Write a name for your job.
    • IAM Role: Pick an IAM role that can access the needed resources like S3.
    • Type: Choose “Spark” as the job type.
    • Glue Version: Pick the right Glue version like Glue 2.0 or later.
  4. Script Libraries and Job Parameters:

    • In the Script libraries part, add the paths for the NumPy and Pandas libraries. If you uploaded the libraries, you can use these S3 paths:
      • s3://your-bucket/path/to/numpy.zip
      • s3://your-bucket/path/to/pandas.zip
  5. Select a Glue Data Source:

    • Choose the data catalog source or specify a data source from S3.
  6. Job Script:

    • Write your job script in the Script area. Here is an example that shows how we can use Pandas and NumPy:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    import pandas as pd
    import numpy as np
    
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    glueContext = GlueContext(SparkContext.getOrCreate())
    spark = glueContext.spark_session
    
    # Read data from AWS Glue Data Catalog
    datasource = glueContext.create_dynamic_frame.from_catalog(database = "your_database", table_name = "your_table")
    
    # Convert to Pandas DataFrame
    df = datasource.toDF().toPandas()
    
    # Perform operations using NumPy and Pandas
    df['new_column'] = np.where(df['existing_column'] > 0, 'Positive', 'Negative')
    
    # Convert back to DynamicFrame
    dynamic_frame = DynamicFrame.fromDF(df, glueContext, "dynamic_frame")
    
    # Write back to S3
    glueContext.write_dynamic_frame.from_options(dynamic_frame, connection_type="s3", connection_options={"path": "s3://your-bucket/output/"}, format="parquet")
  7. Save and Run the Job:

    • After we write the script, click Save.
    • Now we can run the job by selecting it in the Jobs list and clicking Run Job.

By doing these steps, we can create a Glue job that uses NumPy and Pandas for data processing in AWS Glue. For more information on setting up your environment, check Part 1 - Setting Up Your AWS Glue Environment.

For more details on job settings, we can look at Part 2 - Installing NumPy and Pandas in AWS Glue.

Part 4 - Reading Data with Pandas in AWS Glue

We can read data with Pandas in AWS Glue by using GlueContext and Spark DataFrames. This helps us turn our data into a Pandas DataFrame. Here are the steps we need to follow:

  1. Initialize GlueContext: First, we need to create a GlueContext in our job script.

    from awsglue.context import GlueContext
    from pyspark.context import SparkContext
    
    spark_context = SparkContext()
    glue_context = GlueContext(spark_context)
  2. Read Data from S3: Next, we can use the create_dynamic_frame.from_options method. This helps us read data from S3 or other sources.

    datasource = glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={"paths": ["s3://your-bucket-name/path/to/data"]},
        format="csv",
        format_options={"withHeader": True}
    )
  3. Convert to Spark DataFrame: Now, we convert the DynamicFrame to a Spark DataFrame. This is for more processing.

    spark_dataframe = datasource.toDF()
  4. Convert to Pandas DataFrame: After that, we convert the Spark DataFrame to a Pandas DataFrame. This makes it easier to manipulate data.

    import pandas as pd
    
    pandas_dataframe = spark_dataframe.toPandas()
  5. Using the Pandas DataFrame: Now we can do many operations with the Pandas package.

    # Example operation: Display first few rows
    print(pandas_dataframe.head())

Make sure we have installed the needed libraries like NumPy and Pandas in our AWS Glue job. We can check Part 2 - Installing NumPy and Pandas in AWS Glue for more info on how to set up the environment.

This method helps us read and work with data using Pandas in an AWS Glue job. It uses both AWS Glue and Pandas together. For more details on AWS Glue and data handling, we can look at how to use AWS Glue with NumPy and Pandas.

Part 5 - Performing Data Transformations Using NumPy

In this part, we will learn how to change data using NumPy in AWS Glue. First, we need to make sure our Glue job has the NumPy library. After our setup is done, we can use NumPy for quick number calculations and changes.

Here are the steps to do data transformations with NumPy:

  1. Import Necessary Libraries: First, we need to import NumPy and other libraries for our Glue job.

    import numpy as np
    import pandas as pd
    import awswrangler as wr
  2. Read Data into a Pandas DataFrame: We will use AWS Wrangler to read data from an S3 bucket into a Pandas DataFrame.

    df = wr.s3.read_csv('s3://your-bucket/path/to/data.csv')
  3. Convert DataFrame to NumPy Array: Next, we will change the DataFrame to a NumPy array for transformation.

    data_array = df.to_numpy()
  4. Perform Transformations: Now we can use NumPy functions to change our data. We can normalize it, reshape it, or do other math operations.

    # Example: Normalize the data
    normalized_array = (data_array - np.mean(data_array, axis=0)) / np.std(data_array, axis=0)
    
    # Example: Reshape the array if needed
    reshaped_array = normalized_array.reshape(-1, 1)  # Reshape to a column vector
  5. Convert Back to DataFrame: After we finish the changes, we will turn the NumPy array back into a Pandas DataFrame for more work.

    transformed_df = pd.DataFrame(reshaped_array, columns=['Normalized Data'])
  6. Save Transformed Data: Finally, we will save the changed DataFrame back to S3 using AWS Wrangler.

    wr.s3.to_csv(transformed_df, 's3://your-bucket/path/to/transformed_data.csv')

By following these steps, we can easily do data transformations using NumPy in AWS Glue. For more info on how to set up your AWS Glue or how to install NumPy and Pandas, check Part 1 - Setting Up Your AWS Glue Environment and Part 2 - Installing NumPy and Pandas in AWS Glue.

Part 6 - Writing Data Back to S3 with Pandas

We can write data back to Amazon S3 using Pandas in AWS Glue. We can use the to_csv() or to_parquet() methods. Which one we choose depends on the file format we want. Here is a simple guide to help us do this.

  1. Set Up Your S3 Bucket: First, we need to have an S3 bucket ready. This is where we will write the data. Let us note down the bucket name and where we want to save the file.

  2. Import Required Libraries: Next, we should import the libraries we need for our AWS Glue job.

    import pandas as pd
    import boto3
  3. Create a Pandas DataFrame: After we change our data, we will create a DataFrame. This is the data we want to write to S3.

    data = {'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']}
    df = pd.DataFrame(data)
  4. Write DataFrame to S3: Now we will use the to_csv() or to_parquet() method. This will help us write the DataFrame to S3. We must make sure our Glue job has the right IAM permissions to write to S3.

    # Writing as CSV
    output_bucket = 'your-s3-bucket-name'
    output_path = 's3://{}/path/to/output.csv'.format(output_bucket)
    df.to_csv(output_path, index=False)
    
    # Or Writing as Parquet
    output_path_parquet = 's3://{}/path/to/output.parquet'.format(output_bucket)
    df.to_parquet(output_path_parquet, index=False)
  5. AWS Glue Configuration: When we run our Glue job, we need to make sure it can access the S3 bucket. We can set this in the job settings under “IAM role”. Here, we can attach the needed policies.

For more details about using S3 with AWS Glue, we can check this AWS documentation.

By following these steps, we can write our changed data back to S3 using Pandas in AWS Glue. This way, we can access it later for more analysis or processing.

Frequently Asked Questions

1. How can we use AWS Glue with NumPy and Pandas?

To use AWS Glue with NumPy and Pandas, we need to set up our AWS Glue environment first. Next, we install these packages. After that, we can create a Glue job to use these libraries for data tasks. This includes data changes and analytics. For more help, see our article on setting up your AWS Glue environment.

2. What are the benefits of using Pandas in AWS Glue?

Pandas is a strong data tool that helps us do data tasks easily. Using Pandas in AWS Glue makes data loading, changing, and analyzing simpler. It is great for data scientists and analysts who need to handle data well. For more details on using Pandas in AWS Glue, look at our section on reading data with Pandas.

3. Can we perform data transformations using NumPy in AWS Glue?

Yes, we can do data changes using NumPy in AWS Glue. NumPy gives us array tools and math functions that help with data tasks. By using NumPy in our Glue jobs, we can work with large datasets easily. For step-by-step help, see our section on performing data transformations using NumPy.

4. How do we write data back to S3 using Pandas in AWS Glue?

To write data back to Amazon S3 using Pandas in AWS Glue, we use to_csv() or to_parquet() methods after we finish processing our DataFrame. We can set the S3 path in these methods to save our output. This helps us combine data processing and storage smoothly. For more details, see our instructions on writing data back to S3 with Pandas.

5. What troubleshooting tips should we follow when using AWS Glue with Python packages?

When we use AWS Glue with Python packages like NumPy and Pandas, we must make sure our Glue job has the right permissions to access these libraries and data sources. Also, we should check for any version problems with the libraries. If we see errors, read our guide on how to handle errors with AWS Glue for good troubleshooting tips.

Comments