Transcription of AWS Glue - Developer Guide
1 AWS GlueDeveloper GuideAWS glue Developer GuideAWS glue : Developer GuideCopyright 2019 Amazon Web Services, Inc. and/or its affiliates. All rights 's trademarks and trade dress may not be used in connection with any product or service that is not Amazon's, in any mannerthat is likely to cause confusion among customers, or in any manner that disparages or discredits Amazon. All other trademarks notowned by Amazon are the property of their respective owners, who may or may not be affiliated with, connected to, or sponsored glue Developer GuideTable of ContentsWhat Is AWS glue ? .. 1 When Should I Use AWS glue ? .. 1 How It Works .. 3 Serverless ETL Jobs Run in Isolation .. 3 Concepts .. 4 AWS glue Terminology .. 6 AWS glue Console .. 6 AWS glue Data Catalog .. 7 AWS glue Crawlers and Classifiers .. 7 AWS glue ETL Operations.
2 7 The AWS glue Jobs 7 Converting Semi-Structured Schemas to Relational Schemas .. 8 Getting Started .. 10 Setting up IAM Permissions for AWS glue .. 10 Step 1: Create an IAM Policy for the AWS glue Service .. 10 Step 2: Create an IAM Role for AWS glue .. 14 Step 3: Attach a Policy to IAM Users That Access AWS glue .. 15 Step 4: Create an IAM Policy for Notebook Servers .. 22 Step 5: Create an IAM Role for Notebook Servers .. 24 Step 6: Create an IAM Policy for Amazon SageMaker Notebooks .. 25 Step 7: Create an IAM Role for Amazon SageMaker Notebooks .. 27 Setting Up DNS in Your VPC .. 28 Setting Up Your Environment to Access Data Stores .. 28 Amazon VPC Endpoints for Amazon S3 .. 29 Setting Up a VPC to Connect to JDBC Data Stores .. 30 Setting Up Your Environment for Development Endpoints .. 33 Setting Up Your Network for a Development Endpoint.
3 33 Setting Up Amazon EC2 for a Notebook Server .. 34 Setting Up Encryption .. 35 Console Workflow Overview .. 37 Security .. 39 Authentication and Access Control .. 40 Access-Control Overview .. 41 Cross-Account Access .. 50 Resource ARNs .. 54 Policy Examples .. 57 API Permissions Reference .. 66 Encryption and Secure Access .. 80 Encrypting Your Data Catalog .. 81 Encrypting Connection Passwords .. 82 Encrypting Data Written by AWS glue .. 82 Populating the AWS glue Data Catalog .. 86 Defining a Database in Your Data Catalog .. 88 Working with Databases on the Console .. 88 Defining Tables in the AWS glue Data Catalog .. 88 Table Partitions .. 89 Working with Tables on the Console .. 89 Adding a Connection to Your Data Store .. 92 When Is a Connection Used? .. 92 Defining a Connection in the AWS glue Data Catalog.
4 92 Connecting to a JDBC Data Store in a VPC .. 93 Working with Connections on the Console .. 94iiiAWS glue Developer GuideCataloging Tables with a Crawler .. 97 Defining a Crawler in the AWS glue Data Catalog .. 98 Which Data Stores Can I Crawl? .. 98 Using Include and Exclude Patterns .. 98 What Happens When a Crawler Runs? .. 101 Are Amazon S3 Folders Created as Tables or Partitions? .. 102 Configuring a Crawler .. 103 Scheduling a Crawler .. 106 Working with Crawlers on the Console .. 107 Adding Classifiers to a Crawler .. 109 When Do I Use a Classifier?.. 109 Custom 109 Built-In Classifiers in AWS glue .. 110 Writing Custom Classifiers .. 112 Working with Classifiers on the Console .. 122 Working with Data Catalog Settings on the AWS glue Console .. 123 Populating the Data Catalog Using AWS CloudFormation Templates.
5 124 Sample 125 Sample Database, Table, Partitions .. 126 Sample Grok Classifier .. 129 Sample JSON Classifier .. 130 Sample XML 130 Sample Amazon S3 Crawler .. 131 Sample Connection .. 132 Sample JDBC Crawler .. 133 Sample Job for Amazon S3 to Amazon S3 .. 135 Sample Job for JDBC to Amazon S3 .. 136 Sample On-Demand Trigger .. 137 Sample Scheduled Trigger .. 138 Sample Conditional Trigger .. 139 Sample Development Endpoint .. 140 Authoring 141 Workflow Overview .. 142 Adding Jobs .. 142 Defining Job Properties .. 142 Built-In Transforms .. 144 Jobs on the 146 Editing Scripts .. 151 Defining a 151 Scripts on the 152 Providing Your Own Custom Scripts .. 153 Triggering Jobs .. 154 Triggering Jobs Based on Schedules or Events .. 154 Defining Trigger Types .. 154 Working with Triggers on the Console.
6 154 Using Development Endpoints .. 156 Managing the Environment .. 156 Using a Dev 156 Accessing Your Dev Endpoint .. 156 Development Endpoints on the Console .. 157 Tutorial Prerequisites .. 161 Tutorial: Local Zeppelin Notebook .. 164 Tutorial: Amazon EC2 Zeppelin Notebook Server .. 167 Tutorial: Use a REPL Shell .. 170 Tutorial: Use PyCharm Professional .. 171 Managing 177 Notebook Server Considerations .. 179ivAWS glue Developer GuideNotebooks on the 185 Running and 188 Automated Tools .. 189 Time-Based Schedules for Jobs and Crawlers .. 189 Cron Expressions .. 189 Job 191 Using Job 192 Using an AWS glue Script .. 193 Using Modification 194 Automating with CloudWatch Events .. 196 Monitoring with Amazon CloudWatch .. 196 Using CloudWatch Metrics .. 197 Setting Up Amazon CloudWatch Alarms on AWS glue Job Profiles.
7 210 Job Monitoring and 210 Debugging OOM Exceptions and Job Abnormalities .. 211 Debugging Demanding Stages and Straggler Tasks .. 218 Monitoring the Progress of Multiple Jobs .. 222 Monitoring for DPU Capacity Planning .. 226 Logging Using CloudTrail .. 230 AWS glue Information in CloudTrail .. 231 Understanding AWS glue Log File Entries .. 231 Troubleshooting .. 234 Gathering AWS glue Troubleshooting Information .. 234 Troubleshooting Connection Issues .. 234 Troubleshooting Errors .. 235 Error: Resource Unavailable .. 235 Error: Could Not Find S3 Endpoint or NAT Gateway for subnetId in VPC .. 236 Error: Inbound Rule in Security Group Required .. 236 Error: Outbound Rule in Security Group Required .. 236 Error: Custom DNS Resolution Failures .. 236 Error: Job Run Failed Because the Role Passed Should Be Given Assume Role Permissions forthe AWS glue Service.
8 236 Error: DescribeVpcEndpoints Action Is Unauthorized. Unable to Validate VPC ID vpc-id .. 237 Error: DescribeRouteTables Action Is Unauthorized. Unable to Validate Subnet Id: subnet-id inVPC id: 237 Error: Failed to Call ec2:DescribeSubnets .. 237 Error: Failed to Call ec2:DescribeSecurityGroups .. 237 Error: Could Not Find Subnet for AZ .. 237 Error: Job Run Exception When Writing to a JDBC Target .. 237 Error: Amazon S3 Timeout .. 238 Error: Amazon S3 Access Denied .. 238 Error: Amazon S3 Access Key ID Does Not Exist .. 238 Error: Job Run Fails When Accessing Amazon S3 with an s3a:// 238 Error: Amazon S3 Service Token Expired .. 240 Error: No Private DNS for Network Interface Found .. 240 Error: Development Endpoint Provisioning Failed .. 240 Error: Notebook Server CREATE_FAILED .. 240 Error: Local Notebook Fails to Start .. 240 Error: Notebook Usage Errors.
9 241 Error: Running Crawler Failed .. 241 Error: Upgrading Athena Data Catalog .. 241 Error: A Job is Reprocessing Data When Job Bookmarks Are Enabled .. 241 AWS glue Limits .. 242 ETL Programming .. 244 General 244 Special Parameters .. 244 Connection Parameters .. 245vAWS glue Developer GuideFormat Options .. 248 Managing Partitions .. 250 Grouping Input Files .. 251 Reading from JDBC in Parallel .. 252 Moving Data to and from Amazon Redshift .. 253 ETL Programming in Python .. 254 Using Python .. 254 List of Extensions .. 255 List of Transforms .. 255 Python Setup .. 255 Calling 256 Python Libraries .. 258 Python Samples .. 259 PySpark Extensions .. 273 PySpark Transforms .. 297 ETL Programming in Scala .. 324 Using 329 Scala API 330 AWS glue API .. 371 Security .. 376 data types.
10 376 DataCatalogEncryptionSettings .. 377 EncryptionAtRest .. 377 ConnectionPasswordEncryption .. 377 EncryptionConfiguration .. 378S3 Encryption .. 378 CloudWatchEncryption .. 378 JobBookmarksEncryption .. 379 SecurityConfiguration .. 379 operations .. 379 GetDataCatalogEncryptionSettings (get_data_catalog_encryption_settings) .. 379 PutDataCatalogEncryptionSettings (put_data_catalog_encryption_settings) .. 380 PutResourcePolicy (put_resource_policy) .. 381 GetResourcePolicy (get_resource_policy) .. 381 DeleteResourcePolicy (delete_resource_policy) .. 382 CreateSecurityConfiguration (create_security_configuration) .. 382 DeleteSecurityConfiguration (delete_security_configuration) .. 383 GetSecurityConfiguration (get_security_configuration) .. 384 GetSecurityConfigurations (get_security_configurations) .. 385 Tables.