Athena Ug
Athena Ug
Athena Ug
User Guide
Amazon's trademarks and trade dress may not be used in connection with any product or service that is not
Amazon's, in any manner that is likely to cause confusion among customers, or in any manner that disparages or
discredits Amazon. All other trademarks not owned by Amazon are the property of their respective owners, who may
or may not be affiliated with, connected to, or sponsored by Amazon.
Amazon Athena User Guide
Table of Contents
What is Amazon Athena? .................................................................................................................... 1
When should I use Athena? ......................................................................................................... 1
Accessing Athena ....................................................................................................................... 1
Understanding Tables, Databases, and the Data Catalog ................................................................. 2
AWS Service Integrations with Athena .......................................................................................... 3
Release Notes .................................................................................................................................... 6
March 05, 2019 ......................................................................................................................... 7
February 22, 2019 ...................................................................................................................... 7
February 18, 2019 ...................................................................................................................... 8
November 20, 2018 ................................................................................................................... 9
October 15, 2018 ....................................................................................................................... 9
October 10, 2018 ..................................................................................................................... 10
September 6, 2018 ................................................................................................................... 10
August 23, 2018 ...................................................................................................................... 11
August 16, 2018 ...................................................................................................................... 11
August 7, 2018 ........................................................................................................................ 12
June 5, 2018 ........................................................................................................................... 12
Support for Views ............................................................................................................ 12
Improvements and Updates to Error Messages ..................................................................... 12
Bug Fixes ......................................................................................................................... 13
May 17, 2018 .......................................................................................................................... 13
April 19, 2018 ......................................................................................................................... 13
April 6, 2018 ........................................................................................................................... 14
March 15, 2018 ........................................................................................................................ 14
February 2, 2018 ...................................................................................................................... 14
January 19, 2018 ..................................................................................................................... 14
November 13, 2017 .................................................................................................................. 15
November 1, 2017 ................................................................................................................... 15
October 19, 2017 ..................................................................................................................... 15
October 3, 2017 ....................................................................................................................... 15
September 25, 2017 ................................................................................................................. 15
August 14, 2017 ...................................................................................................................... 16
August 4, 2017 ........................................................................................................................ 16
June 22, 2017 .......................................................................................................................... 16
June 8, 2017 ........................................................................................................................... 16
May 19, 2017 .......................................................................................................................... 16
Improvements .................................................................................................................. 17
Bug Fixes ......................................................................................................................... 17
April 4, 2017 ........................................................................................................................... 17
Features .......................................................................................................................... 17
Improvements .................................................................................................................. 18
Bug Fixes ......................................................................................................................... 18
March 24, 2017 ........................................................................................................................ 18
Features .......................................................................................................................... 18
Improvements .................................................................................................................. 18
Bug Fixes ......................................................................................................................... 19
February 20, 2017 .................................................................................................................... 19
Features .......................................................................................................................... 19
Improvements .................................................................................................................. 20
Setting Up ....................................................................................................................................... 21
Sign Up for AWS ...................................................................................................................... 21
To create an AWS account ................................................................................................. 21
Create an IAM User .................................................................................................................. 21
To create a group for administrators ................................................................................... 21
iii
Amazon Athena User Guide
To create an IAM user for yourself, add the user to the administrators group, and create a
password for the user ....................................................................................................... 22
Attach Managed Policies for Using Athena .................................................................................. 22
Getting Started ................................................................................................................................ 23
Prerequisites ............................................................................................................................ 23
Step 1: Create a Database ......................................................................................................... 23
Step 2: Create a Table .............................................................................................................. 24
Step 3: Query Data .................................................................................................................. 25
Accessing Amazon Athena ................................................................................................................. 27
Using the Console .................................................................................................................... 27
Using the API .......................................................................................................................... 27
Using the CLI ........................................................................................................................... 27
Integration with AWS Glue ................................................................................................................ 28
Upgrading to the AWS Glue Data Catalog Step-by-Step ................................................................ 29
Step 1 - Allow a User to Perform the Upgrade ..................................................................... 29
Step 2 - Update Customer-Managed/Inline Policies Associated with Athena Users ..................... 29
Step 3 - Choose Upgrade in the Athena Console .................................................................. 30
FAQ: Upgrading to the AWS Glue Data Catalog ............................................................................ 31
Why should I upgrade to the AWS Glue Data Catalog? .......................................................... 31
Are there separate charges for AWS Glue? ........................................................................... 32
Upgrade process FAQ ........................................................................................................ 32
Best Practices When Using Athena with AWS Glue ........................................................................ 33
Database, Table, and Column Names .................................................................................. 34
Using AWS Glue Crawlers .................................................................................................. 34
Working with CSV Files ..................................................................................................... 38
Using AWS Glue Jobs for ETL with Athena ........................................................................... 40
Connecting to Amazon Athena with ODBC and JDBC Drivers .................................................................. 43
Using Athena with the JDBC Driver ............................................................................................ 43
Links for Downloading the JDBC Driver ............................................................................... 43
JDBC Driver Release Notes, License Agreement, and Notices ................................................... 43
Migration from Previous Version of the JDBC Driver .............................................................. 44
Connecting to Amazon Athena with ODBC .................................................................................. 44
Amazon Athena ODBC Driver License Agreement ................................................................. 44
Windows ......................................................................................................................... 44
Linux .............................................................................................................................. 44
OSX ................................................................................................................................ 45
ODBC Driver Documentation .............................................................................................. 45
Migration from the Previous Version of the ODBC Driver ....................................................... 45
Previous Versions of the ODBC Driver ................................................................................. 45
Security ........................................................................................................................................... 47
Access Control Policies .............................................................................................................. 47
Managed Policies for User Access ....................................................................................... 47
Access to Amazon S3 ........................................................................................................ 51
Fine-Grained Access to Databases and Tables ....................................................................... 51
Access to Encrypted Metadata in the Data Catalog ............................................................... 58
Cross-account Access ........................................................................................................ 58
Workgroup and Tag Policies ............................................................................................... 59
Enabling Federated Access to Athena API ............................................................................ 59
Configuring Encryption Options ................................................................................................. 62
Amazon S3 Encryption Options Supported in Athena ............................................................ 62
Encrypting Query Results Stored in Amazon S3 .................................................................... 63
Permissions to Encrypted Data in Amazon S3 ...................................................................... 64
Permissions to Encrypted Metadata in the AWS Glue Data Catalog .......................................... 65
Creating Tables Based on Encrypted Datasets in Amazon S3 .................................................. 65
Working with Source Data ................................................................................................................. 67
Tables and Databases Creation Process in Athena ......................................................................... 67
Requirements for Tables in Athena and Data in Amazon S3 ................................................... 68
iv
Amazon Athena User Guide
v
Amazon Athena User Guide
vi
Amazon Athena User Guide
Adding Columns at the Beginning or in the Middle of the Table ............................................ 152
Adding Columns at the End of the Table ........................................................................... 152
Removing Columns ......................................................................................................... 153
Renaming Columns ......................................................................................................... 153
Reordering Columns ........................................................................................................ 154
Changing a Column's Data Type ....................................................................................... 155
Updates in Tables with Partitions ............................................................................................. 156
Avoiding Schema Mismatch Errors for Tables with Partitions ................................................ 156
Using Workgroups to Control Query Access and Costs ......................................................................... 158
Using Workgroups for Running Queries ..................................................................................... 158
Benefits of Using Workgroups .......................................................................................... 158
How Workgroups Work ................................................................................................... 159
Setting up Workgroups ................................................................................................... 160
IAM Policies for Accessing Workgroups ............................................................................. 161
Workgroup Example Policies ............................................................................................ 162
Workgroup Settings ........................................................................................................ 166
Managing Workgroups .................................................................................................... 167
Athena Workgroup APIs ................................................................................................. 172
Troubleshooting Workgroups ........................................................................................... 172
Controlling Costs and Monitoring Queries with CloudWatch Metrics .............................................. 174
Enabling CloudWatch Query Metrics ................................................................................. 174
Monitoring Athena Queries with CloudWatch Metrics .......................................................... 175
Setting Data Usage Control Limits .................................................................................... 176
Tagging Workgroups ....................................................................................................................... 180
Tag Basics .............................................................................................................................. 180
Tag Restrictions ...................................................................................................................... 181
Working with Tags Using the Console ....................................................................................... 181
Displaying Tags for Individual Workgroups ......................................................................... 181
Adding and Deleting Tags on an Individual Workgroup ........................................................ 181
Working with Tags Using the API Actions .................................................................................. 183
Tag-Based IAM Access Control Policies ...................................................................................... 184
Tag Policy Examples ....................................................................................................... 184
Monitoring Logs and Troubleshooting ............................................................................................... 187
Logging Amazon Athena API Calls with AWS CloudTrail ............................................................... 187
Athena Information in CloudTrail ...................................................................................... 187
Understanding Athena Log File Entries .............................................................................. 188
Troubleshooting ..................................................................................................................... 190
SerDe Reference ............................................................................................................................. 191
Using a SerDe ........................................................................................................................ 191
To Use a SerDe in Queries ............................................................................................... 191
Supported SerDes and Data Formats ........................................................................................ 192
Avro SerDe .................................................................................................................... 193
RegexSerDe for Processing Apache Web Server Logs ........................................................... 195
CloudTrail SerDe ............................................................................................................. 196
OpenCSVSerDe for Processing CSV ................................................................................... 198
Grok SerDe .................................................................................................................... 200
JSON SerDe Libraries ...................................................................................................... 203
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files ................................................. 206
ORC SerDe ..................................................................................................................... 211
Parquet SerDe ................................................................................................................ 214
Compression Formats .............................................................................................................. 216
DDL and SQL Reference .................................................................................................................. 217
Data Types ............................................................................................................................ 217
List of Supported Data Types in Athena ............................................................................ 217
DDL Statements ..................................................................................................................... 218
ALTER DATABASE SET DBPROPERTIES ............................................................................... 219
ALTER TABLE ADD PARTITION .......................................................................................... 219
vii
Amazon Athena User Guide
viii
Amazon Athena User Guide
When should I use Athena?
Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries
you run. Athena scales automatically—executing queries in parallel—so results are fast, even with large
datasets and complex queries.
Topics
• When should I use Athena? (p. 1)
• Accessing Athena (p. 1)
• Understanding Tables, Databases, and the Data Catalog (p. 2)
• AWS Service Integrations with Athena (p. 3)
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate
reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an
ODBC driver. For more information, see What is Amazon QuickSight in the Amazon QuickSight User Guide
and Connecting to Amazon Athena with ODBC and JDBC Drivers (p. 43).
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your
data in Amazon S3. This allows you to create tables and query data in Athena based on a central
metadata store available throughout your AWS account and integrated with the ETL and data discovery
features of AWS Glue. For more information, see Integration with AWS Glue (p. 28) and What is AWS
Glue in the AWS Glue Developer Guide.
For a list of AWS services that Athena leverages or integrates with, see the section called “AWS Service
Integrations with Athena” (p. 3).
Accessing Athena
You can access Athena using the AWS Management Console, through a JDBC or ODBC connection, using
the Athena API, or using the Athena CLI.
• To get started with the console, see Getting Started (p. 23).
• To learn how to use JDBC or ODBC drivers, see Connecting to Amazon Athena with JDBC (p. 43)
andConnecting to Amazon Athena with ODBC (p. 44).
• To use the Athena API, see the Amazon Athena API Reference.
1
Amazon Athena User Guide
Understanding Tables, Databases, and the Data Catalog
• To use the CLI, install the AWS CLI and then type aws athena help from the command line to see
available commands. For information about available commands, see the AWS Athena command line
reference.
For each dataset that you'd like to query, Athena must have an underlying table it will use for obtaining
and returning query results. Therefore, before querying data, a table must be registered in Athena. The
registration occurs when you either create tables automatically or manually.
Regardless of how the tables are created, the tables creation process registers the dataset with Athena.
This registration occurs either in the AWS Glue Data Catalog, or in the internal Athena data catalog and
enables Athena to run queries on the data.
• To create a table automatically, use an AWS Glue crawler from within Athena. For more information
about AWS Glue and crawlers, see Integration with AWS Glue (p. 28). When AWS Glue creates a
table, it registers it in its own AWS Glue Data Catalog. Athena uses the AWS Glue Data Catalog to store
and retrieve this metadata, using it when you run queries to analyze the underlying dataset.
The AWS Glue Data Catalog is accessible throughout your AWS account. Other AWS services can share
the AWS Glue Data Catalog, so you can see databases and tables created throughout your organization
using Athena and vice versa. In addition, AWS Glue lets you automatically discover data schema and
extract, transform, and load (ETL) data.
Note
You use the internal Athena data catalog in regions where AWS Glue is not available and where
the AWS Glue Data Catalog cannot be used.
When you create tables and databases manually, Athena uses HiveQL data definition language (DDL)
statements such as CREATE TABLE, CREATE DATABASE, and DROP TABLE under the hood to create
tables and databases in the AWS Glue Data Catalog, or in its internal data catalog in those regions where
AWS Glue is not available.
Note
If you have tables in Athena created before August 14, 2017, they were created in an Athena-
managed data catalog that exists side-by-side with the AWS Glue Data Catalog until you
choose to update. For more information, see Upgrading to the AWS Glue Data Catalog Step-by-
Step (p. 29).
When you query an existing table, under the hood, Amazon Athena uses Presto, a distributed SQL
engine. We have examples with sample data within Athena to show you how to create a table and then
2
Amazon Athena User Guide
AWS Service Integrations with Athena
issue a query against it using Athena. Athena also has a tutorial in the console that helps you get started
creating a table based on data that is stored in Amazon S3.
• For a step-by-step tutorial on creating a table and writing queries in the Athena Query Editor, see
Getting Started (p. 23).
• Run the Athena tutorial in the console. This launches automatically if you log in to https://
console.aws.amazon.com/athena/ for the first time. You can also choose Tutorial in the console to
launch it.
3
Amazon Athena User Guide
AWS Service Integrations with Athena
Amazon Virtual Private Cloud Querying Amazon VPC Flow Amazon Virtual Private Cloud
Logs (p. 145) flow logs capture information
about the IP traffic going to and
from network interfaces in a
VPC. Query the logs in Athena
to investigate network traffic
patterns and identify threats
and risks across your Amazon
VPC network.
AWS Glue Data Catalog Integration with AWS Athena integrates with the
Glue (p. 28) AWS Glue Data Catalog, which
offers a persistent metadata
store for your data in Amazon
S3. This allows you to create
tables and query data in Athena
based on a central metadata
store available throughout your
AWS account and integrated
with the ETL and data discovery
features of AWS Glue. For more
information, see Integration
with AWS Glue (p. 28) and
What is AWS Glue in the AWS
Glue Developer Guide.
4
Amazon Athena User Guide
AWS Service Integrations with Athena
IAM Actions for Amazon Athena You can use Athena API actions
in IAM permission policies.
See Actions for Amazon
Athena and Access Control
Policies (p. 47).
5
Amazon Athena User Guide
Release Notes
Describes Amazon Athena features, improvements, and bug fixes by release date.
Contents
• March 05, 2019 (p. 7)
• February 22, 2019 (p. 7)
• February 18, 2019 (p. 8)
• November 20, 2018 (p. 9)
• October 15, 2018 (p. 9)
• October 10, 2018 (p. 10)
• September 6, 2018 (p. 10)
• August 23, 2018 (p. 11)
• August 16, 2018 (p. 11)
• August 7, 2018 (p. 12)
• June 5, 2018 (p. 12)
• Support for Views (p. 12)
• Improvements and Updates to Error Messages (p. 12)
• Bug Fixes (p. 13)
• May 17, 2018 (p. 13)
• April 19, 2018 (p. 13)
• April 6, 2018 (p. 14)
• March 15, 2018 (p. 14)
• February 2, 2018 (p. 14)
• January 19, 2018 (p. 14)
• November 13, 2017 (p. 15)
• November 1, 2017 (p. 15)
• October 19, 2017 (p. 15)
• October 3, 2017 (p. 15)
• September 25, 2017 (p. 15)
• August 14, 2017 (p. 16)
• August 4, 2017 (p. 16)
• June 22, 2017 (p. 16)
• June 8, 2017 (p. 16)
• May 19, 2017 (p. 16)
• Improvements (p. 17)
• Bug Fixes (p. 17)
• April 4, 2017 (p. 17)
• Features (p. 17)
• Improvements (p. 18)
• Bug Fixes (p. 18)
• March 24, 2017 (p. 18)
6
Amazon Athena User Guide
March 05, 2019
Amazon Athena is now available in the Canada (Central) Region. For a list of supported Regions, see
AWS Regions and Endpoints. Released the new version of the ODBC driver with support for Athena
workgroups. For more information, see the ODBC Driver Release Notes.
To download the ODBC driver version 1.0.5 and its documentation, see Connecting to Amazon Athena
with ODBC (p. 44). For information about this version, see the ODBC Driver Release Notes.
To use workgroups with the ODBC driver, set the new connection property, Workgroup, in the
connection string as shown in the following example:
For more information, search for "workgroup" in the ODBC Driver Installation and Configuration
Guide version 1.0.5. There are no changes to the ODBC driver connection string when you use tags on
workgroups. To use tags, upgrade to the latest version of the ODBC driver, which is this current version.
This driver version lets you use Athena API workgroup actions (p. 172) to create and manage
workgroups, and Athena API tag actions (p. 183) to add, list, or remove tags on workgroups. Before you
begin, make sure that you have resource-level permissions in IAM for actions on workgroups and tags.
• Using Workgroups for Running Queries (p. 158) and Workgroup Example Policies (p. 162).
• Tagging Workgroups (p. 180) and Tag-Based IAM Access Control Policies (p. 184).
If you use the JDBC driver or the AWS SDK, upgrade to the latest version of the driver and SDK, both
of which already include support for workgroups and tags in Athena. For more information, see Using
Athena with the JDBC Driver (p. 43).
Added tag support for workgroups in Amazon Athena. A tag consists of a key and a value, both of
which you define. When you tag a workgroup, you assign custom metadata to it. You can add tags to
workgroups to help categorize them, using AWS tagging best practices. You can use tags to restrict
access to workgroups, and to track costs. For example, create a workgroup for each cost center. Then,
by adding tags to these workgroups, you can track your Athena spending for each cost center. For more
information, see Using Tags for Billing in the AWS Billing and Cost Management User Guide.
7
Amazon Athena User Guide
February 18, 2019
You can work with tags by using the Athena console or the API operations. For more information, see
Tagging Workgroups (p. 180).
In the Athena console, you can add one or more tags to each of your workgroups, and search by tags.
Workgroups are an IAM-controlled resource in Athena. In IAM, you can restrict who can add, remove, or
list tags on workgroups that you create. You can also use the CreateWorkGroup API operation that has
the optional tag parameter for adding one or more tags to the workgroup. To add, remove, or list tags,
use TagResource, UntagResource, and ListTagsForResource. For more information, see Working
with Tags Using the API Actions (p. 180).
To allow users to add tags when creating workgroups, ensure that you give each user IAM permissions to
both the TagResource and CreateWorkGroup API actions. For more information and examples, see
Tag-Based IAM Access Control Policies (p. 184).
There are no changes to the JDBC driver when you use tags on workgroups. If you create new
workgroups and use the JDBC driver or the AWS SDK, upgrade to the latest version of the driver and
SDK. For information, see Using Athena with the JDBC Driver (p. 43).
Added ability to control query costs by running queries in workgroups. For information, see Using
Workgroups to Control Query Access and Costs (p. 158). Improved the JSON OpenX SerDe used in
Athena, fixed an issue where Athena did not ignore objects transitioned to the GLACIER storage class,
and added examples for querying Network Load Balancer logs.
• Added support for workgroups. Use workgroups to separate users, teams, applications, or workloads,
and to set limits on amount of data each query or the entire workgroup can process. Because
workgroups act as IAM resources, you can use resource-level permissions to control access to a
specific workgroup. You can also view query-related metrics in Amazon CloudWatch, control query
costs by configuring limits on the amount of data scanned, create thresholds, and trigger actions,
such as Amazon SNS alarms, when these thresholds are breached. For more information, see Using
Workgroups for Running Queries (p. 158) and Controlling Costs and Monitoring Queries with
CloudWatch Metrics (p. 174).
Workgroups are an IAM resource. For a full list of workgroup-related actions, resources, and conditions
in IAM, see Actions, Resources, and Condition Keys for Amazon Athena in the IAM User Guide. Before
you create new workgroups, make sure that you use workgroup IAM policies (p. 161), and the
AmazonAthenaFullAccess Managed Policy (p. 48).
You can start using workgroups in the console, with the workgroup API operations (p. 172), or with
the JDBC driver. For a high-level procedure, see Setting up Workgroups (p. 160). To download the
JDBC driver with workgroup support, see Using Athena with the JDBC Driver (p. 43).
If you use workgroups with the JDBC driver, you must set the workgroup name in the connection string
using the Workgroup configuration parameter as in the following example:
jdbc:awsathena://AwsRegion=<AWSREGION>;UID=<ACCESSKEY>;
PWD=<SECRETKEY>;S3OutputLocation=s3://<athena-output>-<AWSREGION>/;
Workgroup=<WORKGROUPNAME>;
There are no changes in the way you run SQL statements or make JDBC API calls to the driver. The
driver passes the workgroup name to Athena.
8
Amazon Athena User Guide
November 20, 2018
For information about differences introduced with workgroups, see Athena Workgroup APIs (p. 172)
and Troubleshooting Workgroups (p. 172).
• Improved the JSON OpenX SerDe used in Athena. The improvements include, but are not limited to,
the following:
• Support for the ConvertDotsInJsonKeysToUnderscores property. When set to TRUE, it allows
the SerDe to replace the dots in key names with underscores. For example, if the JSON dataset
contains a key with the name "a.b", you can use this property to define the column name to be
"a_b" in Athena. The default is FALSE. By default, Athena does not allow dots in column names.
• Support for the case.insensitive property. By default, Athena requires that all keys in your
JSON dataset use lowercase. Using WITH SERDE PROPERTIES ("case.insensitive"=
FALSE;) allows you to use case-sensitive key names in your data. The default is TRUE. When set to
TRUE, the SerDe converts all uppercase columns to lowercase.
For more information, see the section called “OpenX JSON SerDe” (p. 204).
• Fixed an issue where Athena returned "access denied" error messages, when it processed Amazon
S3 objects that were archived to Glacier by Amazon S3 lifecycle policies. As a result of fixing this issue,
Athena ignores objects transitioned to the GLACIER storage class. Athena does not support querying
data from the GLACIER storage class.
For more information, see the section called “Requirements for Tables in Athena and Data in Amazon
S3” (p. 68) and Transitioning to the GLACIER Storage Class (Object Archival) in the Amazon Simple
Storage Service Developer Guide.
• Added examples for querying Network Load Balancer access logs that receive information about
the Transport Layer Security (TLS) requests. For more information, see the section called “Querying
Network Load Balancer Logs” (p. 142).
Released the new versions of the JDBC and ODBC driver with support for federated access to Athena API
with the AD FS and SAML 2.0 (Security Assertion Markup Language 2.0). For details, see the JDBC Driver
Release Notes and ODBC Driver Release Notes.
With this release, federated access to Athena is supported for the Active Directory Federation Service
(AD FS 3.0). Access is established through the versions of JDBC or ODBC drivers that support SAML 2.0.
For information about configuring federated access to the Athena API, see the section called “Enabling
Federated Access to Athena API” (p. 59).
To download the JDBC driver version 2.0.6 and its documentation, see Using Athena with the JDBC
Driver (p. 43). For information about this version, see JDBC Driver Release Notes.
To download the ODBC driver version 1.0.4 and its documentation, see Connecting to Amazon Athena
with ODBC (p. 44). For information about this version, ODBC Driver Release Notes.
For more information about SAML 2.0 support in AWS, see About SAML 2.0 Federation in the IAM User
Guide.
If you have upgraded to the AWS Glue Data Catalog, there are two new features that provide support for:
9
Amazon Athena User Guide
October 10, 2018
• Encryption of the Data Catalog metadata. If you choose to encrypt metadata in the Data Catalog, you
must add specific policies to Athena. For more information, see Access to Encrypted Metadata in the
AWS Glue Data Catalog (p. 58).
• Fine-grained permissions to access resources in the AWS Glue Data Catalog. You can now define
identity-based (IAM) policies that restrict or allow access to specific databases and tables from the
Data Catalog used in Athena. For more information, see Fine-Grained Access to Databases and Tables
in the AWS Glue Data Catalog (p. 51).
Note
Data resides in the Amazon S3 buckets, and access to it is governed by the Amazon S3
Permissions (p. 51). To access data in databases and tables, continue to use access control
policies to Amazon S3 buckets that store the data.
Athena supports CREATE TABLE AS SELECT, which creates a table from the result of a SELECT query
statement. For details, see Creating a Table from Query Results (CTAS).
Before you create CTAS queries, it is important to learn about their behavior in the Athena
documentation. It contains information about the location for saving query results in Amazon S3, the
list of supported formats for storing CTAS query results, the number of partitions you can create, and
supported compression formats. For more information, see Considerations and Limitations for CTAS
Queries (p. 91).
September 6, 2018
Published on 2018-09-06
Released the new version of the ODBC driver (version 1.0.3). The new version of the ODBC driver streams
results by default, instead of paging through them, allowing business intelligence tools to retrieve large
data sets faster. This version also includes improvements, bug fixes, and an updated documentation for
"Using SSL with a Proxy Server". For details, see the Release Notes for the driver.
For downloading the ODBC driver version 1.0.3 and its documentation, see Connecting to Amazon
Athena with ODBC (p. 44).
The streaming results feature is available with this new version of the ODBC driver. It is also available
with the JDBC driver. For information about streaming results, see the ODBC Driver Installation and
Configuration Guide, and search for UseResultsetStreaming.
The ODBC driver version 1.0.3 is a drop-in replacement for the previous version of the driver. We
recommend that you migrate to the current driver.
10
Amazon Athena User Guide
August 23, 2018
Important
To use the ODBC driver version 1.0.3, follow these requirements:
Added support for these DDL-related features and fixed several bugs, as follows:
• Added support for BINARY and DATE data types for data in Parquet, and for DATE and TIMESTAMP
data types for data in Avro.
• Added support for INT and DOUBLE in DDL queries. INTEGER is an alias to INT, and DOUBLE
PRECISION is an alias to DOUBLE.
• Improved performance of DROP TABLE and DROP DATABASE queries.
• Removed the creation of _$folder$ object in Amazon S3 when a data bucket is empty.
• Fixed an issue where ALTER TABLE ADD PARTITION threw an error when no partition value was
provided.
• Fixed an issue where DROP TABLE ignored the database name when checking partitions after the
qualified name had been specified in the statement.
For more about the data types supported in Athena, see Data Types (p. 217).
For information about supported data type mappings between types in Athena, the JDBC driver, and
Java data types, see the "Data Types" section in the JDBC Driver Installation and Configuration Guide.
Released the JDBC driver version 2.0.5. The new version of the JDBC driver streams results by default,
instead of paging through them, allowing business intelligence tools to retrieve large data sets
faster. Compared to the previous version of the JDBC driver, there are the following performance
improvements:
The streaming results feature is available only with the JDBC driver. It is not available with the ODBC
driver. You cannot use it with the Athena API. For information about streaming results, see the JDBC
Driver Installation and Configuration Guide, and search for UseResultsetStreaming.
For downloading the JDBC driver version 2.0.5 and its documentation, see Using Athena with the JDBC
Driver (p. 43).
The JDBC driver version 2.0.5 is a drop-in replacement for the previous version of the driver (2.0.2). To
ensure that you can use the JDBC driver version 2.0.5, add the athena:GetQueryResultsStream
11
Amazon Athena User Guide
August 7, 2018
policy action to the list of policies for Athena. This policy action is not exposed directly with the API
and is only used with the JDBC driver, as part of streaming results support. For an example policy, see
AWSQuicksightAthenaAccess Managed Policy (p. 50). For more information about migrating from
version 2.0.2 to version 2.0.5 of the driver, see the JDBC Driver Migration Guide.
If you are migrating from a 1.x driver to a 2.x driver, you will need to migrate your existing configurations
to the new configuration. We highly recommend that you migrate to the current version of the driver.
For more information, see Using the Previous Version of the JDBC Driver (p. 249), and the JDBC Driver
Migration Guide.
August 7, 2018
Published on 2018-08-07
You can now store Amazon Virtual Private Cloud flow logs directly in Amazon S3 in a GZIP format, where
you can query them in Athena. For information, see Querying Amazon VPC Flow Logs (p. 145) and
Amazon VPC Flow Logs can now be delivered to S3.
June 5, 2018
Published on 2018-06-05
Topics
• Support for Views (p. 12)
• Improvements and Updates to Error Messages (p. 12)
• Bug Fixes (p. 13)
12
Amazon Athena User Guide
Bug Fixes
Bug Fixes
Fixed the following bugs:
• Fixed an issue that enables the internal translation of REAL to FLOAT data types. This improves
integration with the AWS Glue Crawler that returns FLOAT data types.
• Fixed an issue where Athena was not converting AVRO DECIMAL (a logical type) to a DECIMAL type.
• Fixed an issue where Athena did not return results for queries on Parquet data with WHERE clauses that
referenced values in the TIMESTAMP data type.
Increased query concurrency limits in Athena from five to twenty. This means that you can submit and
run up to twenty DDL queries and twenty SELECT queries at a time. Note that the concurrency limits are
separate for DDL and SELECT queries.
Concurrency limits in Athena are defined as the number of queries that can be submitted to the service
concurrently. You can submit up to twenty queries of the same type (DDL or SELECT) at a time. If you
submit a query that exceeds the concurrent query limit, the Athena API displays an error message: "You
have exceeded the limit for the number of queries you can run concurrently. Reduce the number of
concurrent queries submitted by this account. Contact customer support to request a concurrent query
limit increase.”
After you submit your queries to Athena, it processes the queries by assigning resources based on
the overall service load and the amount of incoming requests. We continuously monitor and make
adjustments to the service so that your queries process as fast as possible.
For information, see Service Limits (p. 253). This is a soft limit and you can request a limit increase for
concurrent queries.
Released the new version of the JDBC driver (version 2.0.2) with support for returning the ResultSet
data as an Array data type, improvements, and bug fixes. For details, see the Release Notes for the driver.
For information about downloading the new JDBC driver version 2.0.2 and its documentation, see Using
Athena with the JDBC Driver (p. 43).
The latest version of the JDBC driver is 2.0.2. If you are migrating from a 1.x driver to a 2.x driver, you will
need to migrate your existing configurations to the new configuration. We highly recommend that you
migrate to the current driver.
For information about the changes introduced in the new version of the driver, the version differences,
and examples, see the JDBC Driver Migration Guide.
For information about the previous version of the JDBC driver, see Using Athena with the Previous
Version of the JDBC Driver (p. 249).
13
Amazon Athena User Guide
April 6, 2018
April 6, 2018
Published on 2018-04-06
Added an ability to automatically create Athena tables for CloudTrail log files directly from the
CloudTrail console. For information, see Creating a Table for CloudTrail Logs in the CloudTrail
Console (p. 136).
February 2, 2018
Published on 2018-02-12
Added an ability to securely offload intermediate data to disk for memory-intensive queries that use the
GROUP BY clause. This improves the reliability of such queries, preventing "Query resource exhausted"
errors.
With Athena, there are no versions to manage. We have transparently upgraded the underlying engine in
Athena to a version based on Presto version 0.172. No action is required on your end.
With the upgrade, you can now use Presto 0.172 Functions and Operators, including Presto 0.172
Lambda Expressions in Athena.
Major updates for this release, including the community-contributed fixes, include:
• Support for ignoring headers. You can use the skip.header.line.count property when defining
tables, to allow Athena to ignore headers. This is currently supported for queries that use the
OpenCSV SerDe, and not for Grok or Regex SerDes.
• Support for the CHAR(n) data type in STRING functions. The range for CHAR(n) is [1.255], while
the range for VARCHAR(n) is [1,65535].
• Support for correlated subqueries.
• Support for Presto Lambda expressions and functions.
• Improved performance of the DECIMAL type and operators.
• Support for filtered aggregations, such as SELECT sum(col_name) FILTER, where id > 0.
• Push-down predicates for the DECIMAL, TINYINT, SMALLINT, and REAL data types.
• Support for quantified comparison predicates: ALL, ANY, and SOME.
• Added functions: arrays_overlap(), array_except(), levenshtein_distance(),
codepoint(), skewness(), kurtosis(), and typeof().
14
Amazon Athena User Guide
November 13, 2017
For a complete list of functions and operators, see SQL Queries, Functions, and Operators (p. 234) in
this guide, and Presto 0.172 Functions.
Athena does not support all of Presto's features. For more information, see Limitations (p. 239).
Added support for connecting Athena to the ODBC Driver. For information, see Connecting to Amazon
Athena with ODBC (p. 44).
November 1, 2017
Published on 2017-11-01
Added support for querying geospatial data, and for Asia Pacific (Seoul), Asia Pacific (Mumbai), and
EU (London) regions. For information, see Querying Geospatial Data (p. 123) and AWS Regions and
Endpoints.
Added support for EU (Frankfurt). For a list of supported regions, see AWS Regions and Endpoints.
October 3, 2017
Published on 2017-10-03
Create named Athena queries with CloudFormation. For more information, see
AWS::Athena::NamedQuery in the AWS CloudFormation User Guide.
15
Amazon Athena User Guide
August 14, 2017
Added support for Asia Pacific (Sydney). For a list of supported regions, see AWS Regions and Endpoints.
Added integration with the AWS Glue Data Catalog and a migration wizard for updating from the Athena
managed data catalog to the AWS Glue Data Catalog. For more information, see Integration with AWS
Glue (p. 28).
August 4, 2017
Published on 2017-08-04
Added support for Grok SerDe, which provides easier pattern matching for records in unstructured text
files such as logs. For more information, see Grok SerDe (p. 200). Added keyboard shortcuts to scroll
through query history using the console (CTRL + ⇧/⇩ using Windows, CMD + ⇧/⇩ using Mac).
Added support for Asia Pacific (Tokyo) and Asia Pacific (Singapore). For a list of supported regions, see
AWS Regions and Endpoints.
June 8, 2017
Published on 2017-06-08
Added support for EU (Ireland). For more information, see AWS Regions and Endpoints.
Added an Amazon Athena API and AWS CLI support for Athena; updated JDBC driver to version 1.1.0;
fixed various issues.
• Amazon Athena enables application programming for Athena. For more information, see Amazon
Athena API Reference. The latest AWS SDKs include support for the Athena API. For links to
documentation and downloads, see the SDKs section in Tools for Amazon Web Services.
• The AWS CLI includes new commands for Athena. For more information, see the Amazon Athena API
Reference.
• A new JDBC driver 1.1.0 is available, which supports the new Athena API as well as the latest features
and bug fixes. Download the driver at https://s3.amazonaws.com/athena-downloads/drivers/
AthenaJDBC41-1.1.0.jar. We recommend upgrading to the latest Athena JDBC driver; however, you
may still use the earlier driver version. Earlier driver versions do not support the Athena API. For more
information, see Using Athena with the JDBC Driver (p. 43).
16
Amazon Athena User Guide
Improvements
• Actions specific to policy statements in earlier versions of Athena have been deprecated. If you
upgrade to JDBC driver version 1.1.0 and have customer-managed or inline IAM policies attached to
JDBC users, you must update the IAM policies. In contrast, earlier versions of the JDBC driver do not
support the Athena API, so you can specify only deprecated actions in policies attached to earlier
version JDBC users. For this reason, you shouldn't need to update customer-managed or inline IAM
policies.
• These policy-specific actions were used in Athena before the release of the Athena API. Use these
deprecated actions in policies only with JDBC drivers earlier than version 1.1.0. If you are upgrading
the JDBC driver, replace policy statements that allow or deny deprecated actions with the appropriate
API actions as listed or errors will occur:
athena:RunQuery athena:StartQueryExecution
athena:CancelQueryExecution athena:StopQueryExecution
athena:GetQueryExecutions athena:ListQueryExecutions
Improvements
• Increased the query string length limit to 256 KB.
Bug Fixes
• Fixed an issue that caused query results to look malformed when scrolling through results in the
console.
• Fixed an issue where a \u0000 character string in Amazon S3 data files would cause errors.
• Fixed an issue that caused requests to cancel a query made through the JDBC driver to fail.
• Fixed an issue that caused the AWS CloudTrail SerDe to fail with Amazon S3 data in US East (Ohio).
• Fixed an issue that caused DROP TABLE to fail on a partitioned table.
April 4, 2017
Published on 2017-04-04
Added support for Amazon S3 data encryption and released JDBC driver update (version 1.0.1) with
encryption support, improvements, and bug fixes.
Features
• Added the following encryption features:
• Support for querying encrypted data in Amazon S3.
• Support for encrypting Athena query results.
• A new version of the driver supports new encryption features, adds improvements, and fixes issues.
17
Amazon Athena User Guide
Improvements
• Added the ability to add, replace, and change columns using ALTER TABLE. For more information, see
Alter Column in the Hive documentation.
• Added support for querying LZO-compressed data.
Improvements
• Better JDBC query performance with page-size improvements, returning 1,000 rows instead of 100.
• Added ability to cancel a query using the JDBC driver interface.
• Added ability to specify JDBC options in the JDBC connection URL. For more information, see Using
Athena with the Previous Version of the JDBC Driver (p. 249) for the previous version of the driver,
and Connect with the JDBC (p. 43), for the most current version.
• Added PROXY setting in the driver, which can now be set using ClientConfiguration in the AWS SDK for
Java.
Bug Fixes
Fixed the following bugs:
• Throttling errors would occur when multiple queries were issued using the JDBC driver interface.
• The JDBC driver would abort when projecting a decimal data type.
• The JDBC driver would return every data type as a string, regardless of how the data type
was defined in the table. For example, selecting a column defined as an INT data type using
resultSet.GetObject() would return a STRING data type instead of INT.
• The JDBC driver would verify credentials at the time a connection was made, rather than at the time a
query would run.
• Queries made through the JDBC driver would fail when a schema was specified along with the URL.
Added the AWS CloudTrail SerDe, improved performance, fixed partition issues.
Features
• Added the AWS CloudTrail SerDe. For more information, see CloudTrail SerDe (p. 196). For detailed
usage examples, see the AWS Big Data Blog post, Analyze Security, Compliance, and Operational
Activity Using AWS CloudTrail and Amazon Athena.
Improvements
• Improved performance when scanning a large number of partitions.
• Improved performance on MSCK Repair Table operation.
• Added ability to query Amazon S3 data stored in regions other than your primary Region. Standard
inter-region data transfer rates for Amazon S3 apply in addition to standard Athena charges.
18
Amazon Athena User Guide
Bug Fixes
Bug Fixes
• Fixed a bug where a "table not found error" might occur if no partitions are loaded.
• Fixed a bug to avoid throwing an exception with ALTER TABLE ADD PARTITION IF NOT EXISTS
queries.
• Fixed a bug in DROP PARTITIONS.
Added support for AvroSerDe and OpenCSVSerDe, US East (Ohio) Region, and bulk editing columns in
the console wizard. Improved performance on large Parquet tables.
Features
• Introduced support for new SerDes:
• Avro SerDe (p. 193)
• OpenCSVSerDe for Processing CSV (p. 198)
• US East (Ohio) Region (us-east-2) launch. You can now run queries in this region.
• You can now use the Add Table wizard to define table schema in bulk. Choose Catalog Manager, Add
table, and then choose Bulk add columns as you walk through the steps to define the table.
19
Amazon Athena User Guide
Improvements
Type name value pairs in the text box and choose Add.
Improvements
• Improved performance on large Parquet tables.
20
Amazon Athena User Guide
Sign Up for AWS
Setting Up
If you've already signed up for Amazon Web Services (AWS), you can start using Amazon Athena
immediately. If you haven't signed up for AWS, or if you need assistance querying data using Athena, first
complete the tasks below:
If you have an AWS account already, skip to the next task. If you don't have an AWS account, use the
following procedure to create one.
Note your AWS account number, because you need it for the next task.
If you signed up for AWS but have not created an IAM user for yourself, you can create one using the IAM
console. If you aren't familiar with using the console, see Working with the AWS Management Console.
21
Amazon Athena User Guide
To create an IAM user for yourself, add the user to the
administrators group, and create a password for the user
https://*your_account_alias*.signin.aws.amazon.com/console/
It is also possible the sign-in link will use your account name instead of number. To verify the sign-in
link for IAM users for your account, open the IAM console and check under IAM users sign-in link on the
dashboard.
22
Amazon Athena User Guide
Prerequisites
Getting Started
This tutorial walks you through using Amazon Athena to query data. You'll create a table based on
sample data stored in Amazon Simple Storage Service, query the table, and check the results of the
query.
The tutorial is using live resources, so you are charged for the queries that you run. You aren't charged
for the sample datasets that you use, but if you upload your own data files to Amazon S3, charges do
apply.
Prerequisites
If you have not already done so, sign up for an account in Setting Up (p. 21).
To create a database
4. To create a database named mydatabase, enter the following CREATE DATABASE statement, and
then choose Run Query:
5. Confirm that the catalog display refreshes and mydatabase appears in the DATABASE list in the
Catalog dashboard on the left side.
23
Amazon Athena User Guide
Step 2: Create a Table
To create a table
1. Make sure that mydatabase is selected for DATABASE and then choose New Query.
2. In the query pane, enter the following CREATE TABLE statement, and then choose Run Query:
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
path/to/data/ with the region identifier where you run Athena, for example, s3://
athena-examples-us-east-1/path/to/data/.
The table cloudfront_logs is created and appears in the Catalog dashboard for your database.
24
Amazon Athena User Guide
Step 3: Query Data
To run a query
1. Choose New Query, enter the following statement anywhere in the query pane, and then choose
Run Query:
25
Amazon Athena User Guide
Step 3: Query Data
2. Optionally, you can save the results of a query to CSV by choosing the file icon on the Results pane.
You can also view the results of previous queries or queries that may take some time to complete.
Choose History then either search for your query or choose View or Download to view or download the
results of previous completed queries. This also displays the status of queries that are currently running.
Query history is retained for 45 days. For information, see Viewing Query History (p. 85).
26
Amazon Athena User Guide
Using the Console
In the right pane, the Query Editor displays an introductory screen that prompts you to create your first
table. You can view your tables under Tables in the left pane.
• Preview tables – View the query syntax in the Query Editor on the right.
• Show properties – Show a table's name, its location in Amazon S3, input and output formats, the
serialization (SerDe) library used, and whether the table has encrypted data.
• Delete table – Delete a table.
• Generate CREATE TABLE DDL – Generate the query behind a table and view it in the query editor.
For examples of using the AWS SDK for Java with Athena, see Code Samples (p. 241).
For more information about AWS SDK for Java documentation and downloads, see the SDKs section in
Tools for Amazon Web Services.
27
Amazon Athena User Guide
In regions where AWS Glue is supported, Athena uses the AWS Glue Data Catalog as a central location
to store and retrieve table metadata throughout an AWS account. The Athena execution engine requires
table metadata that instructs it where to read data, how to read it, and other information necessary to
process the data. The AWS Glue Data Catalog provides a unified metadata repository across a variety
of data sources and data formats, integrating not only with Athena, but with Amazon S3, Amazon RDS,
Amazon Redshift, Amazon Redshift Spectrum, Amazon EMR, and any application compatible with the
Apache Hive metastore.
For more information about the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog
in the AWS Glue Developer Guide. For a list of regions where AWS Glue is available, see Regions and
Endpoints in the AWS General Reference.
Separate charges apply to AWS Glue. For more information, see AWS Glue Pricing and Are there separate
charges for AWS Glue? (p. 32) For more information about the benefits of using AWS Glue with
Athena, see Why should I upgrade to the AWS Glue Data Catalog? (p. 31)
Topics
• Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 29)
• FAQ: Upgrading to the AWS Glue Data Catalog (p. 31)
• Best Practices When Using Athena with AWS Glue (p. 33)
28
Amazon Athena User Guide
Upgrading to the AWS Glue Data Catalog Step-by-Step
Until you upgrade, the Athena-managed data catalog continues to store your table and database
metadata, and you see the option to upgrade at the top of the console. The metadata in the Athena-
managed catalog isn't available in the AWS Glue Data Catalog or vice versa. While the catalogs exist side-
by-side, you aren't able to create tables or databases with the same names, and the creation process in
either AWS Glue or Athena fails in this case.
We created a wizard in the Athena console to walk you through the steps of upgrading to the AWS
Glue console. The upgrade takes just a few minutes, and you can pick up where you left off. For more
information about each upgrade step, see the topics in this section. For more information about working
with data and tables in the AWS Glue Data Catalog, see the guidelines in Best Practices When Using
Athena with AWS Glue (p. 33).
Before the upgrade can be performed, you need to attach a customer-managed IAM policy, with a policy
statement that allows the upgrade action, to the user who performs the migration.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:ImportCatalogToGlue "
],
"Resource": [ "*" ]
}
]
}
29
Amazon Athena User Guide
Step 3 - Choose Upgrade in the Athena Console
{
"Effect":"Allow",
"Action":[
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource":[
"*"
]
}
When you create a table using the console, you now have the option to create a table using an AWS Glue
crawler. For more information, see Using AWS Glue Crawlers (p. 34).
30
Amazon Athena User Guide
FAQ: Upgrading to the AWS Glue Data Catalog
• An AWS Glue crawler can automatically scan your data sources, identify data formats, and infer
schema.
• A fully managed ETL service allows you to transform and move data to various destinations.
• The AWS Glue Data Catalog stores metadata information about databases and tables, pointing to a
data store in Amazon S3 or a JDBC-compliant data store.
Upgrading to the AWS Glue Data Catalog has the following benefits.
31
Amazon Athena User Guide
Are there separate charges for AWS Glue?
For more information, see Populating the AWS Glue Data Catalog.
Easy-to-build pipelines
The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable.
You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After
your ETL job is ready, you can schedule it to run on the fully managed, scale-out Spark infrastructure of
AWS Glue. AWS Glue handles provisioning, configuration, and scaling of the resources required to run
your ETL jobs, allowing you to tightly integrate ETL with your workflow.
For more information, see Authoring AWS Glue Jobs in the AWS Glue Developer Guide.
32
Amazon Athena User Guide
Best Practices When Using Athena with AWS Glue
Under the hood, Athena uses Presto to execute DML statements and Hive to execute the DDL statements
that create and modify schema. With these technologies, there are a couple conventions to follow so
that Athena and AWS Glue work well together.
In this topic
33
Amazon Athena User Guide
Database, Table, and Column Names
• Using AWS Glue Jobs for ETL with Athena (p. 40)
• Creating Tables Using Athena for AWS Glue ETL Jobs (p. 40)
• Using ETL Jobs to Optimize Query Performance (p. 41)
• Converting SMALLINT and TINYINT Datatypes to INT When Converting to ORC (p. 42)
• Automating AWS Glue Jobs for ETL (p. 42)
You can use the AWS Glue Catalog Manager to rename columns, but at this time table names and
database names cannot be changed using the AWS Glue console. To correct database names, you need to
create a new database and copy tables to it (in other words, copy the metadata to a new entity). You can
follow a similar process for tables. You can use the AWS Glue SDK or AWS CLI to do this.
If you have data that arrives for a partitioned table at a fixed time, you can set up an AWS Glue Crawler
to run on schedule to detect and update table partitions. This can eliminate the need to run a potentially
long and expensive MSCK REPAIR command or manually execute an ALTER TABLE ADD PARTITION
command. For more information, see Table Partitions in the AWS Glue Developer Guide.
34
Amazon Athena User Guide
Using AWS Glue Crawlers
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
If the schema for table1 and table2 are similar, and a single data source is set to s3://bucket01/
folder1/ in AWS Glue, the crawler may create a single table with two partition columns: one partition
column that contains table1 and table2, and a second partition column that contains partition1
through partition5.
To have the AWS Glue crawler create two separate tables, set the crawler to have two data sources,
s3://bucket01/folder1/table1/ and s3://bucket01/folder1/table2, as shown in the
following procedure.
35
Amazon Athena User Guide
Using AWS Glue Crawlers
3. Under Add information about your crawler, choose additional settings as appropriate, and then
choose Next.
4. Under Add a data store, change Include path to the table-level directory. For instance, given the
example above, you would change it from s3://bucket01/folder1 to s3://bucket01/
folder1/table1/. Choose Next.
36
Amazon Athena User Guide
Using AWS Glue Crawlers
6. For Include path, enter your other table-level directory (for example, s3://bucket01/folder1/
table2/) and choose Next.
a. Repeat steps 3-5 for any additional table-level directories, and finish the crawler configuration.
The new values for Include locations appear under data stores as follows:
When Athena runs a query, it validates the schema of the table and the schema of any partitions
necessary for the query. The validation compares the column data types in order and makes sure
that they match for the columns that overlap. This prevents unexpected operations such as adding
or removing columns from the middle of a table. If Athena detects that the schema of a partition
differs from the schema of the table, Athena may not be able to process the query and fails with
HIVE_PARTITION_SCHEMA_MISMATCH.
There are a few ways to fix this issue. First, if the data was accidentally added, you can remove the data
files that cause the difference in schema, drop the partition, and re-crawl the data. Second, you can drop
the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the
table's schema. This second option works only if you are confident that the schema applied will continue
to read the data correctly.
AWS Glue may mis-assign metadata when a CSV file has quotes around each data field, getting the
serializationLib property wrong. For more information, see CSV Data Enclosed in quotes (p. 38).
37
Amazon Athena User Guide
Working with CSV Files
• Change the serializationLib property under field in the SerDeInfo field in the table to
org.apache.hadoop.hive.serde2.OpenCSVSerde.
• Enter appropriate values for separatorChar, quoteChar, and escapeChar. The separatorChar
value is a comma, the quoteChar value is double quotes (``), and the escapeChar value is the
backslash (\).
For example, for a CSV file with records such as the following:
You can use the AWS Glue console to edit table details as shown in this example:
38
Amazon Athena User Guide
Working with CSV Files
Alternatively, you can update the table definition in AWS Glue to have a SerDeInfo block such as the
following:
"SerDeInfo": {
39
Amazon Athena User Guide
Using AWS Glue Jobs for ETL with Athena
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"parameters": {
"separatorChar": ","
"quoteChar": """
"escapeChar": "\\"
}
},
For more information, see Viewing and Editing Table Details in the AWS Glue Developer Guide.
The following example shows a function in an AWS Glue script that writes out a dynamic frame
using from_options, and sets the writeHeader format option to false, which removes the header
information:
If the table property was not added when the table was created, you can add it using the AWS Glue
console.
40
Amazon Athena User Guide
Using AWS Glue Jobs for ETL with Athena
For more information, see Working with Tables in the AWS Glue Developer Guide.
We recommend to use Parquet and ORC data formats. AWS Glue supports writing to both of these
data formats, which can make it easier and faster for you to transform data to an optimal format for
Athena. For more information about these formats and other ways to improve performance, see Top
Performance Tuning Tips for Amazon Athena.
41
Amazon Athena User Guide
Using AWS Glue Jobs for ETL with Athena
42
Amazon Athena User Guide
Using Athena with the JDBC Driver
Topics
• Using Athena with the JDBC Driver (p. 43)
• Connecting to Amazon Athena with ODBC (p. 44)
Download the driver that matches your version of the JDK and the JDBC data standards:
• The AthenaJDBC41-2.0.7.jar is compatible with JDBC 4.1 and requires JDK 7.0 or later.
• The AthenaJDBC42-2.0.7.jar is compatible with JDBC 4.2 and requires JDK 8.0 or later.
• Release Notes
• License Agreement
• Notices
• Third-Party Licenses
Now you are ready to migrate from the previous version and install and configure this version of the
JDBC driver.
43
Amazon Athena User Guide
Migration from Previous Version of the JDBC Driver
• JDBC Driver Installation and Configuration Guide. Use this guide to install and configure the driver.
• JDBC Driver Migration Guide. Use this guide to migrate from previous versions to the current version.
For more information about the previous versions of the JDBC driver, see Using the Previous Version of
the JDBC Driver (p. 249).
If you are migrating from a 1.x driver to a 2.x driver, you will need to migrate your existing configurations
to the new configuration. We highly recommend that you migrate to the driver version 2.x. For
information, see the JDBC Driver Migration Guide.
Windows
Driver Version Download Link
ODBC 1.0.5 for Windows 32-bit Windows 32 bit ODBC Driver 1.0.5
ODBC 1.0.5 for Windows 64-bit Windows 64 bit ODBC Driver 1.0.5
Linux
Driver Version Download Link
ODBC 1.0.5 for Linux 32-bit Linux 32 bit ODBC Driver 1.0.5
44
Amazon Athena User Guide
OSX
ODBC 1.0.5 for Linux 64-bit Linux 64 bit ODBC Driver 1.0.5
OSX
Driver Version Download Link
Documentation for ODBC 1.0.5 ODBC Driver Installation and Configuration Guide
version 1.0.5
Release Notes for ODBC 1.0.5 ODBC Driver Release Notes version 1.0.5
ODBC 1.0.4 for Windows 32-bit Windows 32 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Windows 64-bit Windows 64 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Linux 32-bit Linux 32 bit ODBC Driver 1.0.4
ODBC 1.0.4 for Linux 64-bit Linux 64 bit ODBC Driver 1.0.4
45
Amazon Athena User Guide
Previous Versions of the ODBC Driver
Documentation for ODBC 1.0.4 ODBC Driver Installation and Configuration Guide
version 1.0.4
ODBC 1.0.3 for Windows 32-bit Windows 32-bit ODBC Driver 1.0.3
ODBC 1.0.3 for Windows 64-bit Windows 64-bit ODBC Driver 1.0.3
ODBC 1.0.3 for Linux 32-bit Linux 32-bit ODBC Driver 1.0.3
ODBC 1.0.3 for Linux 64-bit Linux 64-bit ODBC Driver 1.0.3
Documentation for ODBC 1.0.3 ODBC Driver Installation and Configuration Guide
version 1.0.3
ODBC 1.0.2 for Windows 32-bit Windows 32-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Windows 64-bit Windows 64-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Linux 32-bit Linux 32-bit ODBC Driver 1.0.2
ODBC 1.0.2 for Linux 64-bit Linux 64-bit ODBC Driver 1.0.2
Documentation for ODBC 1.0.2 ODBC Driver Installation and Configuration Guide
version 1.0.2
46
Amazon Athena User Guide
Access Control Policies
Security
Amazon Athena uses IAM policies to restrict access to Athena operations. Encryption options enable you
to encrypt query result files in Amazon S3 and query data encrypted in Amazon S3. Users must have the
appropriate permissions to access the Amazon S3 locations and decrypt files.
Topics
• Access Control Policies (p. 47)
• Configuring Encryption Options (p. 62)
If you are an administrator for other users, make sure that they have appropriate permissions associated
with their user profiles.
Topics
• Managed Policies for User Access (p. 47)
• Access to Amazon S3 (p. 51)
• Fine-Grained Access to Databases and Tables in the AWS Glue Data Catalog (p. 51)
• Access to Encrypted Metadata in the AWS Glue Data Catalog (p. 58)
• Cross-account Access (p. 58)
• Workgroup and Tag Policies (p. 59)
• Enabling Federated Access to Athena API (p. 59)
Each identity-based (IAM ) policy consists of statements that define the actions that are allowed or
denied. For a list of actions, see the Amazon Athena API Reference.
Managed policies are easy to use and are automatically updated with the required actions as the
service evolves. For more information and step-by-step instructions for attaching a policy to a user, see
Attaching Managed Policies in the AWS Identity and Access Management User Guide.
47
Amazon Athena User Guide
Managed Policies for User Access
• The AmazonAthenaFullAccess managed policy grants full access to Athena. Attach it to users
and other principals who need full access to Athena. See AmazonAthenaFullAccess Managed
Policy (p. 48).
• The AWSQuicksightAthenaAccess managed policy grants access to actions that Amazon
QuickSightneeds to integrate with Athena. Attach this policy to principals who use Amazon QuickSight
in conjunction with Athena. See AWSQuicksightAthenaAccess Managed Policy (p. 50).
Customer-managed and inline identity-based policies allow you to specify more detailed Athena actions
within a policy to fine-tune access. We recommend that you use the AmazonAthenaFullAccess policy
as a starting point and then allow or deny specific actions listed in the Amazon Athena API Reference. For
more information about inline policies, see Managed Policies and Inline Policies in the AWS Identity and
Access Management User Guide.
If you also have principals that connect using JDBC, you must provide the JDBC driver credentials to your
application. For more information, see Service Actions for JDBC Connections (p. 51).
If you have migrated to using AWS Glue with Athena, and have chosen to encrypt your AWS Glue Data
Catalog, you must specify additional actions to the identity-based IAM policies that you use in Athena.
For more information, see Access to Encrypted Metadata in the AWS Glue Data Catalog (p. 58).
Important
If you create and use workgroups, make sure your policies include appropriate access to
workgroup actions. For detailed information, see the section called “ IAM Policies for Accessing
Workgroups” (p. 161) and the section called “Workgroup Example Policies” (p. 162).
Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console
for the most up-to-date policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
48
Amazon Athena User Guide
Managed Policies for User Access
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::athena-examples*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"sns:ListTopics",
"sns:GetTopicAttributes"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricAlarm",
"cloudwatch:DescribeAlarms",
"cloudwatch:DeleteAlarms"
],
49
Amazon Athena User Guide
Managed Policies for User Access
"Resource": [
"*"
]
}
]
}
Managed policy contents change, so the policy shown here may be out-of-date. Check the IAM console
for the most up-to-date policy.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:BatchGetQueryExecution",
"athena:CancelQueryExecution",
"athena:GetCatalogs",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetNamespaces",
"athena:GetQueryExecution",
"athena:GetQueryExecutions",
"athena:GetQueryResults",
"athena:GetQueryResultsStream",
"athena:GetTable",
"athena:GetTables",
"athena:ListQueryExecutions",
"athena:RunQuery",
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:DeleteDatabase",
"glue:GetDatabase",
"glue:GetDatabases",
"glue:UpdateDatabase",
"glue:CreateTable",
"glue:DeleteTable",
"glue:BatchDeleteTable",
"glue:UpdateTable",
"glue:GetTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:BatchDeletePartition",
50
Amazon Athena User Guide
Access to Amazon S3
"glue:UpdatePartition",
"glue:GetPartition",
"glue:GetPartitions",
"glue:BatchGetPartition"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::aws-athena-query-results-*"
]
}
]
}
For information about the latest version of the JDBC driver, see Connect with the JDBC Driver (p. 43).
For information about the latest version of the ODBC driver, see Connect with the ODBC Driver (p. 44).
Access to Amazon S3
In addition to the allowed actions for Athena that you define in IAM identity-based policies, if you or
your users need to create tables and work with underlying data, you must grant appropriate access to
the Amazon S3 location of the data.
You can do this using identity-based policies, bucket resource policies, or both. For detailed information
and scenarios about how to grant Amazon S3 access, see Example Walkthroughs: Managing Access in the
Amazon Simple Storage Service Developer Guide. For more information and an example of which Amazon
S3 actions to allow, see the example bucket policy later in Cross-Account Access (p. 58).
Note
Athena does not support restricting or allowing access to Amazon S3 resources based on the
aws:SourceIp condition key.
51
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
• Databases
• Tables
These policies enable you to define fine-grained access to databases and tables. You define resource-
level permissions in identity-based (IAM) policies in the IAM Console.
Important
This section discusses identity-based (IAM) policies that allow you to define fine-grained access
to specific resources. These are not the same as resource-based policies. For more information
about the differences, see Identity-Based Policies and Resource-Based Policies in the AWS
Identity and Access Management User Guide.
Create an IAM policy Creating IAM Policies in the AWS Identity and Access Management User
that defines fine-grained Guide.
access to resources
Learn about identity- Identity-Based Policies (IAM Policies) in the AWS Glue Developer Guide.
based (IAM) policies used
in AWS Glue
In this section
Limitations
Consider the following limitations when using fine-grained access control with the AWS Glue Data
Catalog and Athena:
• You must upgrade (p. 29) from a data catalog managed in Athena to the AWS Glue Data Catalog.
• You can limit access only to databases and tables. Fine-grained access controls apply at the table level
and you cannot limit access to individual partitions within a table. For more information, see Table
Partitions and Versions in AWS Glue (p. 54).
• Athena does not support cross-account access to the AWS Glue Data Catalog.
• The AWS Glue Data Catalog contains the following resources: CATALOG, DATABASE, TABLE, and
FUNCTION.
Note
From this list, resources that are common between Athena and the AWS Glue Data Catalog
are TABLE, DATABASE, and CATALOG, for each account. Functions are specific to AWS Glue,
however, for delete actions in Athena, you must include permissions to them, if you delete a
database. See Fine-Grained Policy Examples (p. 54).
The hierarchy is as follows: CATALOG is an ancestor of all DATABASES in each account, and each
DATABASE is an ancestor for all of its TABLES and FUNCTIONS. For example, for a table named
table_test that belongs to a database db in the catalog in your account, its ancestors are db and
52
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
the catalog in your account. For the db database, its ancestor is the catalog in your account, and
its descendants are tables and functions. For more information about the hierarchical structure of
resources, see List of ARNs in Data Catalog in the AWS Glue Developer Guide.
• For any non-delete Athena action on a resource, such as CREATE DATABASE, CREATE TABLE, SHOW
DATABASE, SHOW TABLE, or ALTER TABLE, you need permissions to call this action on this resource
(table or database) and all ancestors of this resource in the Data Catalog. For example, for a table, its
ancestors are database and catalog for the account. For a database, its ancestor is the catalog for this
account. See Fine-Grained Policy Examples (p. 54).
• For a delete action in Athena, such as DROP DATABASE or DROP TABLE, you additionally need
permissions to call this delete action on all descendants of this resource in the Data Catalog. For
example, to delete a database you need permissions on the database, on its ancestor, which is the
catalog, and on all descendants, which are the tables and the user defined functions. A table does not
have descendants, and therefore, to run DROP TABLE, you need permissions to this action on the table
and its ancestors. See Fine-Grained Policy Examples (p. 54).
• When limiting access to a specific database in the Data Catalog, you must also specify the access policy
to the default database and catalog for each AWS Region for GetDatabase and CreateDatabase
actions. If you use Athena in more than one Region, add a separate line to the policy for the resource
ARN for each default database and catalog in each Region.
For example, to allow GetDatabase access to example_db in the us-east-1 (N.Virginia) Region,
also include the default database and catalog in the policy for that Region for two actions:
GetDatabase and CreateDatabase:
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
}
53
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
• Fine-grained access controls apply at the table level. You can limit access only to databases and tables.
For example, if you allow access to a partitioned table, this access applies to all partitions in the table.
You cannot limit access to individual partitions within a table.
Important
Having access to all partitions within a table is not sufficient if you need to run actions in
AWS Glue on partitions. To run actions on partitions, you need permissions for those actions.
For example, to run GetPartitions on table myTable in the database myDB, you need
permissions for the action glue:GetPartitions in the Data Catalog, the myDB database,
and myTable.
• Fine-grained access controls do not apply to table versions. As with partitions, access to previous
versions of a table is granted through access to the table version APIs in AWS Glue on the table, and to
the table ancestors.
For information about permissions on AWS Glue actions, see AWS Glue API Permissions: Actions and
Resources Reference in the AWS Glue Developer Guide.
As with any IAM policy, you define these policies in the IAM Console. We recommend that you start with
these examples and, depending on your needs, adjust them to allow or deny specific actions to particular
databases and tables.
These examples include the access policy to the default database and catalog, for GetDatabase and
CreateDatabase actions. This policy is required for Athena and the AWS Glue Data Catalog to work
together. For multiple AWS Regions, include this policy for each of the default databases and their
catalogs, one line for each Region.
In addition, replace the example_db database and test table names with the names for your databases
and tables.
DDL Statement Example of an IAM access policy granting access to the resource
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
54
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
ALTER DATABASE Allows you to modify the properties for the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:UpdateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db"
]
}
DROP DATABASE Allows you to drop the example_db database, including all tables in it.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:DeleteDatabase",
"glue:GetTables",
"glue:GetTable",
"glue:DeleteTable"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/*",
"arn:aws:glue:us-east-1:123456789012:userDefinedFunction/
example_db/*"
]
}
55
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
SHOW DATABASES Allows you to list all databases in the AWS Glue Data Catalog.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabases"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/*"
]
}
CREATE TABLE Allows you to create a table named test in the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:CreateTable"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/test"
]
}
56
Amazon Athena User Guide
Fine-Grained Access to Databases and Tables
DDL Statement Example of an IAM access policy granting access to the resource
SHOW TABLES Allows you to list all tables in the example_db database.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTables"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/*"
]
}
DROP TABLE Allows you to drop a partitioned table named test in the example_db
database. If your table does not have partitions, do not include partition
actions.
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/default"
]
},
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetTable",
"glue:DeleteTable",
"glue:GetPartitions",
"glue:GetPartition",
"glue:DeletePartition"
],
"Resource": [
"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/example_db",
"arn:aws:glue:us-east-1:123456789012:table/example_db/test"
]
}
57
Amazon Athena User Guide
Access to Encrypted Metadata in the Data Catalog
You can optionally enable encryption in the AWS Glue Data Catalog using the AWS Glue console, or the
API. For information, see Encrypting Your Data Catalog in the AWS Glue Developer Guide.
If you encrypt your AWS Glue Data Catalog, you must add the following actions to all of your policies
used to access Athena:
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt",
"kms:Encrypt"
],
"Resource": "(arn of key being used to encrypt the catalog)"
}
}
Cross-account Access
A common scenario is granting access to users in an account different from the bucket owner so that
they can perform queries. In this case, use a bucket policy to grant access.
The following example bucket policy, created and applied to bucket s3://my-athena-data-bucket
by the bucket owner, grants access to all users in account 123456789123, which is a different account.
{
"Version": "2012-10-17",
"Id": "MyPolicyID",
"Statement": [
{
"Sid": "MyStatementSid",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789123:root"
},
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::my-athena-data-bucket",
"arn:aws:s3:::my-athena-data-bucket/*"
]
}
]
58
Amazon Athena User Guide
Workgroup and Tag Policies
To grant access to a particular user in an account, replace the Principal key with
a key that specifies the user instead of root. For example, for user profile Dave, use
arn:aws:iam::123456789123:user/Dave.
"Resource": [arn:aws:athena:<region>:<user-account>:workgroup/<workgroup-name>]
Where <workgroup-name> is the name of your workgroup. For example, for workgroup named
test_workgroup, specify it as a resource as follows:
"Resource": ["arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"]
For a list of workgroup policies, see the section called “Workgroup Example Policies” (p. 162).
For a list of tag-based policies for workgroups, see Tag-Based IAM Access Control Policies (p. 184).
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference.
For more information about IAM policies, see Creating Policies with the Visual Editor in the IAM User
Guide.
For more information about creating IAM policies for workgroups, see Workgroup IAM Policies (p. 161).
In this case, to authenticate users, use the JDBC or ODBC driver with SAML.2.0 support that accesses
Active Directory Federation Services (AD FS) 3.0 and enables a client application to call Athena API
operations.
For more information about SAML 2.0 support in AWS, see About SAML 2.0 Federation in the IAM User
Guide.
Note
Federated access to Athena API is supported for a particular type of identity provider (IdP),
the Active Directory Federation Service (AD FS 3.0), which is part of Windows Server. Access
is established through the versions of JDBC or ODBC drivers that support SAML 2.0. For
information, see Using Athena with the JDBC Driver (p. 43) and Connecting to Amazon Athena
with ODBC (p. 44).
Topics
• Before You Begin (p. 60)
• Architecture Diagram (p. 60)
• Procedure: SAML-based Federated Access to Athena API (p. 61)
59
Amazon Athena User Guide
Enabling Federated Access to Athena API
• Inside your organization, install and configure the AD FS 3.0 as your IdP.
• On the Athena side, install and configure the latest available versions of JDBC or ODBC drivers that
include support for federated access compatible with SAML 2.0. For information, see Using Athena
with the JDBC Driver (p. 43) and Connecting to Amazon Athena with ODBC (p. 44).
Architecture Diagram
The following diagram illustrates this process.
In this diagram:
1. A user in your organization uses a client application with the JDBC or ODBC driver to request
authentication from your organization's IdP. The IdP is AD FS 3.0.
2. The IdP authenticates the user against Active Directory, which is your organization's Identity Store.
3. The IdP constructs a SAML assertion with information about the user and sends the assertion to the
client application via the JDBC or ODBC driver.
4. The JDBC or ODBC driver calls the AWS Security Token Service AssumeRoleWithSAML API operation,
passing it the following parameters:
• The ARN of the SAML provider
• The ARN of the role to assume
• The SAML assertion from the IdP
For more information, see AssumeRoleWithSAML, in the AWS Security Token Service API Reference.
5. The API response to the client application via the JDBC or ODBC driver includes temporary security
credentials.
6. The client application uses the temporary security credentials to call Athena API operations, allowing
your users to access Athena API operations.
60
Amazon Athena User Guide
Enabling Federated Access to Athena API
1. In your organization, register AWS as a service provider (SP) in your IdP. This process is known as
relying party trust. For more information, see Configuring your SAML 2.0 IdP with Relying Party Trust
in the IAM User Guide. As part of this task, perform these steps:
a. Obtain the sample SAML metadata document from this URL: https://signin.aws.amazon.com/
static/saml-metadata.xml.
b. In your organization's IdP (AD FS), generate an equivalent metadata XML file that describes your
IdP as an identity provider to AWS. Your metadata file must include the issuer name, creation
date, expiration date, and keys that AWS uses to validate authentication responses (assertions)
from your organization.
2. In the IAM console, create a SAML identity provider entity. For more information, see Creating SAML
Identity Providers in the IAM User Guide. As part of this step, do the following:
• In the role's permission policy, list actions that users from your organization are allowed to do in
AWS.
• In the role's trust policy, set the SAML provider entity that you created in Step 2 of this procedure
as the principal.
For information about configuring the mapping in AD FS, see the blog post: Enabling Federation to
AWS Using Windows Active Directory, ADFS, and SAML 2.0.
5. Install and configure the JDBC or ODBC driver with SAML 2.0 support. For information, see Using
Athena with the JDBC Driver (p. 43) and Connecting to Amazon Athena with ODBC (p. 44).
6. Specify the connection string from your application to the JDBC or ODBC driver. For information
about the connection string that your application should use, see the topic "Using the Active
Directory Federation Services (ADFS) Credentials Provider" in the JDBC Driver Installation and
Configuration Guide, or a similar topic in the ODBC Driver Installation and Configuration Guide.
The high-level summary of configuring the connection string to the drivers is as follows:
61
Amazon Athena User Guide
Configuring Encryption Options
• For UID and PWD, provide the AD domain user credentials. When using the driver on Windows,
if UID and PWD are not provided, the driver attempts to obtain the user credentials of the user
logged in to the Windows machine.
• Optionally, set ssl_insecure to true. In this case, the driver does not check the authenticity
of the SSL certificate for the AD FS IdP server. Setting to true is needed if the AD FS IdP’s SSL
certificate has not been configured to be trusted by the driver.
• To enable mapping of an Active Directory domain user or group to one or more IAM roles
(as mentioned in step 4 of this procedure), in the preferred_role for the JDBC or ODBC
connection, specify the IAM role (ARN) to assume for the driver connection. Specifying the
preferred_role is optional, and is useful if the role is not the first role listed in the claim rule.
1. The JDBC or ODBC driver calls the AWS STS AssumeRoleWithSAML API, and passes it the
assertions, as shown in step 4 of the architecture diagram (p. 60).
2. AWS makes sure that the request to assume the role comes from the IdP referenced in the SAML
provider entity.
3. If the request is successful, the AWS STS AssumeRoleWithSAML API operation returns a set of
temporary security credentials, which your client application uses to make signed requests to
Athena.
Your application now has information about the current user and can access Athena
programmatically.
• The results of all queries in Amazon S3, which Athena stores in a location known as the S3 staging
directory. You can encrypt query results stored in Amazon S3 whether the underlying dataset is
encrypted in Amazon S3 or not. For information, see Permissions to Encrypted Query Results Stored in
Amazon S3 (p. 63).
• The data in the AWS Glue Data Catalog. For information, see Permissions to Encrypted Metadata in the
AWS Glue Data Catalog (p. 65).
Topics
• Amazon S3 Encryption Options Supported in Athena (p. 62)
• Encrypting Query Results Stored in Amazon S3 (p. 63)
• Permissions to Encrypted Data in Amazon S3 (p. 64)
• Permissions to Encrypted Metadata in the AWS Glue Data Catalog (p. 65)
• Creating Tables Based on Encrypted Datasets in Amazon S3 (p. 65)
62
Amazon Athena User Guide
Encrypting Query Results Stored in Amazon S3
Note
With SSE-KMS, Athena does not require you to indicate that data is encrypted when creating a
table.
These options encrypt data at rest in Amazon S3. Regardless of whether you use these options, transport
layer security (TLS) encrypts objects in-transit between Athena resources and between Athena and
Amazon S3. Query results stream to JDBC clients as plain text and are encrypted using TLS.
Important
The setup for querying an encrypted dataset in Amazon S3 and the options in Athena to encrypt
query results are independent. Each option is enabled and configured separately. You can use
different encryption methods or keys for each. This means that reading encrypted data in
Amazon S3 doesn't automatically encrypt Athena query results in Amazon S3. The opposite is
also true. Encrypting Athena query results in Amazon S3 doesn't encrypt the underlying dataset
in Amazon S3.
For more information about AWS KMS encryption with Amazon S3, see What is AWS Key Management
Service and How Amazon Simple Storage Service (Amazon S3) Uses AWS KMS in the AWS Key
Management Service Developer Guide.
Athena does not support SSE with customer-provided keys (SSE-C), nor does it support client-side
encryption using a client-side master key. To compare Amazon S3 encryption options, see Protecting
Data Using Encryption in the Amazon Simple Storage Service Developer Guide.
Athena does not support running queries from one Region on encrypted data stored in Amazon S3 in
another Region.
To configure the JDBC driver to encrypt your query results using any of the encryption protocols that
Athena supports, see Using Athena with the JDBC Driver (p. 43).
You can configure the setting for encryption of query results in two ways:
• Client-side settings. When you use Settings in the console or the API operations to indicate that you
want to encrypt query results, this is known as using client-side settings. Client-side settings include
query results location and encryption. If you specify them, they are used, unless they are overridden by
the workgroup settings.
• Workgroup settings. When you create or edit a workgroup (p. 168) and select the Override client-
side settings field, then all queries that run in this workgroup use the workgroup settings. For more
information, see Workgroup Settings Override Client-Side Settings (p. 166). Workgroup settings
include query results location and encryption.
63
Amazon Athena User Guide
Permissions to Encrypted Data in Amazon S3
2. For Query result location, enter a custom value or leave the default. This is the Amazon S3 staging
directory where query results are stored.
3. Choose Encrypt query results.
• If your account has access to an existing AWS KMS customer managed key (CMK), choose its alias
or choose Enter a KMS key ARN and then enter an ARN.
• If your account does not have access to an existing AWS KMS customer managed key (CMK),
choose Create KMS key, and then open the AWS KMS console. In the navigation pane, choose
AWS managed keys. For more information, see Creating Keys in the AWS Key Management Service
Developer Guide.
6. Return to the Athena console to specify the key by alias or ARN as described in the previous step.
7. Choose Save.
• SSE-S3. If you use SSE-S3 for encryption, Athena users require no additional permissions in their
policies. It is sufficient to have the appropriate Amazon S3 permissions for the appropriate Amazon
S3 location and for Athena actions. For more information about policies that allow appropriate
Athena and Amazon S3 permissions, see IAM Policies for User Access (p. 47) and Amazon S3
Permissions (p. 51).
• AWS KMS. If you use AWS KMS for encryption, Athena users must be allowed to perform particular
AWS KMS actions in addition to Athena and Amazon S3 permissions. You allow these actions by
editing the key policy for the AWS KMS customer managed keys (CMKs) that are used to encrypt data
in Amazon S3. The easiest way to do this is to use the IAM console to add key users to the appropriate
AWS KMS key policies. For information about how to add a user to a AWS KMS key policy, see How to
Modify a Key Policy in the AWS Key Management Service Developer Guide.
Note
Advanced key policy administrators can adjust key policies. kms:Decrypt is the minimum
allowed action for an Athena user to work with an encrypted dataset. To work with encrypted
query results, the minimum allowed actions are kms:GenerateDataKey and kms:Decrypt.
64
Amazon Athena User Guide
Permissions to Encrypted Metadata
in the AWS Glue Data Catalog
When using Athena to query datasets in Amazon S3 with a large number of objects that are encrypted
with AWS KMS, AWS KMS may throttle query results. This is more likely when there are a large number
of small objects. Athena backs off retry requests, but a throttling error might still occur. In this case,
visit the AWS Support Center and create a case to increase your limit. For more information about
limits and AWS KMS throttling, see Limits in the AWS Key Management Service Developer Guide.
Users that run queries, including the user who creates the table, must have the appropriate permissions
as described earlier in this topic.
Important
If you use Amazon EMR along with EMRFS to upload encrypted Parquet files, you must
disable multipart uploads by setting fs.s3n.multipart.uploads.enabled to
false. If you don't do this, Athena is unable to determine the Parquet file length and a
HIVE_CANNOT_OPEN_SPLIT error occurs. For more information, see Configure Multipart
Upload for Amazon S3 in the Amazon EMR Management Guide.
Indicate that the dataset is encrypted in Amazon S3 in one of the following ways. This step is not
required if SSE-KMS is used.
• Use the CREATE TABLE (p. 223) statement with a TBLPROPERTIES clause that specifies
'has_encrypted_data'='true'.
• Use the JDBC driver (p. 43) and set the TBLPROPERTIES value as shown in the previous example,
when you execute CREATE TABLE (p. 223) using statement.executeQuery().
65
Amazon Athena User Guide
Creating Tables Based on Encrypted Datasets in Amazon S3
• Use the Add table wizard in the Athena console, and then choose Encrypted data set when you
specify a value for Location of input data set.
Tables based on encrypted data in Amazon S3 appear in the Database list with an encryption icon.
66
Amazon Athena User Guide
Tables and Databases Creation Process in Athena
When you create a database and table in Athena, you describe the schema and the location of the data,
making the data in the table ready for real-time querying.
To improve query performance and reduce costs, we recommend that you partition your data and use
open source columnar formats for storage in Amazon S3, such as Apache Parquet or ORC.
Topics
• Tables and Databases Creation Process in Athena (p. 67)
• Names for Tables, Databases, and Columns (p. 71)
• Reserved Keywords (p. 72)
• Table Location in Amazon S3 (p. 73)
• Partitioning Data (p. 74)
• Columnar Storage Formats (p. 78)
• Converting to Columnar Formats (p. 78)
When you create a new table schema in Athena, Athena stores the schema in a data catalog and uses it
when you run queries.
Athena uses an approach known as schema-on-read, which means a schema is projected on to your data
at the time you execute a query. This eliminates the need for data loading or transformation.
Athena uses Apache Hive to define tables and create databases, which are essentially a logical
namespace of tables.
When you create a database and table in Athena, you are simply describing the schema and the location
where the table data are located in Amazon S3 for read-time querying. Database and table, therefore,
have a slightly different meaning than they do for traditional relational database systems because the
data isn't stored along with the schema definition for the database and table.
When you query, you query the table using standard SQL and the data is read at that time. You can find
guidance for how to create databases and tables using Apache Hive documentation, but the following
provides guidance specifically for Athena.
67
Amazon Athena User Guide
Requirements for Tables in Athena and Data in Amazon S3
Hive supports multiple data formats through the use of serializer-deserializer (SerDe) libraries. You
can also define complex schemas using regular expressions. For a list of supported SerDe libraries, see
Supported Data Formats, SerDes, and Compression Formats (p. 192).
• Athena can only query the latest version of data on a versioned Amazon S3 bucket, and cannot query
previous versions of the data.
• You must have the appropriate permissions to work with data in the Amazon S3 location. For more
information, see Setting User and Amazon S3 Bucket Permissions (p. 47).
• If the data is not encrypted in Amazon S3, it can be stored in a different Region from the primary
region where you run Athena. Standard inter-region data transfer rates for Amazon S3 apply in
addition to standard Athena charges.
• If the data is encrypted in Amazon S3, it must be stored in the same Region, and the user or principal
who creates the table in Athena must have the appropriate permissions to decrypt the data. For more
information, see Configuring Encryption Options (p. 62).
• Athena supports querying objects that are stored with multiple storage classes in the same bucket
specified by the LOCATION clause. For example, you can query data in objects that are stored in
different Storage classes (Standard, Standard-IA and Intelligent-Tiering) in Amazon S3.
• Athena does not support Requester Pays buckets.
• Athena does not support querying the data in the GLACIER storage class. It ignores objects
transitioned to the GLACIER storage class based on an Amazon S3 lifecycle policy.
For more information, see Storage Classes, Changing the Storage Class of an Object in Amazon S3,
Transitioning to the GLACIER Storage Class (Object Archival) , and Requester Pays Buckets in the
Amazon Simple Storage Service Developer Guide.
• If you issue queries against Amazon S3 buckets with a large number of objects and the data is not
partitioned, such queries may affect the Get request rate limits in Amazon S3 and lead to Amazon
S3 exceptions. To prevent errors, partition your data. Additionally, consider tuning your Amazon S3
request rates. For more information, see Request Rate and Performance Considerations.
Functions Supported
The functions supported in Athena queries are those found within Presto. For more information, see
Presto 0.172 Functions and Operators in the Presto documentation.
68
Amazon Athena User Guide
All Tables Are EXTERNAL
69
Amazon Athena User Guide
To create a table using Hive DDL
1. In the database that you created, create a table by entering the following statement and choosing
Run Query:
70
Amazon Athena User Guide
Names for Tables, Databases, and Columns
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
OS String,
Browser String,
BrowserVersion String
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s
+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://athena-examples/cloudfront/plaintext/';
2. If the table was successfully created, you can then run queries against your data.
Queries with mixedCase column names, such as profileURI, or upper case column names do not work.
71
Amazon Athena User Guide
Reserved Keywords
...
Reserved Keywords
When you run queries in Athena that include reserved keywords, you must escape them by enclosing
them in special characters. Use the lists in this topic to check which keywords are reserved in Athena.
To escape reserved keywords in DDL statements, enclose them in backticks (`). To escape reserved
keywords in SQL SELECT statements and in queries on Views (p. 86), enclose them in double quotes
('').
You cannot use DDL reserved keywords as identifier names in DDL statements without enclosing them in
backticks (`).
ALL, ALTER, AND, ARRAY, AS, AUTHORIZATION, BETWEEN, BIGINT, BINARY, BOOLEAN, BOTH,
BY, CASE, CASHE, CAST, CHAR, COLUMN, CONF, CONSTRAINT, COMMIT, CREATE, CROSS, CUBE,
CURRENT, CURRENT_DATE, CURRENT_TIMESTAMP, CURSOR, DATABASE, DATE, DAYOFWEEK, DECIMAL,
DELETE, DESCRIBE, DISTINCT, DOUBLE, DROP, ELSE, END, EXCHANGE, EXISTS, EXTENDED,
EXTERNAL, EXTRACT, FALSE, FETCH, FLOAT, FLOOR, FOLLOWING, FOR, FOREIGN, FROM, FULL,
FUNCTION, GRANT, GROUP, GROUPING, HAVING, IF, IMPORT, IN, INNER, INSERT, INT, INTEGER,
INTERSECT, INTERVAL, INTO, IS, JOIN, LATERAL, LEFT, LESS, LIKE, LOCAL, MACRO, MAP, MORE,
NONE, NOT, NULL, NUMERIC, OF, ON, ONLY, OR, ORDER, OUT, OUTER, OVER, PARTIALSCAN,
PARTITION,
PERCENT, PRECEDING, PRECISION, PRESERVE, PRIMARY, PROCEDURE, RANGE, READS, REDUCE, REGEXP,
REFERENCES, REVOKE, RIGHT, RLIKE, ROLLBACK, ROLLUP, ROW, ROWS, SELECT, SET, SMALLINT,
START,TABLE,
TABLESAMPLE, THEN, TIME, TIMESTAMP, TO, TANSFORM, TRIGGER, TRUE, TRUNCATE,
UNBOUNDED,UNION,
UNIQUEJOIN, UPDATE, USER, USING, UTC_TMESTAMP, VALUES, VARCHAR, VIEWS, WHEN, WHERE, WINDOW,
WITH
If you use these keywords as identifiers, you must enclose them in double quotes (") in your query
statements.
72
Amazon Athena User Guide
Examples of Queries with Reserved Words
The following example queries include a column name containing the DDL-related reserved keywords in
ALTER TABLE ADD PARTITION and ALTER TABLE DROP PARTITION statements. The DDL reserved
keywords are enclosed in backticks (`):
The following example query includes a reserved keyword (end) as an identifier in a SELECT statement.
The keyword is escaped in double quotes:
SELECT *
FROM TestTable
WHERE "end" != nil;
The following example query includes a reserved keyword (first) in a SELECT statement. The keyword is
escaped in double quotes:
SELECT "itemId"."first"
FROM testTable
LIMIT 10;
Use these tips and examples when you specify the location in Amazon S3.
• Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and
cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path
only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location,
remove any empty files, and move unneeded files to another location.
• In the LOCATION clause, use a trailing slash for your bucket.
73
Amazon Athena User Guide
Partitioning Data
Use:
s3://bucketname/prefix/
• Do not use filenames, underscores, wildcards, or glob patterns for specifying file locations.
• Do not add the full HTTP notation, such as s3.amazon.com to the Amazon S3 bucket path.
• Do not use empty prefixes (with the extra /) in the path, as follows: S3://bucketname/prefix//
prefix/. While this is a valid Amazon S3 path, Athena does not allow it and changes it to s3://
bucketname/prefix/prefix/, removing the extra /.
Do not use:
s3://path_to_bucket
s3://path_to_bucket/*
s3://path_to_bucket/mySpecialFile.dat
s3://bucketname/prefix/filename.csv
s3://test-bucket.s3.amazon.com
S3://bucket/prefix//prefix/
arn:aws:s3:::bucketname/prefix
Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving
performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your
data by any key. A common practice is to partition the data based on time, often leading to a multi-
level partitioning scheme. For example, a customer who has data coming in every hour might decide to
partition by year, month, date, and hour. Another customer, who has data coming from many different
sources but loaded one time per day, may partition by a data source identifier and date.
If you issue queries against Amazon S3 buckets with a large number of objects and the data is not
partitioned, such queries may affect the Get request rate limits in Amazon S3 and lead to Amazon S3
exceptions. To prevent errors, partition your data. Additionally, consider tuning your Amazon S3 request
rates. For more information, see Request Rate and Performance Considerations.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use
PARTITIONED BY to define the keys by which to partition data. There are two scenarios discussed
below:
1. Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.
2. Data is not partitioned.
74
Amazon Athena User Guide
Scenario 1: Data already partitioned
and stored on S3 in hive format
aws s3 ls s3://elasticmapreduce/samples/hive-ads/tables/impressions/
PRE dt=2009-04-12-13-00/
PRE dt=2009-04-12-13-05/
PRE dt=2009-04-12-13-10/
PRE dt=2009-04-12-13-15/
PRE dt=2009-04-12-13-20/
PRE dt=2009-04-12-14-00/
PRE dt=2009-04-12-14-05/
PRE dt=2009-04-12-14-10/
PRE dt=2009-04-12-14-15/
PRE dt=2009-04-12-14-20/
PRE dt=2009-04-12-15-00/
PRE dt=2009-04-12-15-05/
Here, logs are stored with the column name (dt) set equal to date, hour, and minute increments. When
you give a DDL with the location of the parent folder, the schema, and the name of the partitioned
column, Athena can query data in those subfolders.
Creating a Table
To make a table out of this data, create a partition along 'dt' as in the following Athena DDL statement:
This table uses Hive's native JSON serializer-deserializer to read JSON data stored in Amazon S3. For
more information about the formats supported, see Supported Data Formats, SerDes, and Compression
Formats (p. 192).
After you execute this statement in Athena, choose New Query and execute:
75
Amazon Athena User Guide
Scenario 2: Data is not partitioned
2009-04-12-13-20 ap3HcVKAWfXtgIPu6WpuUfAfL0DQEc
2009-04-12-13-20 17uchtodoS9kdeQP1x0XThKl5IuRsV
2009-04-12-13-20 JOUf1SCtRwviGw8sVcghqE5h0nkgtp
2009-04-12-13-20 NQ2XP0J0dvVbCXJ0pb4XvqJ5A4QxxH
2009-04-12-13-20 fFAItiBMsgqro9kRdIwbeX60SROaxr
2009-04-12-13-20 V4og4R9W6G3QjHHwF7gI1cSqig5D1G
2009-04-12-13-20 hPEPtBwk45msmwWTxPVVo1kVu4v11b
2009-04-12-13-20 v0SkfxegheD90gp31UCr6FplnKpx6i
2009-04-12-13-20 1iD9odVgOIi4QWkwHMcOhmwTkWDKfj
2009-04-12-13-20 b31tJiIA25CK8eDHQrHnbcknfSndUk
76
Amazon Athena User Guide
Scenario 2: Data is not partitioned
In this case, you would have to use ALTER TABLE ADD PARTITION to add each partition manually.
For example, to load the data in s3://athena-examples/elb/plaintext/2015/01/01/, you can run the
following:
You can also automate adding partitions by using the JDBC driver (p. 43).
77
Amazon Athena User Guide
Columnar Storage Formats
Columnar storage formats have the following characteristics that make them suitable for using with
Athena:
• Compression by column, with compression algorithm selected for the column data type to save storage
space in Amazon S3 and reduce disk space and I/O during query processing.
• Predicate pushdown in Parquet and ORC enables Athena queries to fetch only the blocks it needs,
improving query performance. When an Athena query obtains specific column values from your data,
it uses statistics from data block predicates, such as max/min values, to determine whether to read or
skip the block.
• Splitting of data in Parquet and ORC allows Athena to split the reading of data to multiple readers and
increase parallelism during its query processing.
To convert your existing raw data from other storage formats to Parquet or ORC, you can run CREATE
TABLE AS SELECT (CTAS) (p. 91) queries in Athena and specify a data storage format as Parquet or
ORC, or use the AWS Glue Crawler.
You can do this to existing Amazon S3 data sources by creating a cluster in Amazon EMR and converting
it using Hive. The following example using the AWS CLI shows you how to do this with a script and data
stored in Amazon S3.
Overview
The process for converting to columnar formats using an EMR cluster is as follows:
s3://athena-examples/conversion/write-parquet-to-s3.q
78
Amazon Athena User Guide
Overview
Note
Replace REGION in the LOCATION clause with the region where you are running queries. For
example, if your console is in us-east-1, REGION is s3://us-east-1.elasticmapreduce/
samples/hive-ads/tables/.
This creates the table in Hive on the cluster which uses samples located in the Amazon EMR samples
bucket.
3. On Amazon EMR release 4.7.0, include the ADD JAR line to find the appropriate JsonSerDe. The
prettified sample data looks like the following:
{
"number": "977680",
"referrer": "fastcompany.com",
"processId": "1823",
"adId": "TRktxshQXAHWo261jAHubijAoNlAqA",
"browserCookie": "mvlrdwrmef",
"userCookie": "emFlrLGrm5fA2xLFT5npwbPuG7kf6X",
"requestEndTime": "1239714001000",
"impressionId": "1I5G20RmOuG2rt7fFGFgsaWk9Xpkfb",
"userAgent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR
2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506; InfoPa",
"timers": {
"modelLookup": "0.3292",
"requestTime": "0.6398"
},
"threadId": "99",
"ip": "67.189.155.225",
"modelId": "bxxiuxduad",
"hostname": "ec2-0-51-75-39.amazon.com",
"sessionId": "J9NOccA3dDMFlixCuSOtl9QBbjs6aS",
"requestBeginTime": "1239714000000"
}
4. In Hive, load the data from the partitions, so the script runs the following:
79
Amazon Athena User Guide
Before you begin
The script then creates a table that stores your data in a Parquet-formatted file on Amazon S3:
The data are inserted from the impressions table into parquet_hive:
The script stores the above impressions table columns from the date, 2009-04-14-04-05, into s3://
myBucket/myParquet/ in a Parquet-formatted file.
5. After your EMR cluster is terminated, create your table in Athena, which uses the data in the format
produced by the cluster.
3. Create an Amazon EMR cluster using the emr-4.7.0 release to convert the data using the following
AWS CLI emr create-cluster command:
export REGION=us-east-1
export SAMPLEURI=s3://${REGION}.elasticmapreduce/samples/hive-ads/tables/impressions/
export S3BUCKET=myBucketName
80
Amazon Athena User Guide
Example: Converting data to Parquet using an EMR cluster
For more information, see Create and Use IAM Roles for Amazon EMR in the Amazon EMR
Management Guide.
Look for the script step status. If it is COMPLETED, then the conversion is done and you are ready to
query the data.
5. Create the same table that you created on the EMR cluster.
You can use the same statement as above. Log into Athena and enter the statement in the Query
Editor window:
Alternatively, you can select the view (eye) icon next to the table's name in Catalog:
81
Amazon Athena User Guide
Example: Converting data to Parquet using an EMR cluster
82
Amazon Athena User Guide
Query Results
Topics
• Query Results (p. 83)
• Viewing Query History (p. 85)
• Views (p. 86)
• Creating a Table from Query Results (CTAS) (p. 91)
• Querying Arrays (p. 100)
• Querying Arrays with Complex Types and Nested Structures (p. 109)
• Querying Arrays with Maps (p. 115)
• Querying JSON (p. 116)
Query Results
Athena stores query results in Amazon S3.
If necessary, you can access the result files to work with them.
Topics
• Saving Query Results (p. 84)
• The Default Location for Query Results (p. 84)
You can specify the query results location in one of the following ways:
• For individual queries. This way of specifying the settings is known as client-side query settings. The
client-side settings include query results location and encryption configuration. To view or change the
query results location for individual queries, choose Settings in the upper right pane of the Athena
console, or use the OutputLocation API.
• For all queries in a workgroup. This way of specifying the settings is known as workgroup settings. The
workgroup settings include the results location and encryption configuration. To view or change the
query results location for all queries in a workgroup, choose the Workgroup:<workgroup_name> tab
in the console, switch to your workgroup, view or edit the workgroup, and specify the location in the
Query result location field. For more information, see Workgroup Settings (p. 166) and Managing
Workgroups (p. 167). You can also specify this setting with WorkGroupConfiguration in the API.
83
Amazon Athena User Guide
Saving Query Results
Important
If you use the API or the drivers, you must specify the query results location in one of the two
ways: for individual queries (client-side query settings), or for all queries in the workgroup. You
cannot omit specifying the location altogether. If the location is not specified in either way,
Athena issues an error at query execution. If you use the Athena console, and don't specify the
query results location using one of the methods, Athena uses the default location (p. 84).
The client-side query settings are used only if your workgroup's settings, which include query
results location and encryption, are not enforced. If your workgroup's settings override client-
side settings, then the location and encryption configuration you specify in the Settings are not
used. This applies to queries you run in the console, by using the API operations, or the driver. If
Override client-side settings is selected, your query uses the workgroup's settings, even though
the settings specified for this particular query may differ from the workgroup's settings. For
more information, see Workgroup Settings Override Client-Side Settings (p. 166).
If you select Override client-side settings for the workgroup, the following screen displays
when you choose Settings. This indicates that client-side settings are not used for queries in this
workgroup.
To save the results of the most recent query to CSV, choose the file icon.
To save the results of a query you ran previously, choose History, locate your query, and use Download
Results.
84
Amazon Athena User Guide
Viewing Query History
workgroup settings do not override client-side settings. For more information, see Workgroup
Settings Override Client-Side Settings (p. 166).
In this case only, Athena stores individual query results in this Amazon S3 bucket by default:
aws-athena-query-results-<ACCOUNTID>-<REGION>
Note
The default location is used only for queries you run in the Athena console. If you use the API
or the drivers to run queries, you must specify the query results location using one of the ways:
either for individual queries, using OutputLocation (client-side), or in the workgroup, using
WorkGroupConfiguration.
Query results are saved based on the name of the query and the date the query ran, as follows:
{QueryLocation}/{QueryName|Unsaved}/{yyyy}/{mm}/{dd}/{QueryID}.csv
{QueryLocation}/{QueryName|Unsaved}/{yyyy}/{mm}/{dd}/{QueryID}.csv.metadata
In this notation:
• QueryLocation is the base location for all query results if the workgroup's settings are not used. To
view or change this location, choose Settings in the upper right pane. You can enter a new value for
Query result location at any time. You can also choose to encrypt individual query results in Amazon
S3. For more information, see Configuring Encryption Options (p. 62).
• QueryName is the name of the query for which the results are saved. If the query wasn't saved,
Unsaved appears. To see a list of queries and examine their SQL statements, choose Saved queries.
• yyyy/mm/dd/ is the date the query ran.
• QueryID is the unique ID of the query.
To retain query history for a longer period, write a program using methods from Athena API and the
AWS CLI to periodically retrieve the query history and save it to a data store:
85
Amazon Athena User Guide
Views
Views
A view in Amazon Athena is a logical, not a physical table. The query that defines a view runs each time
the view is referenced in a query.
You can create a view from a SELECT query and then reference this view in future queries. For more
information, see CREATE VIEW (p. 228).
Topics
• When to Use Views? (p. 86)
• Supported Actions for Views in Athena (p. 87)
• Considerations for Views (p. 87)
• Limitations for Views (p. 88)
• Working with Views in the Console (p. 88)
• Creating Views (p. 89)
• Examples of Views (p. 90)
• Updating Views (p. 91)
• Deleting Views (p. 91)
• Query a subset of data. For example, you can create a table with a subset of columns from the original
table to simplify querying data.
• Combine multiple tables in one query. When you have multiple tables and want to combine them with
UNION ALL, you can create a view with that expression to simplify queries against the combined
tables.
• Hide the complexity of existing base queries and simplify queries run by users. Base queries often include
joins between tables, expressions in the column list, and other SQL syntax that make it difficult to
understand and debug them. You might create a view that hides the complexity and simplifies queries.
• Experiment with optimization techniques and create optimized queries. For example, if you find a
combination of WHERE conditions, JOIN order, or other expressions that demonstrate the best
performance, you can create a view with these clauses and expressions. Applications can then make
relatively simple queries against this view. If you later find a better way to optimize the original query,
when you recreate the view, all the applications immediately take advantage of the optimized base
query.
86
Amazon Athena User Guide
Supported Actions for Views in Athena
• Hide the underlying table and column names, and minimize maintenance problems if those names
change. In that case, you recreate the view using the new names. All queries that use the view rather
than the underlying tables keep running with no changes.
Statement Description
CREATE VIEW (p. 228) Creates a new view from a specified SELECT query. For more information,
see Creating Views (p. 89).
The optional OR REPLACE clause lets you update the existing view by
replacing it.
DESCRIBE VIEW (p. 229) Shows the list of columns for the named view. This allows you to examine
the attributes of a complex view.
DROP VIEW (p. 230) Deletes an existing view. The optional IF EXISTS clause suppresses
the error if the view does not exist. For more information, see Deleting
Views (p. 91).
SHOW CREATE Shows the SQL statement that creates the specified view.
VIEW (p. 232)
SHOW VIEWS (p. 234) Lists the views in the specified database, or in the current database if you
omit the database name. Use the optional LIKE clause with a regular
expression to restrict the list of view names. You can also see the list of
views in the left pane in the console.
• In Athena, you can preview and work with views created in the Athena Console, in the AWS Glue Data
Catalog, if you have migrated to using it, or with Presto running on the Amazon EMR cluster connected
to the same catalog. You cannot preview or add to Athena views that were created in other ways.
• If you are creating views through the AWS GlueData Catalog, you must include the PartitionKeys
parameter and set its value to an empty list, as follows: PartitionKeys":[]. Otherwise, your view
query will fail in Athena. The following example shows a view created from the Data Catalog with
PartitionKeys":[]:
87
Amazon Athena User Guide
Limitations for Views
"Location":"s3://xxxxx/Oct2018/25Oct2018/",
"InputFormat":"org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.serde2.OpenCSVSerde",
"Parameters":{"separatorChar": "|", "serialization.format": "1"}}},"PartitionKeys":
[]}'
• If you have created Athena views in the Data Catalog, then Data Catalog treats views as tables. You can
use table level fine-grained access control in Data Catalog to restrict access (p. 51) to these views.
• Athena prevents you from running recursive views and displays an error message in such cases. A
recursive view is a view query that references itself.
• Athena detects stale views and displays an error message in such cases. A stale view is a view query
that references tables or databases that do not exist.
• You can create and run nested views as long as the query behind the nested view is valid and the
tables and databases exist.
• Locate all views in the left pane, where tables are listed. Athena runs a SHOW VIEWS (p. 234)
operation to present this list to you.
• Filter views.
• Preview a view, show its properties, edit it, or delete it.
A view shows up in the console only if you have already created it.
1. In the Athena console, choose Views, choose a view, then expand it.
The view displays, with the columns it contains, as shown in the following example:
2. In the list of views, choose a view, and open the context (right-click) menu. The actions menu icon (⋮)
is highlighted for the view that you chose, and the list of actions opens, as shown in the following
example:
88
Amazon Athena User Guide
Creating Views
3. Choose an option. For example, Show properties shows the view name, the name of the database in
which the table for the view is created in Athena, and the time stamp when it was created:
Creating Views
You can create a view from any SELECT query.
Before you create a view, choose a database and then choose a table. Run a SELECT query on a table and
then create a view from it.
View names cannot contain special characters, other than underscore (_). See Names for Tables,
Databases, and Columns (p. 71). Avoid using Reserved Keywords (p. 72) for naming views.
3. Run the view query, debug it if needed, and save it.
Alternatively, create a query in the Query Editor, and then use Create view from query.
89
Amazon Athena User Guide
Examples of Views
If you run a view that is not valid, Athena displays an error message.
If you delete a table from which the view was created, when you attempt to run the view, Athena
displays an error message.
You can create a nested view, which is a view on top of an existing view. Athena prevents you from
running a recursive view that references itself.
Examples of Views
To show the syntax of the view query, use SHOW CREATE VIEW (p. 232).
Example Example 1
Consider the following two tables: a table employees with two columns, id and name, and a table
salaries, with two columns, id and salary.
In this example, we create a view named name_salary as a SELECT query that obtains a list of IDs
mapped to salaries from the tables employees and salaries:
Example Example 2
In the following example, we create a view named view1 that enables you to hide more complex query
syntax.
This view runs on top of two tables, table1 and table2, where each table is a different SELECT query.
The view selects all columns from table1 and joins the results with table2. The join is based on
column a that is present in both tables.
90
Amazon Athena User Guide
Updating Views
Updating Views
After you create a view, it appears in the Views list in the left pane.
To edit the view, choose it, choose the context (right-click) menu, and then choose Show/edit query. You
can also edit the view in the Query Editor. For more information, see CREATE VIEW (p. 228).
Deleting Views
To delete a view, choose it, choose the context (right-click) menu, and then choose Delete view. For more
information, see DROP VIEW (p. 230).
• Create tables from query results in one step, without repeatedly querying raw data sets. This makes it
easier to work with raw data sets.
• Transform query results into other storage formats, such as Parquet and ORC. This improves
query performance and reduces query costs in Athena. For information, see Columnar Storage
Formats (p. 78).
• Create copies of existing tables that contain only the data you need.
Topics
• Considerations and Limitations for CTAS Queries (p. 91)
• Running CTAS Queries in the Console (p. 93)
• Bucketing vs Partitioning (p. 96)
• Examples of CTAS Queries (p. 97)
CTAS query The CTAS query syntax differs from the syntax of CREATE [EXTERNAL] TABLE used
syntax for creating tables. See CREATE TABLE AS (p. 226).
Note
Table, database, or column names for CTAS queries should not contain
quotes or backticks. To ensure this, check that your table, database, or
column names do not represent reserved words (p. 72), and do not contain
special characters (which require enclosing them in quotes or backticks). For
more information, see Names for Tables, Databases, and Columns (p. 71).
CTAS queries CTAS queries write new data to a specified location in Amazon S3, whereas views do
vs views not write any data.
91
Amazon Athena User Guide
Considerations and Limitations for CTAS Queries
Location of The location for storing CTAS query results in Amazon S3 must be empty. A CTAS
CTAS query query checks that the path location (prefix) in the bucket is empty and never
results overwrites the data if the location already has data in it. To use the same location
again, delete the data in the key prefix location in the bucket, otherwise your CTAS
query will fail.
You can specify the location for storing your CTAS query results. If omitted and if your
workgroup does not override client-side settings (p. 166), Athena uses this location
by default: s3://aws-athena-query-results-<account>-<region>/<query-
name-or-unsaved>/<year>/<month/<date>/<query-id>/.
If your workgroup overrides client-side settings, this means that the workgroup's
query result location is used for your CTAS queries. If you specify a different
results location, your query will fail. To obtain the results location specified for the
workgroup, view workgroup's details (p. 170).
If the workgroup in which a query will run is configured with an enforced query
results location (p. 166), do not specify an external_location for the CTAS
query. Athena issues an error and fails a query that specifies an external_location
in this case. For example, this query fails, if you override client-side settings for query
results location, enforcing the workgroup to use its own location: CREATE TABLE
<DB>.<TABLE1> WITH (format='Parquet', external_location='s3://
my_test/test/') AS SELECT * FROM <DB>.<TABLE2> LIMIT 10;
Formats for The results of CTAS queries are stored in Parquet by default, if you don't specify
storing query a data storage format. You can store CTAS results in PARQUET, ORC, AVRO, JSON,
results and TEXTFILE. CTAS queries do not require specifying a SerDe to interpret format
transformations. See Example 5: Storing Results of a CTAS Query in Another
Format (p. 98).
Compression GZIP compression is used for CTAS query results by default. For Parquet and ORC, you
formats can also specify SNAPPY. See Example 4: Specifying Data Storage and Compression
Formats for CTAS Query Results (p. 98).
Partitioning You can partition the results data of a CTAS query by one or more columns. When
creating a partitioned table, Athena automatically adds partitions to the AWS Glue
Data Catalog.
Important
The results of a CTAS query in Athena can have a maximum of 100 partitions
that Athena creates for you when writing CTAS query results to a specified
location in Amazon S3. If the number of partitions to which the results data
is written in parallel exceeds 100, Athena issues an error.
List partition columns at the end of the SELECT statement in a CTAS query.
For more information, see Example 7: CTAS Queries with Partitions (p. 99)
and Bucketing vs Partitioning (p. 96).
Bucketing You can configure buckets for storing the results of a CTAS query and bucket data
by one or more columns. There is no limit to the number of buckets you can specify.
For more information, see Example 8: A CTAS Query with Bucketing (p. 100) and
Bucketing vs Partitioning (p. 96).
Encryption You can encrypt CTAS query results in Amazon S3, similar to the way you encrypt
other query results in Athena. For more information, see Configuring Encryption
Options (p. 62).
Data types Column data types for a CTAS query are the same as specified for the original query.
92
Amazon Athena User Guide
Running CTAS Queries in the Console
1. Run the query, choose Create, and then choose Create table from query.
2. In the Create a new table on the results of a query form, complete the fields as follows:
93
Amazon Athena User Guide
Running CTAS Queries in the Console
f. Choose Next to review your query and revise it as needed. For query syntax, see CREATE TABLE
AS (p. 226). The preview window opens, as shown in the following example:
94
Amazon Athena User Guide
Running CTAS Queries in the Console
g. Choose Create.
3. Choose Run query.
Use the CREATE TABLE AS SELECT template to create a CTAS query from scratch.
1. In the Athena console, choose Create table, and then choose CREATE TABLE AS SELECT.
2. In the Query Editor, edit the query as needed, For query syntax, see CREATE TABLE AS (p. 226).
3. Choose Run query.
4. Optionally, choose Save as to save the query.
95
Amazon Athena User Guide
Bucketing vs Partitioning
Bucketing vs Partitioning
You can specify partitioning and bucketing, for storing data from CTAS query results in Amazon S3. For
information about CTAS queries, see CREATE TABLE AS SELECT (CTAS) (p. 91).
This section discusses partitioning and bucketing as they apply to CTAS queries only. For general
guidelines about using partitioning in CREATE TABLE queries, see Top Performance Tuning Tips for
Amazon Athena.
Use the following tips to decide whether to partition and/or to configure bucketing, and to select
columns in your CTAS queries by which to do so:
• Partitioning CTAS query results works well when the number of partitions you plan to have is limited.
When you run a CTAS query, Athena writes the results to a specified location in Amazon S3. If you
specify partitions, it creates them and stores each partition in a separate partition folder in the same
location. The maximum number of partitions you can configure with CTAS query results is 100.
Having partitions in Amazon S3 helps with Athena query performance, because this helps you run
targeted queries for only specific partitions. Athena then scans only those partitions, saving you query
costs and query time. For information about partitioning syntax, search for partition_by in CREATE
TABLE AS (p. 226).
Partition data by those columns that have similar characteristics, such as records from the same
department, and that can have a limited number of possible values, such as a limited number of
distinct departments in an organization. This characteristic is known as data cardinality. For example,
if you partition by the column department, and this column has a limited number of distinct values,
partitioning by department works well and decreases query latency.
• Bucketing CTAS query results works well when you bucket data by the column that has high cardinality
and evenly distributed values.
For example, columns storing timestamp data could potentially have a very large number of distinct
values, and their data is evenly distributed across the data set. This means that a column storing
timestamp type data will most likely have values and won't have nulls. This also means that data
from such a column can be put in many buckets, where each bucket will have roughly the same
amount of data stored in Amazon S3.
You can specify any number of buckets for your CTAS query results, using one or more columns as
bucket names.
To choose the column by which to bucket the CTAS query results, use the column that has a high
number of values (high cardinality) and whose data can be split for storage into many buckets that
will have roughly the same amount of data. Columns that are sparsely populated with values are not
good candidates for bucketing. This is because you will end up with buckets that have less data and
other buckets that have a lot of data. By comparison, columns that you predict will almost always have
values, such as timestamp type values, are good candidates for bucketing. This is because their data
has high cardinality and can be stored in roughly equal chunks.
For more information about bucketing syntax, search for bucketed_by in CREATE TABLE
AS (p. 226).
To conclude, you can partition and use bucketing for storing results of the same CTAS query. These
techniques for writing data do not exclude each other. Typically, the columns you use for bucketing differ
from those you use for partitioning.
For example, if your dataset has columns department, sales_quarter, and ts (for storing
timestamp type data), you can partition your CTAS query results by department and sales_quarter.
96
Amazon Athena User Guide
Examples of CTAS Queries
These columns have relatively low cardinality of values: a limited number of departments and sales
quarters. Also, for partitions, it does not matter if some records in your dataset have null or no values
assigned for these columns. What matters is that data with the same characteristics, such as data from
the same department, will be in one partition that you can query in Athena.
At the same time, because all of your data has timestamp type values stored in a ts column, you can
configure bucketing for the same query results by the column ts. This column has high cardinality. You
can store its data in more than one bucket in Amazon S3. Consider an opposite scenario: if you don't
create buckets for timestamp type data and run a query for particular date or time values, then you
would have to scan a very large amount of data stored in a single location in Amazon S3. Instead, if you
configure buckets for storing your date- and time-related results, you can only scan and query buckets
that have your value and avoid long-running queries that scan a large amount of data.
In this section:
The following example creates a table by copying all columns from a table:
In the following variation of the same example, your SELECT statement also includes a WHERE clause. In
this case, the query selects only those rows from the table that satisfy the WHERE clause:
Example Example 2: Selecting Specific Columns from One or More Tables with CTAS
The following example creates a new query that runs on a set of columns from another table:
This variation of the same example creates a new table from specific columns from multiple tables:
97
Amazon Athena User Guide
Examples of CTAS Queries
The following example uses WITH NO DATA to create a new table that is empty and has the same
schema as the original table:
Example Example 4: Specifying Data Storage and Compression Formats for CTAS Query
Results
The following example creates a new CTAS query that saves data in Parquet. This allows you to change
the storage format from the one used by the original table. You can specify PARQUET, ORC, AVRO, JSON,
and TEXTFILE in a similar way.
This example also specifies compression as SNAPPY. If omitted, GZIP is used. GZIP and SNAPPY are the
supported compression formats for CTAS query results stored in Parquet and ORC.
The following example is similar, but it stores the CTAS query results in ORC, and uses the
orc_compression parameter, to specify the compression format. If you omit the compression format,
Athena uses GZIP by default.
The following CTAS query takes the results from another query, which could be stored in CSV or another
text format, and stores them in ORC:
The following examples create tables that are not partitioned. The table data is stored in different
formats. Some of these examples specify the external location.
98
Amazon Athena User Guide
Examples of CTAS Queries
The following example creates a CTAS query that stores the results as a text file:
In the following example, results are stored in Parquet, and the default results location is used:
In the following query, the table is stored in JSON, and specific columns are selected from the original
table's results:
99
Amazon Athena User Guide
Querying Arrays
The following example shows a CREATE TABLE AS SELECT query that uses both partitioning and
bucketing for storing query results in Amazon S3. The table results are partitioned and bucketed by
different columns. You can create an unlimited number of buckets and bucket by one or more columns.
For syntax, see CTAS Table Properties (p. 227).
For information about choosing the columns for bucketing, see Bucketing vs Partitioning (p. 96).
Querying Arrays
Amazon Athena lets you create arrays, concatenate them, convert them to different data types, and then
filter, flatten, and sort them.
Topics
• Creating Arrays (p. 100)
• Concatenating Arrays (p. 102)
• Converting Array Data Types (p. 103)
• Finding Lengths (p. 103)
• Accessing Array Elements (p. 103)
• Flattening Nested Arrays (p. 104)
• Creating Arrays from Subqueries (p. 107)
• Filtering Arrays (p. 107)
• Sorting Arrays (p. 108)
• Using Aggregation Functions with Arrays (p. 108)
• Converting Arrays to Strings (p. 109)
Creating Arrays
To build an array literal in Athena, use the ARRAY keyword, followed by brackets [ ], and include the
array elements separated by commas.
100
Amazon Athena User Guide
Creating Arrays
Examples
This query creates one array with four elements.
It returns:
+-----------+
| items |
+-----------+
| [1,2,3,4] |
+-----------+
It returns:
+--------------------+
| items |
+--------------------+
| [[1, 2], [3, 4]] |
+--------------------+
To create an array from selected columns of compatible types, use a query, as in this example:
WITH
dataset AS (
SELECT 1 AS x, 2 AS y, 3 AS z
)
SELECT ARRAY [x,y,z] AS items FROM dataset
+-----------+
| items |
+-----------+
| [1,2,3] |
+-----------+
In the following example, two arrays are selected and returned as a welcome message.
WITH
dataset AS (
SELECT
ARRAY ['hello', 'amazon', 'athena'] AS words,
ARRAY ['hi', 'alexa'] AS alexa
)
SELECT ARRAY[words, alexa] AS welcome_msg
FROM dataset
+----------------------------------------+
101
Amazon Athena User Guide
Concatenating Arrays
| welcome_msg |
+----------------------------------------+
| [[hello, amazon, athena], [hi, alexa]] |
+----------------------------------------+
To create an array of key-value pairs, use the MAP operator that takes an array of keys followed by an
array of values, as in this example:
SELECT ARRAY[
MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
] AS people
+-----------------------------------------------------------------------------------------------------
+
| people
|
+-----------------------------------------------------------------------------------------------------
+
| [{last=Smith, first=Bob, age=40}, {last=Doe, first=Jane, age=30}, {last=Smith,
first=Billy, age=8}] |
+-----------------------------------------------------------------------------------------------------
+
Concatenating Arrays
To concatenate multiple arrays, use the double pipe || operator between them.
+--------------------------+
| items |
+--------------------------+
| [[4, 5], [1, 2], [3, 4]] |
+--------------------------+
To combine multiple arrays into a single array, use the concat function.
WITH
dataset AS (
SELECT
ARRAY ['hello', 'amazon', 'athena'] AS words,
ARRAY ['hi', 'alexa'] AS alexa
)
SELECT concat(words, alexa) AS welcome_msg
FROM dataset
+------------------------------------+
| welcome_msg |
+------------------------------------+
102
Amazon Athena User Guide
Converting Array Data Types
SELECT
ARRAY [CAST(4 AS VARCHAR), CAST(5 AS VARCHAR)]
AS items
+-------+
| items |
+-------+
| [4,5] |
+-------+
Create two arrays with key-value pair elements, convert them to JSON, and concatenate, as in this
example:
SELECT
ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items
+--------------------------------------------------+
| items |
+--------------------------------------------------+
| [{"a1":1,"a2":2,"a3":3}, {"b1":4,"b2":5,"b3":6}] |
+--------------------------------------------------+
Finding Lengths
The cardinality function returns the length of an array, as in this example:
+------------+
| item_count |
+------------+
| 4 |
+------------+
103
Amazon Athena User Guide
Flattening Nested Arrays
WITH dataset AS (
SELECT
ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items )
SELECT items[1] AS item FROM dataset
+------------------------+
| item |
+------------------------+
| {"a1":1,"a2":2,"a3":3} |
+------------------------+
To access the elements of an array at a given position (known as the index position), use the
element_at() function and specify the array name and the index position:
• If the index is greater than 0, element_at() returns the element that you specify, counting from the
beginning to the end of the array. It behaves as the [] operator.
• If the index is less than 0, element_at() returns the element counting from the end to the beginning
of the array.
The following query creates an array words, and selects the first element hello from it as the
first_word, the second element amazon (counting from the end of the array) as the middle_word,
and the third element athena, as the last_word.
WITH dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT
element_at(words, 1) AS first_word,
element_at(words, -2) AS middle_word,
element_at(words, cardinality(words)) AS last_word
FROM dataset
+----------------------------------------+
| first_word | middle_word | last_word |
+----------------------------------------+
| hello | amazon | athena |
+----------------------------------------+
Examples
To flatten a nested array's elements into a single array of values, use the flatten function. This query
returns a row for each element in the array.
104
Amazon Athena User Guide
Flattening Nested Arrays
+-----------+
| items |
+-----------+
| [1,2,3,4] |
+-----------+
To flatten an array into multiple rows, use CROSS JOIN in conjunction with the UNNEST operator, as in
this example:
WITH dataset AS (
SELECT
'engineering' as department,
ARRAY['Sharon', 'John', 'Bob', 'Sally'] as users
)
SELECT department, names FROM dataset
CROSS JOIN UNNEST(users) as t(names)
+----------------------+
| department | names |
+----------------------+
| engineering | Sharon |
+----------------------|
| engineering | John |
+----------------------|
| engineering | Bob |
+----------------------|
| engineering | Sally |
+----------------------+
To flatten an array of key-value pairs, transpose selected keys into columns, as in this example:
WITH
dataset AS (
SELECT
'engineering' as department,
ARRAY[
MAP(ARRAY['first', 'last', 'age'],ARRAY['Bob', 'Smith', '40']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Jane', 'Doe', '30']),
MAP(ARRAY['first', 'last', 'age'],ARRAY['Billy', 'Smith', '8'])
] AS people
)
SELECT names['first'] AS
first_name,
names['last'] AS last_name,
department FROM dataset
CROSS JOIN UNNEST(people) AS t(names)
+--------------------------------------+
| first_name | last_name | department |
+--------------------------------------+
| Bob | Smith | engineering |
| Jane | Doe | engineering |
| Billy | Smith | engineering |
+--------------------------------------+
105
Amazon Athena User Guide
Flattening Nested Arrays
From a list of employees, select the employee with the highest combined scores. UNNEST can be used
in the FROM clause without a preceding CROSS JOIN as it is the default join operator and therefore
implied.
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department
VARCHAR, scores ARRAY(INTEGER))),
CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER))),
CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER)))
] AS users
),
users AS (
SELECT person, score
FROM
dataset,
UNNEST(dataset.users) AS t(person),
UNNEST(person.scores) AS t(score)
)
SELECT person.name, person.department, SUM(score) AS total_score FROM users
GROUP BY (person.name, person.department)
ORDER BY (total_score) DESC
LIMIT 1
+---------------------------------+
| name | department | total_score |
+---------------------------------+
| Amy | devops | 54 |
+---------------------------------+
From a list of employees, select the employee with the highest individual score.
WITH
dataset AS (
SELECT ARRAY[
CAST(ROW('Sally', 'engineering', ARRAY[1,2,3,4]) AS ROW(name VARCHAR, department
VARCHAR, scores ARRAY(INTEGER))),
CAST(ROW('John', 'finance', ARRAY[7,8,9]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER))),
CAST(ROW('Amy', 'devops', ARRAY[12,13,14,15]) AS ROW(name VARCHAR, department VARCHAR,
scores ARRAY(INTEGER)))
] AS users
),
users AS (
SELECT person, score
FROM
dataset,
UNNEST(dataset.users) AS t(person),
UNNEST(person.scores) AS t(score)
)
SELECT person.name, score FROM users
ORDER BY (score) DESC
LIMIT 1
+--------------+
106
Amazon Athena User Guide
Creating Arrays from Subqueries
| name | score |
+--------------+
| Amy | 15 |
+--------------+
WITH
dataset AS (
SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
+-----------------+
| array_items |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
To create an array of unique values from a set of rows, use the distinct keyword.
WITH
dataset AS (
SELECT ARRAY [1,2,2,3,3,4,5] AS items
)
SELECT array_agg(distinct i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
This query returns the following result. Note that ordering is not guaranteed.
+-----------------+
| array_items |
+-----------------+
| [1, 2, 3, 4, 5] |
+-----------------+
Filtering Arrays
Create an array from a collection of rows if they match the filter criteria.
WITH
dataset AS (
SELECT ARRAY[1,2,3,4,5] AS items
)
SELECT array_agg(i) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
WHERE i > 3
107
Amazon Athena User Guide
Sorting Arrays
+-------------+
| array_items |
+-------------+
| [4, 5] |
+-------------+
Filter an array based on whether one of its elements contain a specific value, such as 2, as in this
example:
WITH
dataset AS (
SELECT ARRAY
[
ARRAY[1,2,3,4],
ARRAY[5,6,7,8],
ARRAY[9,0]
] AS items
)
SELECT i AS array_items FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
WHERE contains(i, 2)
+--------------+
| array_items |
+--------------+
| [1, 2, 3, 4] |
+--------------+
Sorting Arrays
Create a sorted array of unique values from a set of rows.
WITH
dataset AS (
SELECT ARRAY[3,1,2,5,2,3,6,3,4,5] AS items
)
SELECT array_sort(array_agg(distinct i)) AS array_items
FROM dataset
CROSS JOIN UNNEST(items) AS t(i)
+--------------------+
| array_items |
+--------------------+
| [1, 2, 3, 4, 5, 6] |
+--------------------+
108
Amazon Athena User Guide
Converting Arrays to Strings
Note
ORDER BY is not supported for aggregation functions, for example, you cannot use it within
array_agg(x).
WITH
dataset AS (
SELECT ARRAY
[
ARRAY[1,2,3,4],
ARRAY[5,6,7,8],
ARRAY[9,0]
] AS items
),
item AS (
SELECT i AS array_items
FROM dataset, UNNEST(items) AS t(i)
)
SELECT array_items, sum(val) AS total
FROM item, UNNEST(array_items) AS t(val)
GROUP BY array_items
This query returns the following results. The order of returned results is not guaranteed.
+----------------------+
| array_items | total |
+----------------------+
| [1, 2, 3, 4] | 10 |
| [5, 6, 7, 8] | 26 |
| [9, 0] | 9 |
+----------------------+
WITH
dataset AS (
SELECT ARRAY ['hello', 'amazon', 'athena'] AS words
)
SELECT array_join(words, ' ') AS welcome_msg
FROM dataset
+---------------------+
| welcome_msg |
+---------------------+
| hello amazon athena |
+---------------------+
109
Amazon Athena User Guide
Creating a ROW
Creating a ROW
Note
The examples in this section use ROW as a means to create sample data to work with. When
you query tables within Athena, you do not need to create ROW data types, as they are already
created from your data source. When you use CREATE_TABLE, Athena defines a STRUCT in it,
populates it with data, and creates the ROW data type for you, for each row in the dataset. The
underlying ROW data type consists of named fields of any supported SQL data types.
WITH dataset AS (
SELECT
ROW('Bob', 38) AS users
)
SELECT * FROM dataset
+-------------------------+
| users |
+-------------------------+
| {field0=Bob, field1=38} |
+-------------------------+
WITH dataset AS (
SELECT
CAST(
ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)
) AS users
)
SELECT * FROM dataset
+--------------------+
| users |
+--------------------+
| {NAME=Bob, AGE=38} |
+--------------------+
Note
In the example above, you declare name as a VARCHAR because this is its type in Presto. If you
declare this STRUCT inside a CREATE TABLE statement, use String type because Hive defines
this data type as String.
110
Amazon Athena User Guide
Filtering Arrays Using the . Notation
SELECT
CAST(useridentity.accountid AS bigint) as newid
FROM cloudtrail_logs
LIMIT 2;
+--------------+
| newid |
+--------------+
| 112233445566 |
+--------------+
| 998877665544 |
+--------------+
WITH dataset AS (
SELECT ARRAY[
CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
] AS users
)
SELECT * FROM dataset
+-----------------------------------------------------------------+
| users |
+-----------------------------------------------------------------+
| [{NAME=Bob, AGE=38}, {NAME=Alice, AGE=35}, {NAME=Jane, AGE=27}] |
+-----------------------------------------------------------------+
To define a dataset for an array of values that includes a nested BOOLEAN value, issue this query:
WITH dataset AS (
SELECT
CAST(
ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
) AS sites
)
SELECT * FROM dataset
111
Amazon Athena User Guide
Filtering Arrays Using UNNEST
+----------------------------------------------------------+
| sites |
+----------------------------------------------------------+
| {HOSTNAME=aws.amazon.com, FLAGGEDACTIVITY={ISNEW=true}} |
+----------------------------------------------------------+
Next, to filter and access the BOOLEAN value of that element, continue to use the dot . notation.
WITH dataset AS (
SELECT
CAST(
ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
) AS sites
)
SELECT sites.hostname, sites.flaggedactivity.isnew
FROM dataset
This query selects the nested fields and returns this result:
+------------------------+
| hostname | isnew |
+------------------------+
| aws.amazon.com | true |
+------------------------+
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
),
CAST(
ROW('news.cnn.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
),
CAST(
ROW('netflix.com', ROW(false)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew
BOOLEAN))
)
] as items
)
SELECT sites.hostname, sites.flaggedActivity.isNew
FROM dataset, UNNEST(items) t(sites)
WHERE sites.flaggedActivity.isNew = true
It returns:
+------------------------+
| hostname | isnew |
+------------------------+
| aws.amazon.com | true |
112
Amazon Athena User Guide
Finding Keywords in Arrays Using regexp_like
+------------------------+
The regular expression pattern needs to be contained within the string, and does not have to match it.
To match the entire string, enclose the pattern with ^ at the beginning of it, and $ at the end, such as
'^pattern$'.
Consider an array of sites containing their hostname, and a flaggedActivity element. This element
includes an ARRAY, containing several MAP elements, each listing different popular keywords and their
popularity count. Assume you want to find a particular keyword inside a MAP in this array.
To search this dataset for sites with a specific keyword, we use regexp_like instead of the similar SQL
LIKE operator, because searching for a large number of keywords is more efficient with regexp_like.
The query in this example uses the regexp_like function to search for terms 'politics|bigdata',
found in values within arrays:
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('netflix.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
)
] AS items
),
sites AS (
SELECT sites.hostname, sites.flaggedactivity
FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname
113
Amazon Athena User Guide
Finding Keywords in Arrays Using regexp_like
+----------------+
| hostname |
+----------------+
| aws.amazon.com |
+----------------+
| news.cnn.com |
+----------------+
The query in the following example adds up the total popularity scores for the sites matching your
search terms with the regexp_like function, and then orders them from highest to lowest.
WITH dataset AS (
SELECT ARRAY[
CAST(
ROW('aws.amazon.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['bigdata', '10']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '50']),
MAP(ARRAY['term', 'count'], ARRAY['analytics', '82']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '74'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('news.cnn.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['politics', '241']),
MAP(ARRAY['term', 'count'], ARRAY['technology', '211']),
MAP(ARRAY['term', 'count'], ARRAY['serverless', '25']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '170'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
),
CAST(
ROW('netflix.com', ROW(ARRAY[
MAP(ARRAY['term', 'count'], ARRAY['cartoons', '1020']),
MAP(ARRAY['term', 'count'], ARRAY['house of cards', '112042']),
MAP(ARRAY['term', 'count'], ARRAY['orange is the new black', '342']),
MAP(ARRAY['term', 'count'], ARRAY['iot', '4'])
])
) AS ROW(hostname VARCHAR, flaggedActivity ROW(flags ARRAY(MAP(VARCHAR, VARCHAR)) ))
)
] AS items
),
sites AS (
SELECT sites.hostname, sites.flaggedactivity
FROM dataset, UNNEST(items) t(sites)
)
SELECT hostname, array_agg(flags['term']) AS terms, SUM(CAST(flags['count'] AS INTEGER)) AS
total
FROM sites, UNNEST(sites.flaggedActivity.flags) t(flags)
WHERE regexp_like(flags['term'], 'politics|bigdata')
GROUP BY (hostname)
ORDER BY total DESC
114
Amazon Athena User Guide
Querying Arrays with Maps
+------------------------------------+
| hostname | terms | total |
+----------------+-------------------+
| news.cnn.com | politics | 241 |
+----------------+-------------------+
| aws.amazon.com | big data | 10 |
+----------------+-------------------+
To create maps, use the MAP operator and pass it two arrays: the first is the column (key) names, and
the second is values. All values in the arrays must be of the same type. If any of the map value array
elements need to be of different types, you can convert them later.
Examples
This example selects a user from a dataset. It uses the MAP operator and passes it two arrays. The first
array includes values for column names, such as "first", "last", and "age". The second array consists of
values for each of these columns, such as "Bob", "Smith", "35".
WITH dataset AS (
SELECT MAP(
ARRAY['first', 'last', 'age'],
ARRAY['Bob', 'Smith', '35']
) AS user
)
SELECT user FROM dataset
+---------------------------------+
| user |
+---------------------------------+
| {last=Smith, first=Bob, age=35} |
+---------------------------------+
You can retrieve Map values by selecting the field name followed by [key_name], as in this example:
WITH dataset AS (
SELECT MAP(
ARRAY['first', 'last', 'age'],
ARRAY['Bob', 'Smith', '35']
) AS user
)
SELECT user['first'] AS first_name FROM dataset
+------------+
| first_name |
+------------+
| Bob |
115
Amazon Athena User Guide
Querying JSON
+------------+
Querying JSON
Amazon Athena lets you parse JSON-encoded values, extract data from JSON, search for values, and find
length and size of JSON arrays.
Topics
• Best Practices for Reading JSON Data (p. 116)
• Extracting Data from JSON (p. 117)
• Searching for Values (p. 119)
• Obtaining Length and Size of JSON Arrays (p. 121)
In Amazon Athena, you can create tables from external data and include the JSON-encoded data in
them. For such types of source data, use Athena together with JSON SerDe Libraries (p. 203).
• Convert fields in source data that have an undetermined schema to JSON-encoded strings in Athena.
When Athena creates tables backed by JSON data, it parses the data based on the existing and
predefined schema. However, not all of your data may have a predefined schema. To simplify schema
management in such cases, it is often useful to convert fields in source data that have an undetermined
schema to JSON strings in Athena, and then use JSON SerDe Libraries (p. 203).
For example, consider an IoT application that publishes events with common fields from different
sensors. One of those fields must store a custom payload that is unique to the sensor sending the event.
In this case, since you don't know the schema, we recommend that you store the information as a JSON-
encoded string. To do this, convert data in your Athena table to JSON, as in the following example. You
can also convert JSON-encoded data to Athena data types.
116
Amazon Athena User Guide
Extracting Data from JSON
WITH dataset AS (
SELECT
CAST('HELLO ATHENA' AS JSON) AS hello_msg,
CAST(12345 AS JSON) AS some_int,
CAST(MAP(ARRAY['a', 'b'], ARRAY[1,2]) AS JSON) AS some_map
)
SELECT * FROM dataset
+-------------------------------------------+
| hello_msg | some_int | some_map |
+-------------------------------------------+
| "HELLO ATHENA" | 12345 | {"a":1,"b":2} |
+-------------------------------------------+
WITH dataset AS (
SELECT
CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
CAST(JSON '12345' AS INTEGER) AS some_int,
CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset
+-------------------------------------+
| hello_msg | some_int | some_map |
+-------------------------------------+
| HELLO ATHENA | 12345 | {a:1,b:2} |
+-------------------------------------+
117
Amazon Athena User Guide
Extracting Data from JSON
[
{"name":"project1", "completed":false},
{"name":"project2", "completed":true}
]
}
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM dataset
The returned value is a JSON-encoded string, and not a native Athena data type.
+-----------------------------------------------------------------------------------------------
+
| name | projects
|
+-----------------------------------------------------------------------------------------------
+
| "Susan Smith" | [{"name":"project1","completed":false},
{"name":"project2","completed":true}] |
+-----------------------------------------------------------------------------------------------
+
To extract the scalar value from the JSON string, use the json_extract_scalar function. It is similar
to json_extract, but returns only scalar values (Boolean, number, or string).
Note
Do not use the json_extract_scalar function on arrays, maps, or structs.
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT
json_extract_scalar(blob, '$.name') AS name,
json_extract_scalar(blob, '$.projects') AS projects
FROM dataset
118
Amazon Athena User Guide
Searching for Values
+---------------------------+
| name | projects |
+---------------------------+
| Susan Smith | |
+---------------------------+
To obtain the first element of the projects property in the example array, use the json_array_get
function and specify the index position.
WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_array_get(json_extract(blob, '$.projects'), 0) AS item
FROM dataset
It returns the value at the specified index position in the JSON-encoded array.
+---------------------------------------+
| item |
+---------------------------------------+
| {"name":"project1","completed":false} |
+---------------------------------------+
To return an Athena string type, use the [] operator inside a JSONPath expression, then Use
the json_extract_scalar function. For more information about [], see Accessing Array
Elements (p. 103).
WITH dataset AS (
SELECT '{"name": "Bob Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},{"name":"project2",
"completed":true}]}'
AS blob
)
SELECT json_extract_scalar(blob, '$.projects[0].name') AS project_name
FROM dataset
+--------------+
| project_name |
+--------------+
| project1 |
+--------------+
The following query lists the names of the users who are participating in "project2".
WITH dataset AS (
119
Amazon Athena User Guide
Searching for Values
+-------------+
| user |
+-------------+
| Susan Smith |
+-------------+
| Jane Smith |
+-------------+
The following query example lists the names of users who have completed projects along with the total
number of completed projects. It performs these actions:
Note
When using CAST to MAP you can specify the key element as VARCHAR (native String in Presto),
but leave the value as JSON, because the values in the MAP are of different types: String for the
first key-value pair, and Boolean for the second.
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name": "Bob Smith",
"org": "legal",
"projects": [{"name":"project1", "completed":false}]}'),
(JSON '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project2", "completed":true},
{"name":"project3", "completed":true}]}'),
(JSON '{"name": "Jane Smith",
"org": "finance",
"projects": [{"name":"project2", "completed":true}]}')
) AS t (users)
),
employees AS (
SELECT users, CAST(json_extract(users, '$.projects') AS
ARRAY(MAP(VARCHAR, JSON))) AS projects_array
FROM dataset
),
names AS (
SELECT json_extract_scalar(users, '$.name') AS name, projects
FROM employees, UNNEST (projects_array) AS t(projects)
)
SELECT name, count(projects) AS completed_projects FROM names
WHERE cast(element_at(projects, 'completed') AS BOOLEAN) = true
120
Amazon Athena User Guide
Obtaining Length and Size of JSON Arrays
GROUP BY name
+----------------------------------+
| name | completed_projects |
+----------------------------------+
| Susan Smith | 2 |
+----------------------------------+
| Jane Smith | 1 |
+----------------------------------+
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name":
"Bob Smith",
"org":
"legal",
"projects": [{"name":"project1", "completed":false}]}'),
(JSON '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project2", "completed":true},
{"name":"project3", "completed":true}]}'),
(JSON '{"name": "Jane Smith",
"org": "finance",
"projects": [{"name":"project2", "completed":true}]}')
) AS t (users)
)
SELECT
json_extract_scalar(users, '$.name') as name,
json_array_length(json_extract(users, '$.projects')) as count
FROM dataset
ORDER BY count DESC
+---------------------+
| name | count |
+---------------------+
| Susan Smith | 2 |
+---------------------+
| Bob Smith | 1 |
+---------------------+
| Jane Smith | 1 |
+---------------------+
Example: json_size
To obtain the size of a JSON-encoded array or object, use the json_size function, and specify the
column containing the JSON string and the JSONPath expression to the array or object.
WITH dataset AS (
121
Amazon Athena User Guide
Obtaining Length and Size of JSON Arrays
+---------------------+
| name | count |
+---------------------+
| Susan Smith | 2 |
+---------------------+
| Bob Smith | 1 |
+---------------------+
| Jane Smith | 1 |
+---------------------+
122
Amazon Athena User Guide
What is a Geospatial Query?
Geospatial identifiers, such as latitude and longitude, allow you to convert any mailing address into a set
of geographic coordinates.
Topics
• What is a Geospatial Query? (p. 123)
• Input Data Formats and Geometry Data Types (p. 123)
• List of Supported Geospatial Functions (p. 124)
• Examples: Geospatial Queries (p. 132)
• Using the following specialized geometry data types: point, line, multiline, polygon, and
multipolygon.
• Expressing relationships between geometry data types, such as distance, equals, crosses,
touches, overlaps, disjoint, and others.
Using geospatial queries in Athena, you can run these and other similar operations:
For example, to obtain a point geometry data type from a pair of double values for the geographic
coordinates of Mount Rainier in Athena, use the ST_POINT (double, double) (longitude,
latitude) geospatial function, specifying the longitude first, then latitude:
123
Amazon Athena User Guide
Geometry Data Types
• point
• line
• polygon
• multiline
• multipolygon
124
Amazon Athena User Guide
Constructor Functions
eventid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 's3://my-query-log/csv'
Some of the subsequent examples are based on these tables and rely on two sample files stored in the
Amazon S3 location. These files are not included with Athena and are used for illustration purposes only:
• An earthquakes.csv file, which lists earthquakes that occurred in California. This file has fields that
correspond to the fields in the table earthquakes.
• A california-counties.json file, which lists JSON-encoded county data in the ESRI-compliant
format, and includes many fields, such as AREA, PERIMETER, STATE, COUNTY, and NAME. The
counties table is based on this file and has two fields only: Name (string), and BoundaryShape
(binary).
Constructor Functions
Use constructor functions to obtain binary representations of point, line, or polygon geometry data
types. You can also use these functions to convert binary data to text, and obtain binary values for
geometry data that is expressed as Well-Known Text (WKT).
ST_POINT(double, double)
Returns a binary representation of a point geometry data type.
To obtain the point geometry data type, use the ST_POINT function in Athena. For the input data
values to this function, use geometric values, such as values in the Universal Transverse Mercator (UTM)
Cartesian coordinate system, or geographic map units (longitude and latitude) in decimal degrees. The
longitude and latitude values use the World Geodetic System, also known as WGS 1984, or EPSG:4326.
WGS 1984 is the coordinate system used by the Global Positioning System (GPS).
For example, in the following notation, the map coordinates are specified in longitude and latitude, and
the value .072284, which is the buffer distance, is specified in angular units as decimal degrees:
Syntax:
In the alternative syntax, you can also specify the coordinates as a point data type with two values:
Example. This example uses specific longitude and latitude coordinates from earthquakes.csv:
125
Amazon Athena User Guide
Constructor Functions
00 00 00 00 01 01 00 00 00 48 e1 7a 14 ae c7 4e 40 e1 7a 14 ae 47 d1 63 c0
00 00 00 00 01 01 00 00 00 20 25 76 6d 6f 80 52 c0 18 3e 22 a6 44 5a 44 40
In the following example, we use the ST_GEOMETRY_TO_TEXT function to obtain the binary values from
WKT:
This query returns a WKT representation of the point geometry type: 1 POINT (-74.006801
40.70522).
ST_LINE(varchar)
Returns a value in the line data type, which is a binary representation of the geometry data type line.
Example:
ST_POLYGON(varchar)
Returns a value in the polygon data type, which is a binary representation of the geometry data type
polygon. Example:
ST_GEOMETRY_TO_TEXT (varbinary)
Converts each of the specified geometry data types to text. Returns a value in a geometry data type,
which is a WKT representation of the geometry data type. Example:
ST_GEOMETRY_FROM_TEXT (varchar)
Converts text into a geometry data type. Returns a value in a geometry data type, which is a binary
representation of the geometry data type. Example:
126
Amazon Athena User Guide
Geospatial Relationship Functions
SELECT ST_CONTAINS('POLYGON((0 2,1 1,0 -1,0 2))', 'POLYGON((-1 3,2 1,0 -3,-1 3))')
127
Amazon Athena User Guide
Operation Functions
Example:
Example:
Operation Functions
Use operation functions to perform operations on geometry data type values. For example, you can
obtain the boundaries of a single geometry data type; intersections between two geometry data types;
difference between left and right geometries, where each is of the same geometry data type; or an
exterior buffer or ring around a particular geometry data type.
All operation functions take as an input one of the geometry data types and return their binary
representations.
ST_BOUNDARY (geometry)
Takes as an input one of the geometry data types, and returns a binary representation of the boundary
geometry data type.
Examples:
128
Amazon Athena User Guide
Operation Functions
In the following example, the map coordinates are specified in longitude and latitude, and the value
.072284, which is the buffer distance, is specified in angular units as decimal degrees:
ST_ENVELOPE (geometry)
Takes as an input line, polygon, multiline, and multipolygon geometry data types. Does not
support point geometry data type. Returns a binary representation of an envelope, where an envelope
is a rectangle around the specified geometry data type. Examples:
ST_EXTERIOR_RING (geometry)
Returns a binary representation of the exterior ring of the input type polygon. Examples:
129
Amazon Athena User Guide
Accessor Functions
Accessor Functions
Accessor functions are useful to obtain values in types varchar, bigint, or double from different
geometry data types, where geometry is any of the geometry data types supported in Athena: point,
line, polygon, multiline, and multipolygon. For example, you can obtain an area of a polygon
geometry data type, maximum and minimum X and Y values for a specified geometry data type, obtain
the length of a line, or receive the number of points in a specified geometry data type.
ST_AREA (geometry)
Takes as an input a geometry data type polygon and returns an area in type double. Example:
ST_CENTROID (geometry)
Takes as an input a geometry data type polygon, and returns a point that is the center of the polygon's
envelope in type varchar. Examples:
ST_COORDINATE_DIMENSION (geometry)
Takes as input one of the supported geometry types, and returns the count of coordinate components in
type bigint. Example:
SELECT ST_COORDINATE_DIMENSION(ST_POINT(1.5,2.5))
ST_DIMENSION (geometry)
Takes as an input one of the supported geometry types, and returns the spatial dimension of a geometry
in type bigint. Example:
130
Amazon Athena User Guide
Accessor Functions
ST_IS_CLOSED (geometry)
Returns TRUE (type boolean) if and only if the line is closed. Example:
ST_IS_EMPTY (geometry)
Returns TRUE (type boolean) if and only if the specified geometry is empty. Example:
ST_IS_RING (geometry)
Returns TRUE (type boolean) if and only if the line type is closed and simple. Example:
ST_LENGTH (geometry)
Returns the length of line in type double. Example:
ST_MAX_X (geometry)
Returns the maximum X coordinate of a geometry in type double. Example:
ST_MAX_Y (geometry)
Returns the maximum Y coordinate of a geometry in type double. Example:
ST_MIN_X (geometry)
Returns the minimum X coordinate of a geometry in type double. Example:
ST_MIN_Y (geometry)
Returns the minimum Y coordinate of a geometry in type double. Example:
131
Amazon Athena User Guide
Examples: Geospatial Queries
ST_START_POINT (geometry)
Returns the first point of a line geometry data type in type point. Example:
ST_END_POINT (geometry)
Returns the last point of a line geometry data type in type point. Example:
ST_X (point)
Returns the X coordinate of a point in type double. Example:
ST_Y (point)
Returns the Y coordinate of a point in type double. Example:
ST_POINT_NUMBER (geometry)
Returns the number of points in the geometry in type bigint. Example:
ST_INTERIOR_RING_NUMBER (geometry)
Returns the number of interior rings in the polygon geometry in type bigint. Example:
• An earthquakes.csv sample file, which lists earthquakes that occurred in California. This file has
fields that correspond to the fields in the table earthquakes in the following example.
• A california-counties.json file, which lists JSON-encoded county data in the ESRI-compliant
format, and includes many fields such as AREA, PERIMETER, STATE, COUNTY, and NAME. The
132
Amazon Athena User Guide
Examples: Geospatial Queries
following example shows the counties table from this file with two fields only: Name (string), and
BoundaryShape (binary).
The following code example uses the CROSS JOIN function for the two tables created earlier.
Additionally, for both tables, it uses ST_CONTAINS and asks for counties whose boundaries include a
geographical location of the earthquakes, specified with ST_POINT. It then groups such counties by
name, orders them by count, and returns them in descending order.
SELECT counties.name,
COUNT(*) cnt
FROM counties
CROSS JOIN earthquakes
WHERE ST_CONTAINS (counties.boundaryshape, ST_POINT(earthquakes.longitude,
earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt DESC
+------------------------+
| name | cnt |
133
Amazon Athena User Guide
Examples: Geospatial Queries
+------------------------+
| Kern | 36 |
+------------------------+
| San Bernardino | 35 |
+------------------------+
| Imperial | 28 |
+------------------------+
| Inyo | 20 |
+------------------------+
| Los Angeles | 18 |
+------------------------+
| Riverside | 14 |
+------------------------+
| Monterey | 14 |
+------------------------+
| Santa Clara | 12 |
+------------------------+
| San Benito | 11 |
+------------------------+
| Fresno | 11 |
+------------------------+
| San Diego | 7 |
+------------------------+
| Santa Cruz | 5 |
+------------------------+
| Ventura | 3 |
+------------------------+
| San Luis Obispo | 3 |
+------------------------+
| Orange | 2 |
+------------------------+
| San Mateo | 1 |
+------------------------+
134
Amazon Athena User Guide
Querying AWS CloudTrail Logs
The tasks in this section use the Athena console, but you can also use other tools that connect via JDBC.
For more information, see Using Athena with the JDBC Driver (p. 43), the AWS CLI, or the Amazon Athena
API Reference.
The topics in this section assume that you have set up both an IAM user with appropriate permissions to
access Athena and the Amazon S3 bucket where the data to query should reside. For more information,
see Setting Up (p. 21) and Getting Started (p. 23).
Topics
• Querying AWS CloudTrail Logs (p. 135)
• Querying Amazon CloudFront Logs (p. 139)
• Querying Classic Load Balancer Logs (p. 140)
• Querying Network Load Balancer Logs (p. 142)
• Querying Application Load Balancer Logs (p. 143)
• Querying Amazon VPC Flow Logs (p. 145)
CloudTrail logs include details about any API calls made to your AWS services, including the console.
CloudTrail generates encrypted log files and stores them in Amazon S3. For more information, see the
AWS CloudTrail User Guide.
Using Athena with CloudTrail logs is a powerful way to enhance your analysis of AWS service activity. For
example, you can use queries to identify trends and further isolate activity by attributes, such as source
IP address or user.
A common application is to use CloudTrail logs to analyze operational activity for security and
compliance. For information about a detailed example, see the AWS Big Data Blog post, Analyze Security,
Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.
You can use Athena to query these log files directly from Amazon S3, specifying the LOCATION of log
files. You can do this one of two ways:
• By creating tables for CloudTrail log files directly from the CloudTrail console.
• By manually creating tables for CloudTrail log files in the Athena console.
Topics
• Understanding CloudTrail Logs and Athena Tables (p. 136)
• Creating a Table for CloudTrail Logs in the CloudTrail Console (p. 136)
135
Amazon Athena User Guide
Understanding CloudTrail Logs and Athena Tables
• Manually Creating the Table for CloudTrail Logs in Athena (p. 137)
• Example Query for CloudTrail Logs (p. 138)
• Tips for Querying CloudTrail Logs (p. 138)
CloudTrail saves logs as JSON text files in compressed gzip format (*.json.gzip). The location of the log
files depends on how you set up trails, the AWS Region or Regions in which you are logging, and other
factors.
For more information about where logs are stored, the JSON structure, and the record file contents, see
the following topics in the AWS CloudTrail User Guide:
To collect logs and save them to Amazon S3, enable CloudTrail for the console. For more information,
see Creating a Trail in the AWS CloudTrail User Guide.
Note the destination Amazon S3 bucket where you save the logs. Replace the LOCATION clause with
the path to the CloudTrail log location and the set of objects with which to work. The example uses a
LOCATION value of logs for a particular account, but you can use the degree of specificity that suits your
application.
For example:
• To analyze data from multiple accounts, you can roll back the LOCATION specifier to indicate all
AWSLogs by using LOCATION 's3://MyLogFiles/AWSLogs/.
• To analyze data from a specific date, account, and Region, use LOCATION `s3://
MyLogFiles/123456789012/CloudTrail/us-east-1/2016/03/14/'.
Using the highest level in the object hierarchy gives you the greatest flexibility when you query using
Athena.
136
Amazon Athena User Guide
Manually Creating the Table for CloudTrail Logs in Athena
1. Copy and paste the following DDL statement into the Athena console.
2. Modify the s3://CloudTrail_bucket_name/AWSLogs/Account_ID/ to point to the Amazon S3
bucket that contains your logs data.
3. Verify that fields are listed correctly. For more information about the full list of fields in a CloudTrail
record, see CloudTrail Record Contents.
137
Amazon Athena User Guide
Example Query for CloudTrail Logs
responseelements STRING,
additionaleventdata STRING,
requestid STRING,
eventid STRING,
resources ARRAY<STRUCT<
ARN:STRING,
accountId:STRING,
type:STRING>>,
eventtype STRING,
apiversion STRING,
readonly STRING,
recipientaccountid STRING,
serviceeventdetails STRING,
sharedeventid STRING,
vpcendpointid STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://CloudTrail_bucket_name/AWSLogs/Account_ID/';
4. Run the query in the Athena console. After the query completes, Athena registers
cloudtrail_logs, making the data in it ready for you to issue queries.
SELECT *
FROM cloudtrail_logs
WHERE
eventsource = 's3.amazonaws.com' AND
eventname in ('GetObject') AND
useridentity.accountid LIKE '%ANONYMOUS%' AND
useridentity.arn IS NULL AND
requestparameters LIKE '%[your bucket name ]%'
For more information, see the AWS Big Data blog post Analyze Security, Compliance, and Operational
Activity Using AWS CloudTrail and Amazon Athena.
• Before querying the logs, verify that your logs table looks the same as the one in the section called
“Manually Creating the Table for CloudTrail Logs in Athena” (p. 137). If it is not the first table, delete
the existing table using the following command: DROP TABLE cloudtrail_logs;.
• After you drop the existing table, re-create it. For more information, see Creating the Table for
CloudTrail Logs (p. 137).
Verify that fields in your Athena query are listed correctly. For information about the full list of fields
in a CloudTrail record, see CloudTrail Record Contents.
If your query includes fields in JSON formats, such as STRUCT, extract data from JSON. For more
information, see Extracting Data From JSON (p. 117).
Now you are ready to issue queries against your CloudTrail table.
138
Amazon Athena User Guide
Querying Amazon CloudFront Logs
• Start by looking at which IAM users called which API operations and from which source IP addresses.
• Use the following basic SQL query as your template. Paste the query to the Athena console and run it.
SELECT
useridentity.arn,
eventname,
sourceipaddress,
eventtime
FROM cloudtrail_logs
LIMIT 100;
Before you begin querying the logs, enable Web distributions access log on your preferred CloudFront
distribution. For information, see Access Logs in the Amazon CloudFront Developer Guide.
This query uses the LazySimpleSerDe (p. 206) by default and it is omitted.
The column date is escaped using backticks (`) because it is a reserved word in Athena. For
information, see Reserved Keywords (p. 72).
139
Amazon Athena User Guide
Example Query for CloudFront Logs
user_agent STRING,
query_string STRING,
cookie STRING,
result_type STRING,
request_id STRING,
host_header STRING,
request_protocol STRING,
request_bytes BIGINT,
time_taken FLOAT,
xforwarded_for STRING,
ssl_protocol STRING,
ssl_cipher STRING,
response_result_type STRING,
http_version STRING,
fle_status STRING,
fle_encrypted_fields INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://CloudFront_bucket_name/AWSLogs/ACCOUNT_ID/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
2. Run the query in Athena console. After the query completes, Athena registers the cloudfront_logs
table, making the data in it ready for you to issue queries.
In some cases, you need to eliminate empty values from the results of CREATE TABLE query for
CloudFront. To do so, run:
SELECT DISTINCT *
FROM cloudfront_logs
LIMIT 10;
For more information, see the AWS Big Data Blog post Build a Serverless Architecture to Analyze Amazon
CloudFront Access Logs Using AWS Lambda, Amazon Athena, and Amazon Kinesis Analytics.
Before you analyze the Elastic Load Balancing logs, configure them for saving in the destination Amazon
S3 bucket. For more information, see Enable Access Logs for Your Classic Load Balancer.
• Create the table for Elastic Load Balancing logs (p. 141)
• Elastic Load Balancing Example Queries (p. 141)
140
Amazon Athena User Guide
To create the table for Elastic Load Balancing logs
timestamp string,
elb_name string,
request_ip string,
request_port int,
backend_ip string,
backend_port int,
request_processing_time double,
backend_processing_time double,
client_response_time double,
elb_response_code string,
backend_response_code string,
received_bytes bigint,
sent_bytes bigint,
request_verb string,
url string,
protocol string,
user_agent string,
ssl_cipher string,
ssl_protocol string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*)[:-]([0-9]*) ([-.0-9]*)
([-.0-9]*) ([-.0-9]*) (|[-0-9]*) (-|[-0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*)
(- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$' )
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/elasticloadbalancing/'
2. Modify the LOCATION Amazon S3 bucket to specify the destination of your Elastic Load Balancing
logs.
3. Run the query in the Athena console. After the query completes, Athena registers the elb_logs
table, making the data in it ready for queries. For more information, see Elastic Load Balancing
Example Queries (p. 141)
SELECT
request_timestamp,
elb_name,
backend_ip,
backend_response_code
FROM elb_logs
WHERE backend_response_code LIKE '4%' OR
backend_response_code LIKE '5%'
LIMIT 100;
Use a subsequent query to sum up the response time of all the transactions grouped by the backend IP
address and Elastic Load Balancing instance name.
141
Amazon Athena User Guide
Querying Network Load Balancer Logs
SELECT sum(backend_processing_time) AS
total_ms,
elb_name,
backend_ip
FROM elb_logs WHERE backend_ip <> ''
GROUP BY backend_ip, elb_name
LIMIT 100;
Before you analyze the Network Load Balancer access logs, enable and configure them for saving in the
destination Amazon S3 bucket. For more information, see Access Logs for Your Network Load Balancer.
• Create the table for Network Load Balancer logs (p. 142)
• Network Load Balancer Example Queries (p. 143)
142
Amazon Athena User Guide
Network Load Balancer Example Queries
LOCATION 's3://your_log_bucket/prefix/AWSLogs/AWS_account_ID/
elasticloadbalancing/region';
2. Modify the LOCATION Amazon S3 bucket to specify the destination of your Network Load Balancer
logs.
3. Run the query in the Athena console. After the query completes, Athena registers the
nlb_tls_logs table, making the data in it ready for queries.
SELECT count(*) AS
ct,
cert_arn
FROM "nlb_tls_logs"
GROUP BY cert_arn;
The following query shows how many users are using the older TLS version:
SELECT tls_protocol_version,
COUNT(tls_protocol_version) AS
num_connections,
client_ip
FROM "nlb_tls_logs"
WHERE tls_protocol_version < 'tlsv12'
GROUP BY tls_protocol_version, client_ip;
Use the following query to identify connections that take a long TLS handshake time:
SELECT *
FROM "nlb_tls_logs"
ORDER BY tls_handshake_time_ms DESC
LIMIT 10;
Before you begin, enable access logging for Application Load Balancer logs to be saved to your Amazon
S3 bucket.
143
Amazon Athena User Guide
Example Queries for ALB Logs
2. Run the query in the Athena console. After the query completes, Athena registers the alb_logs
table, making the data in it ready for you to issue queries.
SELECT COUNT(request_verb) AS
count,
144
Amazon Athena User Guide
Querying Amazon VPC Flow Logs
request_verb,
client_ip
FROM alb_logs
GROUP BY request_verb, client_ip
LIMIT 100;
SELECT request_url
FROM alb_logs
WHERE user_agent LIKE '%Safari%'
LIMIT 10;
Before you begin querying the logs in Athena, enable VPC flow logs, and configure them to be saved to
your Amazon S3 bucket. After you create the logs, let them run for a few minutes to collect some data.
The logs are created in a GZIP compression format that Athena lets you query directly.
145
Amazon Athena User Guide
Example Queries for Amazon VPC Flow Logs
destinationport int,
protocol int,
numpackets int,
numbytes bigint,
starttime int,
endtime int,
action string,
logstatus string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://your_log_bucket/prefix/AWSLogs/{subscribe_account_id}/vpcflowlogs/
{region_code}/'
TBLPROPERTIES ("skip.header.line.count"="1");
3. Run the query in Athena console. After the query completes, Athena registers the vpc_flow_logs
table, making the data in it ready for you to issue queries.
4. Create partitions to be able to read the data, as in the following sample query. This query creates a
single partition for a specified date. Replace the placeholders for date and location as needed.
Note
This query creates a single partition only, for a date that you specify. To automate the
process, use a script that runs this query and creates partitions this way for the year/
month/day, or use AWS Glue Crawler to create partitions for a given Amazon S3 bucket. For
information, see Scheduling a Crawler to Keep the AWS Glue Data Catalog and Amazon S3 in
Sync (p. 34).
This query uses Date and Time Functions and Operators. It converts values in the dt String column to
timestamp with the date function from_iso8601_timestamp(string), and extracts the day of the
week from timestamp with day_of_week.
SELECT day_of_week(from_iso8601_timestamp(dt)) AS
day,
dt,
interfaceid,
sourceaddress,
action,
protocol
FROM vpc_flow_logs
WHERE action = 'REJECT' AND protocol = 6
LIMIT 100;
To see which one of your servers is receiving the highest number of HTTPS requests, use this query. It
counts the number of packets received on HTTPS port 443, groups them by destination IP address, and
returns the top 10.
SELECT SUM(numpackets) AS
146
Amazon Athena User Guide
Example Queries for Amazon VPC Flow Logs
packetcount,
destinationaddress
FROM vpc_flow_logs
WHERE destinationport = 443
GROUP BY destinationaddress
ORDER BY packetcount DESC
LIMIT 10;
For more information, see the AWS Big Data blog post Analyzing VPC Flow Logs with Amazon Kinesis
Firehose, Athena, and Amazon QuickSight.
147
Amazon Athena User Guide
Summary: Updates and Data Formats in Athena
If you anticipate changes in table schemas, consider creating them in a data format that is suitable for
your needs. Your goals are to reuse existing Athena queries against evolving schemas, and avoid schema
mismatch errors when querying tables with partitions.
Important
Schema updates described in this section do not work on tables with complex or nested data
types, such as arrays and structs.
To achieve these goals, choose a table's data format based on the table in the following topic.
Topics
• Summary: Updates and Data Formats in Athena (p. 148)
• Index Access in ORC and Parquet (p. 149)
• Types of Updates (p. 151)
• Updates in Tables with Partitions (p. 156)
In this table, observe that Parquet and ORC are columnar formats with different default column access
methods. By default, Parquet will access columns by name and ORC by index (ordinal value). Therefore,
Athena provides a SerDe property defined when creating a table to toggle the default column access
method which enables greater flexibility with schema evolution.
For Parquet, the parquet.column.index.access property may be set to TRUE, which sets the column
access method to use the column’s ordinal number. Setting this property to FALSE will change the
column access method to use column name. Similarly, for ORC use the orc.column.index.access
property to control the column access method. For more information, see Index Access in ORC and
Parquet (p. 149).
CSV and TSV allow you to do all schema manipulations except reordering of columns, or adding columns
at the beginning of the table. For example, if your schema evolution requires only renaming columns but
not removing them, you can choose to create your tables in CSV or TSV. If you require removing columns,
do not use CSV or TSV, and instead use any of the other supported formats, preferably, a columnar
format, such as Parquet or ORC.
148
Amazon Athena User Guide
Index Access in ORC and Parquet
149
Amazon Athena User Guide
ORC: Read by Index
Since these are defaults, specifying these SerDe properties in your CREATE TABLE queries is optional,
they are used implicitly. When used, they allow you to run some schema update operations while
preventing other such operations. To enable those operations, run another CREATE TABLE query and
change the SerDe settings.
Note
The SerDe properties are not automatically propagated to each partition. Use ALTER TABLE
ADD PARTITION statements to set the SerDe properties for each partition. To automate this
process, write a script that runs ALTER TABLE ADD PARTITION statements.
WITH SERDEPROPERTIES (
'orc.column.index.access'='true')
Reading by index allows you to rename columns. But then you lose the ability to remove columns or add
them in the middle of the table.
To make ORC read by name, which will allow you to add columns in the middle of the table or remove
columns in ORC, set the SerDe property orc.column.index.access to FALSE in the CREATE TABLE
statement. In this configuration, you will lose the ability to rename columns.
The following example illustrates how to change the ORC to make it read by name:
150
Amazon Athena User Guide
Types of Updates
WITH SERDEPROPERTIES (
'parquet.column.index.access'='false')
Reading by name allows you to add columns in the middle of the table and remove columns. But then
you lose the ability to rename columns.
To make Parquet read by index, which will allow you to rename columns, you must create a table with
parquet.column.index.access SerDe property set to TRUE.
Types of Updates
Here are the types of updates that a table’s schema can have. We review each type of schema update
and specify which data formats allow you to have them in Athena.
Important
Schema updates described in this section do not work on tables with complex or nested data
types, such as arrays and structs.
Depending on how you expect your schemas to evolve, to continue using Athena queries, choose a
compatible data format.
Let’s consider an application that reads orders information from an orders table that exists in two
formats: CSV and Parquet.
151
Amazon Athena User Guide
Adding Columns at the Beginning
or in the Middle of the Table
In the following sections, we review how updates to these tables affect Athena queries.
To add columns at the beginning or in the middle of the table, and continue running queries against
existing tables, use AVRO, JSON, and Parquet and ORC if their SerDe property is set to read by name. For
information, see Index Access in ORC and Parquet (p. 149).
Do not add columns at the beginning or in the middle of the table in CSV and TSV, as these formats
depend on ordering. Adding a column in such cases will lead to schema mismatch errors when the
schema of partitions changes.
The following example shows adding a column to a JSON table in the middle of the table:
In the following example, drop an existing table in Parquet, and add a new Parquet table with a new
`comment` column at the end of the table:
152
Amazon Athena User Guide
Removing Columns
)
STORED AS PARQUET
LOCATION 's3://schema_updates/orders_parquet/';
In the following example, drop an existing table in CSV and add a new CSV table with a new `comment`
column at the end of the table:
Removing Columns
You may need to remove columns from tables if they no longer contain data, or to restrict access to the
data in them.
• You can remove columns from tables in JSON, Avro, and in Parquet and ORC if they are read by name.
For information, see Index Access in ORC and Parquet (p. 149).
• We do not recommend removing columns from tables in CSV and TSV if you want to retain the tables
you have already created in Athena. Removing a column breaks the schema and requires that you
recreate the table without the removed column.
In this example, remove a column `totalprice` from a table in Parquet and run a query. In Athena,
Parquet is read by name by default, this is why we omit the SERDEPROPERTIES configuration that
specifies reading by name. Notice that the following query succeeds, even though you changed the
schema:
Renaming Columns
You may want to rename columns in your tables to correct spelling, make column names more
descriptive, or to reuse an existing column to avoid column reordering.
You can rename columns if you store your data in CSV and TSV, or in Parquet and ORC that are
configured to read by index. For information, see Index Access in ORC and Parquet (p. 149).
153
Amazon Athena User Guide
Reordering Columns
Athena reads data in CSV and TSV in the order of the columns in the schema and returns them in the
same order. It does not use column names for mapping data to a column, which is why you can rename
columns in CSV or TSV without breaking Athena queries.
In this example, rename the column `o_totalprice` to `o_total_price` in the Parquet table, and
then run a query in Athena:
In the Parquet table case, the following query runs, but the renamed column does not show data
because the column was being accessed by name (a default in Parquet) rather than by index:
SELECT *
FROM orders_parquet_column_renamed;
In the CSV table case, the following query runs and the data displays in all columns, including the one
that was renamed:
SELECT *
FROM orders_csv_column_renamed;
Reordering Columns
You can reorder columns only for tables with data in formats that read by name, such as JSON or ORC,
which reads by name by default. You can also make Parquet read by name, if needed. For information,
see Index Access in ORC and Parquet (p. 149).
154
Amazon Athena User Guide
Changing a Column's Data Type
`o_comment` string,
`o_orderkey` int,
`o_custkey` int,
`o_orderpriority` string,
`o_orderstatus` string,
`o_clerk` string,
`o_shippriority` int,
`o_orderdate` string
)
STORED AS PARQUET
LOCATION 's3://schema_updates/orders_parquet/';
• Only certain data types can be converted to other data types. See the table in this section for data
types that can change.
• For data in Parquet and ORC, you cannot change a column's data type if the table is not partitioned.
For partitioned tables in Parquet and ORC, a partition's column type can be different from another
partition's column type, and Athena will CAST to the desired type, if possible. For information, see
Avoiding Schema Mismatch Errors for Tables with Partitions (p. 156).
Important
We strongly suggest that you test and verify your queries before performing data type
translations. If Athena cannot convert the data type from the original data type to the target
data type, the CREATE TABLE query may fail.
The following table lists data types that you can change:
INT BIGINT
FLOAT DOUBLE
In the following example of the orders_json table, change the data type for the column
`o_shippriority` to BIGINT:
155
Amazon Athena User Guide
Updates in Tables with Partitions
`o_totalprice` double,
`o_orderdate` string,
`o_orderpriority` string,
`o_clerk` string,
`o_shippriority` BIGINT
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://schema_updates/orders_json';
The following query runs successfully, similar to the original SELECT query, before the data type change:
• If your table's schema changes, the schemas for partitions are not updated to remain in sync with the
table's schema.
• The AWS Glue Crawler allows you to discover data in partitions with different schemas. This means
that if you create a table in Athena with AWS Glue, after the crawler finishes processing, the schemas
for the table and its partitions may be different.
• If you add partitions directly using an AWS API.
Athena processes tables with partitions successfully if they meet the following constraints. If these
constraints are not met, Athena issues a HIVE_PARTITION_SCHEMA_MISMATCH error.
For example, for CSV and TSV formats, you can rename columns, add new columns at the end of the
table, and change a column's data type if the types are compatible, but you cannot remove columns.
For other formats, you can add or remove columns, or change a column's data type to another if the
types are compatible. For information, see Summary: Updates and Data Formats in Athena (p. 148).
Important
Schema updates described in this section do not work on tables with complex or nested data
types, such as arrays and structs.
• For Parquet and ORC data storage types, Athena relies on the column names and uses them for its
column name-based schema verification. This eliminates HIVE_PARTITION_SCHEMA_MISMATCH
156
Amazon Athena User Guide
Avoiding Schema Mismatch
Errors for Tables with Partitions
errors for tables with partitions in Parquet and ORC. (This is true for ORC if the SerDe property is set to
access the index by name: orc.column.index.access=FALSE. Parquet reads the index by name by
default).
• For CSV, JSON, and Avro, Athena uses an index-based schema verification. This means that if you
encounter a schema mismatch error, you should drop the partition that is causing a schema mismatch
and recreate it, so that Athena can query it without failing.
Athena compares the table's schema to the partition schemas. If you create a table in CSV, JSON,
and AVRO in Athena with AWS Glue Crawler, after the Crawler finishes processing, the schemas for
the table and its partitions may be different. If there is a mismatch between the table's schema and
the partition schemas, your queries fail in Athena due to the schema verification error similar to this:
'crawler_test.click_avro' is declared as type 'string', but partition 'partition_0=2017-01-17' declared
column 'col68' as type 'double'."
A typical workaround for such errors is to drop the partition that is causing the error and recreate it.
157
Amazon Athena User Guide
Using Workgroups for Running Queries
Workgroups integrate with IAM, CloudWatch, and Amazon Simple Notification Service as follows:
• IAM identity-based policies with resource-level permissions control who can run queries in a
workgroup.
• Athena publishes the workgroup query metrics to CloudWatch, if you enable query metrics.
• In Amazon SNS, you can create Amazon SNS topics that issue alarms to specified workgroup users
when data usage controls for queries in a workgroup exceed your established thresholds.
Topics
• Using Workgroups for Running Queries (p. 158)
• Controlling Costs and Monitoring Queries with CloudWatch Metrics (p. 174)
Topics
• Benefits of Using Workgroups (p. 158)
• How Workgroups Work (p. 159)
• Setting up Workgroups (p. 160)
• IAM Policies for Accessing Workgroups (p. 161)
• Workgroup Example Policies (p. 162)
• Workgroup Settings (p. 166)
• Managing Workgroups (p. 167)
• Athena Workgroup APIs (p. 172)
• Troubleshooting Workgroups (p. 172)
158
Amazon Athena User Guide
How Workgroups Work
Isolate users, teams, Each workgroup has its own distinct query history and a list of saved
applications, or workloads queries. For more information, see How Workgroups Work (p. 159).
into groups.
For all queries in the workgroup, you can choose to configure workgroup
settings. They include an Amazon S3 location for storing query results,
and encryption configuration. You can also enforce workgroup settings.
For more information, see Workgroup Settings (p. 166).
Enforce costs constraints. You can set two types of cost constraints for queries in a workgroup:
For detailed steps, see Setting Data Usage Control Limits (p. 176).
Track query-related metrics For each query that runs in a workgroup, if you configure the workgroup
for all workgroup queries in to publish metrics, Athena publishes them to CloudWatch. You can view
CloudWatch. query metrics (p. 175) for each of your workgroups within the Athena
console. In CloudWatch, you can create custom dashboards, and set
thresholds and alarms on these metrics.
• By default, each account has a primary workgroup and the default permissions allow all authenticated
users access to this workgroup. The primary workgroup cannot be deleted.
• Each workgroup that you create shows saved queries and query history only for queries that ran in it,
and not for all queries in the account. This separates your queries from other queries within an account
and makes it more efficient for you to locate your own saved queries and queries in history.
• Disabling a workgroup prevents queries from running in it, until you enable it. Queries sent to a
disabled workgroup fail, until you enable it again.
• If you have permissions, you can delete an empty workgroup, and a workgroup that contains saved
queries. In this case, before deleting a workgroup, Athena warns you that saved queries are deleted.
Before deleting a workgroup to which other users have access, make sure its users have access to other
workgroups in which they can continue to run queries.
• You can set up workgroup-wide settings and enforce their usage by all queries that run in a workgroup.
The settings include query results location in Amazon S3 and encryption configuration.
Important
When you enforce workgroup-wide settings, all queries that run in this workgroup use
workgroup settings. This happens even if their client-side settings may differ from workgroup
settings. For information, see Workgroup Settings Override Client-Side Settings (p. 166).
159
Amazon Athena User Guide
Setting up Workgroups
Setting up Workgroups
Setting up workgroups involves creating them and establishing permissions for their usage. First, decide
which workgroups your organization needs, and create them. Next, set up IAM workgroup policies that
control user access and actions on a workgroup resource. Users with access to these workgroups can
now run queries in them.
Note
Use these tasks for setting up workgroups when you begin to use them for the first time. If
your Athena account already uses workgroups, each account's user requires permissions to run
queries in one or more workgroups in the account. Before you run queries, check your IAM policy
to see which workgroups you can access, adjust your policy if needed, and switch (p. 171) to a
workgroup you intend to use.
By default, if you have not created any workgroups, all queries in your account run in the primary
workgroup:
Workgroups display in the Athena console in the Workgroup:<workgroup_name> tab. The console lists
the workgroup that you have switched to. When you run queries, they run in this workgroup. You can run
queries in the workgroup in the console, or by using the API operations, the command line interface, or a
client application through the JDBC or ODBC driver. When you have access to a workgroup, you can view
workgroup's settings, metrics, and data usage control limits. Additionally, you can have permissions to
edit the settings and data usage control limits.
To Set Up Workgroups
1. Decide which workgroups to create. For example, you can decide the following:
• Who can run queries in each workgroup, and who owns workgroup configuration. This
determines IAM policies you create. For more information, see IAM Policies for Accessing
Workgroups (p. 161).
• Which locations in Amazon S3 to use for the query results for queries that run in each workgroup.
A location must exist in Amazon S3 before you can specify it for the workgroup query results.
All users who use a workgroup must have access to this location. For more information, see
Workgroup Settings (p. 166).
• Which encryption settings are required, and which workgroups have queries that must be
encrypted. We recommend that you create separate workgroups for encrypted and non-encrypted
queries. That way, you can enforce encryption for a workgroup that applies to all queries that run
in it. For more information, see Encrypting Query Results Stored in Amazon S3 (p. 63).
2. Create workgroups as needed, and add tags to them. Open the Athena console, choose the
Workgroup:<workgroup_name> tab, and then choose Create workgroup. For detailed steps, see
Create a Workgroup (p. 168).
3. Create IAM policies for your users, groups, or roles to enable their access to workgroups. The
policies establish the workgroup membership and access to actions on a workgroup resource. For
detailed steps, see IAM Policies for Accessing Workgroups (p. 161). For example JSON policies, see
Workgroup Example Policies (p. 59).
160
Amazon Athena User Guide
IAM Policies for Accessing Workgroups
4. Set workgroup settings. Specify a location in Amazon S3 for query results and encryption
settings, if needed. You can enforce workgroup settings. For more information, see workgroup
settings (p. 166).
Important
If you override client-side settings (p. 166), Athena will use the workgroup's settings. This
affects queries that you run in the console, by using the drivers, the command line interface,
or the API operations.
While queries continue to run, automation built based on availability of results in a certain
Amazon S3 bucket may break. We recommend that you inform your users before overriding.
After workgroup settings are set to override, you can omit specifying client-side settings in
the drivers or the API.
5. Notify users which workgroups to use for running queries. Send an email to inform your account's
users about workgroup names that they can use, the required IAM policies, and the workgroup
settings.
6. Configure cost control limits, also known as data usage control limits, for queries and workgroups.
To notify you when a threshold is breached, create an Amazon SNS topic and configure
subscriptions. For detailed steps, see Setting Data Usage Control Limits (p. 176) and Creating an
Amazon SNS Topic in the Amazon Simple Notification Service Getting Started Guide.
7. Switch to the workgroup so that you can run queries.To run queries, switch to the appropriate
workgroup. For detailed steps, see the section called “Specify a Workgroup in Which to Run
Queries” (p. 172).
For IAM-specific information, see the links listed at the end of this section. For information about
example JSON workgroup policies, see Workgroup Example Policies (p. 162).
To use the visual editor in the IAM console to create a workgroup policy
1. Sign in to the AWS Management Console and open the IAM console at https://
console.aws.amazon.com/iam/.
2. In the navigation pane on the left, choose Policies, and then choose Create policy.
3. On the Visual editor tab, choose Choose a service. Then choose Athena to add to the policy.
4. Choose Select actions, and then choose the actions to add to the policy. The visual editor shows the
actions available in Athena. For more information, see Actions, Resources, and Condition Keys for
Amazon Athena in the IAM User Guide.
5. Choose add actions to type a specific action or use wildcards (*) to specify multiple actions.
By default, the policy that you are creating allows the actions that you choose. If you chose one or
more actions that support resource-level permissions to the workgroup resource in Athena, then
the editor lists the workgroup resource.
6. Choose Resources to specify the specific workgroups for your policy. For example JSON workgroup
policies, see Workgroup Example Policies (p. 162).
7. Specify the workgroup resource as follows:
arn:aws:athena:<region>:<user-account>:workgroup/<workgroup-name>
8. Choose Review policy, and then type a Name and a Description (optional) for the policy that you
are creating. Review the policy summary to make sure that you granted the intended permissions.
161
Amazon Athena User Guide
Workgroup Example Policies
For more information, see the following topics in the IAM User Guide:
For example JSON workgroup policies, see Workgroup Example Policies (p. 162).
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference.
A workgroup is an IAM resource managed by Athena. Therefore, if your workgroup policy uses actions
that take workgroup as an input, you must specify the workgroup's ARN as follows:
"Resource": [arn:aws:athena:<region>:<user-account>:workgroup/<workgroup-name>]
Where <workgroup-name> is the name of your workgroup. For example, for workgroup named
test_workgroup, specify it as a resource as follows:
"Resource": ["arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"]
For a complete list of Amazon Athena actions, see the API action names in the Amazon Athena API
Reference. For more information about IAM policies, see Creating Policies with the Visual Editor in the
IAM User Guide. For more information about creating IAM policies for workgroups, see Workgroup IAM
Policies (p. 161).
The following policy allows full access to all workgroup resources that might exist in the account. We
recommend that you use this policy for those users in your account that must administer and manage
workgroups for all other users.
162
Amazon Athena User Guide
Workgroup Example Policies
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:*"
],
"Resource": [
"*"
]
}
]
}
The following policy allows full access to the single specific workgroup resource, named workgroupA.
You could use this policy for users with full control over a particular workgroup.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
]
},
{
"Effect": "Allow",
"Action": [
"athena:DeleteWorkGroup",
"athena:UpdateWorkGroup",
"athena:GetWorkGroup",
"athena:CreateWorkGroup"
],
163
Amazon Athena User Guide
Workgroup Example Policies
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
]
}
]
}
In the following policy, a user is allowed to run queries in the specified workgroupA, and view them. The
user is not allowed to perform management tasks for the workgroup itself, such as updating or deleting
it.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetWorkGroup"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
]
}
]
}
In the following example, we use the policy that allows a particular user to run queries in the primary
workgroup.
Note
We recommend that you add this policy to all users who are otherwise configured to run queries
in their designated workgroups. Adding this policy to their workgroup user policies is useful in
164
Amazon Athena User Guide
Workgroup Example Policies
case their designated workgroup is deleted or is disabled. In this case, they can continue running
queries in the primary workgroup.
To allow users in your account to run queries in the primary workgroup, add the following policy to a
resource section of the Example Policy for Running Queries in a Specified Workgroup (p. 164).
"arn:aws:athena:us-east-1:123456789012:workgroup/primary"
In the following policy, a user is allowed to create, delete, obtain details, and update a workgroup
test_workgroup.
{
"Effect": "Allow",
"Action": [
"athena:CreateWorkGroup",
"athena:GetWorkGroup",
"athena:DeleteWorkGroup",
"athena:UpdateWorkGroup"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups"
],
"Resource": "*"
}
Example Example Policy for Running and Stopping Queries in a Specific Workgroup
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:StopQueryExecution"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
Example Example Policy for Working with Named Queries in a Specific Workgroup
In the following policy, a user has permissions to create, delete, and obtain information about named
queries in the specified workgroup:
165
Amazon Athena User Guide
Workgroup Settings
{
"Effect": "Allow",
"Action": [
"athena:CreateNamedQuery",
"athena:GetNamedQuery",
"athena:DeleteNamedQuery"
],
"Resource": [
"arn:aws:athena:us-east-1:123456789012:workgroup/test_workgroup"
]
}
Workgroup Settings
Each workgroup has the following settings:
• A unique name. It can contain from 1 to 128 characters, including alphanumeric characters, dashes,
and underscores. After you create a workgroup, you cannot change its name. You can, however, create
a new workgroup with the same settings and a different name.
• Settings that apply to all queries running in the workgroup. They include:
• A location in Amazon S3 for storing query results for all queries that run in this workgroup. This
location must exist before you specify it for the workgroup when you create it.
• An encryption setting, if you use encryption for all workgroup queries. You can encrypt only all
queries in a workgroup, not just some of them. It is best to create separate workgroups to contain
queries that are either encrypted or not encrypted.
In addition, you can override client-side settings (p. 166). Before the release of workgroups, you
could specify results location and encryption options as parameters in the JDBC or ODBC driver, or in
the Properties tab in the Athena console. These settings could also be specified directly via the API
operations. These settings are known as "client-side settings". With workgroups, you can configure these
settings at the workgroup level and enforce control over them. This spares your users from setting them
individually. If you select the Override Client-Side Settings, queries use the workgroup settings and
ignore the client-side settings.
If Override Client-Side Settings is selected, the user is notified on the console that their settings have
changed. If workgroup settings are enforced this way, users can omit corresponding client-side settings.
In this case, if you run queries in the console, the workgroup's settings are used for them even if any
queries have client-side settings. Also, if you run queries in this workgroup through the command
line interface, API operations, or the drivers, any settings that you specified are overwritten by the
workgroup's settings. This affects the query results location and encryption. To check which settings are
used for the workgroup, view workgroup's details (p. 170).
You can also set query limits (p. 174) for queries in workgroups.
• If Override client-side settings is not selected, workgroup settings are not enforced. In this case, for
all queries that run in this workgroup, Athena uses the clients-side settings for query results location
and encryption. Each user can specify client-side settings in the Settings menu on the console. If the
client-side settings are not used, the workgroup-wide settings apply, but are not enforced. Also, if you
run queries in this workgroup through the API operations, the command line interface, or the JDBC
and ODBC drivers, and specify your query results location and encryption there, your queries continue
using those settings.
166
Amazon Athena User Guide
Managing Workgroups
• If Override client-side settings is selected, Athena uses the workgroup-wide settings for query results
location and encryption. It also overrides any other settings that you specified for the query in the
console, by using the API operations, or with the drivers. This affects you only if you run queries in this
workgroup. If you do, workgroup settings are used.
If you override client-side settings, then the next time that you or any workgroup user open the Athena
console, the notification dialog box displays, as shown in the following example. It notifies you that
queries in this workgroup use workgroup's settings, and prompts you to acknowledge this change.
Important
If you run queries through the API operations, the command line interface, or the JDBC and
ODBC drivers, and have not updated your settings to match those of the workgroup, your
queries run, but use the workgroup's settings. For consistency, we recommend that you omit
client-side settings in this case or update your query settings to match the workgroup's
settings for the results location and encryption. To check which settings are used for the
workgroup, view workgroup's details (p. 170).
Managing Workgroups
In the https://console.aws.amazon.com/athena/, you can perform the following tasks:
Statement Description
Edit a Edit a workgroup and change its settings. You cannot change a
Workgroup (p. 169) workgroup's name, but you can create a new workgroup with the same
settings and a different name.
View the Workgroup's View the workgroup's details, such as its name, description, data usage
Details (p. 170) limits, location of query results, and encryption. You can also verify
whether this workgroup enforces its settings, if Override client-side
settings is checked.
Enable and Disable a Enable or disable a workgroup. When a workgroup is disabled, its users
Workgroup (p. 171) cannot run queries, or create new named queries. If you have access to it,
you can still view metrics, data usage limit controls, workgroup's settings,
query history, and saved queries.
167
Amazon Athena User Guide
Managing Workgroups
Statement Description
Specify a Workgroup Before you can run queries, you must specify to Athena which workgroup
in Which to Run to use. You must have permissions to the workgroup.
Queries (p. 172)
Create a Workgroup
Creating a workgroup requires permissions to CreateWorkgroup API actions. See Access to Athena
Workgroups (p. 59) and IAM Policies for Accessing Workgroups (p. 161). If you are adding tags, you also
need to add permissions to TagResource. See the section called “Tag Policy Examples” (p. 184).
Field Description
Workgroup name Required. Enter a unique name for your workgroup. Use 1 - 128
characters. (A-Z,a-z,0-9,_,-,.). This name cannot be changed.
Query result location Optional. Enter a path to an Amazon S3 bucket or prefix. This bucket
and prefix must exist before you specify them.
Note
If you run queries in the console, specifying the query results
location is optional. If you don't specify it for the workgroup
or in Settings, Athena uses the default query result location.
If you run queries with the API or the drivers, you must specify
query results location in at least one of the two places: for
individual queries with OutputLocation, or for the workgroup,
with WorkGroupConfiguration.
Encrypt query results Optional. Encrypt results stored in Amazon S3. If selected, all queries in
the workgroup are encrypted.
If selected, you can select the Encryption type, the Encryption key
and enter the KMS Key ARN.
168
Amazon Athena User Guide
Managing Workgroups
Field Description
If you don't have the key, open the AWS KMS console to create it. For
more information, see Creating Keys in the AWS Key Management
Service Developer Guide.
Publish to CloudWatch This field is selected by default. Publish query metrics to Amazon
CloudWatch. See Viewing Query Metrics (p. 175).
Override client-side This field is unselected by default. If you select it, workgroup settings
settings apply to all queries in the workgroup and override client-side settings.
For more information, see Workgroup Settings Override Client-Side
Settings (p. 166).
Tags Optional. Add one or more tags to a workgroup. A tag is a label that
you assign to an Athena workgroup resource. It consists of a key and
a value. Use best practices for AWS tagging strategies to create a
consistent set of tags and categorize workgroups by purpose, owner,
or environment. You can also use tags in IAM policies, and to control
billing costs. Do not use duplicate tag keys the same workgroup. For
more information, see Tagging Workgroups (p. 180).
4. Choose Create workgroup. The workgroup appears in the list in the Workgroups panel.
Edit a Workgroup
Editing a workgroup requires permissions to UpdateWorkgroup API operations. See Access to
Athena Workgroups (p. 59) and IAM Policies for Accessing Workgroups (p. 161). If you are adding or
editing tags, you also need to have permissions to TagResource. See the section called “Tag Policy
Examples” (p. 184).
169
Amazon Athena User Guide
Managing Workgroups
2. In the Workgroups panel, choose the workgroup that you want to edit. The View details panel for
the workgroup displays, with the Overview tab selected.
3. Choose Edit workgroup.
4. Change the fields as needed. For the list of fields, see Create workgroup (p. 168). You can change
all fields except for the workgroup's name. If you need to change the name, create another
workgroup with the new name and the same settings.
5. Choose Save. The updated workgroup appears in the list in the Workgroups panel.
• In the Workgroups panel, choose the workgroup that you want to edit. The View details panel for
the workgroup displays, with the Overview tab selected. The workgroup details display, as in the
following example:
Delete a Workgroup
You can delete a workgroup if you have permissions to do so. The primary workgroup cannot be deleted.
If you have permissions, you can delete an empty workgroup at any time. You can also delete a
workgroup that contains saved queries. In this case, before proceeding to delete a workgroup, Athena
warns you that saved queries are deleted.
170
Amazon Athena User Guide
Managing Workgroups
If you delete a workgroup while you are in it, the console switches focus to the primary workgroup. If you
have access to it, you can run queries and view its settings.
If you delete a workgroup, its settings and per-query data limit controls are deleted. The workgroup-wide
data limit controls remain in CloudWatch, and you can delete them there if needed.
Important
Before deleting a workgroup, ensure that its users also belong to other workgroups where
they can continue to run queries. If the users' IAM policies allowed them to run queries only in
this workgroup, and you delete it, they no longer have permissions to run queries. For more
information, see Example Policy for Running Queries in the Primary Workgroup (p. 164).
To delete a workgroup with the API operation, use the DeleteWorkGroup action.
You can open up to ten query tabs within each workgroup. When you switch between workgroups, your
query tabs remain open for up to three workgroups.
3. Choose Switch. The console shows the Workgroup: <workgroup_name> tab with the name of the
workgroup that you switched to. You can now run queries in this workgroup.
171
Amazon Athena User Guide
Athena Workgroup APIs
1. Make sure your permissions allow you to run queries in a workgroup that you intend to use. For more
information, see the section called “ IAM Policies for Accessing Workgroups” (p. 161).
2. To specify the workgroup to Athena, use one of these options:
• If you are accessing Athena via the console, set the workgroup by switching workgroups (p. 171).
• If you are using the Athena API operations, specify the workgroup name in the API action. For
example, you can set the workgroup name in StartQueryExecution, as follows:
• If you are using the JDBC or ODBC driver, set the workgroup name in the connection string using
the Workgroup configuration parameter. The driver passes the workgroup name to Athena.
Specify the workgroup parameter in the connection string as in the following example:
jdbc:awsathena://AwsRegion=<AWSREGION>;UID=<ACCESSKEY>;
PWD=<SECRETKEY>;S3OutputLocation=s3://<athena-output>-<AWSREGION>/;
Workgroup=<WORKGROUPNAME>;
For more information, search for "Workgroup" in the driver documentation link included in JDBC
Driver Documentation (p. 44).
• CreateWorkGroup
• DeleteWorkGroup
• GetWorkGroup
• ListWorkGroups
• UpdateWorkGroup
Troubleshooting Workgroups
Use the following tips to troubleshoot workgroups.
• Check permissions for individual users in your account. They must have access to the location for query
results, and to the workgroup in which they want to run queries. If they want to switch workgroups,
they too need permissions to both workgroups. For information, see IAM Policies for Accessing
Workgroups (p. 161).
172
Amazon Athena User Guide
Troubleshooting Workgroups
• Pay attention to the context in the Athena console, to see in which workgroup you are going to run
queries. If you use the driver, make sure to set the workgroup to the one you need. For information,
see the section called “Specify a Workgroup in Which to Run Queries” (p. 172).
• If you use the API or the drivers to run queries, you must specify the query results location using one
of the ways: either for individual queries, using OutputLocation (client-side), or in the workgroup, using
WorkGroupConfiguration. If the location is not specified in either way, Athena issues an error at query
execution. If you use the Athena console, and don't specify the query results location using one of the
methods, Athena uses the default location (p. 84).
• If you override client-side settings with workgroup settings, you may encounter errors with query
result location. For example, a workgroup's user may not have permissions to the workgroup's location
in Amazon S3 for storing query results. In this case, add the necessary permissions.
• Workgroups introduce changes in the behavior of the API operations. Calls to the following existing
API operations require that users in your account have resource-based permissions in IAM to the
workgroups in which they make them. If no permissions to the workgroup and to workgroup
actions exist, the following API actions throw AccessDeniedException: CreateNamedQuery,
DeleteNamedQuery, GetNamedQuery, ListNamedQueries, StartQueryExecution,
StopQueryExecution, ListQueryExecutions, GetQueryExecution, GetQueryResults, and
GetQueryResultsStream (this API action is only available for use with the driver and is not exposed
otherwise for public use). For more information, see Actions, Resources, and Condition Keys for
Amazon Athena in the IAM User Guide.
You may see the following errors. This table provides a list of some of the errors related to workgroups
and suggests solutions.
Workgroup errors
query state CANCELED. Bytes scanned A query hits a per-query data limit and is
limit was exceeded. canceled. Consider rewriting the query so that
it reads less data, or contact your account
administrator.
173
Amazon Athena User Guide
Controlling Costs and Monitoring
Queries with CloudWatch Metrics
INVALID_INPUT. WorkGroup <name> is not A user runs a query in a workgroup, but the
found. workgroup does not exist. This could happen if
the workgroup was deleted. Switch to another
workgroup to run your query.
InvalidRequestException: when calling A user runs a query with the API without
the StartQueryExecution operation: No specifying the location for query results. You
output location provided. An output must set the output location for query results
location is required either through using one of the two ways: either for individual
the Workgroup result configuration queries, using OutputLocation (client-side), or in
setting or as an API input. the workgroup, using WorkGroupConfiguration.
The Create Table As Select query If the workgroup in which a query runs is
failed because it was submitted with configured with an enforced query results
an 'external_location' property to location (p. 166), and you specify an
an Athena Workgroup that enforces external_location for the CTAS query. In this
a centralized output location for case, remove the external_location and rerun
all queries. Please remove the the query.
'external_location' property and
resubmit the query.
• Configure Data usage controls per query and per workgroup, and establish actions that will be taken if
queries breach the thresholds.
• View and analyze query metrics, and publish them to CloudWatch. If you create a workgroup in the
console, the setting for publishing the metrics to CloudWatch is selected for you. If you use the API
operations, you must enable publishing the metrics (p. 174). When metrics are published, they are
displayed under the Metrics tab in the Workgroups panel. Metrics are disabled by default for the
primary workgroup.
Topics
• Enabling CloudWatch Query Metrics (p. 174)
• Monitoring Athena Queries with CloudWatch Metrics (p. 175)
• Setting Data Usage Control Limits (p. 176)
174
Amazon Athena User Guide
Monitoring Athena Queries with CloudWatch Metrics
If you use API operations, the command line interface, or the client application with the JDBC driver to
create workgroups, to enable publishing of query metrics, set PublishCloudWatchMetricsEnabled
to true in WorkGroupConfiguration. The following example shows only the metrics configuration and
omits other configuration:
"WorkGroupConfiguration": {
"PublishCloudWatchMetricsEnabled": "true"
....
}
When you enable query metrics for queries in workgroups, the metrics are displayed within the Metrics
tab in the Workgroups panel, for each workgroup in the Athena console.
To view a workgroup's metrics, you don't need to switch to it and can remain in another workgroup.
You do need to select the workgroup from the list. You also must have permissions to view its
metrics.
3. Select the workgroup from the list, and then choose View details. If you have permissions, the
workgroup's details display in the Overview tab.
4. Choose the Metrics tab.
175
Amazon Athena User Guide
Setting Data Usage Control Limits
Total amount of data scanned per The amount of data in Megabytes that Athena scanned per
query query.
Total query execution time The amount of time in seconds it takes Athena to run the
query.
• The per-query control limit specifies the total amount of data scanned per query. If any query that
runs in the workgroup exceeds the limit, it is canceled. You can create only one per-query control limit
in a workgroup and it applies to each query that runs in it. Edit the limit if you need to change it. For
detailed steps, see To create a per-query data usage control (p. 177).
• The workgroup-wide data usage control limit specifies the total amount of data scanned for all
queries that run in this workgroup during the specified time period. You can create multiple limits per
176
Amazon Athena User Guide
Setting Data Usage Control Limits
workgroup. The workgroup-wide query limit allows you to set multiple thresholds on hourly or daily
aggregates on data scanned by queries running in the workgroup.
If the aggregate amount of data scanned exceeds the threshold, you can choose to take one of the
following actions:
• Configure an Amazon SNS alarm and an action in the Athena console to notify an administrator
when the limit is breached. For detailed steps, see To create a per-workgroup data usage
control (p. 178). You can also create an alarm and an action on any metric that Athena publishes
from the CloudWatch console. For example, you can set an alert on a number of failed queries. This
alert can trigger an email to an administrator if the number crosses a certain threshold. If the limit is
exceeded, an action sends an Amazon SNS alarm notification to the specified users.
• Invoke a Lambda function. For more information, see Invoking Lambda functions using Amazon SNS
notifications in the Amazon Simple Notification Service Developer Guide.
• Disable the workgroup, stopping any further queries from running.
The per-query and per-workgroup limits are independent of each other. A specified action is taken
whenever either limit is exceeded. If two or more users run queries at the same time in the same
workgroup, it is possible that each query does not exceed any of the specified limits, but the total sum of
data scanned exceeds the data usage limit per workgroup. In this case, an Amazon SNS alarm is sent to
the user.
The per-query control limit specifies the total amount of data scanned per query. If any query that runs
in the workgroup exceeds the limit, it is canceled. Canceled queries are charged according to Amazon
Athena pricing.
Note
In the case of canceled or failed queries, Athena may have already written partial results to
Amazon S3. In such cases, Athena does not delete partial results from the Amazon S3 prefix
where results are stored. You must remove the Amazon S3 prefix with partial results. Athena
uses Amazon S3 multipart uploads to write data Amazon S3. We recommend that you set
the bucket lifecycle policy to abort multipart uploads in cases when queries fail. For more
information, see Aborting Incomplete Multipart Uploads Using a Bucket Lifecycle Policy in the
Amazon Simple Storage Service Developer Guide.
You can create only one per-query control limit in a workgroup and it applies to each query that runs in
it. Edit the limit if you need to change it.
To create a data usage control for a query in a particular workgroup, you don't need to switch to it
and can remain in another workgroup. You do need to select the workgroup from the list and have
permissions to edit the workgroup.
3. Select the workgroup from the list, and then choose View details. If you have permissions, the
workgroup's details display in the Overview tab.
4. Choose the Data usage controls tab. The Per Query Data Usage Control dialog displays.
177
Amazon Athena User Guide
Setting Data Usage Control Limits
• For Data limit, specify a value between 10000 KB (minimum) and 7 EB (maximum).
Note
These are limits imposed by the console for data usage controls within workgroups. They
do not represent any query limits in Athena.
• For units, select the unit value from the drop-down list.
• Review the default Action. The default Action is to cancel the query if it exceeds the limit. This
action cannot be changed.
6. Choose Create if you are creating a new limit, or Update if you are editing an existing limit. If you
are editing an existing limit, refresh the Overview tab to see the updated limit.
To create a data usage control for a particular workgroup, you don't need to switch to it and
can remain in another workgroup. You do need to select the workgroup from the list and have
permissions to edit the workgroup.
3. Select the workgroup from the list, and then choose View details. If you have edit permissions, the
workgroup's details display in the Overview tab.
4. Choose the Data usage controls tab, and scroll down. Then choose Workgroup Data Usage Control
to create a new limit or edit an existing limit. The Create workgroup data usage control dialog
displays.
178
Amazon Athena User Guide
Setting Data Usage Control Limits
• For Data limits, specify a value between 10000 KB (minimum) and 7 EB (maximum).
Note
These are limits imposed by the console for data usage controls within workgroups. They
do not represent any query limits in Athena.
• For units, select the unit value from the drop-down list.
• For time period, choose a time period from the drop-down list.
• For Action, choose the Amazon SNS topic from the drop-down list, if you have it configured.
Or, choose Create an Amazon SNS topic to go directly to the Amazon SNS console, create the
Amazon SNS topic, and set up a subscription for it for one of the users in your Athena account. For
more information, see Creating an Amazon SNS Topic in the Amazon Simple Notification Service
Getting Started Guide.
6. Choose Create if you are creating a new limit, or Save if you are editing an existing limit. If you are
editing an existing limit, refresh the Overview tab for the workgroup to see the updated limit.
179
Amazon Athena User Guide
Tag Basics
Tagging Workgroups
A tag consists of a key and a value, both of which you define. When you tag a workgroup, you assign
custom metadata to it. You can use tags to categorize your AWS resources in different ways; for example,
by purpose, owner, or environment. For Athena, the workgroup is the resource that you can tag. For
example, you can create a set of tags for workgroups in your account that helps you track workgroup
owners, or identify workgroups by their purpose. We recommend that you use AWS tagging best
practices to create a consistent set of tags to meet your organization requirements.
You can work with tags using the Athena console or the API operations.
Topics
• Tag Basics (p. 180)
• Tag Restrictions (p. 181)
• Working with Tags Using the Console (p. 181)
• Working with Tags Using the API Actions (p. 183)
• Tag-Based IAM Access Control Policies (p. 184)
Tag Basics
A tag is a label that you assign to an Athena resource. Each tag consists of a key and an optional value,
both of which you define.
Tags enable you to categorize your AWS resources in different ways. For example, you can define a set of
tags for your account's workgroups that helps you track each workgroup owner or purpose.
You can add tags when creating a new Athena workgroup, or you can add, edit, or remove tags from
an existing workgroup. You can edit a tag in the console. If you use the API operations, to edit a tag,
remove the old tag and add a new one. If you delete a workgroup, any tags for it are also deleted. Other
workgroups in your account continue using the same tags.
Athena does not automatically assign tags to your resources, such as your workgroups. You can edit tag
keys and values, and you can remove tags from a workgroup at any time. You can set the value of a tag
to an empty string, but you can't set the value of a tag to null. Do not add duplicate tag keys at the same
time to the same workgroup. If you do, Athena issues an error message. If you tag a workgroup using an
existing tag key in a separate TagResource action, the new tag value overwrites the old value.
In IAM, you can control which users in your AWS account have permission to create, edit, remove, or list
tags. For more information, see the section called “Tag Policy Examples” (p. 184).
For a complete list of Amazon Athena tag actions, see the API action names in the Amazon Athena API
Reference.
You can use the same tags for billing. For more information, see Using Tags for Billing in the AWS Billing
and Cost Management User Guide.
180
Amazon Athena User Guide
Tag Restrictions
Tag Restrictions
Tags have the following restrictions:
Tagging operations, such as adding, editing, removing, or listing tags, require that you specify an ARN
for the workgroup resource.
• Athena allows you to use letters, numbers, spaces represented in UTF-8, and the following characters:
+ - = . _ : / @.
• Tag keys and values are case-sensitive.
• Don't use the "aws:" prefix in tag keys; it's reserved for AWS use. You can't edit or delete tag keys
with this prefix. Tags with this prefix do not count against your per-resource tags limit.
• The tags you assign are available only to your AWS account.
Topics
• Displaying Tags for Individual Workgroups (p. 181)
• Adding and Deleting Tags on an Individual Workgroup (p. 181)
To view a list of tags for a workgroup, select the workgroup, choose View Details, and then choose
the Tags tab. The list of tags for the workgroup displays. You can also view tags on a workgroup if you
choose Edit Workgroup.
To search for tags, choose the Tags tab, and then choose Manage Tags. Then, enter a tag name into the
search tool.
181
Amazon Athena User Guide
Adding and Deleting Tags on an Individual Workgroup
1. Open the Athena console, and then choose the Workgroups tab.
2. In the workgroup list, select the workgroup, choose View details, and then choose the Tags tab.
3. On the Tags tab, choose Manage tags.
182
Amazon Athena User Guide
Working with Tags Using the API Actions
4. In the list of tags, select the delete button (a cross) for the tag, and choose Save.
Example TagResource
In the following example, we add two tags to workgroupA:
client.tagResource(request);
Note
Do not add duplicate tag keys at the same time to the same workgroup. If you do, Athena issues
an error message. If you tag a workgroup using an existing tag key in a separate TagResource
action, the new tag value overwrites the old value.
Example UntagResource
In the following example, we remove tagKey2 from workgroupA:
client.untagResource(request);
Example ListTagsForResource
In the following example, we list tags for workgroupA:
183
Amazon Athena User Guide
Tag-Based IAM Access Control Policies
The following IAM policy allows you to run queries and interact with tags for the workgroup named
workgroupA:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"athena:ListWorkGroups",
"athena:GetExecutionEngine",
"athena:GetExecutionEngines",
"athena:GetNamespace",
"athena:GetCatalogs",
"athena:GetNamespaces",
"athena:GetTables",
"athena:GetTable"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:GetWorkGroup",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/workgroupA"
}
184
Amazon Athena User Guide
Tag Policy Examples
]
}
Example 2: Policy Block that Denies Actions on a Workgroup Based on a Tag Key and Tag
Value Pair
Tags that are associated with an existing workgroup are referred to as resource tags. Resource tags let
you write policy blocks, such as the following, which deny the listed actions on any workgroup tagged
with tag key and tag value pair, such as: stack, production.
{
"Effect": "Deny",
"Action": [
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:DeleteNamedQuery",
"athena:UpdateWorkGroup",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:GetWorkGroup",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"athena:GetQueryExecutions",
"athena:ListNamedQueries",
"athena:CreateNamedQuery",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution",
"athena:TagResource",
"athena:UntagResource",
"athena:ListTagsForResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/stack": "production"
}
}
}
Example 3. Policy Block that Restricts Tag-Changing Action Requests to Specified Tags
Tags passed in as parameters to a tag-mutating API action, such as CreateWorkGroup with tags,
TagResource, and UntagResource, are referred to as request tags. Use these tags, as shown in the
following example policy block. This allows CreateWorkGroup only if one of the tags included when you
create a workgroup is a tag with the costcenter key with one of the allowed tag values: 1, 2, or 3.
Note: Make sure that you give a user IAM permissions to the TagResource and CreateWorkGroup API
operations, if you want to allow them to pass in tags upon CreateWorkGroup.
{
"Effect": "Allow",
"Action": [
"athena:CreateWorkGroup",
"athena:TagResource"
],
"Resource": "arn:aws:athena:us-east-1:123456789012:workgroup/*",
"Condition": {
"StringEquals": {
"aws:RequestTag/costcenter": [
"1",
"2",
185
Amazon Athena User Guide
Tag Policy Examples
"3"
]
}
}
}
186
Amazon Athena User Guide
Logging Amazon Athena API Calls with AWS CloudTrail
Topics
• Logging Amazon Athena API Calls with AWS CloudTrail (p. 187)
• Troubleshooting (p. 190)
CloudTrail captures all API calls for Athena as events. The calls captured include calls from the Athena
console and code calls to the Athena API operations. If you create a trail, you can enable continuous
delivery of CloudTrail events to an Amazon S3 bucket, including events for Athena. If you don't configure
a trail, you can still view the most recent events in the CloudTrail console in Event history.
Using the information collected by CloudTrail, you can determine the request that was made to Athena,
the IP address from which the request was made, who made the request, when it was made, and
additional details.
To learn more about CloudTrail, see the AWS CloudTrail User Guide.
You can also use Athena to query CloudTrail log files for insight. For more information, see Querying
AWS CloudTrail Logs (p. 135) and CloudTrail SerDe (p. 196).
For an ongoing record of events in your AWS account, including events for Athena, create a trail. A trail
enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when you create a trail in the
console, the trail applies to all AWS Regions. The trail logs events from all Regions in the AWS partition
and delivers the log files to the Amazon S3 bucket that you specify. Additionally, you can configure
other AWS services to further analyze and act upon the event data collected in CloudTrail logs. For more
information, see the following:
187
Amazon Athena User Guide
Understanding Athena Log File Entries
All Athena actions are logged by CloudTrail and are documented in the Amazon Athena API Reference.
For example, calls to the StartQueryExecution and GetQueryResults actions generate entries in the
CloudTrail log files.
Every event or log entry contains information about who generated the request. The identity
information helps you determine the following:
• Whether the request was made with root or AWS Identity and Access Management (IAM) user
credentials.
• Whether the request was made with temporary security credentials for a role or federated user.
• Whether the request was made by another AWS service.
StartQueryExecution (Successful)
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
"eventTime":"2017-05-04T00:23:55Z",
"eventSource":"athena.amazonaws.com",
"eventName":"StartQueryExecution",
"awsRegion":"us-east-1",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-internal/3",
"requestParameters":{
"clientRequestToken":"16bc6e70-f972-4260-b18a-db1b623cb35c",
"resultConfiguration":{
"outputLocation":"s3://athena-johndoe-test/test/"
},
"query":"Select 10"
},
"responseElements":{
"queryExecutionId":"b621c254-74e0-48e3-9630-78ed857782f9"
},
"requestID":"f5039b01-305f-11e7-b146-c3fc56a7dc7a",
"eventID":"c97cf8c8-6112-467a-8777-53bb38f83fd5",
188
Amazon Athena User Guide
Understanding Athena Log File Entries
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
}
StartQueryExecution (Failed)
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
"eventTime":"2017-05-04T00:21:57Z",
"eventSource":"athena.amazonaws.com",
"eventName":"StartQueryExecution",
"awsRegion":"us-east-1",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-internal/3",
"errorCode":"InvalidRequestException",
"errorMessage":"Invalid result configuration. Should specify either output location or
result configuration",
"requestParameters":{
"clientRequestToken":"ca0e965f-d6d8-4277-8257-814a57f57446",
"query":"Select 10"
},
"responseElements":null,
"requestID":"aefbc057-305f-11e7-9f39-bbc56d5d161e",
"eventID":"6e1fc69b-d076-477e-8dec-024ee51488c4",
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
}
CreateNamedQuery
{
"eventVersion":"1.05",
"userIdentity":{
"type":"IAMUser",
"principalId":"EXAMPLE_PRINCIPAL_ID",
"arn":"arn:aws:iam::123456789012:user/johndoe",
"accountId":"123456789012",
"accessKeyId":"EXAMPLE_KEY_ID",
"userName":"johndoe"
},
"eventTime":"2017-05-16T22:00:58Z",
"eventSource":"athena.amazonaws.com",
"eventName":"CreateNamedQuery",
"awsRegion":"us-west-2",
"sourceIPAddress":"77.88.999.69",
"userAgent":"aws-cli/1.11.85 Python/2.7.10 Darwin/16.6.0 botocore/1.5.48",
"requestParameters":{
"name":"johndoetest",
"queryString":"select 10",
"database":"default",
"clientRequestToken":"fc1ad880-69ee-4df0-bb0f-1770d9a539b1"
},
"responseElements":{
"namedQueryId":"cdd0fe29-4787-4263-9188-a9c8db29f2d6"
189
Amazon Athena User Guide
Troubleshooting
},
"requestID":"2487dd96-3a83-11e7-8f67-c9de5ac76512",
"eventID":"15e3d3b5-6c3b-4c7c-bc0b-36a8dd95227b",
"eventType":"AwsApiCall",
"recipientAccountId":"123456789012"
},
Troubleshooting
Use these documentation topics to troubleshoot problems with Amazon Athena.
190
Amazon Athena User Guide
Using a SerDe
SerDe Reference
Athena supports several SerDe libraries for parsing data from different data formats, such as CSV, JSON,
Parquet, and ORC. Athena does not support custom SerDes.
Topics
• Using a SerDe (p. 191)
• Supported SerDes and Data Formats (p. 192)
• Compression Formats (p. 216)
Using a SerDe
A SerDe (Serializer/Deserializer) is a way in which Athena interacts with data in various formats.
It is the SerDe you specify, and not the DDL, that defines the table schema. In other words, the SerDe can
override the DDL configuration that you specify in Athena when you create your table.
• Use DDL statements to describe how to read and write data to the table and do not specify a ROW
FORMAT, as in this example. This omits listing the actual SerDe type and the native LazySimpleSerDe
is used by default.
In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify
ROW FORMAT DELIMITED.
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
• Explicitly specify the type of SerDe Athena should use when it reads and writes data to the table. Also,
specify additional properties in SERDEPROPERTIES, as in this example.
191
Amazon Athena User Guide
Supported SerDes and Data Formats
To create tables and query data in these formats in Athena, specify a serializer-deserializer class (SerDe)
so that Athena knows which format is used and how to parse the data.
This table lists the data formats supported in Athena and their corresponding SerDe libraries.
A SerDe is a custom library that tells the data catalog used by Athena how to handle the data. You
specify a SerDe type by listing it explicitly in the ROW FORMAT part of your CREATE TABLE statement in
Athena. In some cases, you can omit the SerDe name because Athena uses some SerDe types by default
for certain types of data formats.
CSV (Comma-Separated Values) For data in CSV, each line • Use the LazySimpleSerDe
represents a data record, and for CSV, TSV, and Custom-
each record consists of one Delimited Files (p. 206) if
or more fields, separated by your data does not include
commas. values enclosed in quotes.
• Use the OpenCSVSerDe for
Processing CSV (p. 198)
when your data includes
quotes in values, or different
separator or escape
characters.
TSV (Tab-Separated Values) For data in TSV, each line Use the LazySimpleSerDe for
represents a data record, and CSV, TSV, and Custom-Delimited
each record consists of one or Files (p. 206) and specify the
more fields, separated by tabs. separator character as FIELDS
TERMINATED BY '\t'.
Custom-Delimited For data in this format, each line Use the LazySimpleSerDe for
represents a data record, and CSV, TSV, and Custom-Delimited
records are separated by custom Files (p. 206) and specify
delimiters. custom delimiters.
JSON (JavaScript Object For JSON data, each line • Use the Hive JSON
Notation) represents a data record, SerDe (p. 203).
and each record consists of • Use the OpenX JSON
attribute–value pairs and arrays, SerDe (p. 204).
separated by commas.
Apache Avro A format for storing data in Use the Avro SerDe (p. 193).
Hadoop that uses JSON-based
schemas for record values.
192
Amazon Athena User Guide
Avro SerDe
ORC (Optimized Row Columnar) A format for optimized Use the ORC SerDe (p. 211)
columnar storage of Hive data. and ZLIB compression.
Apache Parquet A format for columnar storage Use the Parquet SerDe (p. 214)
of data in Hadoop. and SNAPPY compression.
Logstash logs A format for storing logs in Use the Grok SerDe (p. 200).
Logstash.
Apache WebServer logs A format for storing logs in Use the RegexSerDe for
Apache WebServer. Processing Apache Web Server
Logs (p. 195).
Topics
• Avro SerDe (p. 193)
• RegexSerDe for Processing Apache Web Server Logs (p. 195)
• CloudTrail SerDe (p. 196)
• OpenCSVSerDe for Processing CSV (p. 198)
• Grok SerDe (p. 200)
• JSON SerDe Libraries (p. 203)
• LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files (p. 206)
• ORC SerDe (p. 211)
• Parquet SerDe (p. 214)
Avro SerDe
SerDe Name
Avro SerDe
Library Name
org.apache.hadoop.hive.serde2.avro.AvroSerDe
Examples
Athena does not support using avro.schema.url to specify table schema for security reasons. Use
avro.schema.literal. To extract schema from data in the Avro format, use the Apache avro-
193
Amazon Athena User Guide
Avro SerDe
tools-<version>.jar with the getschema parameter. This returns a schema that you can use in your
WITH SERDEPROPERTIES statement. For example:
The avro-tools-<version>.jar file is located in the java subdirectory of your installed Avro
release. To download Avro, see Apache Avro Releases. To download Apache Avro Tools directly, see the
Apache Avro Tools Maven Repository.
After you obtain the schema, use a CREATE TABLE statement to create an Athena table based on
underlying Avro data stored in Amazon S3. In ROW FORMAT, specify the Avro SerDe as follows: ROW
FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'. In SERDEPROPERTIES,
specify the schema, as shown in the following example.
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
path/to/data/ with the region identifier where you run Athena, for example, s3://athena-
examples-us-east-1/path/to/data/.
194
Amazon Athena User Guide
RegexSerDe for Processing Apache Web Server Logs
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata.
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Library Name
RegexSerDe
Examples
The following example creates a table from CloudFront logs using the RegExSerDe from the Getting
Started tutorial.
195
Amazon Athena User Guide
CloudTrail SerDe
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
path/to/data/ with the region identifier where you run Athena, for example, s3://athena-
examples-us-east-1/path/to/data/.
CloudTrail SerDe
AWS CloudTrail is a service that records AWS API calls and events for AWS accounts. CloudTrail generates
encrypted logs and stores them in Amazon S3. You can use Athena to query these logs directly from
Amazon S3, specifying the LOCATION of logs.
To query CloudTrail logs in Athena, create table from the logs and use the CloudTrail SerDe to deserialize
the logs data.
In addition to using the CloudTrail SerDe, instances exist where you need to use a different SerDe or to
extract data from JSON. Certain fields in CloudTrail logs are STRING values that may have a variable
data format, which depends on the service. As a result, the CloudTrail SerDe is unable to predictably
deserialize them. To query the following fields, identify the data pattern and then use a different
SerDe, such as the OpenX JSON SerDe (p. 204). Alternatively, to get data out of these fields, use
JSON_EXTRACT functions. For more information, see Extracting Data From JSON (p. 117).
• requestParameters
• responseElements
• additionalEventData
• serviceEventDetails
SerDe Name
CloudTrail SerDe
Library Name
com.amazon.emr.hive.serde.CloudTrailSerde
196
Amazon Athena User Guide
CloudTrail SerDe
Examples
The following example uses the CloudTrail SerDe on a fictional set of logs to create a table based on
them.
The following query returns the logins that occurred over a 24-hour period:
SELECT
197
Amazon Athena User Guide
OpenCSVSerDe for Processing CSV
useridentity.username,
sourceipaddress,
eventtime,
additionaleventdata
FROM default.cloudtrail_logs
WHERE eventname = 'ConsoleLogin'
AND eventtime >= '2017-02-17T00:00:00Z'
AND eventtime < '2017-02-18T00:00:00Z';
For more information, see Querying AWS CloudTrail Logs (p. 135).
• If data contains values enclosed in double quotes ("), you can use the OpenCSV SerDe to deserialize
the values in Athena. In the following sections, note the behavior of this SerDe with STRING data
types.
• If data does not contain values enclosed in double quotes ("), you can omit specifying any SerDe. In
this case, Athena uses the default LazySimpleSerDe. For information, see LazySimpleSerDe for CSV,
TSV, and Custom-Delimited Files (p. 206).
• Cannot escape \t or \n directly. To escape them, use "escapeChar" = "\\". See the example in
this topic.
• Does not support embedded line breaks in CSV files.
Note
When you use Athena with OpenCSVSerDe, the SerDe converts all column types to STRING.
Next, the parser in Athena parses the values from STRING into actual types based on what it
finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types
when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses
them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT.
DATE type values are also parsed as INT.
To further convert columns to the desired type in a table, you can create a view (p. 86) over the
table and use CAST to convert to the desired type.
For data types other than STRING, when the parser in Athena can recognize them, this SerDe behaves as
follows:
• Recognizes BOOLEAN, BIGINT, INT, and DOUBLE data types and parses them without changes.
• Recognizes the TIMESTAMP type if it is specified in the UNIX format, such as yyyy-mm-dd
hh:mm:ss[.f...], as the type LONG.
198
Amazon Athena User Guide
OpenCSVSerDe for Processing CSV
• Does not support TIMESTAMP in the JDBC-compliant java.sql.Timestamp format, such as "YYYY-
MM-DD HH:MM:SS.fffffffff" (9 decimal place precision). If you are processing CSV data from Hive,
use the UNIX format for TIMESTAMP.
• Recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.
• Does not support DATE in another format. If you are processing CSV data from Hive, use the UNIX
format for DATE.
The following statement creates a table in Athena, specifying that "escapeChar" = "\\".
f1 s2
\t\t\n 123 \t\t\n abc
456 xyz
SerDe Name
CSV SerDe
Library Name
To use this SerDe, specify its fully qualified class name in ROW FORMAT. Also specify the delimiters inside
SERDEPROPERTIES, as follows:
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "`",
"escapeChar" = "\\"
)
Example
This example presumes data in CSV saved in s3://mybucket/mycsv/ with the following contents:
199
Amazon Athena User Guide
Grok SerDe
"a1","a2","a3","a4"
"1","2","abc","def"
"a","a1","abc3","ab4"
Use a CREATE TABLE statement to create an Athena table based on the data, and reference the
OpenCSVSerDe class in ROW FORMAT, also specifying SerDe properties for character separator, quote
character, and escape character, as follows:
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Grok SerDe
The Logstash Grok SerDe is a library with a set of specialized patterns for deserialization of unstructured
text data, usually logs. Each Grok pattern is a named regular expression. You can identify and re-use
these deserialization patterns as needed. This makes it easier to use Grok compared with using regular
expressions. Grok provides a set of pre-defined patterns. You can also create custom patterns.
To specify the Grok SerDe when creating a table in Athena, use the ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe' clause, followed by the WITH SERDEPROPERTIES clause
that specifies the patterns to match in your data, where:
• The input.format expression defines the patterns to match in the data. It is required.
• The input.grokCustomPatterns expression defines a named custom pattern, which you
can subsequently use within the input.format expression. It is optional. To include multiple
pattern entries into the input.grokCustomPatterns expression, use the newline escape
character (\n) to separate them, as follows: 'input.grokCustomPatterns'='INSIDE_QS ([^
\"]*)\nINSIDE_BRACKETS ([^\\]]*)').
200
Amazon Athena User Guide
Grok SerDe
Examples
These examples rely on the list of predefined Grok patterns. See pre-defined patterns.
Example 1
This example uses source data from Postfix maillog entries saved in s3://mybucket/groksample.
The following statement creates a table in Athena called mygroktable from the source data, using a
custom pattern and the predefined patterns that you specify:
Start with a simple pattern, such as %{NOTSPACE:column}, to get the columns mapped first and then
specialize the columns if needed.
Example 2
In the following example, you create a query for Log4j logs. The example logs have the entries in this
format:
201
Amazon Athena User Guide
Grok SerDe
• Add the Grok pattern to the input.format for each column. For example, for timestamp, add
%{TIMESTAMP_ISO8601:timestamp}. For loglevel, add %{LOGLEVEL:loglevel}.
• Make sure the pattern in input.format matches the format of the log exactly, by mapping the
dashes (-) and the commas that separate the entries in the log format.
Example 3
The following example of querying Amazon S3 logs shows the 'input.grokCustomPatterns'
expression that contains two pattern entries, separated by the newline escape character (\n), as
shown in this snippet from the example query: 'input.grokCustomPatterns'='INSIDE_QS ([^
\"]*)\nINSIDE_BRACKETS ([^\\]]*)').
202
Amazon Athena User Guide
JSON SerDe Libraries
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='%{NOTSPACE:bucket_owner} %{NOTSPACE:bucket} \\[%{INSIDE_BRACKETS:time}\\]
%{NOTSPACE:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation}
%{NOTSPACE:key} \"?%{INSIDE_QS:request_uri}\"? %{NOTSPACE:http_status}
%{NOTSPACE:error_code} %{NOTSPACE:bytes_sent} %{NOTSPACE:object_size}
%{NOTSPACE:total_time} %{NOTSPACE:turnaround_time} \"?%{INSIDE_QS:referrer}\"? \"?
%{INSIDE_QS:user_agent}\"? %{NOTSPACE:version_id}',
'input.grokCustomPatterns'='INSIDE_QS ([^\"]*)\nINSIDE_BRACKETS ([^\\]]*)')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-for-service-logs/s3_access'
SerDe Names
Hive-JsonSerDe
Openx-JsonSerDe
Library Names
Use one of the following:
org.apache.hive.hcatalog.data.JsonSerDe
org.openx.data.jsonserde.JsonSerDe
You can also use the Hive JSON SerDe to parse more complex JSON-encoded data with nested
structures. However, this requires having a matching DDL representing the complex data types. See
Example: Deserializing Nested JSON (p. 205).
With this SerDe, duplicate keys are not allowed in map (or struct) key names.
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
path/to/data/ with the region identifier where you run Athena, for example, s3://athena-
examples-us-east-1/path/to/data/.
203
Amazon Athena User Guide
JSON SerDe Libraries
This SerDe has a few useful properties that you can specify when creating tables in Athena, to help
address inconsistencies in the data:
ignore.malformed.json
Optional. When set to TRUE, lets you skip malformed JSON syntax. The default is FALSE.
ConvertDotsInJsonKeysToUnderscores
Optional. The default is FALSE. When set to TRUE, allows the SerDe to replace the dots in key names
with underscores. For example, if the JSON dataset contains a key with the name "a.b", you can
use this property to define the column name to be "a_b" in Athena. By default (without this SerDe),
Athena does not allow dots in column names.
case.insensitive
Optional. By default, Athena requires that all keys in your JSON dataset use lowercase. The default is
TRUE. When set to TRUE, the SerDe converts all uppercase columns to lowercase. Using WITH SERDE
PROPERTIES ("case.insensitive"= FALSE;) allows you to use case-sensitive key names in
your data.
ColumnToJsonKeyMappings
Optional. Maps column names to JSON keys that aren't identical to the column names. This is useful
when the JSON data contains keys that are keywords (p. 72). For example, if you have a JSON key
named timestamp, set this parameter to {"ts": "timestamp"} to map this key to a column
named ts. This parameter takes values of type string. It uses the following key pattern: ^\S+$ and
the following value pattern: ^(?!\s*$).+
204
Amazon Athena User Guide
JSON SerDe Libraries
With this SerDe, duplicate keys are not allowed in map (or struct) key names.
When dealing with complex nested JSON, there are common issues you may encounter. For more
information about these issues and troubleshooting practices, see the AWS Knowledge Center Article I
receive errors when I try to read JSON data in Amazon Athena.
For more information about common scenarios and query tips, see Create Tables in Amazon Athena from
Nested JSON and Mappings Using JSONSerDe.
The following example demonstrates a simple approach to creating an Athena table from data with
nested structures in JSON.To parse JSON-encoded data in Athena, each JSON document must be on its
own line, separated by a new line.
{
"DocId": "AWS",
"User": {
"Id": 1234,
"Username": "bob1234",
"Name": "Bob",
"ShippingAddress": {
"Address1": "123 Main St.",
"Address2": null,
"City": "Seattle",
"State": "WA"
},
"Orders": [
{
"ItemId": 6789,
"OrderDate": "11/11/2017"
205
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
},
{
"ItemId": 4352,
"OrderDate": "12/12/2017"
}
]
}
}
The following CREATE TABLE command uses the Openx-JsonSerDe with collection data types like
struct and array to establish groups of objects. Each JSON document is listed on its own line,
separated by a new line. To avoid errors, the data being queried does not include duplicate keys in
struct and map key names. Duplicate keys are not allowed in map (or struct) key names.
Library Name
The Class library name for the LazySimpleSerDe is
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe. For more information, see
LazySimpleSerDe.
Examples
The following examples show how to create tables in Athena from CSV and TSV, using the
LazySimpleSerDe. To deserialize custom-delimited file using this SerDe, specify the delimiters similar
to the following examples.
206
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
path/to/data/ with the region identifier where you run Athena, for example, s3://athena-
examples-us-east-1/path/to/data/.
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
CSV Example
Use the CREATE TABLE statement to create an Athena table from the underlying data in CSV stored in
Amazon S3.
207
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
208
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/csv/';
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table:
TSV Example
This example presumes source data in TSV saved in s3://mybucket/mytsv/.
Use a CREATE TABLE statement to create an Athena table from the TSV data stored in Amazon
S3. Notice that this example does not reference any SerDe class in ROW FORMAT because it uses the
LazySimpleSerDe, and it can be omitted. The example specifies SerDe properties for character and line
separators, and an escape character:
209
Amazon Athena User Guide
LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files
crsdeptime STRING,
deptime STRING,
depdelay INT,
depdelayminutes INT,
depdel15 INT,
departuredelaygroups INT,
deptimeblk STRING,
taxiout INT,
wheelsoff STRING,
wheelson STRING,
taxiin INT,
crsarrtime INT,
arrtime STRING,
arrdelay INT,
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
210
Amazon Athena User Guide
ORC SerDe
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://athena-examples-myregion/flight/tsv/';
Run MSCK REPAIR TABLE to refresh partition metadata each time a new partition is added to this table:
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
ORC SerDe
SerDe Name
OrcSerDe
Library Name
This is the SerDe class for data in the ORC format. It passes the object from ORC to the reader and from
ORC to the writer: OrcSerDe
Examples
Note
You can query data in regions other than the region where you run Athena. Standard inter-
region data transfer rates for Amazon S3 apply in addition to standard Athena charges. To
reduce data transfer charges, replace myregion in s3://athena-examples-myregion/
211
Amazon Athena User Guide
ORC SerDe
path/to/data/ with the region identifier where you run Athena, for example, s3://athena-
examples-us-east-1/path/to/data/.
The following example creates a table for the flight delays data in ORC. The table includes partitions:
212
Amazon Athena User Guide
ORC SerDe
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year String)
STORED AS ORC
LOCATION 's3://athena-examples-myregion/flight/orc/'
tblproperties ("orc.compress"="ZLIB");
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata:
Use this query to obtain the top 10 routes delayed by more than 1 hour:
213
Amazon Athena User Guide
Parquet SerDe
FROM flight_delays_orc
WHERE depdelayminutes > 60
GROUP BY origin, dest
ORDER BY 3 DESC
LIMIT 10;
Parquet SerDe
SerDe Name
ParquetHiveSerDe is used for data stored in Parquet Format.
Library Name
Athena uses this class when it needs to deserialize data stored in Parquet:
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.
Use the following CREATE TABLE statement to create an Athena table from the underlying data in CSV
stored in Amazon S3 in Parquet:
214
Amazon Athena User Guide
Parquet SerDe
deptime STRING,
depdelay INT,
depdelayminutes INT,
depdel15 INT,
departuredelaygroups INT,
deptimeblk STRING,
taxiout INT,
wheelsoff STRING,
wheelson STRING,
taxiin INT,
crsarrtime INT,
arrtime STRING,
arrdelay INT,
arrdelayminutes INT,
arrdel15 INT,
arrivaldelaygroups INT,
arrtimeblk STRING,
cancelled INT,
cancellationcode STRING,
diverted INT,
crselapsedtime INT,
actualelapsedtime INT,
airtime INT,
flights INT,
distance INT,
distancegroup INT,
carrierdelay INT,
weatherdelay INT,
nasdelay INT,
securitydelay INT,
lateaircraftdelay INT,
firstdeptime STRING,
totaladdgtime INT,
longestaddgtime INT,
divairportlandings INT,
divreacheddest INT,
divactualelapsedtime INT,
divarrdelay INT,
divdistance INT,
div1airport STRING,
div1airportid INT,
div1airportseqid INT,
div1wheelson STRING,
div1totalgtime INT,
div1longestgtime INT,
div1wheelsoff STRING,
div1tailnum STRING,
div2airport STRING,
div2airportid INT,
div2airportseqid INT,
div2wheelson STRING,
div2totalgtime INT,
div2longestgtime INT,
div2wheelsoff STRING,
div2tailnum STRING,
div3airport STRING,
div3airportid INT,
div3airportseqid INT,
div3wheelson STRING,
div3totalgtime INT,
div3longestgtime INT,
div3wheelsoff STRING,
div3tailnum STRING,
div4airport STRING,
div4airportid INT,
div4airportseqid INT,
215
Amazon Athena User Guide
Compression Formats
div4wheelson STRING,
div4totalgtime INT,
div4longestgtime INT,
div4wheelsoff STRING,
div4tailnum STRING,
div5airport STRING,
div5airportid INT,
div5airportseqid INT,
div5wheelson STRING,
div5totalgtime INT,
div5longestgtime INT,
div5wheelsoff STRING,
div5tailnum STRING
)
PARTITIONED BY (year STRING)
STORED AS PARQUET
LOCATION 's3://athena-examples-myregion/flight/parquet/'
tblproperties ("parquet.compress"="SNAPPY");
Run the MSCK REPAIR TABLE statement on the table to refresh partition metadata:
Note
The flight table data comes from Flights provided by US Department of Transportation, Bureau
of Transportation Statistics. Desaturated from original.
Compression Formats
Athena supports the following compression formats:
Note
The compression formats listed in this section are used for CREATE TABLE (p. 223) queries. For
CTAS queries, Athena supports GZIP and SNAPPY (for data stored in Parquet and ORC). If you
omit a format, GZIP is used by default. For more information, see CREATE TABLE AS (p. 226).
• SNAPPY. This is the default compression format for files in the Parquet data storage format.
• ZLIB. This is the default compression format for files in the ORC data storage format.
• LZO
• GZIP.
For data in CSV, TSV, and JSON, Athena determines the compression type from the file extension. If
it is not present, the data is not decompressed. If your data is compressed, make sure the file name
includes the compression extension, such as gz.
Use the GZIP compression in Athena for querying Amazon Kinesis Data Firehose logs. Athena
and Amazon Kinesis Data Firehose each support different versions of SNAPPY, so GZIP is the only
compatible format.
216
Amazon Athena User Guide
Data Types
Topics
• Data Types (p. 217)
• DDL Statements (p. 218)
• SQL Queries, Functions, and Operators (p. 234)
• Unsupported DDL (p. 238)
• Limitations (p. 239)
Data Types
When you run CREATE TABLE, you must specify column names and their data types. For a complete
syntax of this command, see CREATE TABLE (p. 223).
To specify decimal values as literals, such as when selecting rows with a specific decimal value in a
query DDL expression, specify the DECIMAL type definition, and list the decimal value as a literal (in
single quotes) in your query, as in this example: decimal_value = DECIMAL '0.12'.
• String types
• CHAR. Fixed length character data, with a specified length between 1 and 255, such as char(10).
For more information, see CHAR Hive Data Type.
217
Amazon Athena User Guide
DDL Statements
• VARCHAR. Variable length character data, with a specified length between 1 and 65535, such as
varchar(10). For more information, see VARCHAR Hive Data Type.
• BINARY (for data in Parquet)
• Date and time types
• DATE, in the UNIX format, such as YYYY-MM-DD.
• TIMESTAMP. Instant in time and date in the UNiX format, such as yyyy-mm-dd hh:mm:ss[.f...].
For example, TIMESTAMP '2008-09-15 03:04:05.324'. This format uses the session time zone.
• Structural types
• ARRAY < data_type >
• MAP < primitive_type, data_type >
• STRUCT < col_name : data_type [COMMENT col_comment] [, ...] >
For information about supported data type mappings between types in Athena, the JDBC driver, and
Java data types, see the "Data Types" section in the JDBC Driver Installation and Configuration Guide.
DDL Statements
Use the following DDL statements directly in Athena.
Athena does not support all DDL statements. For information, see Unsupported DDL (p. 238).
Topics
• ALTER DATABASE SET DBPROPERTIES (p. 219)
• ALTER TABLE ADD PARTITION (p. 219)
• ALTER TABLE DROP PARTITION (p. 220)
• ALTER TABLE RENAME PARTITION (p. 221)
• ALTER TABLE SET LOCATION (p. 221)
• ALTER TABLE SET TBLPROPERTIES (p. 222)
• CREATE DATABASE (p. 222)
• CREATE TABLE (p. 223)
• CREATE TABLE AS (p. 226)
• CREATE VIEW (p. 228)
• DESCRIBE TABLE (p. 228)
• DESCRIBE VIEW (p. 229)
• DROP DATABASE (p. 229)
• DROP TABLE (p. 230)
• DROP VIEW (p. 230)
• MSCK REPAIR TABLE (p. 231)
• SHOW COLUMNS (p. 231)
• SHOW CREATE TABLE (p. 231)
• SHOW CREATE VIEW (p. 232)
• SHOW DATABASES (p. 232)
• SHOW PARTITIONS (p. 232)
• SHOW TABLES (p. 233)
218
Amazon Athena User Guide
ALTER DATABASE SET DBPROPERTIES
Synopsis
ALTER (DATABASE|SCHEMA) database_name
SET DBPROPERTIES ('property_name'='property_value' [, ...] )
Parameters
SET DBPROPERTIES ('property_name'='property_value' [, ...]
Specifies a property or properties for the database named property_name and establishes the
value for each of the properties respectively as property_value. If property_name already exists,
the old value is overwritten with property_value.
Examples
ALTER DATABASE jd_datasets
SET DBPROPERTIES ('creator'='John Doe', 'department'='applied mathematics');
Synopsis
ALTER TABLE table_name ADD [IF NOT EXISTS]
PARTITION
(partition_col1_name = partition_col1_value
[,partition_col2_name = partition_col2_value]
[,...])
[LOCATION 'location1']
[PARTITION
(partition_colA_name = partition_colA_value
[,partition_colB_name = partition_colB_value
[,...])]
[LOCATION 'location2']
[,...]
219
Amazon Athena User Guide
ALTER TABLE DROP PARTITION
Parameters
[IF NOT EXISTS]
Causes the error to be suppressed if a partition with the same definition already exists.
PARTITION (partition_col_name = partition_col_value [,...])
Creates a partition with the column name/value combinations that you specify. Enclose
partition_col_value in string characters only if the data type of the column is a string.
[LOCATION 'location']
Specifies the directory in which to store the partitions defined by the preceding statement.
Examples
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN');
Synopsis
ALTER TABLE table_name DROP [IF EXISTS] PARTITION (partition_spec) [, PARTITION
(partition_spec)]
Parameters
[IF EXISTS]
Suppresses the error message if the partition specified does not exist.
PARTITION (partition_spec)
Examples
ALTER TABLE orders DROP PARTITION (dt = '2014-05-14', country = 'IN');
220
Amazon Athena User Guide
ALTER TABLE RENAME PARTITION
ALTER TABLE orders DROP PARTITION (dt = '2014-05-14', country = 'IN'), PARTITION (dt =
'2014-05-15', country = 'IN');
Synopsis
ALTER TABLE table_name PARTITION (partition_spec) RENAME TO PARTITION (new_partition_spec)
Parameters
PARTITION (partition_spec)
Examples
ALTER TABLE orders PARTITION (dt = '2014-05-14', country = 'IN') RENAME TO PARTITION (dt =
'2014-05-15', country = 'IN');
Synopsis
ALTER TABLE table_name [ PARTITION (partition_spec) ] SET LOCATION 'new location'
Parameters
PARTITION (partition_spec)
Specifies the partition with parameters partition_spec whose location you want to change. The
partition_spec specifies a column name/value combination in the form partition_col_name
= partition_col_value.
SET LOCATION 'new location'
Examples
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://mystorage/
custdata';
221
Amazon Athena User Guide
ALTER TABLE SET TBLPROPERTIES
Synopsis
ALTER TABLE table_name SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])
Parameters
SET TBLPROPERTIES ('property_name' = 'property_value' [ , ... ])
Specifies the metadata properties to add as property_name and the value for each as property
value. If property_name already exists, its value is reset to property_value.
Examples
ALTER TABLE orders SET TBLPROPERTIES ('notes'="Please don't drop this table.");
CREATE DATABASE
Creates a database. The use of DATABASE and SCHEMA is interchangeable. They mean the same thing.
Synopsis
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT 'database_comment']
[LOCATION 'S3_loc']
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...]]
Parameters
[IF NOT EXISTS]
Establishes the metadata value for the built-in metadata property named comment and the value
you provide for database_comment.
[LOCATION S3_loc]
Specifies the location where database files and metastore will exist as S3_loc. The location must be
an Amazon S3 location.
[WITH DBPROPERTIES ('property_name' = 'property_value') [, ...] ]
Allows you to specify custom metadata properties for the database definition.
Examples
CREATE DATABASE clickstreams;
222
Amazon Athena User Guide
CREATE TABLE
CREATE TABLE
Creates a table with the name and the parameters that you specify.
Synopsis
CREATE [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.]table_name [(col_name data_type [COMMENT col_comment] [, ...] )]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
[STORED AS file_format]
[WITH SERDEPROPERTIES (...)] ]
[LOCATION 's3_loc']
[TBLPROPERTIES ( ['has_encrypted_data'='true | false',]
['classification'='aws_glue_classification',] property_name=property_value [, ...] ) ]
Parameters
[EXTERNAL]
Specifies that the table is based on an underlying data file that exists in Amazon S3, in the
LOCATION that you specify. When you create an external table, the data referenced must comply
with the default format or the format that you specify with the ROW FORMAT, STORED AS, and WITH
SERDEPROPERTIES clauses.
Causes the error message to be suppressed if a table named table_name already exists.
[db_name.]table_name
Specifies a name for the table to be created. The optional db_name parameter specifies the
database where the table exists. If omitted, the current database is assumed. If the table name
includes numbers, enclose table_name in quotation marks, for example "table123". If
table_name begins with an underscore, use backticks, for example, `_mytable`. Special
characters (other than underscore) are not supported.
Athena table names are case-insensitive; however, if you work with Apache Spark, Spark requires
lowercase table names.
[ ( col_name data_type [COMMENT col_comment] [, ...] ) ]
Specifies the name for each column to be created, along with the column's data type. Column names
do not allow special characters other than underscore (_). If col_name begins with an underscore,
enclose the column name in backticks, for example `_mycolumn`.
223
Amazon Athena User Guide
CREATE TABLE
• SMALLINT. A 16-bit signed INTEGER in two’s complement format, with a minimum value of -2^15
and a maximum value of 2^15-1.
• INT. Athena combines two different implementations of the INTEGER data type. In Data
Definition Language (DDL) queries, Athena uses the INT data type. In all other queries, Athena
uses the INTEGER data type, where INTEGER is represented as a 32-bit signed value in two's
complement format, with a minimum value of-2^31 and a maximum value of 2^31-1. In the JDBC
driver, INTEGER is returned, to ensure compatibility with business analytics applications.
• BIGINT.A 64-bit signed INTEGER in two’s complement format, with a minimum value of -2^63
and a maximum value of 2^63-1.
• DOUBLE
• FLOAT
• DECIMAL [ (precision, scale) ], where precision is the total number of digits, and
scale (optional) is the number of digits in fractional part, the default is 0. For example, use these
type definitions: DECIMAL(11,5), DECIMAL(15).
To specify decimal values as literals, such as when selecting rows with a specific decimal value in a
query DDL expression, specify the DECIMAL type definition, and list the decimal value as a literal
(in single quotes) in your query, as in this example: decimal_value = DECIMAL '0.12'.
• CHAR. Fixed length character data, with a specified length between 1 and 255, such as char(10).
For more information, see CHAR Hive Data Type.
• VARCHAR. Variable length character data, with a specified length between 1 and 65535, such as
varchar(10). For more information, see VARCHAR Hive Data Type.
• BINARY (for data in Parquet)
• Date and time types
• DATE, in the UNIX format, such as YYYY-MM-DD.
• TIMESTAMP. Instant in time and date in the UNiX format, such as yyyy-mm-dd
hh:mm:ss[.f...]. For example, TIMESTAMP '2008-09-15 03:04:05.324'. This format uses
the session time zone.
• ARRAY < data_type >
• MAP < primitive_type, data_type >
• STRUCT < col_name : data_type [COMMENT col_comment] [, ...] >
[COMMENT table_comment]
Creates the comment table property and populates it with the table_comment you specify.
[PARTITIONED BY (col_name data_type [ COMMENT col_comment ], ... ) ]
Creates a partitioned table with one or more partition columns that have the col_name,
data_type and col_comment specified. A table can have one or more partitions, which consist of a
distinct column name and value combination. A separate data directory is created for each specified
combination, which can improve query performance in some circumstances. Partitioned columns
don't exist within the table data itself. If you use a value for col_name that is the same as a table
column, you get an error. For more information, see Partitioning Data (p. 74).
Note
After you create a table with partitions, run a subsequent query that consists of the MSCK
REPAIR TABLE (p. 231) clause to refresh partition metadata, for example, MSCK REPAIR
TABLE cloudfront_logs;.
Specifies the row format of the table and its underlying source data if applicable. For row_format,
you can specify one or more delimiters with the DELIMITED clause or, alternatively, use the SERDE
224
Amazon Athena User Guide
CREATE TABLE
clause as described below. If ROW FORMAT is omitted or ROW FORMAT DELIMITED is specified, a
native SerDe is used.
• [DELIMITED FIELDS TERMINATED BY char [ESCAPED BY char]]
• [DELIMITED COLLECTION ITEMS TERMINATED BY char]
• [MAP KEYS TERMINATED BY char]
• [LINES TERMINATED BY char]
• [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
--OR--
• SERDE 'serde_name' [WITH SERDEPROPERTIES ("property_name" = "property_value",
"property_name" = "property_value" [, ...] )]
The serde_name indicates the SerDe to use. The WITH SERDEPROPERTIES clause allows you to
provide one or more custom properties allowed by the SerDe.
[STORED AS file_format]
Specifies the file format for table data. If omitted, TEXTFILE is the default. Options for
file_format are:
• SEQUENCEFILE
• TEXTFILE
• RCFILE
• ORC
• PARQUET
• AVRO
• INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
[LOCATION 'S3_loc']
Specifies the location of the underlying data in Amazon S3 from which the table is created, for
example, 's3://mystorage/'. For more information about considerations such as data format
and permissions, see Requirements for Tables in Athena and Data in Amazon S3 (p. 68).
Use a trailing slash for your folder or bucket. Do not use file names or glob characters.
Use: s3://mybucket/key/
Specifies custom metadata key-value pairs for the table definition in addition to predefined table
properties, such as "comment".
Athena has a built-in property, has_encrypted_data. Set this property to true to indicate that
the underlying dataset specified by LOCATION is encrypted. If omitted and if the workgroup's
settings do not override client-side settings, false is assumed. If omitted or set to false when
underlying data is encrypted, the query results in an error. For more information, see Configuring
Encryption Options (p. 62).
To run ETL jobs, AWS Glue requires that you create a table with the classification property
to indicate the data type for AWS Glue as csv, parquet, orc, avro, or json. For example,
'classification'='csv'. ETL jobs will fail if you do not specify this property. You can
subsequently specify it using the AWS Glue console, API, or CLI. For more information, see Using
225
Amazon Athena User Guide
CREATE TABLE AS
AWS Glue Jobs for ETL with Athena (p. 40) and Authoring Jobs in Glue in the AWS Glue Developer
Guide.
Examples
CREATE EXTERNAL TABLE IF NOT EXISTS mydatabase.cloudfront_logs (
Date DATE,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
os STRING,
Browser STRING,
BrowserVersion STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s
+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)
[\/](.*)$"
) LOCATION 's3://athena-examples/cloudfront/plaintext/';
CREATE TABLE AS
Creates a new table populated with the results of a SELECT (p. 235) query. To create an empty table,
use CREATE TABLE (p. 223).
Topics
• Synopsis (p. 219)
• CTAS Table Properties (p. 227)
Synopsis
CREATE TABLE table_name
[ WITH ( property_name = expression [, ...] ) ]
AS query
[ WITH [ NO ] DATA ]
Where:
A list of optional CTAS table properties, some of which are specific to the data storage format. See
CTAS Table Properties (p. 227).
query
226
Amazon Athena User Guide
CREATE TABLE AS
[ WITH [ NO ] DATA ]
If WITH NO DATA is used, a new empty table with the same schema as the original table is created.
The location where Athena saves your CTAS query in Amazon S3, for example, WITH
(external_location ='s3://my-bucket/tables/parquet_table/'). This property
is optional. When you don’t specify any location and your workgroup does not override
client-side settings (p. 166), Athena stores the CTAS query results in external_location
= 's3://aws-athena-query-results-<account>-<region>/<query-name-or-
unsaved>/<year>/<month>/<date>/<query-id>/', and does not use the same path
again. If you specify the location manually, make sure that the Amazon S3 location has no
data. Athena never attempts to delete your data. If you want to use the same location again,
manually clean the data, otherwise your CTAS query will fail.
If the workgroup in which a query will run is configured with an enforced query results
location (p. 166), do not specify an external_location for the CTAS query. Athena
issues an error and fails a query that specifies an external_location in this case. For
example, this query fails, if you enforce the workgroup to use its own location: CREATE TABLE
<DB>.<TABLE1> WITH (format='Parquet', external_location='s3://my_test/
test/') AS SELECT * FROM <DB>.<TABLE2> LIMIT 10;
To obtain the results location specified for the workgroup, view workgroup's details (p. 170).
format = [format]
The data format for the CTAS query results, such as ORC, PARQUET, AVRO, JSON, or TEXTFILE.
For example, WITH (format = 'PARQUET'). If omitted, PARQUET is used by default. The
name of this parameter, format, must be listed in lowercase, or your CTAS query will fail.
partitioned_by = ( [col_name,…])
Optional. An array list of columns by which the CTAS table will be partitioned. Verify that the
names of partitioned columns are listed last in the list of columns in the SELECT statement.
bucketed_by( [bucket_name,…])
An array list of buckets to bucket data. If omitted, Athena does not bucket your data in this
query.
bucket_count = [int]
The number of buckets for bucketing your data. If omitted, Athena does not bucket your data.
orc_compression = [format]
The compression type to use for ORC data. For example, WITH (orc_compression =
'ZLIB'). If omitted, GZIP compression is used by default for ORC and other data storage
formats supported by CTAS.
parquet_compression = [format]
The compression type to use for Parquet data. For example, WITH (parquet_compression =
'SNAPPY'). If omitted, GZIP compression is used by default for Parquet and other data storage
formats supported by CTAS.
227
Amazon Athena User Guide
CREATE VIEW
field_delimiter = [delimiter]
Optional and specific to text-based data storage formats. The field delimiter for files in CSV,
TSV, and text files. For example, WITH (field_delimiter = ','). If you don't specify a field
delimiter, \001 is used by default.
CREATE VIEW
Creates a new view from a specified SELECT query. The view is a logical table that can be referenced by
future queries. Views do not contain any data and do not write data. Instead, the query specified by the
view runs each time you reference the view by another query.
The optional OR REPLACE clause lets you update the existing view by replacing it. For more information,
see Creating Views (p. 89).
Synopsis
CREATE [ OR REPLACE ] VIEW view_name AS query
Examples
To create a view test from the table orders, use a query similar to the following:
To create a view orders_by_date from the table orders, use the following query:
See also SHOW COLUMNS (p. 231), SHOW CREATE VIEW (p. 232), DESCRIBE VIEW (p. 229), and
DROP VIEW (p. 230).
DESCRIBE TABLE
Shows the list of columns, including partition columns, for the named column. This allows you to
examine the attributes of a complex column.
Synopsis
DESCRIBE [EXTENDED | FORMATTED] [db_name.]table_name [PARTITION partition_spec] [col_name
( [.field_name] | [.'$elem$'] | [.'$key$'] | [.'$value$'] )]
228
Amazon Athena User Guide
DESCRIBE VIEW
Parameters
[EXTENDED | FORMATTED]
Determines the format of the output. If you specify EXTENDED, all metadata for the table is
output in Thrift serialized form. This is useful primarily for debugging and not for general use. Use
FORMATTED or omit the clause to show the metadata in tabular format.
[PARTITION partition_spec]
Specifies the column and attributes to examine. You can specify .field_name for an element of a
struct, '$elem$' for array element, '$key$' for a map key, and '$value$' for map value. You can
specify this recursively to further explore the complex column.
Examples
DESCRIBE orders;
DESCRIBE VIEW
Shows the list of columns for the named view. This allows you to examine the attributes of a complex
view.
Synopsis
DESCRIBE [view_name]
Example
DESCRIBE orders;
See also SHOW COLUMNS (p. 231), SHOW CREATE VIEW (p. 232), SHOW VIEWS (p. 234), and DROP
VIEW (p. 230).
DROP DATABASE
Removes the named database from the catalog. If the database contains tables, you must either drop the
tables before executing DROP DATABASE or use the CASCADE clause. The use of DATABASE and SCHEMA
are interchangeable. They mean the same thing.
Synopsis
DROP {DATABASE | SCHEMA} [IF EXISTS] database_name [RESTRICT | CASCADE]
Parameters
[IF EXISTS]
229
Amazon Athena User Guide
DROP TABLE
[RESTRICT|CASCADE]
Determines how tables within database_name are regarded during the DROP operation. If you
specify RESTRICT, the database is not dropped if it contains tables. This is the default behavior.
Specifying CASCADE causes the database and all its tables to be dropped.
Examples
DROP DATABASE clickstreams;
DROP TABLE
Removes the metadata table definition for the table named table_name. When you drop an external
table, the underlying data remains intact because all tables in Athena are EXTERNAL.
Synopsis
DROP TABLE [IF EXISTS] table_name
Parameters
[ IF EXISTS ]
Examples
DROP TABLE fulfilled_orders;
DROP VIEW
Drops (deletes) an existing view. The optional IF EXISTS clause causes the error to be suppressed if the
view does not exist.
Synopsis
DROP VIEW [ IF EXISTS ] view_name
Examples
DROP VIEW orders_by_date
230
Amazon Athena User Guide
MSCK REPAIR TABLE
See also CREATE VIEW (p. 228), SHOW COLUMNS (p. 231), SHOW CREATE VIEW (p. 232), SHOW
VIEWS (p. 234), and DESCRIBE VIEW (p. 229).
Synopsis
MSCK REPAIR TABLE table_name
Examples
MSCK REPAIR TABLE orders;
SHOW COLUMNS
Lists the columns in the schema for a base table or a view.
Synopsis
SHOW COLUMNS IN table_name|view_name
Examples
SHOW COLUMNS IN clicks;
Synopsis
SHOW CREATE TABLE [db_name.]table_name
Parameters
TABLE [db_name.]table_name
The db_name parameter is optional. If omitted, the context defaults to the current database.
Note
The table name is required.
231
Amazon Athena User Guide
SHOW CREATE VIEW
Examples
SHOW CREATE TABLE orderclickstoday;
Synopsis
SHOW CREATE VIEW view_name
Examples
SHOW CREATE VIEW orders_by_date
See also CREATE VIEW (p. 228) and DROP VIEW (p. 230).
SHOW DATABASES
Lists all databases defined in the metastore. You can use DATABASES or SCHEMAS. They mean the same
thing.
Synopsis
SHOW {DATABASES | SCHEMAS} [LIKE 'regular_expression']
Parameters
[LIKE 'regular_expression']
Filters the list of databases to those that match the regular_expression you specify. Wildcards
can only be *, which indicates any character, or |, which indicates a choice between characters.
Examples
SHOW SCHEMAS;
SHOW PARTITIONS
Lists all the partitions in a table.
232
Amazon Athena User Guide
SHOW TABLES
Synopsis
SHOW PARTITIONS table_name
Examples
SHOW PARTITIONS clicks;
SHOW TABLES
Lists all the base tables and views in a database.
Synopsis
SHOW TABLES [IN database_name] ['regular_expression']
Parameters
[IN database_name]
Specifies the database_name from which tables will be listed. If omitted, the database from the
current context is assumed.
['regular_expression']
Filters the list of tables to those that match the regular_expression you specify. Only the
wildcard *, which indicates any character, or |, which indicates a choice between characters, can be
used.
Examples
SHOW TABLES;
SHOW TBLPROPERTIES
Lists table properties for the named table.
Synopsis
SHOW TBLPROPERTIES table_name [('property_name')]
Parameters
[('property_name')]
233
Amazon Athena User Guide
SHOW VIEWS
Examples
SHOW TBLPROPERTIES orders;
SHOW VIEWS
Lists the views in the specified database, or in the current database if you omit the database name. Use
the optional LIKE clause with a regular expression to restrict the list of view names.
Athena returns a list of STRING type values where each value is a view name.
Synopsis
SHOW VIEWS [IN database_name] LIKE ['regular_expression']
Parameters
[IN database_name]
Specifies the database_name from which views will be listed. If omitted, the database from the
current context is assumed.
[LIKE 'regular_expression']
Filters the list of views to those that match the regular_expression you specify. Only the
wildcard *, which indicates any character, or |, which indicates a choice between characters, can be
used.
Examples
SHOW VIEWS;
See also SHOW COLUMNS (p. 231), SHOW CREATE VIEW (p. 232), DESCRIBE VIEW (p. 229), and
DROP VIEW (p. 230).
Amazon Athena query engine is based on Presto 0.172. For more information about these functions, see
Presto 0.172 Functions and Operators.
Athena does not support all of Presto's features. For information, see Limitations (p. 239).
234
Amazon Athena User Guide
SELECT
• Logical Operators
• Comparison Functions and Operators
• Conditional Expressions
• Conversion Functions
• Mathematical Functions and Operators
• Bitwise Functions
• Decimal Functions and Operators
• String Functions and Operators
• Binary Functions
• Date and Time Functions and Operators
• Regular Expression Functions
• JSON Functions and Operators
• URL Functions
• Aggregate Functions
• Window Functions
• Color Functions
• Array Functions and Operators
• Map Functions and Operators
• Lambda Expressions and Functions
• Teradata Functions
SELECT
Retrieves rows from zero or more tables.
Synopsis
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ UNION [ ALL | DISTINCT ] union_query ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
Parameters
[ WITH with_query [, ....] ]
The WITH clause precedes the SELECT list in a query and defines one or more subqueries for use
within the SELECT query.
Each subquery defines a temporary table, similar to a view definition, which you can reference in the
FROM clause. The tables are used only when the query runs.
235
Amazon Athena User Guide
SELECT
Where:
• subquery_table_name is a unique name for a temporary table that defines the results of the
WITH clause subquery. Each subquery must have a table name that can be referenced in the
FROM clause.
• column_name [, ...] is an optional list of output column names. The number of column
names must be equal to or less than the number of columns defined by subquery.
• subquery is any query statement.
[ ALL | DISTINCT ] select_expr
ALL is the default. Using ALL is treated the same as if it were omitted; all rows for all columns are
selected and duplicates are kept.
Use DISTINCT to return only distinct values when a column contains duplicate values.
FROM from_item [, ...]
Indicates the input to the query, where from_item can be a view, a join construct, or a subquery as
described below.
Where table_name is the name of the target table from which to select rows, alias is the name
to give the output of the SELECT statement, and column_alias defines the columns for the
alias specified.
-OR-
• join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ]
Divides the output of the SELECT statement into rows with matching values.
ALL and DISTINCT determine whether duplicate grouping sets each produce distinct output rows. If
omitted, ALL is assumed.
236
Amazon Athena User Guide
SELECT
The grouping_expressions element can be any function, such as SUM, AVG, or COUNT, performed
on input columns, or be an ordinal number that selects an output column by position, starting at
one.
GROUP BY expressions can group output by input column names that don't appear in the output of
the SELECT statement.
All output expressions must be either aggregate functions or columns present in the GROUP BY
clause.
You can use a single query to perform analysis that requires aggregating multiple column sets.
These complex grouping operations don't support expressions comprising input columns. Only
column names or ordinals are allowed.
You can often use UNION ALL to achieve the same results as these GROUP BY operations, but
queries that use GROUP BY have the advantage of reading the data one time, whereas UNION ALL
reads the underlying data three times and may produce inconsistent results when the data source is
subject to change.
GROUP BY CUBE generates all possible grouping sets for a given set of columns. GROUP BY
ROLLUP generates all possible subtotals for a given set of columns.
[ HAVING condition ]
Used with aggregate functions and the GROUP BY clause. Controls which groups are selected,
eliminating groups that don't satisfy condition. This filtering occurs after groups and aggregates
are computed.
[ UNION [ ALL | DISTINCT ] union_query] ]
Combines the results of more than one SELECT statement into a single query. ALL or DISTINCT
control which rows are included in the final result set.
ALL causes all rows to be included, even if the rows are identical.
DISTINCT causes only unique rows to be included in the combined result set. DISTINCT is the
default.
Multiple UNION clauses are processed left to right unless you use parentheses to explicitly define the
order of processing.
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
When the clause contains multiple expressions, the result set is sorted according to the first
expression. Then the second expression is applied to rows that have matching values from the
first expression, and so on.
Each expression may specify output columns from SELECT or an ordinal number for an output
column by position, starting at one.
ORDER BY is evaluated as the last step after any GROUP BY or HAVING clause. ASC and DESC
determine whether results are sorted in ascending or descending order.
The default null ordering is NULLS LAST, regardless of ascending or descending sort order.
LIMIT [ count | ALL ]
Restricts the number of rows in the result set to count. LIMIT ALL is the same as omitting the
LIMIT clause. If the query has no ORDER BY clause, the results are arbitrary.
237
Amazon Athena User Guide
Unsupported DDL
BERNOULLI selects each row to be in the table sample with a probability of percentage. All
physical blocks of the table are scanned, and certain rows are skipped based on a comparison
between the sample percentage and a random value calculated at runtime.
With SYSTEM, the table is divided into logical segments of data, and the table is sampled at this
granularity.
Either all rows from a particular segment are selected, or the segment is skipped based on a
comparison between the sample percentage and a random value calculated at runtime. SYTSTEM
sampling is dependent on the connector. This method does not guarantee independent sampling
probabilities.
[ UNNEST (array_or_map) [WITH ORDINALITY] ]
Expands an array or map into a relation. Arrays are expanded into a single column. Maps are
expanded into two columns (key, value).
You can use UNNEST with multiple arguments, which are expanded into multiple columns with as
many rows as the highest cardinality argument.
UNNEST is usually used with a JOIN and can reference columns from relations on the left side of the
JOIN.
Examples
SELECT * FROM table;
SELECT os, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN date '2014-07-05' AND
date '2014-08-05' GROUP BY os;
For more examples, see Querying Data in Amazon Athena Tables (p. 83).
Unsupported DDL
The following native Hive DDLs are not supported by Athena:
• ALTER INDEX
• ALTER TABLE table_name ARCHIVE PARTITION
• ALTER TABLE table_name CLUSTERED BY
• ALTER TABLE table_name EXCHANGE PARTITION
• ALTER TABLE table_name NOT CLUSTERED
• ALTER TABLE table_name NOT SKEWED
• ALTER TABLE table_name NOT SORTED
• ALTER TABLE table_name NOT STORED AS DIRECTORIES
• ALTER TABLE table_name partitionSpec ADD COLUMNS
• ALTER TABLE table_name partitionSpec CHANGE COLUMNS
238
Amazon Athena User Guide
Limitations
Limitations
Athena does not support the following features, which are supported by an open source Presto version
0.172.
239
Amazon Athena User Guide
Limitations
240
Amazon Athena User Guide
Code Samples
Use the links in this section to use the previous version of the JDBC driver.
Topics
• Code Samples (p. 241)
• Using the Previous Version of the JDBC Driver (p. 249)
• Service Limits (p. 253)
Code Samples
Use examples in this topic as a starting point for writing Athena applications using the SDK for Java 2.x.
Note
These samples use constants (for example, ATHENA_SAMPLE_QUERY) for strings, which are
defined in an ExampleConstants.java class declaration. Replace these constants with your
own strings or defined constants.
Constants
The ExampleConstants.java class demonstrates how to query a table created by the Getting
Started (p. 23) tutorial in Athena.
package aws.example.athena;
241
Amazon Athena User Guide
Create a Client to Access Athena
package aws.example.athena;
import software.amazon.awssdk.auth.credentials.InstanceProfileCredentialsProvider;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.AthenaClientBuilder;
/**
* AthenaClientFactory
* -------------------------------------
* This code shows how to create and configure an Amazon Athena client.
*/
public class AthenaClientFactory {
/**
* AthenaClientClientBuilder to build Athena with the following properties:
* - Set the region of the client
* - Use the instance profile from the EC2 instance as the credentials provider
* - Configure the client to increase the execution timeout.
*/
private final AthenaClientBuilder builder = AthenaClient.builder()
.region(Region.US_WEST_2)
.credentialsProvider(InstanceProfileCredentialsProvider.create());
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.*;
import software.amazon.awssdk.services.athena.paginators.GetQueryResultsIterable;
import java.util.List;
/**
* StartQueryExample
* -------------------------------------
* This code shows how to submit a query to Athena for execution, wait till results
242
Amazon Athena User Guide
Start Query Execution
waitForQueryToComplete(athenaClient, queryExecutionId);
processResultRows(athenaClient, queryExecutionId);
}
/**
* Submits a sample query to Athena and returns the execution ID of the query.
*/
private static String submitAthenaQuery(AthenaClient athenaClient) {
// The QueryExecutionContext allows us to set the Database.
QueryExecutionContext queryExecutionContext = QueryExecutionContext.builder()
.database(ExampleConstants.ATHENA_DEFAULT_DATABASE).build();
// The result configuration specifies where the results of the query should go in
S3 and encryption options
ResultConfiguration resultConfiguration = ResultConfiguration.builder()
// You can provide encryption options for the output that is written.
// .withEncryptionConfiguration(encryptionConfiguration)
.outputLocation(ExampleConstants.ATHENA_OUTPUT_BUCKET).build();
StartQueryExecutionResponse startQueryExecutionResponse =
athenaClient.startQueryExecution(startQueryExecutionRequest);
return startQueryExecutionResponse.queryExecutionId();
}
/**
* Wait for an Athena query to complete, fail or to be cancelled. This is done by
polling Athena over an
* interval of time. If a query fails or is cancelled, then it will throw an exception.
*/
GetQueryExecutionResponse getQueryExecutionResponse;
boolean isQueryStillRunning = true;
while (isQueryStillRunning) {
getQueryExecutionResponse =
athenaClient.getQueryExecution(getQueryExecutionRequest);
String queryState =
getQueryExecutionResponse.queryExecution().status().state().toString();
if (queryState.equals(QueryExecutionState.FAILED.toString())) {
throw new RuntimeException("Query Failed to run with Error Message: " +
getQueryExecutionResponse
.queryExecution().status().stateChangeReason());
243
Amazon Athena User Guide
Start Query Execution
} else if (queryState.equals(QueryExecutionState.CANCELLED.toString())) {
throw new RuntimeException("Query was cancelled.");
} else if (queryState.equals(QueryExecutionState.SUCCEEDED.toString())) {
isQueryStillRunning = false;
} else {
// Sleep an amount of time before retrying again.
Thread.sleep(ExampleConstants.SLEEP_AMOUNT_IN_MS);
}
System.out.println("Current Status is: " + queryState);
}
}
/**
* This code calls Athena and retrieves the results of a query.
* The query must be in a completed state before the results can be retrieved and
* paginated. The first row of results are the column headers.
*/
private static void processResultRows(AthenaClient athenaClient, String
queryExecutionId) {
GetQueryResultsRequest getQueryResultsRequest = GetQueryResultsRequest.builder()
// Max Results can be set but if its not set,
// it will choose the maximum page size
// As of the writing of this code, the maximum value is 1000
// .withMaxResults(1000)
.queryExecutionId(queryExecutionId).build();
GetQueryResultsIterable getQueryResultsResults =
athenaClient.getQueryResultsPaginator(getQueryResultsRequest);
244
Amazon Athena User Guide
Stop Query Execution
break;
default:
throw new RuntimeException("Unexpected Type is not expected" +
columnInfo.type());
}
}
}
}
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.*;
/**
* StopQueryExecutionExample
* -------------------------------------
* This code runs an example query, immediately stops the query, and checks the status of
the query to
* ensure that it was cancelled.
*/
public class StopQueryExecutionExample {
public static void main(String[] args) throws Exception {
// Build an Athena client
AthenaClientFactory factory = new AthenaClientFactory();
AthenaClient athenaClient = factory.createClient();
StopQueryExecutionResponse stopQueryExecutionResponse =
athenaClient.stopQueryExecution(stopQueryExecutionRequest);
GetQueryExecutionResponse getQueryExecutionResponse =
athenaClient.getQueryExecution(getQueryExecutionRequest);
if (getQueryExecutionResponse.queryExecution()
.status()
.state()
.equals(QueryExecutionState.CANCELLED)) {
// Query was cancelled.
System.out.println("Query has been cancelled");
}
}
/**
* Submits an example query and returns a query execution ID of a running query to
stop.
*/
public static String submitAthenaQuery(AthenaClient athenaClient) {
245
Amazon Athena User Guide
List Query Executions
StartQueryExecutionRequest startQueryExecutionRequest =
StartQueryExecutionRequest.builder()
.queryExecutionContext(queryExecutionContext)
.queryString(ExampleConstants.ATHENA_SAMPLE_QUERY)
.resultConfiguration(resultConfiguration).build();
StartQueryExecutionResponse startQueryExecutionResponse =
athenaClient.startQueryExecution(startQueryExecutionRequest);
return startQueryExecutionResponse.queryExecutionId();
}
}
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.ListQueryExecutionsRequest;
import software.amazon.awssdk.services.athena.model.ListQueryExecutionsResponse;
import software.amazon.awssdk.services.athena.paginators.ListQueryExecutionsIterable;
import java.util.List;
/**
* ListQueryExecutionsExample
* -------------------------------------
* This code shows how to obtain a list of query execution IDs.
*/
public class ListQueryExecutionsExample {
public static void main(String[] args) throws Exception {
// Build an Athena client
AthenaClientFactory factory = new AthenaClientFactory();
AthenaClient athenaClient = factory.createClient();
System.out.println(queryExecutionIds);
}
}
}
246
Amazon Athena User Guide
Create a Named Query
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryRequest;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryResponse;
/**
* CreateNamedQueryExample
* -------------------------------------
* This code shows how to create a named query.
*/
public class CreateNamedQueryExample {
public static void main(String[] args) throws Exception {
// Build an Athena client
AthenaClientFactory factory = new AthenaClientFactory();
AthenaClient athenaClient = factory.createClient();
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryRequest;
import software.amazon.awssdk.services.athena.model.CreateNamedQueryResponse;
import software.amazon.awssdk.services.athena.model.DeleteNamedQueryRequest;
import software.amazon.awssdk.services.athena.model.DeleteNamedQueryResponse;
/**
* DeleteNamedQueryExample
* -------------------------------------
* This code shows how to delete a named query by using the named query ID.
*/
public class DeleteNamedQueryExample {
private static String getNamedQueryId(AthenaClient athenaClient) {
// Create the NameQuery Request.
CreateNamedQueryRequest createNamedQueryRequest = CreateNamedQueryRequest.builder()
.database(ExampleConstants.ATHENA_DEFAULT_DATABASE)
.queryString(ExampleConstants.ATHENA_SAMPLE_QUERY)
.name("SampleQueryName")
.description("Sample Description").build();
247
Amazon Athena User Guide
List Named Queries
package aws.example.athena;
import software.amazon.awssdk.services.athena.AthenaClient;
import software.amazon.awssdk.services.athena.model.ListNamedQueriesRequest;
import software.amazon.awssdk.services.athena.model.ListNamedQueriesResponse;
import software.amazon.awssdk.services.athena.paginators.ListNamedQueriesIterable;
import java.util.List;
/**
* ListNamedQueryExample
* -------------------------------------
* This code shows how to obtain a list of named query IDs.
*/
public class ListNamedQueryExample {
public static void main(String[] args) throws Exception {
// Build an Athena client
AthenaClientFactory factory = new AthenaClientFactory();
AthenaClient athenaClient = factory.createClient();
System.out.println(namedQueryIds);
248
Amazon Athena User Guide
Using the Previous Version of the JDBC Driver
}
}
The previous versions of the JDBC driver are 2.0.6, 2.0.5, and 2.0.2.
The JDBC driver version 1.1.0 is also available for download, however, we highly recommend that you
migrate to the current version of the driver. For information, see the JDBC Driver Migration Guide.
The JDBC driver version 1.0.1 and earlier versions are deprecated.
2. Download the Release Notes, the License Agreement and Notices for the driver you downloaded in
step 1.
3. Use the AWS CLI with the following command:
249
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
• To install and configure the JDBC driver version 2.0.6, see the JDBC Driver Installation and
Configuration Guide.
• To migrate to this version of the JDBC driver from a 1.x version, see the JDBC Driver Migration
Guide.
Download the JDBC driver version 1.1.0 that is compatible with JDBC 4.1 and JDK 7.0:
AthenaJDBC41-1.1.0.jar. Also, download the driver license, and the third-party licenses for the
driver. Use the AWS CLI with the following command: aws s3 cp s3://path_to_the_driver
[local_directory], and then use the remaining instructions in this section.
Note
The following instructions are specific to JDBC version 1.1.0 and earlier.
jdbc:awsathena://athena.{REGION}.amazonaws.com:443
where {REGION} is a region identifier, such as us-west-2. For information on Athena regions see
Regions.
JDBC Driver Version 1.1.0: Specify the JDBC Driver Class Name
To use the driver in custom applications, set up your Java class path to the location of the JAR file
that you downloaded from Amazon S3 https://s3.amazonaws.com/athena-downloads/drivers/JDBC/
AthenaJDBC_1.1.0/AthenaJDBC41-1.1.0.jar. This makes the classes within the JAR available for use. The
main JDBC driver class is com.amazonaws.athena.jdbc.AthenaDriver.
Another method to supply credentials to BI tools, such as SQL Workbench, is to supply the credentials
used for the JDBC as AWS access key and AWS secret key for the JDBC properties for user and password,
respectively.
Users who connect through the JDBC driver and have custom access policies attached to their profiles
need permissions for policy actions in addition to those in the Amazon Athena API Reference.
250
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
• athena:GetCatalogs
• athena:GetExecutionEngine
• athena:GetExecutionEngines
• athena:GetNamespace
• athena:GetNamespaces
• athena:GetTable
• athena:GetTables
251
Amazon Athena User Guide
Instructions for JDBC Driver version 1.1.0
log_path Local path of the Athena JDBC driver logs. If no log N/A No
path is provided, then no log files are created.
log_level Log level of the Athena JDBC driver logs. Valid N/A No
values: INFO, DEBUG, WARN, ERROR, ALL, OFF,
FATAL, TRACE.
Examples: Using the 1.1.0 Version of the JDBC Driver with the
JDK
The following code examples demonstrate how to use the JDBC driver version 1.1.0 in a Java application.
These examples assume that the AWS JAVA SDK is included in your classpath, specifically the aws-java-
sdk-core module, which includes the authorization packages (com.amazonaws.auth.*) referenced in
the examples.
info.put("aws_credentials_provider_class","com.amazonaws.auth.DefaultAWSCredentialsProviderChain");
Class.forName("com.amazonaws.athena.jdbc.AthenaDriver");
The following examples demonstrate different ways to use a credentials provider that implements the
AWSCredentialsProvider interface with the previous version of the JDBC driver.
myProps.put("aws_credentials_provider_class","com.amazonaws.auth.PropertiesFileCredentialsProvider");
myProps.put("aws_credentials_provider_arguments","/Users/
myUser/.athenaCredentials");
accessKey = ACCESSKEY
252
Amazon Athena User Guide
Service Limits
secretKey = SECRETKEY
Replace the right part of the assignments with your account's AWS access and secret keys.
myProps.put("aws_credentials_provider_class","com.amazonaws.athena.jdbc.CustomSessionsCredentialsProvi
String providerArgs = "My_Access_Key," + "My_Secret_Key," + "My_Token";
myProps.put("aws_credentials_provider_arguments",providerArgs);
Note
If you use the InstanceProfileCredentialsProvider, you don't need to supply any
credential provider arguments because they are provided using the Amazon EC2 instance
profile for the instance on which you are running your application. You would still set the
aws_credentials_provider_class property to this class name, however.
athena:RunQuery athena:StartQueryExecution
athena:CancelQueryExecution athena:StopQueryExecution
athena:GetQueryExecutions athena:ListQueryExecutions
Service Limits
Note
You can contact AWS Support to request a limit increase for the limits listed here.
253
Amazon Athena User Guide
Service Limits
• 20 DML queries at the same time. DML queries include SELECT and CREATE TABLE AS (CTAS)
queries.
After you submit your queries to Athena, it processes the queries by assigning resources based on
the overall service load and the amount of incoming requests. We continuously monitor and make
adjustments to the service so that your queries process as fast as possible.
Athena service limits are shared across all workgroups in the account.
These are soft limits and you can request a limit increase. These limits in Athena are defined as the
number of queries that can be submitted to the service at the same time. You can submit up to 20
queries of the same type (DDL or DML) at a time. If you submit a query that exceeds the query limit,
the Athena API displays an error message: "You have exceeded the limit for the number of queries you
can run concurrently. Reduce the number of concurrent queries submitted by this account. Contact
customer support to request a concurrent query limit increase.”
• If you use Athena in regions where AWS Glue is available, migrate to AWS Glue Data Catalog. See
Upgrading to the AWS Glue Data Catalog Step-by-Step (p. 29).
• If you have migrated to AWS Glue Data Catalog, for service limits on tables, databases, and
partitions in Athena, see AWS Glue Limits.
• If you have not migrated to AWS Glue Data Catalog, the number of partitions per table is 20,000.
You can request a limit increase.
• You may encounter a limit for Amazon S3 buckets per account, which is 100. Athena also needs a
separate bucket to log results.
• The query timeout is 30 minutes.
• The maximum allowed query string length is 262144 bytes, where the strings are encoded in UTF-8.
Use these tips (p. 71) for naming columns, tables, and databases in Athena.
• The maximum number of workgroups you can create per Region in your account is 1000.
• Athena APIs have the following default limits for the number of calls to the API per account (not per
query):
BatchGetNamedQuery, ListNamedQueries, 5 up to 10
ListQueryExecutions
CreateNamedQuery, DeleteNamedQuery, 5 up to 20
GetNamedQuery
BatchGetQueryExecution 20 up to 40
StartQueryExecution, StopQueryExecution 20 up to 80
For example, for StartQueryExecution, you can make up to 20 calls per second. In addition, if this
API is not called for 4 seconds, your account accumulates a burst capacity of up to 80 calls. In this case,
your application can make up to 80 calls to this API in burst mode.
If you use any of these APIs and exceed the default limit for the number of calls per second, or the
burst capacity in your account, the Athena API issues an error similar to the following: ""ClientError: An
error occurred (ThrottlingException) when calling the <API_name> operation: Rate exceeded." Reduce
the number of calls per second, or the burst capacity for the API for this account. You can contact AWS
Support to request a limit increase.
254
Amazon Athena User Guide
Document History
This documentation is associated with the May, 18, 2017 version of Amazon Athena.
Released the new To download the ODBC driver version 1.0.5 and its March 5, 2019
version of the documentation, see Connecting to Amazon Athena with
ODBC driver with ODBC (p. 44). For information about this version, see the
support for Athena ODBC Driver Release Notes.
workgroups.
For more information, search for "workgroup" in the ODBC
Driver Installation and Configuration Guide version 1.0.5.
There are no changes to the ODBC driver connection string
when you use tags on workgroups. To use tags, upgrade to
the latest version of the ODBC driver, which is this current
version.
Added tag support A tag consists of a key and a value, both of which you February 22,
for workgroups in define. When you tag a workgroup, you assign custom 2019
Amazon Athena. metadata to it. You can add tags to workgroups to help
categorize them, using AWS tagging best practices. You
can use tags to restrict access to workgroups, and to track
costs. For example, create a workgroup for each cost center.
Then, by adding tags to these workgroups, you can track
your Athena spending for each cost center. For more
information, see Using Tags for Billing in the AWS Billing
and Cost Management User Guide.
Improved the JSON The improvements include, but are not limited to, the February 18,
OpenX SerDe used in following: 2019
Athena.
• Support for the
ConvertDotsInJsonKeysToUnderscores property.
When set to TRUE, it allows the SerDe to replace the dots
in key names with underscores. For example, if the JSON
dataset contains a key with the name "a.b", you can use
this property to define the column name to be "a_b" in
Athena. The default is FALSE. By default, Athena does
not allow dots in column names.
• Support for the case.insensitive property. By
default, Athena requires that all keys in your JSON
dataset use lowercase. Using WITH SERDE PROPERTIES
("case.insensitive"= FALSE;) allows you to
use case-sensitive key names in your data. The default
255
Amazon Athena User Guide
Added support for Use workgroups to separate users, teams, applications, or February 18,
workgroups. workloads, and to set limits on amount of data each query 2019
or the entire workgroup can process. Because workgroups
act as IAM resources, you can use resource-level permissions
to control access to a specific workgroup. You can also
view query-related metrics in Amazon CloudWatch, control
query costs by configuring limits on the amount of data
scanned, create thresholds, and trigger actions, such as
Amazon SNS alarms, when these thresholds are breached.
For more information, see Using Workgroups for Running
Queries (p. 158) and Controlling Costs and Monitoring
Queries with CloudWatch Metrics (p. 174).
Added support Added example Athena queries for analyzing logs from January 24, 2019
for analyzing logs Network Load Balancer. These logs receive detailed
from Network Load information about the Transport Layer Security (TLS)
Balancer. requests sent to the Network Load Balancer. You can
use these access logs to analyze traffic patterns and
troubleshoot issues. For information, see the section called
“Querying Network Load Balancer Logs” (p. 142).
Released the new With this release of the drivers, federated access to Athena November 10,
versions of the JDBC is supported for the Active Directory Federation Service (AD 2018
and ODBC driver FS 3.0). Access is established through the versions of JDBC
with support for or ODBC drivers that support SAML 2.0. For information
federated access to about configuring federated access to the Athena API, see
Athena API with the the section called “Enabling Federated Access to Athena
AD FS and SAML 2.0 API” (p. 59).
(Security Assertion
Markup Language
2.0).
Added support Added support for creating identity-based (IAM) policies October 15, 2018
for fine-grained that provide fine-grained access control to resources in the
access control to AWS Glue Data Catalog, such as databases and tables used
databases and in Athena.
tables in Athena.
Additionally, added Additionally, you can encrypt database and table metadata
policies in Athena in the Data Catalog, by adding specific policies to Athena.
that allow you to
encrypt database For details, see Access Control Policies.
and table metadata
in the Data Catalog.
256
Amazon Athena User Guide
Added support for Added support for CREATE TABLE AS SELECT October 10, 2018
CREATE TABLE AS statements. See Creating a Table from Query
SELECT statements. Results (p. 91), Considerations and Limitations (p. 91), and
Examples (p. 97).
Made other
improvements in the
documentation.
Released the ODBC The ODBC driver version 1.0.3 supports streaming results September 6,
driver version 1.0.3 and also includes improvements, bug fixes, and an updated 2018
with support for documentation for "Using SSL with a Proxy Server". For
streaming results details, see the Release Notes for the driver.
instead of fetching
them in pages. For downloading the ODBC driver version 1.0.3 and its
documentation, see Connecting to Amazon Athena with
Made other ODBC (p. 44).
improvements in the
documentation.
Released the JDBC Released the JDBC driver 2.0.5 with default support for August 16, 2018
driver version 2.0.5 streaming results instead of fetching them in pages. For
with default support information, see Using Athena with the JDBC Driver (p. 43).
for streaming results
instead of fetching For information about streaming results, search for
them in pages. UseResultsetStreaming in the JDBC Driver Installation and
Configuration Guide.
Made other
improvements in the
documentation.
Updated the Updated the documentation for querying Amazon Virtual August 7, 2018
documentation for Private Cloud flow logs, which can be stored directly in
querying Amazon Amazon S3 in a GZIP format. For information, see Querying
Virtual Private Cloud Amazon VPC Flow Logs (p. 145).
flow logs, which can
be stored directly in Updated examples for querying ALB logs. For information,
Amazon S3 in a GZIP see Querying Application Load Balancer Logs (p. 143).
format.
Updated examples
for querying ALB
logs.
Added support Added support for views. For information, see Views (p. 86). June 5, 2018
for views. Added
guidelines Updated this guide with guidance on handling schema
for schema updates for various data storage formats. For information,
manipulations for see Handling Schema Updates (p. 148).
various data storage
formats.
Increased default You can submit and run up to twenty DDL queries and May 17, 2018
query concurrency twenty SELECT queries at a time. For information, see
limits from five to Service Limits (p. 253).
twenty.
257
Amazon Athena User Guide
Added query tabs, Added query tabs, and an ability to configure auto- May 8, 2018
and an ability to complete in the Query Editor. For information, see Using
configure auto- the Console (p. 27).
complete in the
Query Editor.
Released the JDBC Released the new version of the JDBC driver (version April 19, 2018
driver version 2.0.2. 2.0.2). For information, see Using Athena with the JDBC
Driver (p. 43).
Added auto- Added auto-complete for typing queries in the Athena April 6, 2018
complete for typing console.
queries in the
Athena console.
Added an ability Added an ability to automatically create Athena tables for March 15, 2018
to create Athena CloudTrail log files directly from the CloudTrail console. For
tables for CloudTrail information, see Creating a Table for CloudTrail Logs in the
log files directly CloudTrail Console (p. 136).
from the CloudTrail
console.
Added support for Added an ability to securely offload intermediate data February 2, 2018
securely offloading to disk for memory-intensive queries that use the GROUP
intermediate data to BY clause. This improves the reliability of such queries,
disk for queries with preventing "Query resource exhausted" errors. For
GROUP BY. more information, see the release note for February 2,
2018 (p. 14).
Added support for Upgraded the underlying engine in Amazon Athena January 19, 2018
Presto version 0.172. to a version based on Presto version 0.172. For more
information, see the release note for January 19,
2018 (p. 14).
Added support for Added support for connecting Athena to the ODBC Driver. November 13,
the ODBC Driver. For information, see Connecting to Amazon Athena with 2017
ODBC.
Added support Added support for querying geospatial data, and for Asia November 1,
for Asia Pacific Pacific (Seoul), Asia Pacific (Mumbai), EU (London) regions. 2017
(Seoul), Asia Pacific For information, see Querying Geospatial Data and AWS
(Mumbai), and EU Regions and Endpoints.
(London) regions.
Added support for
querying geospatial
data.
Added support for Added support for EU (Frankfurt). For a list of supported October 19, 2017
EU (Frankfurt). regions, see AWS Regions and Endpoints.
Added support Added support for creating named Athena queries October 3, 2017
for named Athena with AWS CloudFormation. For more information, see
queries with AWS AWS::Athena::NamedQuery in the AWS CloudFormation User
CloudFormation. Guide.
Added support for Added support for Asia Pacific (Sydney). For a list of September 25,
Asia Pacific (Sydney). supported regions, see AWS Regions and Endpoints. 2017
258
Amazon Athena User Guide
Added a section Added examples for Querying AWS Service Logs (p. 135) September 5,
to this guide for and for querying different types of data in Athena. 2017
querying AWS For information, see Querying Data in Amazon Athena
Service logs and Tables (p. 83).
different types of
data, including
maps, arrays, nested
data, and data
containing JSON.
Added support for Added integration with the AWS Glue Data Catalog and a August 14, 2017
AWS Glue Data migration wizard for updating from the Athena managed
Catalog. data catalog to the AWS Glue Data Catalog. For more
information, see Integration with AWS Glue and AWS Glue.
Added support for Added support for Grok SerDe, which provides easier August 4, 2017
Grok SerDe. pattern matching for records in unstructured text files
such as logs. For more information, see Grok SerDe. Added
keyboard shortcuts to scroll through query history using the
console (CTRL + ⇧/⇩ using Windows, CMD + ⇧/⇩ using Mac).
Added support for Added support for Asia Pacific (Tokyo) and Asia Pacific June 22, 2017
Asia Pacific (Tokyo). (Singapore). For a list of supported regions, see AWS
Regions and Endpoints.
Added support for Added support for EU (Ireland). For more information, see June 8, 2017
EU (Ireland). AWS Regions and Endpoints.
Added an Amazon Added an Amazon Athena API and AWS CLI support for May 19, 2017
Athena API and AWS Athena. Updated JDBC driver to version 1.1.0.
CLI support.
Added support for Added support for Amazon S3 data encryption and April 4, 2017
Amazon S3 data released a JDBC driver update (version 1.0.1) with
encryption. encryption support, improvements, and bug fixes. For more
information, see Configuring Encryption Options (p. 62).
Added the AWS Added the AWS CloudTrail SerDe, improved performance, March 24, 2017
CloudTrail SerDe. fixed partition issues. For more information, see CloudTrail
SerDe (p. 196).
Added support for Added support for Avro SerDe (p. 193) and OpenCSVSerDe February 20,
US East (Ohio). for Processing CSV (p. 198), US East (Ohio), and bulk editing 2017
columns in the console wizard. Improved performance on
large Parquet tables.
The initial release of the Amazon Athena User Guide. November, 2016
259
Amazon Athena User Guide
AWS Glossary
For the latest AWS terminology, see the AWS Glossary in the AWS General Reference.
260