Amazon Athena
Based on Presto
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there's no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.
When should I use Athena
Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.
Athena integrates with Amazon QuickSight for easy data visualization. You can use Athena to generate reports or to explore data with business intelligence tools or SQL clients connected with a JDBC or an ODBC driver. For more information, see What is Amazon QuickSight in theAmazon QuickSight User Guideand Connecting to Amazon Athena with ODBC and JDBC Drivers.
Athena integrates with the AWS Glue Data Catalog, which offers a persistent metadata store for your data in Amazon S3. This allows you to create tables and query data in Athena based on a central metadata store available throughout your AWS account and integrated with the ETL and data discovery features of AWS Glue. For more information, see Integration with AWS Glue and What is AWS Glue in theAWS Glue Developer Guide.
https://docs.aws.amazon.com/athena/latest/ug/when-should-i-use-ate.html
Partitioning Data
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your data by any key. A common practice is to partition the data based on time, often leading to a multi-level partitioning scheme. For example, a customer who has data coming in every hour might decide to partition by year, month, date, and hour. Another customer, who has data coming from many different sources but loaded one time per day, may partition by a data source identifier and date.
If you issue queries against Amazon S3 buckets with a large number of objects and the data is not partitioned, such queries may affect the Get request rate limits in Amazon S3 and lead to Amazon S3 exceptions. To prevent errors, partition your data. Additionally, consider tuning your Amazon S3 request rates.
- Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog | AWS Big Data Blog
- What is partitioning? - Amazon Athena
- Partitioning Data on S3 to Improve Performance in Athena/Presto | Upsolver
- How Not to Partition Data in S3 (And What to Do Instead)
s3://bucket/data/dt=2025-01-01/events.parquetinstead ofs3://bucket/data/year=2025/month=01/day=01/events.parquet
Note
If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only from that partition. For more information, see Table Location and Partitions.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data. There are two scenarios discussed in the following sections:
- Data is already partitioned, stored on Amazon S3, and you need to access the data on Athena.
- Data is not partitioned.
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
When you create a table from CSV data in Athena, determine what types of values it contains:
- If data contains values enclosed in double quotes ("), you can use the OpenCSV SerDe to deserialize the values in Athena. In the following sections, note the behavior of this SerDe withSTRINGdata types.
- If data does not contain values enclosed in double quotes ("), you can omit specifying any SerDe. In this case, Athena uses the defaultLazySimpleSerDe. For information, see LazySimpleSerDe for CSV, TSV, and Custom-Delimited Files.
Key Points
- Does not support embedded line breaks in CSV files.
- Does not support empty fields in columns defined as a numeric data type.