Deep dive / Best practices

Insert performance

https://stackoverflow.com/questions/16485425/aws-redshift-jdbc-insert-performance

Optimizations / Best practices

Data storage, ingestion and ELT

Redundancy

Amazon Redshift utilizes locally attached storage devices

Compute nodes have 2 to 3 times the advertised storage capacity

Global commit ensures all permanent tables have written blocks to another node in the cluster to ensure data redundancy

Asynchronously backup blocks to Amazon S3 - always consistent snapshot

Every 5 GB of changed data or eight hours

User on-demand manual snapshots

To disable backups at the table level: CREATE TABLE example (id int) BACKUP NO; Temporary tables

Blocks are not mirrored to the remote partition - two-times faster write performance

Do not trigger a full commit or backups

Transactions

Amazon Redshift is a fully transactional, ACID complaint data warehourse

Isolation level is serializable
Two phase commits (local and global commit phases)

Design consideration

Because of the expense of commit overhead, limit commits by explicitly creating transactions

Data ingestion: COPY statement

Ingestion throughput

Each slice's query processors can load one file at a time:

Streaming decompression
Parse
Distribute
Write

Realizing only partial node usage at 6.25% of slices are active

Number of input files should be a multiple of the number of slices

Splitting the single file into 16 input files, all slices are working to maximize ingestion performance

COPY continues to scale linearly as you add nodes

Recommendation is to use delimited files - 1 MB to 1 GB after gzip compression

Best practices: COPY ingestion

Delimited files are recommend

Pick a simple delimites '|' or ',' or tabs
Pick a simple NULL character (N)
Use double quotes and an escape character ('') for varchars
UTF-8 varchar columns take four bytes per char

Split files into a nuber that is a multiple of the total number of slices in the Amazon Redshift cluster

SELECT count(slice) from stv_slices;

Data ingestion: Amazon Redshift Spectrum

Use INSERT INTO SELECT against external Amazon S3 tables

Aggregate incoming data
Select subset of columns and/or rows
Manipulate incoming column data with SQL

Best practices:

Save cluster resources for querying and reporting rather than on ELT
Filtering/aggregating incoming data can improve performance over COPY

Design considerations

Repeated reads against Amazon S3 are not transactional
$5/TB of (compressed) data scanned

Design considerations: Data ingestion

Designed for large writes

Batch processing system, optimized for processing massive amounts of data
1 MB size plus immutable blocks means that we clone blocks on write so as not to introduce fragmentation
Small write (110 rows) has similar cost to a larger write (~100K rows)

UPDATE and DELETE

Immutable blocks means that we only logically delete rows on UPDATE or DELETE
Must VACUUM or DEEP COPY to remove ghost rows from table

Steps

Load CSV data into a staging table
Delete duplicate data from the production table
Insert (or append) data from the staging into the production table

Create a Transaction

Best practices: ELT

Wrap workflow/statements in an explicit transaction

Consider using DROP TABLE or TRUNCATE instead of DELETE

Staging tables:

Use temporary table or permanent table with the "BACKUP NO" option
If possible use DISTSTYLE KEY on both the staging table and production table to speed up the the INSERT INTO SELECT statement
Turn off automatic compression - COMPUPDATE OFF
Copy compression settings from production table or use ANALYZE COMPRESSION statement
- Use CREATE TABLE LIKE or write encodings into the DDL
For copying a large number of rows (> hundreds of millions) consider using ALTER TABLE APPEND instead of INSERT INTO SELECT

VACUUM and ANALYZE

VACUUM will globally sort the table and remove rows that are marked as deleted

For tables with a sort key, ingestion operations will locally sort new data and write it into the unsorted region

ANALYZE collects table statistics for optimal query planning

Best Practices:

VACUUM should be run only as necessary
- Typically nightly or weekly
- Consider Deep Copy (recreating and copying data) for larger or wide tables
ANALYZE can be run periodically after ingestion on just the columns taht WHERE predicates are filtered on
Utility to VACUUM and ANALYZE all the tables in the cluster

Insert performance​

Optimizations / Best practices​

Data storage, ingestion and ELT​

Redundancy​

Transactions​

Data ingestion: COPY statement​

Best practices: COPY ingestion​

Data ingestion: Amazon Redshift Spectrum​

Design considerations: Data ingestion​

Best practices: ELT​