Query Optimizations

-- Unoptimized
WHERE order_date >= DATE_TRUNC(CURRENT_DATE('Asia/Kolkata'), MONTH) - INTERVAL 1 MONTH

-- Optimized
WHERE order_date >= DATE_TRUNC(DATE_SUB(CURRENT_DATE('Asia/Kolkata'), INTERVAL 1 MONTH), MONTH)  AND order_date <= CURRENT_DATE('Asia/Kolkata')

1. Use SELECT Specific Columns

Avoid SELECT * as it scans all columns, including unnecessary ones. Instead, select only the columns you need.

Example:

SELECT customer_id, total_amount FROM orders;

2. Leverage Partitioning

Partition tables by DATE or TIMESTAMP columns to minimize the data scanned.
Query specific partitions using the WHERE clause.

Example:

SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';

3. Use Clustering

Cluster data based on frequently filtered columns (e.g., region, product_id). This optimizes how data is stored within partitions, speeding up queries.

Example:

CREATE TABLE sales (
  region STRING, product_id STRING, revenue FLOAT64
) PARTITION BY DATE(sale_date) CLUSTER BY region,
product_id;

4. Filter Early in Queries

Use filters as early as possible in your query to minimize the amount of data processed.

Example:

-- Less efficient SELECT * FROM orders WHERE total_amount > 100;
-- More efficient
SELECT order_id, total_amount FROM orders WHERE total_amount > 100;

5. Use WITH Clauses (Common Table Expressions)

Break down complex queries into manageable, reusable blocks using WITH.

Example:

WITH filtered_data AS (
  SELECT
    customer_id,
    total_amount
  FROM
    orders
  WHERE
    order_date >= '2024-01-01'
)
SELECT
  customer_id,
  SUM(total_amount) AS total_spent
FROM
  filtered_data
GROUP BY
  customer_id;

6. Leverage Query Caching

BigQuery automatically caches query results. If the same query (without changes) is run within 24 hours, cached results are used, avoiding redundant computation.

7. Use APPROX Functions

For large datasets where approximate results are sufficient, use APPROX functions like APPROX_COUNT_DISTINCT to reduce processing.

Example:

SELECT APPROX_COUNT_DISTINCT(customer_id) AS unique_customers FROM orders;

8. Optimize JOINs

Reduce JOIN data size by filtering and aggregating before performing the JOIN.
Ensure the smaller table is on the right-hand side of the JOIN for performance.

Example:

-- Filter smaller dataset first
WITH filtered_customers AS (
  SELECT
    customer_id,
    region
  FROM
    customers
  WHERE
    region = 'North America'
)
SELECT
  o.order_id,
  c.customer_id
FROM
  orders o
  JOIN filtered_customers c ON o.customer_id = c.customer_id;

9. Use ARRAY Aggregation and UNNEST Appropriately

Use ARRAY types to combine multiple rows into a single row when applicable. When querying nested or repeated fields, use UNNEST efficiently.

Example:

-- Efficiently flatten repeated fields
SELECT customer_id, item FROM orders, UNNEST(items) AS item

10. Limit Data with Pre-Aggregation

Pre-aggregate data using GROUP BY or materialized views to reduce scanned rows in subsequent queries.

Example:

-- Pre-aggregate sales by region
CREATE MATERIALIZED VIEW regional_sales AS
SELECT
  region,
  SU (revenue) AS total_revenue
FROM
  sales
GROUP BY
  region;

11. Use TEMP Tables for Large Intermediate Results

Store intermediate results in temporary tables for reuse in multiple queries.

Example:

CREATE TEMP TABLE temp_results AS
SELECT
  customer_id,
  COUNT(order_id) AS order_count
FROM
  orders
GROUP BY
  customer_id;
SELECT
  *
FROM
  temp_results
WHERE
  order_count > 10;

12. Optimize Window Functions

Minimize the use of window functions (OVER clause) when not necessary. Use aggregate functions with GROUP BY instead.

Example:

-- Instead of this
SELECT customer_id, SUM(total_amount) OVER(PARTITION BY customer_id) AS total_spent FROM orders;
-- Use this
SELECT customer_id, SUM(total_amount) AS total_spent FROM orders GROUP BY customer_id;

13. Materialized Views

Use materialized views for frequently run, repetitive queries. These precompute and store results, reducing query time.

In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency.
Use Cases:
- Pre-aggregate data - Aggregation of streaming data.
- Pre-filter data - Run queries that only read a particular subset of the table.
- Pre-join data - Query joins, especially between large and small tables.
- Recluster data - Run queries that would benefit from a clustering scheme that differs from the base tables.

Comparison to other BigQuery techniques

The following table summarizes the similarities and differences between BigQuery caching, scheduled queries, logical views, and materialized views.

Component	Caching	Scheduled queries	Logical views	Materialized views
Optimize compute	Yes	No	No	Yes
Query support	All	All	All	Limited1
Partitioning and clustering	No	Yes	N/A	Yes
Incremental refresh	No	No	No	Yes
Additional storage	No	Yes	No	Yes
Query rewrite	No	No	No	Yes
Maintenance costs	No	Yes	N/A	Yes
Data staleness	Never	Yes	Never	Optional 2

Introduction to materialized views | BigQuery | Google Cloud

14. Avoid Cross Joins

Cross joins process the Cartesian product of two tables, scanning massive amounts of data. Avoid unless absolutely necessary.

15. Monitor Query Execution

Use the Query Execution Details tab in the BigQuery console to identify bottlenecks like high bytes processed or skewed slot utilization.

Others

Avoid Overuse of Nested Queries

Flatten nested queries into simpler steps where possible. Complex subqueries can slow down execution.

Leverage Table Sampling

Use TABLESAMPLE for debugging or testing on large datasets instead of processing the full table.

Example:

SELECT * FROM my_table TABLESAMPLE SYSTEM (10 PERCENT);

Use Query Parameters

Use parameterized queries instead of hardcoding values. This improves performance and allows caching.

Example:

DECLARE start_date DATE;
DECLARE end_date DATE;
SET
  start_date = '2024-01-01';
SET
  end_date = '2024-01-31';
SELECT
  *
FROM
  orders
WHERE
  order_date BETWEEN start_date
  AND end_date;

Use Native BigQuery Functions

Use built-in BigQuery functions like STRING_AGG, ARRAY_AGG, etc., instead of writing custom logic. These are highly optimized.

Streaming Inserts vs Batch Inserts

If you use real-time inserts, ensure the streaming buffer is not overutilized. Batch inserts with larger data sizes can be more cost-effective.

Avoid Repeated Computations

Compute values once and reuse them. For example, store common calculations in temporary tables or variables.

Limit Result Sets

Use LIMIT when you don’t need full datasets, especially for exploratory queries.

Optimize Data Formats

Prefer efficient data formats like Avro or Parquet over CSV, as they are compressed and optimized for analytical workloads.

Take Advantage of BI Engine

If using BigQuery for dashboards, enable BigQuery BI Engine for faster response times.

1. Use SELECT Specific Columns​

2. Leverage Partitioning​

3. Use Clustering​

4. Filter Early in Queries​

5. Use WITH Clauses (Common Table Expressions)​

6. Leverage Query Caching​

7. Use APPROX Functions​

8. Optimize JOINs​

9. Use ARRAY Aggregation and UNNEST Appropriately​

10. Limit Data with Pre-Aggregation​

11. Use TEMP Tables for Large Intermediate Results​

12. Optimize Window Functions​

13. Materialized Views​

Comparison to other BigQuery techniques​

Links​

14. Avoid Cross Joins​

15. Monitor Query Execution​

Others​

Avoid Overuse of Nested Queries​

Leverage Table Sampling​

Use Query Parameters​

Use Native BigQuery Functions​

Streaming Inserts vs Batch Inserts​

Avoid Repeated Computations​

Limit Result Sets​

Optimize Data Formats​

Take Advantage of BI Engine​

1. Use SELECT Specific Columns

2. Leverage Partitioning

3. Use Clustering

4. Filter Early in Queries

5. Use WITH Clauses (Common Table Expressions)

6. Leverage Query Caching

7. Use APPROX Functions

8. Optimize JOINs

9. Use ARRAY Aggregation and UNNEST Appropriately

10. Limit Data with Pre-Aggregation

11. Use TEMP Tables for Large Intermediate Results

12. Optimize Window Functions

13. Materialized Views

Comparison to other BigQuery techniques

Links

14. Avoid Cross Joins

15. Monitor Query Execution

Others

Avoid Overuse of Nested Queries

Leverage Table Sampling

Use Query Parameters

Use Native BigQuery Functions

Streaming Inserts vs Batch Inserts

Avoid Repeated Computations

Limit Result Sets

Optimize Data Formats

Take Advantage of BI Engine