Query Optimizations
-- Unoptimized
WHERE order_date >= DATE_TRUNC(CURRENT_DATE('Asia/Kolkata'), MONTH) - INTERVAL 1 MONTH
-- Optimized
WHERE order_date >= DATE_TRUNC(DATE_SUB(CURRENT_DATE('Asia/Kolkata'), INTERVAL 1 MONTH), MONTH) AND order_date <= CURRENT_DATE('Asia/Kolkata')
1. Use SELECT Specific Columns
Avoid SELECT *
as it scans all columns, including unnecessary ones. Instead, select only the columns you need.
Example:
SELECT customer_id, total_amount FROM orders;
2. Leverage Partitioning
- Partition tables by
DATE
orTIMESTAMP
columns to minimize the data scanned. - Query specific partitions using the
WHERE
clause.
Example:
SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
3. Use Clustering
Cluster data based on frequently filtered columns (e.g., region
, product_id
). This optimizes how data is stored within partitions, speeding up queries.
Example:
CREATE TABLE sales (
region STRING, product_id STRING, revenue FLOAT64
) PARTITION BY DATE(sale_date) CLUSTER BY region,
product_id;
4. Filter Early in Queries
Use filters as early as possible in your query to minimize the amount of data processed.
Example:
-- Less efficient SELECT * FROM orders WHERE total_amount > 100;
-- More efficient
SELECT order_id, total_amount FROM orders WHERE total_amount > 100;
5. Use WITH Clauses (Common Table Expressions)
Break down complex queries into manageable, reusable blocks using WITH
.
Example:
WITH filtered_data AS (
SELECT
customer_id,
total_amount
FROM
orders
WHERE
order_date >= '2024-01-01'
)
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM
filtered_data
GROUP BY
customer_id;
6. Leverage Query Caching
BigQuery automatically caches query results. If the same query (without changes) is run within 24 hours, cached results are used, avoiding redundant computation.
7. Use APPROX Functions
For large datasets where approximate results are sufficient, use APPROX
functions like APPROX_COUNT_DISTINCT
to reduce processing.
Example:
SELECT APPROX_COUNT_DISTINCT(customer_id) AS unique_customers FROM orders;
8. Optimize JOINs
- Reduce JOIN data size by filtering and aggregating before performing the JOIN.
- Ensure the smaller table is on the right-hand side of the JOIN for performance.
Example:
-- Filter smaller dataset first
WITH filtered_customers AS (
SELECT
customer_id,
region
FROM
customers
WHERE
region = 'North America'
)
SELECT
o.order_id,
c.customer_id
FROM
orders o
JOIN filtered_customers c ON o.customer_id = c.customer_id;
9. Use ARRAY Aggregation and UNNEST Appropriately
Use ARRAY
types to combine multiple rows into a single row when applicable. When querying nested or repeated fields, use UNNEST
efficiently.
Example:
-- Efficiently flatten repeated fields
SELECT customer_id, item FROM orders, UNNEST(items) AS item
10. Limit Data with Pre-Aggregation
Pre-aggregate data using GROUP BY
or materialized views to reduce scanned rows in subsequent queries.
Example:
-- Pre-aggregate sales by region
CREATE MATERIALIZED VIEW regional_sales AS
SELECT
region,
SU (revenue) AS total_revenue
FROM
sales
GROUP BY
region;
11. Use TEMP Tables for Large Intermediate Results
Store intermediate results in temporary tables for reuse in multiple queries.
Example:
CREATE TEMP TABLE temp_results AS
SELECT
customer_id,
COUNT(order_id) AS order_count
FROM
orders
GROUP BY
customer_id;
SELECT
*
FROM
temp_results
WHERE
order_count > 10;
12. Optimize Window Functions
Minimize the use of window functions (OVER
clause) when not necessary. Use aggregate functions with GROUP BY
instead.
Example:
-- Instead of this
SELECT customer_id, SUM(total_amount) OVER(PARTITION BY customer_id) AS total_spent FROM orders;
-- Use this
SELECT customer_id, SUM(total_amount) AS total_spent FROM orders GROUP BY customer_id;
13. Materialized Views
Use materialized views for frequently run, repetitive queries. These precompute and store results, reducing query time.
- In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency.