Optimizing BigQuery SQL Queries_ A Comprehensive Guide
Optimizing BigQuery SQL Queries_ A Comprehensive Guide
Optimizing BigQuery SQL Queries_ A Comprehensive Guide
Comprehensive Guide
Table of Contents
1. Introduction
2. Overview of the Original Query
3. Identifying Optimization Opportunities
○ 3.1 Minimize Data Scanning from Large Tables
○ 3.2 Optimize Regular Expressions and String Functions
○ 3.3 Reduce Redundant Data Processing
○ 3.4 Use QUALIFY and Window Functions Wisely
○ 3.5 Leverage BigQuery Partitioning and Clustering
○ 3.6 Optimize JSON Processing
○ 3.7 Simplify Complex Calculations
○ 3.8 Reduce Data in Early Stages
○ 3.9 Use Approximate Aggregations
○ 3.10 Analyze Query Execution with EXPLAIN
4. Rewritten Query with Optimizations
5. Conclusion
6. Additional Resources
Introduction
Optimizing SQL queries is crucial for improving performance, reducing costs, and ensuring
efficient data processing, especially when dealing with large datasets in BigQuery. This guide
walks you through the process of optimizing a complex SQL query by identifying inefficiencies
and applying best practices.
While functional, the query can be optimized for better performance and efficiency.
Solution:
● Filter Data Early: Apply a WHERE clause to limit the data based on a date range or
relevant conditions whenever you read from the table.
● Use a Base Common Table Expression (CTE): Read from maintenance_compil
once into a base CTE with necessary filters and select only the required columns.
Implementation:
sql
Copy code
WITH base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
-- Add other required columns
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
)
By filtering the data early, subsequent operations work with a smaller dataset, improving
performance.
2. Optimize Regular Expressions and String Functions
Solution:
Implementation:
Replace:
sql
Copy code
REGEXP_CONTAINS(server_disconnected, r'(\d+)\s*hour')
With:
sql
Copy code
server_disconnected LIKE '%hour%'
Issue: The same table is joined multiple times just to get client_id, leading to redundant
data processing.
Solution:
● Eliminate Unnecessary Joins: Use the existing list of client_id values from
previous CTEs to avoid extra joins.
Implementation:
sql
Copy code
opportunity_data AS (
SELECT
io.ClientId,
-- Other columns
FROM `datamarts.dm_sf_opportunities` AS io
JOIN (SELECT DISTINCT client_id FROM last_alerts) AS l
ON io.ClientId = l.client_id
)
Apply similar changes to other CTEs that unnecessarily join the same tables.
Solution:
Implementation:
Ensure that the last_alerts CTE correctly selects the latest records and that subsequent
joins do not reintroduce duplicates.
Issue: Scanning entire tables without utilizing partitioning and clustering can be inefficient.
Solution:
● Partition Tables: Ensure large tables are partitioned by date or another logical
partitioning key.
● Cluster Tables: Cluster tables on columns frequently used in filters or joins.
Implementation:
If you have control over the dataset schema, partition maintenance_compil by timestamp
and cluster by client_id or server_id.
Solution:
● Filter Before Unnesting: Apply any possible filters before UNNEST to reduce the
number of elements processed.
● Use Efficient Functions: Use JSON_VALUE and JSON_QUERY instead of
JSON_EXTRACT_SCALAR and JSON_EXTRACT_ARRAY when appropriate.
Implementation:
Replace:
sql
Copy code
JSON_EXTRACT_SCALAR(res, "$.source")
With:
sql
Copy code
JSON_VALUE(res, "$.source")
These functions are designed for scalar values and can improve performance.
Issue: Complex CASE statements and calculations in the final SELECT can be hard to read and
maintain.
Solution:
● Compute Values in CTEs: Move complex calculations to earlier CTEs for clarity and
reusability.
● Simplify Logic: Review and simplify conditions in CASE statements where possible.
Implementation:
sql
Copy code
last_lost_dates AS (
SELECT
client_id,
server_id,
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
-- Other conditions
END AS last_lost_date
FROM last_alerts
)
Issue: Processing large volumes of data in early stages affects overall query performance.
Solution:
● Select Only Necessary Columns: In each CTE, only include columns that are needed
in subsequent steps.
● Apply Filters Early: Use WHERE clauses as soon as possible to limit the data being
processed.
Issue: Exact counts can be expensive and unnecessary when approximate values suffice.
Solution:
● Use Approximate Functions: Functions like APPROX_COUNT_DISTINCT provide faster
results with minimal accuracy loss.
Implementation:
sql
Copy code
COUNT(DISTINCT some_column)
With:
sql
Copy code
APPROX_COUNT_DISTINCT(some_column)
Issue: Without understanding the execution plan, it's challenging to identify bottlenecks.
Solution:
Implementation:
sql
Copy code
EXPLAIN
SELECT -- your query here
Conclusion
By applying these optimizations, the query becomes more efficient, readable, and maintainable.
Key takeaways include:
● Filtering Data Early: Reduces the amount of data processed in subsequent steps.
● Simplifying Functions: Improves performance by using less computationally intensive
operations.
● Eliminating Redundant Joins: Streamlines the query and avoids unnecessary data
processing.
● Organizing Calculations: Enhances readability and makes the query easier to debug
and maintain.
Additional Resources
● BigQuery Best Practices: Performance Optimization
● SQL Query Optimization Techniques: SQL Optimization Guide
● Understanding CTEs: Common Table Expressions (CTEs)
Introduction
Optimizing SQL queries is essential for improving performance, reducing costs, and ensuring
efficient data processing, especially when dealing with large datasets in BigQuery. This guide
provides a step-by-step walkthrough of optimizing a complex BigQuery SQL query, identifying
inefficiencies, correcting errors, and applying best practices.
Initial Query Analysis
The original query performs multiple operations:
While the query is functional, there are several areas where it can be optimized for better
performance and maintainability.
Optimization Opportunities
1. Minimize Data Scanning from Large Tables
Solution:
● Filter Data Early: Apply a WHERE clause to limit the data based on a date range or
relevant conditions whenever you read from the table.
● Use a Base Common Table Expression (CTE): Read from maintenance_compil
once into a base CTE with necessary filters and select only the required columns.
Implementation:
sql
Copy code
WITH base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
-- Other required columns
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
)
By filtering data early, subsequent operations work with a smaller dataset, improving
performance.
Solution:
● Simplify Regular Expressions: Use simpler patterns or string functions where possible.
● Use String Functions: Replace regex functions with LIKE, SPLIT, or SUBSTR when
appropriate.
Implementation:
Replace:
sql
Copy code
REGEXP_CONTAINS(server_disconnected, r'(\d+)\s*hour')
With:
sql
Copy code
server_disconnected LIKE '%hour%'
sql
Copy code
CAST(SPLIT(server_disconnected, ' ')[OFFSET(0)] AS INT64)
Issue: The same table is joined multiple times to get client_id, leading to redundant data
processing.
Solution:
● Eliminate Unnecessary Joins: Use the existing list of client_id values from
previous CTEs to avoid extra joins.
Implementation:
sql
Copy code
opportunity_data AS (
SELECT
io.ClientId,
-- Other columns
FROM `datamarts.dm_sf_opportunities` AS io
JOIN (SELECT DISTINCT client_id FROM last_alerts) AS l
ON io.ClientId = l.client_id
)
Apply similar changes to other CTEs that unnecessarily join the same tables.
Solution:
Implementation:
Since last_alerts already contains the latest records, remove any additional QUALIFY
clauses in the final SELECT.
5. Leverage BigQuery Partitioning and Clustering
Issue: Scanning entire tables without utilizing partitioning and clustering can be inefficient.
Solution:
● Partition Tables: Ensure large tables are partitioned by date or another logical
partitioning key.
● Cluster Tables: Cluster tables on columns frequently used in filters or joins.
If you have control over the dataset schema, partition maintenance_compil by timestamp
and cluster by client_id or server_id.
Solution:
● Filter Before Unnesting: Apply necessary filters before unnesting to reduce the number
of elements processed.
● Use Efficient Functions: Use JSON_VALUE and JSON_QUERY instead of
JSON_EXTRACT_SCALAR and JSON_EXTRACT_ARRAY when appropriate.
Implementation:
sql
Copy code
resolution_data AS (
SELECT
client_id,
JSON_EXTRACT_ARRAY(resolution) AS resolution_array
FROM
last_alerts
),
7. Simplify Complex Calculations
Issue: Complex CASE statements and calculations in the final SELECT can be hard to read and
maintain.
Solution:
● Compute Values in CTEs: Move complex calculations to earlier CTEs for clarity and
reusability.
● Simplify Logic: Review and simplify conditions in CASE statements where possible.
Implementation:
sql
Copy code
score_calculation AS (
SELECT
l.client_id,
l.server_id,
CASE
-- Complex conditions
END AS SCORE
FROM last_alerts AS l
-- Necessary joins
)
Issue: Processing unnecessary columns or records can slow down the query.
Solution:
● Select Only Needed Columns: In each CTE, select only the columns required for
downstream processing.
● Apply Filters Early: Use WHERE clauses as soon as possible to limit the data being
processed.
9. Use Approximate Aggregations
Issue: Exact counts can be expensive and unnecessary when approximate values suffice.
Solution:
sql
Copy code
COUNT(DISTINCT some_column)
With:
sql
Copy code
APPROX_COUNT_DISTINCT(some_column)
Issue: Without understanding the execution plan, it's challenging to identify bottlenecks.
Solution:
Implementation:
sql
Copy code
EXPLAIN
SELECT -- your query here
Key Changes:
Problem:
Error Message:
sql
Copy code
TIMESTAMP_SUB does not support the MONTH date part when the argument
is TIMESTAMP type
Solution:
Implementation:
sql
Copy code
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
WHEN server_disconnected LIKE '%month%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) MONTH) AS
STRING)
WHEN server_disconnected LIKE '%week%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) WEEK) AS
STRING)
WHEN server_disconnected LIKE '%day%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) DAY) AS
STRING)
WHEN server_disconnected LIKE '%hour%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) HOUR) AS
STRING)
ELSE 'not found'
END AS last_lost_date
Problem:
● Similar to the previous error, using TIMESTAMP_SUB with 'MONTH' or 'WEEK' intervals
causes an error in the calledto_restart_server calculation.
Solution:
Implementation:
sql
Copy code
CASE
WHEN l.server_disconnected IS NULL THEN 'No'
WHEN l.server_disconnected LIKE '%month%' THEN
CASE
WHEN DATETIME_SUB(CAST(l.timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) MONTH) >
CAST(io.LastActivityDate AS DATETIME) THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%week%' THEN
CASE
WHEN DATETIME_SUB(CAST(l.timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) WEEK) >
CAST(io.LastActivityDate AS DATETIME) THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%day%' THEN
CASE
WHEN TIMESTAMP_SUB(l.timestamp, INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) DAY) >
io.LastActivityDate THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%hour%' THEN
CASE
WHEN TIMESTAMP_SUB(l.timestamp, INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) HOUR) >
io.LastActivityDate THEN 'No'
ELSE 'Yes'
END
ELSE 'No'
END AS calledto_restart_server
● Data Type Consistency: Ensure both sides of comparisons are of the same data type.
● Avoid Unnecessary Casting: Simplify code by avoiding redundant casts and
conversions.
● Verify Column Data Types: Check the data types of columns like
io.LastActivityDate to cast them appropriately.
sql
Copy code
WITH
base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
server_disconnected,
unstable_stream,
fps_fail_ratio_last_24h,
trashed_alert,
dropped_alert,
gpu_restart_indicator,
gpu_import_indicator,
minimum_temperature_last_24h,
streams,
cpu_full,
disk_full,
upload,
download,
watcher_disconnected,
resolution,
fibre,
version,
movement_percentage_last_45_minutes,
server_detection_rate,
reason
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
),
last_alerts AS (
SELECT
client_id,
server_id,
version,
server_disconnected,
streams AS theoretical_flux_capacity,
server_detection_rate,
movement_percentage_last_45_minutes,
fps_fail_ratio_last_24h,
gpu_restart_indicator,
trashed_alert,
dropped_alert,
unstable_stream,
gpu_import_indicator,
minimum_temperature_last_24h,
streams,
cpu_full,
disk_full,
upload,
download,
watcher_disconnected,
resolution,
fibre,
reason,
timestamp,
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
WHEN server_disconnected LIKE '%month%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) MONTH) AS
STRING)
WHEN server_disconnected LIKE '%week%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) WEEK) AS
STRING)
WHEN server_disconnected LIKE '%day%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) DAY) AS
STRING)
WHEN server_disconnected LIKE '%hour%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) HOUR) AS
STRING)
ELSE 'not found'
END AS last_lost_date
FROM base_maintenance_compil
QUALIFY ROW_NUMBER() OVER (PARTITION BY client_id, server_id ORDER
BY timestamp DESC) = 1
),
SELECT
-- Final SELECT statement with all required columns and
calculations
FROM last_alerts AS l
-- Necessary LEFT JOINs with other CTEs
ORDER BY
l.timestamp DESC
Conclusion
By applying the discussed optimizations and corrections, the query becomes more efficient,
readable, and maintainable. Key takeaways include:
● Filtering Data Early: Reduces the amount of data processed in subsequent steps.
● Simplifying Functions: Improves performance by using less computationally intensive
operations.
● Eliminating Redundant Joins: Streamlines the query and avoids unnecessary data
processing.
● Handling Errors Promptly: Addressing errors by understanding function limitations and
data type requirements.
● Organizing Calculations: Enhances readability and makes the query easier to debug
and maintain.
Additional Resources
● BigQuery Best Practices: Performance Optimization
● SQL Query Optimization Techniques: SQL Optimization Guide
● Understanding CTEs: Common Table Expressions (CTEs)
● BigQuery Date and Time Functions: Date Functions
By studying these optimizations and the reasoning behind them, you'll enhance your SQL skills
and be better equipped to write efficient queries in the future. Remember to test your queries
incrementally and validate results at each step to ensure accuracy.
Feel free to reach out if you have any questions or need further clarification on any of the topics
covered.