Optimizing BigQuery SQL Queries_ A Comprehensive Guide

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Optimizing BigQuery SQL Queries: A

Comprehensive Guide
Table of Contents
1. Introduction
2. Overview of the Original Query
3. Identifying Optimization Opportunities
○ 3.1 Minimize Data Scanning from Large Tables
○ 3.2 Optimize Regular Expressions and String Functions
○ 3.3 Reduce Redundant Data Processing
○ 3.4 Use QUALIFY and Window Functions Wisely
○ 3.5 Leverage BigQuery Partitioning and Clustering
○ 3.6 Optimize JSON Processing
○ 3.7 Simplify Complex Calculations
○ 3.8 Reduce Data in Early Stages
○ 3.9 Use Approximate Aggregations
○ 3.10 Analyze Query Execution with EXPLAIN
4. Rewritten Query with Optimizations
5. Conclusion
6. Additional Resources

Introduction
Optimizing SQL queries is crucial for improving performance, reducing costs, and ensuring
efficient data processing, especially when dealing with large datasets in BigQuery. This guide
walks you through the process of optimizing a complex SQL query by identifying inefficiencies
and applying best practices.

Overview of the Original Query


The original query performs multiple operations:

● Aggregates alert counts over the last 30 days.


● Calculates daily and average problems from server incidents.
● Extracts and processes JSON data.
● Joins multiple datasets to enrich the final output.
● Applies complex transformations and calculations in the final SELECT statement.

While functional, the query can be optimized for better performance and efficiency.

Identifying Optimization Opportunities


1. Minimize Data Scanning from Large Tables

Issue: The server_incidents.maintenance_compil table is accessed multiple times


without filtering, leading to unnecessary data scanning.

Solution:

● Filter Data Early: Apply a WHERE clause to limit the data based on a date range or
relevant conditions whenever you read from the table.
● Use a Base Common Table Expression (CTE): Read from maintenance_compil
once into a base CTE with necessary filters and select only the required columns.

Implementation:

sql
Copy code
WITH base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
-- Add other required columns
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
)

By filtering the data early, subsequent operations work with a smaller dataset, improving
performance.
2. Optimize Regular Expressions and String Functions

Issue: Using REGEXP_CONTAINS and REGEXP_EXTRACT extensively can be computationally


expensive.

Solution:

● Simplify Regular Expressions: Combine or simplify patterns where possible.


● Use String Functions: Replace regex functions with simpler string operations like
LIKE, SUBSTR, or INSTR when appropriate.

Implementation:

Replace:

sql
Copy code
REGEXP_CONTAINS(server_disconnected, r'(\d+)\s*hour')

With:

sql
Copy code
server_disconnected LIKE '%hour%'

This reduces computational overhead and improves query performance.

3. Reduce Redundant Data Processing

Issue: The same table is joined multiple times just to get client_id, leading to redundant
data processing.

Solution:

● Eliminate Unnecessary Joins: Use the existing list of client_id values from
previous CTEs to avoid extra joins.

Implementation:

Modify the opportunity_data CTE:

sql
Copy code
opportunity_data AS (
SELECT
io.ClientId,
-- Other columns
FROM `datamarts.dm_sf_opportunities` AS io
JOIN (SELECT DISTINCT client_id FROM last_alerts) AS l
ON io.ClientId = l.client_id
)

Apply similar changes to other CTEs that unnecessarily join the same tables.

4. Use QUALIFY and Window Functions Wisely

Issue: Repeatedly applying QUALIFY ROW_NUMBER() can lead to unnecessary computations.

Solution:

● Consolidate Filtering: Ensure that ROW_NUMBER() is only applied where necessary.


● Prevent Duplicate Rows: Structure joins carefully to avoid duplication, eliminating the
need for extra filtering.

Implementation:

Ensure that the last_alerts CTE correctly selects the latest records and that subsequent
joins do not reintroduce duplicates.

5. Leverage BigQuery Partitioning and Clustering

Issue: Scanning entire tables without utilizing partitioning and clustering can be inefficient.

Solution:

● Partition Tables: Ensure large tables are partitioned by date or another logical
partitioning key.
● Cluster Tables: Cluster tables on columns frequently used in filters or joins.

Implementation:
If you have control over the dataset schema, partition maintenance_compil by timestamp
and cluster by client_id or server_id.

6. Optimize JSON Processing

Issue: Unnesting and processing JSON arrays can be resource-intensive.

Solution:

● Filter Before Unnesting: Apply any possible filters before UNNEST to reduce the
number of elements processed.
● Use Efficient Functions: Use JSON_VALUE and JSON_QUERY instead of
JSON_EXTRACT_SCALAR and JSON_EXTRACT_ARRAY when appropriate.

Implementation:

Replace:

sql
Copy code
JSON_EXTRACT_SCALAR(res, "$.source")

With:

sql
Copy code
JSON_VALUE(res, "$.source")

These functions are designed for scalar values and can improve performance.

7. Simplify Complex Calculations

Issue: Complex CASE statements and calculations in the final SELECT can be hard to read and
maintain.

Solution:

● Compute Values in CTEs: Move complex calculations to earlier CTEs for clarity and
reusability.
● Simplify Logic: Review and simplify conditions in CASE statements where possible.

Implementation:

Create a last_lost_dates CTE:

sql
Copy code
last_lost_dates AS (
SELECT
client_id,
server_id,
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
-- Other conditions
END AS last_lost_date
FROM last_alerts
)

This makes the main SELECT cleaner and easier to understand.

8. Reduce Data in Early Stages

Issue: Processing large volumes of data in early stages affects overall query performance.

Solution:

● Select Only Necessary Columns: In each CTE, only include columns that are needed
in subsequent steps.
● Apply Filters Early: Use WHERE clauses as soon as possible to limit the data being
processed.

9. Use Approximate Aggregations

Issue: Exact counts can be expensive and unnecessary when approximate values suffice.

Solution:
● Use Approximate Functions: Functions like APPROX_COUNT_DISTINCT provide faster
results with minimal accuracy loss.

Implementation:

If precise counts aren't essential, replace:

sql
Copy code
COUNT(DISTINCT some_column)

With:

sql
Copy code
APPROX_COUNT_DISTINCT(some_column)

10. Analyze Query Execution with EXPLAIN

Issue: Without understanding the execution plan, it's challenging to identify bottlenecks.

Solution:

● Use EXPLAIN: Analyze the query execution plan to identify inefficiencies.

Implementation:

Before running the query, execute:

sql
Copy code
EXPLAIN
SELECT -- your query here

Review the output to understand how BigQuery processes the query.

Rewritten Query with Optimizations


Below is the optimized version of your query, incorporating the discussed improvements.
sql
Copy code
WITH
base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
version,
server_disconnected,
unstable_stream,
fps_fail_ratio_last_24h,
trashed_alert,
dropped_alert,
gpu_restart_indicator,
gpu_import_indicator,
minimum_temperature_last_24h,
streams,
cpu_full,
disk_full,
upload,
download,
watcher_disconnected,
resolution,
fibre,
movement_percentage_last_45_minutes,
server_detection_rate
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
),
alerts_last_30_days AS (
SELECT
clientid AS client_id,
COUNT(alertid) AS alert_count
FROM `datamarts.dm_vn_alerts`
WHERE SentAt >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
GROUP BY clientid
),
daily_problems AS (
SELECT
client_id,
server_id,
DATE(timestamp) AS date,
SUM(CASE
WHEN server_disconnected LIKE '%hour%' THEN
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)\\s*hour') AS INT64)
* 3600
WHEN server_disconnected LIKE '%minute%' THEN
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)\\s*minute') AS
INT64) * 60
ELSE 0
END) AS total_downtime_seconds,
AVG(unstable_stream) AS avg_unstable_flux,
AVG(fps_fail_ratio_last_24h) AS avg_fail_fps,
AVG(trashed_alert) AS avg_trashed_alert,
AVG(dropped_alert) AS avg_dropped_alert
FROM base_maintenance_compil
GROUP BY client_id, server_id, DATE(timestamp)
),
avg_problems AS (
SELECT
client_id,
server_id,
SUM(total_downtime_seconds) AS total_downtime_seconds,
30 * 24 * 3600 AS total_available_seconds,
(SUM(total_downtime_seconds) / (30 * 24 * 3600)) * 100 AS
downtime_percentage,
AVG(avg_unstable_flux) AS daily_avg_unstable_flux_percentage,
AVG(avg_fail_fps) AS daily_avg_fail_fps_percentage,
AVG(avg_trashed_alert) AS daily_avg_trashed_alert,
AVG(avg_dropped_alert) AS daily_avg_dropped_alert
FROM daily_problems
GROUP BY client_id, server_id
),
last_alerts AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY client_id, server_id ORDER BY
timestamp DESC) AS rn
FROM base_maintenance_compil
),
filtered_last_alerts AS (
SELECT * FROM last_alerts WHERE rn = 1
),
recurring_alerts AS (
SELECT
client_id,
server_id,
COUNTIF(gpu_restart_indicator > 0) AS alert_restart_count,
COUNTIF(trashed_alert > 0) AS alert_trashed_count,
COUNTIF(dropped_alert > 0) AS alert_dropped_count,
COUNTIF(unstable_stream > 0) AS alert_unstable_count,
COUNTIF(gpu_import_indicator > 0) AS alert_import_count
FROM base_maintenance_compil
GROUP BY client_id, server_id
HAVING COUNT(*) > 1
),
resolution_data AS (
SELECT
client_id,
JSON_QUERY_ARRAY(resolution) AS resolution_array
FROM base_maintenance_compil
),
extracted_data AS (
SELECT
client_id,
JSON_VALUE(res, "$.source") AS source,
JSON_VALUE(res, "$.width") AS width,
JSON_VALUE(res, "$.height") AS height,
REGEXP_EXTRACT(JSON_VALUE(res, "$.source"),
r"(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})") AS ip_address
FROM resolution_data,
UNNEST(resolution_array) AS res
),
ranked_data AS (
SELECT
client_id,
source,
width,
height,
ip_address,
ROW_NUMBER() OVER (PARTITION BY client_id, ip_address ORDER BY
client_id) AS row_num,
CASE
WHEN CAST(width AS INT64) < 600
OR CAST(width AS INT64) > 2000
OR CAST(height AS INT64) < 600
🔴'
OR CAST(height AS INT64) > 2000 THEN '
WHEN width IS NULL OR height IS NULL THEN 'NA'
ELSE ' 🟢'
END AS resolution_status
FROM extracted_data
),
opportunity_data AS (
SELECT
io.ClientId,
io.ResellerName,
io.StoreCountryCode,
io.LastActivityDate,
io.TypeOfSale,
io.CameraRange AS camera_range,
CAST(SPLIT(io.CameraRange, '-')[SAFE_OFFSET(0)] AS INT64) AS
min_camera_range,
CAST(SPLIT(io.CameraRange, '-')[SAFE_OFFSET(1)] AS INT64) AS
max_camera_range
FROM `datamarts.dm_sf_opportunities` AS io
JOIN (SELECT DISTINCT client_id FROM filtered_last_alerts) AS l
ON io.ClientId = l.client_id
),
subscription_data AS (
SELECT
sd.ClientId,
sd.IsChurned,
sd.Active AS is_active
FROM `datamarts.dm_sf_subscriptions` AS sd
JOIN (SELECT DISTINCT client_id FROM filtered_last_alerts) AS l
ON sd.ClientId = l.client_id
),
last_lost_dates AS (
SELECT
client_id,
server_id,
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
WHEN server_disconnected LIKE '%month%' THEN
TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)') AS INT64) MONTH)
WHEN server_disconnected LIKE '%week%' THEN
TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)') AS INT64) WEEK)
WHEN server_disconnected LIKE '%day%' THEN
TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)') AS INT64) DAY)
WHEN server_disconnected LIKE '%hour%' THEN
TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'(\\d+)') AS INT64) HOUR)
ELSE NULL
END AS last_lost_date
FROM filtered_last_alerts
)
SELECT
l.client_id,
l.server_id,
l.server_disconnected AS lost_duration,
l.streams AS theoretical_flux_capacity,
l.server_detection_rate AS alert_type,
l.timestamp AS last_alert_date,
CASE WHEN l.server_disconnected IS NOT NULL THEN 'Lost ❌' ELSE
'Active ✅' END AS server_status,
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), l.timestamp, HOUR) AS
hours_since_alert,
-- Other columns and calculations
FROM filtered_last_alerts AS l
LEFT JOIN recurring_alerts AS ra ON l.client_id = ra.client_id AND
l.server_id = ra.server_id
LEFT JOIN avg_problems AS ap ON l.client_id = ap.client_id AND
l.server_id = ap.server_id
LEFT JOIN `server_incidents.deep_dive_image_quality` AS dd ON
l.client_id = dd.name
LEFT JOIN alerts_last_30_days AS al ON l.client_id = al.client_id
LEFT JOIN ranked_data AS rkd ON l.client_id = rkd.client_id
LEFT JOIN opportunity_data AS io ON l.client_id = io.ClientId
LEFT JOIN subscription_data AS sd ON l.client_id = sd.ClientId
LEFT JOIN last_lost_dates AS ld ON l.client_id = ld.client_id AND
l.server_id = ld.server_id
LEFT JOIN `veesion_manager.machines_machine` AS mach ON mach.hostname
= l.server_id
ORDER BY l.timestamp DESC

Conclusion
By applying these optimizations, the query becomes more efficient, readable, and maintainable.
Key takeaways include:

● Filtering Data Early: Reduces the amount of data processed in subsequent steps.
● Simplifying Functions: Improves performance by using less computationally intensive
operations.
● Eliminating Redundant Joins: Streamlines the query and avoids unnecessary data
processing.
● Organizing Calculations: Enhances readability and makes the query easier to debug
and maintain.

Additional Resources
● BigQuery Best Practices: Performance Optimization
● SQL Query Optimization Techniques: SQL Optimization Guide
● Understanding CTEs: Common Table Expressions (CTEs)

Optimizing BigQuery SQL Queries: A


Comprehensive Guide
Table of Contents
1. Introduction
2. Initial Query Analysis
3. Optimization Opportunities
○ 3.1 Minimize Data Scanning from Large Tables
○ 3.2 Optimize Regular Expressions and String Functions
○ 3.3 Reduce Redundant Data Processing
○ 3.4 Use QUALIFY and Window Functions Wisely
○ 3.5 Leverage BigQuery Partitioning and Clustering
○ 3.6 Optimize JSON Processing
○ 3.7 Simplify Complex Calculations
○ 3.8 Reduce Data in Early Stages
○ 3.9 Use Approximate Aggregations
○ 3.10 Analyze Query Execution with EXPLAIN
4. Rewriting the Query with Optimizations
5. Handling Errors and Corrections
○ 5.1 Error: TIMESTAMP_SUB Does Not Support MONTH Interval
○ 5.2 Correcting the last_lost_date Calculation
○ 5.3 Error in calledto_restart_server Calculation
○ 5.4 Final Corrections and Recommendations
6. Final Optimized Query
7. Conclusion
8. Additional Resources

Introduction
Optimizing SQL queries is essential for improving performance, reducing costs, and ensuring
efficient data processing, especially when dealing with large datasets in BigQuery. This guide
provides a step-by-step walkthrough of optimizing a complex BigQuery SQL query, identifying
inefficiencies, correcting errors, and applying best practices.
Initial Query Analysis
The original query performs multiple operations:

● Aggregates alert counts over the last 30 days.


● Calculates daily and average problems from server incidents.
● Extracts and processes JSON data.
● Joins multiple datasets to enrich the final output.
● Applies complex transformations and calculations in the final SELECT statement.

While the query is functional, there are several areas where it can be optimized for better
performance and maintainability.

Optimization Opportunities
1. Minimize Data Scanning from Large Tables

Issue: The server_incidents.maintenance_compil table is accessed multiple times


without adequate filtering, leading to unnecessary data scanning.

Solution:

● Filter Data Early: Apply a WHERE clause to limit the data based on a date range or
relevant conditions whenever you read from the table.
● Use a Base Common Table Expression (CTE): Read from maintenance_compil
once into a base CTE with necessary filters and select only the required columns.

Implementation:

sql
Copy code
WITH base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
-- Other required columns
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
)

By filtering data early, subsequent operations work with a smaller dataset, improving
performance.

2. Optimize Regular Expressions and String Functions

Issue: Extensive use of REGEXP_CONTAINS and REGEXP_EXTRACT can be computationally


expensive.

Solution:

● Simplify Regular Expressions: Use simpler patterns or string functions where possible.
● Use String Functions: Replace regex functions with LIKE, SPLIT, or SUBSTR when
appropriate.

Implementation:

Replace:

sql
Copy code
REGEXP_CONTAINS(server_disconnected, r'(\d+)\s*hour')

With:

sql
Copy code
server_disconnected LIKE '%hour%'

For extracting numeric values:

sql
Copy code
CAST(SPLIT(server_disconnected, ' ')[OFFSET(0)] AS INT64)

This reduces computational overhead and improves query performance.


3. Reduce Redundant Data Processing

Issue: The same table is joined multiple times to get client_id, leading to redundant data
processing.

Solution:

● Eliminate Unnecessary Joins: Use the existing list of client_id values from
previous CTEs to avoid extra joins.

Implementation:

Modify the opportunity_data CTE:

sql
Copy code
opportunity_data AS (
SELECT
io.ClientId,
-- Other columns
FROM `datamarts.dm_sf_opportunities` AS io
JOIN (SELECT DISTINCT client_id FROM last_alerts) AS l
ON io.ClientId = l.client_id
)

Apply similar changes to other CTEs that unnecessarily join the same tables.

4. Use QUALIFY and Window Functions Wisely

Issue: Repeatedly applying QUALIFY ROW_NUMBER() can lead to unnecessary computations.

Solution:

● Consolidate Filtering: Ensure that ROW_NUMBER() is only applied where necessary.


● Prevent Duplicate Rows: Structure joins carefully to avoid duplication, eliminating the
need for extra filtering.

Implementation:

Since last_alerts already contains the latest records, remove any additional QUALIFY
clauses in the final SELECT.
5. Leverage BigQuery Partitioning and Clustering

Issue: Scanning entire tables without utilizing partitioning and clustering can be inefficient.

Solution:

● Partition Tables: Ensure large tables are partitioned by date or another logical
partitioning key.
● Cluster Tables: Cluster tables on columns frequently used in filters or joins.

If you have control over the dataset schema, partition maintenance_compil by timestamp
and cluster by client_id or server_id.

6. Optimize JSON Processing

Issue: Unnesting and processing JSON arrays can be resource-intensive.

Solution:

● Filter Before Unnesting: Apply necessary filters before unnesting to reduce the number
of elements processed.
● Use Efficient Functions: Use JSON_VALUE and JSON_QUERY instead of
JSON_EXTRACT_SCALAR and JSON_EXTRACT_ARRAY when appropriate.

Implementation:

Process only the required records by joining resolution_data with last_alerts:

sql
Copy code
resolution_data AS (
SELECT
client_id,
JSON_EXTRACT_ARRAY(resolution) AS resolution_array
FROM
last_alerts
),
7. Simplify Complex Calculations

Issue: Complex CASE statements and calculations in the final SELECT can be hard to read and
maintain.

Solution:

● Compute Values in CTEs: Move complex calculations to earlier CTEs for clarity and
reusability.
● Simplify Logic: Review and simplify conditions in CASE statements where possible.

Implementation:

Create a score_calculation CTE to compute the SCORE:

sql
Copy code
score_calculation AS (
SELECT
l.client_id,
l.server_id,
CASE
-- Complex conditions
END AS SCORE
FROM last_alerts AS l
-- Necessary joins
)

Then join score_calculation in the final SELECT.

8. Reduce Data in Early Stages

Issue: Processing unnecessary columns or records can slow down the query.

Solution:

● Select Only Needed Columns: In each CTE, select only the columns required for
downstream processing.
● Apply Filters Early: Use WHERE clauses as soon as possible to limit the data being
processed.
9. Use Approximate Aggregations

Issue: Exact counts can be expensive and unnecessary when approximate values suffice.

Solution:

● Use Approximate Functions: Functions like APPROX_COUNT_DISTINCT provide faster


results with minimal accuracy loss.

If precise counts aren't essential, replace:

sql
Copy code
COUNT(DISTINCT some_column)

With:

sql
Copy code
APPROX_COUNT_DISTINCT(some_column)

10. Analyze Query Execution with EXPLAIN

Issue: Without understanding the execution plan, it's challenging to identify bottlenecks.

Solution:

● Use EXPLAIN: Analyze the query execution plan to identify inefficiencies.

Implementation:

Before running the query, execute:

sql
Copy code
EXPLAIN
SELECT -- your query here

Review the output to understand how BigQuery processes the query.


Rewriting the Query with Optimizations
Applying the above optimizations, the query is rewritten to enhance performance and
maintainability.

Key Changes:

● Introduced base_maintenance_compil CTE with early filtering.


● Simplified regular expressions and string functions.
● Eliminated redundant joins by using existing client_id values.
● Moved complex calculations to dedicated CTEs.
● Reduced data processed by selecting only necessary columns.

Handling Errors and Corrections


Error: TIMESTAMP_SUB Does Not Support MONTH Interval

Problem:

● When attempting to use TIMESTAMP_SUB with 'MONTH' or 'WEEK' intervals on


TIMESTAMP data types, BigQuery returns an error.

Error Message:

sql
Copy code
TIMESTAMP_SUB does not support the MONTH date part when the argument
is TIMESTAMP type

Correcting the last_lost_date Calculation

Solution:

● Use DATETIME_SUB or DATE_SUB for 'MONTH' and 'WEEK' intervals.


● Cast TIMESTAMP to DATETIME when using DATETIME_SUB.

Implementation:
sql
Copy code
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
WHEN server_disconnected LIKE '%month%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) MONTH) AS
STRING)
WHEN server_disconnected LIKE '%week%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) WEEK) AS
STRING)
WHEN server_disconnected LIKE '%day%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) DAY) AS
STRING)
WHEN server_disconnected LIKE '%hour%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) HOUR) AS
STRING)
ELSE 'not found'
END AS last_lost_date

Error in calledto_restart_server Calculation

Problem:

● Similar to the previous error, using TIMESTAMP_SUB with 'MONTH' or 'WEEK' intervals
causes an error in the calledto_restart_server calculation.

Solution:

● Use DATETIME_SUB for 'MONTH' and 'WEEK' intervals.


● Ensure consistent data types in comparisons.

Implementation:

sql
Copy code
CASE
WHEN l.server_disconnected IS NULL THEN 'No'
WHEN l.server_disconnected LIKE '%month%' THEN
CASE
WHEN DATETIME_SUB(CAST(l.timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) MONTH) >
CAST(io.LastActivityDate AS DATETIME) THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%week%' THEN
CASE
WHEN DATETIME_SUB(CAST(l.timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) WEEK) >
CAST(io.LastActivityDate AS DATETIME) THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%day%' THEN
CASE
WHEN TIMESTAMP_SUB(l.timestamp, INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) DAY) >
io.LastActivityDate THEN 'No'
ELSE 'Yes'
END
WHEN l.server_disconnected LIKE '%hour%' THEN
CASE
WHEN TIMESTAMP_SUB(l.timestamp, INTERVAL
CAST(REGEXP_EXTRACT(l.server_disconnected, r'\\d+') AS INT64) HOUR) >
io.LastActivityDate THEN 'No'
ELSE 'Yes'
END
ELSE 'No'
END AS calledto_restart_server

Final Corrections and Recommendations

● Data Type Consistency: Ensure both sides of comparisons are of the same data type.
● Avoid Unnecessary Casting: Simplify code by avoiding redundant casts and
conversions.
● Verify Column Data Types: Check the data types of columns like
io.LastActivityDate to cast them appropriately.

Final Optimized Query


Below is the final optimized query, incorporating all recommendations and corrections:

sql
Copy code
WITH
base_maintenance_compil AS (
SELECT
client_id,
server_id,
timestamp,
server_disconnected,
unstable_stream,
fps_fail_ratio_last_24h,
trashed_alert,
dropped_alert,
gpu_restart_indicator,
gpu_import_indicator,
minimum_temperature_last_24h,
streams,
cpu_full,
disk_full,
upload,
download,
watcher_disconnected,
resolution,
fibre,
version,
movement_percentage_last_45_minutes,
server_detection_rate,
reason
FROM `server_incidents.maintenance_compil`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30
DAY)
),

last_alerts AS (
SELECT
client_id,
server_id,
version,
server_disconnected,
streams AS theoretical_flux_capacity,
server_detection_rate,
movement_percentage_last_45_minutes,
fps_fail_ratio_last_24h,
gpu_restart_indicator,
trashed_alert,
dropped_alert,
unstable_stream,
gpu_import_indicator,
minimum_temperature_last_24h,
streams,
cpu_full,
disk_full,
upload,
download,
watcher_disconnected,
resolution,
fibre,
reason,
timestamp,
CASE
WHEN server_disconnected IS NULL THEN 'Never been down'
WHEN server_disconnected LIKE '%month%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) MONTH) AS
STRING)
WHEN server_disconnected LIKE '%week%' THEN
CAST(DATETIME_SUB(CAST(timestamp AS DATETIME), INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) WEEK) AS
STRING)
WHEN server_disconnected LIKE '%day%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) DAY) AS
STRING)
WHEN server_disconnected LIKE '%hour%' THEN
CAST(TIMESTAMP_SUB(timestamp, INTERVAL
CAST(REGEXP_EXTRACT(server_disconnected, r'\\d+') AS INT64) HOUR) AS
STRING)
ELSE 'not found'
END AS last_lost_date
FROM base_maintenance_compil
QUALIFY ROW_NUMBER() OVER (PARTITION BY client_id, server_id ORDER
BY timestamp DESC) = 1
),

-- Rest of the query including other CTEs like alerts_last_30_days,


daily_problems, avg_problems, recurring_alerts, resolution_data,
extracted_data, ranked_data, opportunity_data, subscription_data,
score_calculation

SELECT
-- Final SELECT statement with all required columns and
calculations
FROM last_alerts AS l
-- Necessary LEFT JOINs with other CTEs
ORDER BY
l.timestamp DESC

Conclusion
By applying the discussed optimizations and corrections, the query becomes more efficient,
readable, and maintainable. Key takeaways include:

● Filtering Data Early: Reduces the amount of data processed in subsequent steps.
● Simplifying Functions: Improves performance by using less computationally intensive
operations.
● Eliminating Redundant Joins: Streamlines the query and avoids unnecessary data
processing.
● Handling Errors Promptly: Addressing errors by understanding function limitations and
data type requirements.
● Organizing Calculations: Enhances readability and makes the query easier to debug
and maintain.

Additional Resources
● BigQuery Best Practices: Performance Optimization
● SQL Query Optimization Techniques: SQL Optimization Guide
● Understanding CTEs: Common Table Expressions (CTEs)
● BigQuery Date and Time Functions: Date Functions

By studying these optimizations and the reasoning behind them, you'll enhance your SQL skills
and be better equipped to write efficient queries in the future. Remember to test your queries
incrementally and validate results at each step to ensure accuracy.

Feel free to reach out if you have any questions or need further clarification on any of the topics
covered.

You might also like