100 SQL Tips
100 SQL Tips
SQL TIPS
To Take Your Skills to the Next Level
KYLE MALONE
Table of Contents
Introduction................................................................................................................................................4
Tip#1:TheMostCommonPatternforExploringData..............................................................................6
Tip#2:HowtoDoSUMIFSinSQL...........................................................................................................8
Tip#3:MyFavoriteStringFunction.........................................................................................................10
Tip#4:HowtoExtractPartofaStringinSQL........................................................................................12
Tip#5:UnderstandingSQL’sOrderofExecution....................................................................................14
Tip#6:ProperUseoftheORDERBYClause........................................................................................15
Tip#7:ASQLFormattingTip..................................................................................................................17
Tip#8:WritingMoreReadable,ExplicitSQL..........................................................................................18
Tip#9:WorkingwithUNIXTimestampsinSQL......................................................................................19
Tip#10:AnAlternativetoCASEStatements..........................................................................................21
Tip#11:ROUNDfunctioninSQL............................................................................................................23
Tip#12:ASpecialWaytoRoundNumbersinSQL................................................................................25
Tip#13:CTEsvs.Subqueries.................................................................................................................27
Tip#14:PivotDatawithCASEStatements.............................................................................................28
Tip#15:INNERJOINvs.WHEREIN......................................................................................................30
Tip#16:IntroducingGROUPBYALL.....................................................................................................31
Tip#17:TimeSeriesAnalysiswithLAG..................................................................................................32
Tip#18:SelfJoins...................................................................................................................................34
Tip#19:CleaningDatawithTRIM...........................................................................................................35
Tip#20:AvoidSQLErrorswithSAFE_CAST.........................................................................................37
Tip#21:NULLIFFunctioninSQL...........................................................................................................40
Tip#22:PERCENT_RANKinSQL.........................................................................................................42
Tip#23:CheckDivisibilitywithMOD.......................................................................................................44
Tip#24:ChangingDataTypesinSQL....................................................................................................46
Tip#25:FixDataIssuesattheSource...................................................................................................48
Tip#26:DDL,DML,&DQL—What’stheDifference?...........................................................................49
Tip#27:AQuickSQLTiptoSaveYouTime...........................................................................................51
Tip#28:SpeedUpYourQueriesByJoiningonThisDataType.............................................................52
Tip#29:UnderstandingColumnvs.RowStoreDatabases....................................................................53
Tip#30:TheDATE_TRUNCFunction.....................................................................................................54
Tip#31:FULLOUTERJOINinSQL.......................................................................................................55
Tip#32:DealingwithNULLValuesinSQL.............................................................................................58
Tip#33:4CommonSQLMistakesandHowtoAvoidThem..................................................................59
Tip#34:HowtoRankValuesinSQL......................................................................................................60
Tip#35:WriteFasterSELECTStatementswithEXCEPT......................................................................62
Tip#36:ACleanerWaytoFilteronWindowFunctions..........................................................................63
ip#37:SimplifyYourCASEStatements................................................................................................64
T
Tip#38:AppendingTogetherTwoDatasets............................................................................................65
Tip#39:IdentifyingDuplicatesinYourDataset.......................................................................................67
Tip#40:WorkingwithUnstructuredDatainSQL....................................................................................68
Tip#41:TheEXTRACTFunction............................................................................................................70
Tip#42:ASQLFunctionforCleaningData............................................................................................71
Tip#43:HowtoDealwithaDividebyZeroError...................................................................................73
Tip#44:MyMostUsedDataCleaningFunction.....................................................................................75
Tip#45:ReferencingaTableNameAliasintheWHEREClause...........................................................77
Tip#46:HowtoLearnJoins....................................................................................................................78
Tip#47:GeneratingAllPossibleCombinationswithCROSSJOIN........................................................80
Tip#48:TraversingHierarchicalDatawithRecursiveCTEs...................................................................82
Tip#49:FilteringAggregatedDatawithHAVING(vs.WHERE).............................................................84
Tip#50:WHERETRUE..........................................................................................................................86
Tip#51:AnORDERBYTrick..................................................................................................................88
Tip#52:CTEsareSimplerThanYouThink............................................................................................89
Tip#53:SQLJOINsExplained................................................................................................................90
Tip#54:NULLValuesinWHEREClause...............................................................................................92
Tip#55:CalculatingRunningTotalswithWindowFunctions..................................................................93
Tip#56:Generatea“DateSpine”toAvoidMissingDatesinCharts......................................................95
Tip#57:ValidatingStringLength.............................................................................................................97
Tip#58:CombiningColumnsintoaUniqueKeywithCONCAT.............................................................98
Tip#59:OptimizeYourQueryPerformanceandCost..........................................................................100
Tip#60:Factvs.DimensionTables.......................................................................................................102
Tip#61:AggregateYourDataatMultipleLevelswithGROUPINGSETS............................................103
Tip#62:CheckforMultipleNULLColumnsUsingCOALESCE............................................................105
Tip#63:CalculateMovingAveragesinSQL.........................................................................................106
Tip#64:SpotCheckingandTesting......................................................................................................108
Tip#65:QueryMultipleTablesatOncewithWildcardTables...............................................................110
Tip#66:CreatingYourOwnSQLFunctionstoReuseLogic.................................................................111
Tip#67:WhentoUseDISTINCTvs.GROUPBY.................................................................................113
Tip#68:SlowlyChangingDimensions..................................................................................................115
Tip#69:SometimesYouJustNeedaGiantCASEStatement..............................................................116
Tip#70:Avoidingthe“DoubleCounting”Pitfall.....................................................................................118
Tip#71:CombineMultipleRowsintoOnewithSTRING_AGG............................................................120
Tip#72:ParsingTextwithREGEX_EXTRACT:Don’tFeartheRegex!................................................122
Tip#73:EasilyGroupDatabyMonthwithLAST_DAY.........................................................................124
Tip#74:AnonymizingPersonallyIdentifiableInformationwithaHashFunction..................................126
Tip#75:JoiningBeyondSimpleIDs......................................................................................................127
Tip#76:ParsingDateswithPARSE_DATE:FixingMessyDateFormats.............................................128
Tip#77:SplittingBadRowsintoanErrorTablewithSAFE_CAST.......................................................130
Tip#78:FilteringonBoolean.................................................................................................................132
Tip#79:UsingGREATESTandLEASTtoCompareValuesinaRow.................................................133
Tip#80:WhentoUseFIRST_VALUEandLAST_VALUE....................................................................135
2
ip#81:DemystifyingWindowFunctionKeywords...............................................................................137
T
Tip#82:HowtoHandleMissingData...................................................................................................139
Tip#83:UnderstandingtheDifferenceBetweenNULLandZero.........................................................140
Tip#84:UnderstandingCOUNT(*),COUNT(column),andCOUNT(1).................................................142
Tip#85:BestPracticesforUsingINNERJOINvs.LEFTJOINandFilteringData...............................143
Tip#86:ExploringandValidatingYourDatawithSimpleQueries........................................................145
Tip#87:WhyYouMightWanttoUseaViewinBigQuery....................................................................147
Tip#88:UsingtheREPEATFunction...................................................................................................149
Tip#89:UsingtheSPLITFunctionforStringManipulation...................................................................151
Tip#90:UsingtheBAG_OF_WORDSFunctionforTextAnalysis........................................................153
Tip#91:You’reNeverGoingtoGettheDataExactlyHowYouWantIt—AndThat’sOkay..................155
Tip#92:WhyGROUPBYColumnsMustAppearintheSELECTClause(UnlessAggregated)..........157
Tip#93:WHEREEXISTS.....................................................................................................................159
Tip#94:WhyYouShouldAlwaysCommentYourCode.......................................................................161
Tip#95:WhentoUseREGEXP_CONTAINSInsteadofLIKE..............................................................164
Tip#96:TheSELECTClauseisIndependentoftheQuery’sProcessing............................................166
Tip#97:Don’tUseCASEStatementsasJoinConditions....................................................................167
Tip#98:HAVINGMAXandHAVINGMIN.............................................................................................169
Tip#99:DataNormalizationBasics......................................................................................................171
Tip#100:Using||forStringConcatenation...........................................................................................173
Conclusion.............................................................................................................................................174
3
Introduction
Welcome to 100 SQL Tips to Level Up Your Skills!
In this e-book, I share with you the top 100 SQL techniques that I’ve learned over the last decade of my
career in data analytics.
It’s the difference between a data analyst who’s just competent and one who stands out, gets
promotions and pay increases, and drives the business strategy of their company.
I’ve seen over and over how the level of someone’s SQL skills can be the difference-maker in
advancing their data career.
I once interviewed a hungry candidate who was eager to join our analytics team.
No one had time to teach SQL from scratch with everything else on our plates.
ou can have the best communication skills and business acumen, but if you can’t write SQL queries,
Y
you’ll struggle to add real value as a data analyst.
I want you to crush it in your career, get promoted, and achieve whatever you dream of.
hese 100 tips condense the lessons I’ve learned (sometimes the hard way) through years of
T
experience with SQL.
I remember earlier in my career, I didn’t know where to turn to improve my SQL skills.
And I want to help those of you in the same place that I was in years ago.
4
These tips are designed to close knowledge gaps and provide a solid foundation you can build on.
If you apply them, you’ll move past the basics and start writing queries that unlock real business value.
Think of this as your shortcut to mastering the most important tool in data analytics.
y the last page, you’ll be writing SQL with a level of confidence that sets you apart, whether you’re
B
eyeing your first data job or gunning for that senior role.
P.S. The tips are not in any particular order — just an order that I thought had a nice flow.
P.P.S. Please excuse the inconsistent format of the SQL code images. I like to mix it up! 🤣
5
Tip #1: The Most Common Pattern for Exploring Data
s a data analyst, exploring and understanding your dataset is crucial before diving into more complex
A
analysis.
ne common pattern for exploring data is to find the most frequent values for a specific column, ranked
O
from top to bottom.
Here's a SQL tip for finding the most common values for a column in your dataset.
Let's consider an example where we have a table named “orders” with the following data:
6
A query to find the most common product categories would look something like the query above.
In this example, the query returns the product categories and their respective frequencies, ordered by
frequency in descending order.
This helps us quickly identify the most common product categories in our dataset.
ne other tip is you can try to GROUP BY 1 ORDER BY 2, instead of typing the column names
O
explicitly.
Not every SQL environment supports this but it can make your query cleaner and faster to write.
s a data analyst, using this pattern for exploring your data will enable you to better understand your
A
dataset and make more informed decisions during your analysis.
It's a simple yet effective way to quickly gain insights into the distribution of values within your data.
7
Tip #2: How to Do SUMIFS in SQL
SUMIFS is my favorite spreadsheet function.
In SQL, the combination of the CASE statement within the SUM function can be used to achieve
functionality similar to SUMIFS in Excel.
The first time I saw that you could put a CASE statement inside a SUM, it blew my mind. 🤯
he CASE statement allows you to apply conditional logic within the aggregation function. This lets you
T
sum the values based on specific criteria.
e want to calculate the total quantity sold for product_id 1, but only for completed sales. In this case,
W
we can use a CASE statement within the SUM function to apply the necessary conditions:
8
In this example, the CASE statement checks if the product_id is 1 and if the sale_status is 'Completed.'
If both conditions are met, it returns the quantity value; otherwise, it returns 0.
he SUM function then adds up the returned values, giving us the total quantity sold for product_id 1
T
with completed sales.
sing a CASE statement within a SUM function allows you to perform conditional aggregations similar
U
to SUMIFS in Excel.
9
Tip #3: My Favorite String Function
Working with strings in SQL is one of the most common things you do as a data analyst.
s a part of that, you may need to combine the values of multiple columns or strings in your SQL
A
queries.
In this lesson, we'll introduce you to the CONCAT function and explain how it can be helpful in your
SQL queries.
What is CONCAT?
The CONCAT function is used to concatenate (or join) two or more strings or column values together.
It returns a single string that is the result of combining the input strings in the order they are provided.
uppose we have a table called “employees” with the following columns: first_name, last_name, and
S
email.
e want to create a query that returns a full name by combining the first_name and last_name
W
columns.
In this query, we use the CONCAT function to join the first_name and last_name columns, separated by
a space character.
y using this function, you can easily create new strings by joining existing ones, which can be useful
B
for formatting.
.S. Another cool thing CONCAT can do is create a unique key on your table. If none of the columns
P
are unique by themselves but a combination of columns is unique, then you can just concatenate them
together to create a unique key.
10
11
Tip #4: How to Extract Part of a String in SQL
Cleaning data in SQL can be one of the hardest parts of a data analyst's job.
In SQL, string functions like STRPOS() and SUBSTRING() can be used to manipulate and extract
specific parts of text data.
STRPOS() returns the position of the first occurrence of a part of a piece of text.
In combination with other string functions like SUBSTRING(), you can extract parts of a string based on
the position of a specific character.
Let's consider an example where we have a table named “file_paths” with the following data...
e want to extract the file names, which are all characters to the right of the last / in the file_path
W
column.
We can use the STRPOS() and SUBSTRING() functions in a query to achieve this.
12
In this example, the STRPOS(file_path, '/') function finds the position of the first occurrence of / in the
file_path column.
We add 1 to this position to start from the character immediately after the /.
inally, we use the SUBSTRING() function to extract the substring starting at the calculated position
F
until the end of the string.
Boom!
13
Tip #5: Understanding SQL’s Order of Execution
common confusion for new analysts is that SQLdoesn’texecute in the same top-to-bottom order
A
you see in your query.
o if you’re ever surprised that you can’t reference a SELECT alias in WHERE, or run into issues with a
S
window function, it’s probably due to how SQL is actually evaluated under the hood.
.
1 ROM: Identify which table you’re pulling data from.
F
2. JOIN: Combine the data from multiple tables if needed.
3. WHERE: Filter out rows that don’t meet your conditions.
4. GROUP BY: Group rows for aggregation.
5. HAVING: Filter the grouped results.
6. SELECT: Choose which columns or expressions to display.
7. ORDER BY: Sort the results.
8. LIMIT: Return only a subset of rows (e.g., top 10).
Here’s an example to illustrate why this matters. Let’s say you write:
SELECT
customer_id,
COUNT
(*)
AS
order_count
FROM
orders
WHERE
order_count >
5
-- This won't work
GROUP
BY
customer_id
You’d need:
SELECT
customer_id,
COUNT
(*)
AS
order_count
FROM
orders
GROUP
BY
customer_id
HAVING
COUNT
(*) >
5
ere, we useHAVINGto filter after the grouping and counting is done. This makes sense because by
H
the time you hit the WHERE clause, the SELECT clause hasn’t run yet!
QL runs in a specificlogicalorder—knowing this helps you avoid those “Why won’t this work?”
S
moments. 🤔
14
Tip #6: Proper Use of the ORDER BY Clause
hen you're working with SQL, it's crucial to write efficient queries to retrieve the data you need as
W
quickly as possible.
ne way to optimize performance is by ensuring that you use the ORDER BY clause only at the end of
O
your query, rather than in subqueries or derived tables.
he reason behind this is thatsorting data can becomputationally expensive, especially when
T
dealing with large datasets.
he image below shows an example of an inefficient ORDER BY at the beginning of a query vs. an
T
efficient ORDER BY at the end.
15
y using ORDER BY at the end of your query, you give the database engine the opportunity to filter,
B
join, or aggregate the data before sorting it.
This reduces the amount of data that needs to be sorted, making the process more efficient and faster.
16
Tip #7: A SQL Formatting Tip
Learning good SQL formatting is one of the highest leverage things you can do as a new data analyst.
That's because it's easy to learn but can be detrimental to your interviews if you don't do it.
earning to format your SQL is the EASIEST and FASTEST way to make you look like you know what
L
you’re doing.
ELECT
S name
, age, country, occupation, hobby
FROMusers
WHERE
country =
'USA'
AND
age >
21
ORDER
BY
name
DESC
SELECT
name
,
age,
country,
occupation,
hobby
FROM
users
WHERE
country =
'USA'
AND
age >
21
ORDER
BY
name
DESC
But all of them follow some of these basic best practices like using a lot of vertical space.
It's about writing queries that others can easily understand and maintain.
17
Tip #8: Writing More Readable, Explicit SQL
Just because two SQL queries do the same thing, doesn’t mean they are equal.
Here’s an example of how a senior analyst vs. a more junior analyst might write the same query.
Junior Analyst:
Senior analyst:
Both of these queries get you the same result. But the second one is more explicit.
It articulates exactly what you’re trying to do to someone else reading your query.
In the first example, the reader has to guess why seven hours are being subtracted from the created_at
column.
Explicit queries reduce the risk of misinterpretation or errors when others work with your code.
asily readable SQL strengthens team collaboration and eases the onboarding process for new
E
analysts.
The goal is to write SQL that not only works but works well for everyone who reads it!
18
Tip #9: Working with UNIX Timestamps in SQL
Have you ever seen a timestamp that just looked like a really long number?
It’s a representation of time as the number of seconds that have elapsed since January 1, 1970,
00:00:00 Coordinated Universal Time (UTC), not counting leap seconds.
NIX timestamps are a convenient way to store and manipulate date and time values in various
U
programming languages and databases, including SQL.
hey are especially useful for calculating time intervals and comparing dates since they represent time
T
as a single numeric value.
Different SQL databases have their own set of functions to handle UNIX timestamps.
Here are some examples showing how PostgreSQL, MySQL, and BigQuery handle these.
Convert a UNIX timestamp to a date and time value using TIMESTAMP_SECONDS() function:
SELECT
TIMESTAMP_SECONDS(
1623647987
)
AS
date_time
Convert a date and time value to a UNIX timestamp using the UNIX_SECONDS() function:
SELECT
UNIX_SECONDS(
TIMESTAMP
"2021-06-14 12:33:07"
)
AS
unix_timestamp
Let's consider an example where we have a table named user_logins with the following data:
e want to convert the UNIX timestamps to human-readable date and time values. In PostgreSQL, we
W
can do this using the to_timestamp() function:
19
Next time you see a timestamp shown as a long string of numbers you will know what to do!
20
Tip #10: An Alternative to CASE Statements
Sometimes CASE statements in SQL can get long and hard to read.
In certain SQL dialects (e.g. BigQuery), you can use the IF() function as a cleaner alternative.
Here’s an example...
The code below shows how to write that with both CASE and IF.
21
The IF function takes three arguments:
. a condition
1
2. the value to return if the condition is true
3. the value to return if the condition is false
I’m not saying you should ALWAYS use IF over CASE statements.
I think it works best to use IF when the logic is very simple like the example above.
But if you’ve never used it, try out IF next time you are about to write a CASE statement!
22
Tip #11: ROUND function in SQL
common mistake Data Analysts make (and one I’ve made myself) is includingtoo many decimal
A
placesin presentations to stakeholders.
SQL provides a versatile function called ROUND that helps you combat this.
. T
1 he number you want to round
2. The number of decimal places to round to.
If you pass apositive valueas the second argument,it rounds to that many decimal places.
negative valuewill round to the left of the decimalpoint and passing zero will round the number to
A
the nearest integer.
ere’s an example where we have a table named “product_prices” and we want to round the prices
H
multiple ways:
23
In this example, the query uses the ROUND function to round the prices to two decimal places, the
nearest integer, and the nearest multiple of 10.
As a result, we can see the rounded prices for each product in our dataset.
Happy rounding. 🟠
24
Tip #12: A Special Way to Round Numbers in SQL
If SQL already has a ROUND function, why do we also need FLOOR and CEILING?
hile ROUND is great for general number rounding, FLOOR and CEILING are neededto control the
W
direction of rounding, such as when bucketing data.
et’s say you want to analyze how much customers spend in their first30, 60, 90, etc., daysafter
L
signing up.
ou need togroup spending into fixed 30-day buckets,ensuring that each period starts at a multiple
Y
of 30 days.
25
If we used ROUND, a transaction at45 dayswould round to60instead of being correctly placed in the
30 to 60-day bucket.
Example:
ELECT
S ROUND
(
4
5
/
30
) *
30
;
-- Returns 60 (incorrect
for bucketing)
SELECT
FLOOR
(
4
5
/
30
) *
30
; -- Returns 30 (correct
for bucketing)
By using FLOOR, we ensure that transactions always fall into the correct 30-day range.
Key Takeaways:
- se FLOOR when bucketing data into fixed intervals (e.g., 30-day periods).
U
- Use CEILING when you need to round up (e.g., determining “at most X days”).
- Use ROUND only when rounding to the nearest value is appropriate.
nderstanding these differences helps ensure accuracy in SQL reporting and analysis—especially for
U
LTV, retention, and cohort studies.
26
Tip #13: CTEs vs. Subqueries
When should you use CTEs (common table expressions) vs. subqueries in SQL?
o refresh your memory, I like to think of CTEs and subqueries as ways to organize the writing of your
T
SQL code.
CTE is kind of like making a bunch of little tables individually, and then working with them in the lower
A
parts of your query.
ELECT
S columns...
FROM
(
SELECT
col1, col2, etc.
FROM
table_name
WHERE
...)
...
Whenever I see a big subquery inside another query it makes my mind explode. 🤯
However, I understand sometimes it's just easy to write a little subquery when you have to.
Only use subqueries when they are VERY simple and easy to understand.
If it takes more than five seconds to read and understand what the subquery is doing, make it a CTE
instead.
This will make it easier to write, understand, and debug your code.
It will also make it WAY easier for someone else who didn't write your query to read and understand it.
Follow this rule and your other team members and future self will thank you!
27
Tip #14: Pivot Data with CASE Statements
One useful feature of CASE statements is to pivot data dynamically.
CASE statements can transform data in ways beyond simple conditional expressions.
ivoting data means converting rows into columns, which can be especially useful when dealing with
P
categorical data.
Here's an example of how you can use the CASE statement to pivot data:
In this query, we used SUM functions and CASE statements to pivot the age_group column, converting
it into separate columns for each age group's dog count.
Here is what that looks like before and after the pivot:
28
-- BEFORE PIVOT
og_breed
d | age_group | number_of_dogs
------------------|------------|---------------
Labrador
| Puppy | 10
Labrador
| Adult | 35
Labrador
| Senior | 15
Golden Retriever | Puppy
| 12
Golden Retriever | Adult
| 30
Golden Retriever | Senior
| 20
Beagle
| Puppy | 8
Beagle
| Adult | 25
Beagle
| Senior | 12
-- AFTER PIVOT
og_breed
d | puppy_count | adult_count | senior_count
------------------|-------------|-------------|--------------
Labrador
| 10 | 35 | 15
Golden Retriever | 12
| 30 | 20
Beagle
| 8 | 25 | 12
📊 📊
Enjoying this e‑book?
here are many people (just like you) out there who want to learn SQL but feel lost and don’t know
T
where to start.
All you need to do is share a quick post on LinkedIn (or your preferred social platform).
Tag me, mention how this e‑book helped you, and include a link so others can find it.
eople listen to others’ recommendations and your recommendation could guide them in the right
P
direction.
Thanks for sharing this e-book and helping others on their path.
29
Tip #15: INNER JOIN vs. WHERE IN
Two paths, same destination.
When filtering data, you might see two approaches that do basically the same thing.
ELECT
S t1.*
FROM
table1 t1
JOIN
table2 t2
ON
t1.column = t2.column
ELECT
S *
FROM
table1
WHERE
column
IN
(
S
ELECT
column
FROM
table2)
In many modern SQL engines, both often end up with similar execution plans.
Some find WHERE IN more explicit, especially when using a simple subquery.
I personally gravitate toward INNER JOINs as the primary way to filter data if possible.
As you gain experience, you’ll develop preferences, but don’t let this topic confuse you.
The database optimizer usually handles both similarly, especially for reasonably sized datasets.
Ultimately,clarityis key.
Pick the syntax that best communicates the intent to you and to those who will be reading your code.
30
Tip #16: Introducing GROUP BY ALL
This SQL tip will change your life.
Phew 🥵
GROUP BY 1,2,3,4
Better... 🤨
Now, just simply:
GROUP BY ALL
Beautiful, elegant.
31
Tip #17: Time Series Analysis with LAG
Time series analysis is one of the most important skills you can have as a data analyst.
This is because many important insights are related to a change in some metric over time.
The LAG function in SQL is a window function that is helpful for this.
sing LAG, you can calculate period-to-period changes, such as the difference in sales between
U
consecutive months or the growth in user signups from week to week.
It can seem a bit confusing at first since it’s a window function, but I promise you’ll get the hang of it.
Below I show how to use it to calculate a month-over-month change in the sales column.
et's consider an example where we have a table named “monthly_sales” with the following monthly
L
sales data:
32
In this example, the LAG function is first used within the OVER clause, which defines the order of rows
to be used in the calculation.
his retrieves the sales value from the previous row, relative to the current row, when sorted by the
T
month column (previous_month_sales).
Then, we subtract the previous month's sales from the current month's sales to get the sales change.
y using the LAG function, you can gain insights into how values change over time, helping you identify
B
trends, growth rates, and patterns in your data.
33
Tip #18: Self Joins
The first time I saw a self join in SQL I thought it was so weird.
But they’re super useful for analyzing relationships within the same dataset.
et’s say you’re a data analyst working with employee data, and you want to find which employees
L
report to which managers—all from the same employees table.
SELECT
e.employee_id,
e.first_name
AS
employee_name,
m.first_name
AS
manager_name
FROM
employees e
JOIN
employees m
ON
e.manager_id = m.employee_id
ORDER
BY
manager_name, employee_name
Think of self joins like comparing rows from the same table to find patterns or relationships.
quick rule of thumb isself joins are perfect for tables that havehierarchies.
A
(managers/employees, product categories, etc.)
If you come across a table that has hierarchies, remember you might need a self join to analyze it!
34
Tip #19: Cleaning Data with TRIM
Extra spaces before and after words in SQL columns can drive analysts crazy.
his can happen for a lot of reasons — a common one is when a user accidentally types a space
T
before or after their email in an online form or survey.
There are often processes in place to ensure this doesn't happen, but that's not always the case.
I can't tell you how many times I've tried for hours to figure out what's wrong with my SQL query only to
realize that one of the fields has a sneaky space in front of it!
To combat this you can use the TRIM function in SQL.
he TRIM function is a string manipulation function that removes leading and trailing spaces from a
T
given string.
et's consider an example where we have a table named “users” with user information that includes
L
first names and email addresses with extra spaces.
We want to remove the extra spaces from the first_name and email columns.
35
y using the TRIM() function, you can ensure that your queries, joins, and comparisons work correctly
B
and avoid issues caused by extra spaces in user-submitted data.
36
Tip #20: Avoid SQL Errors with SAFE_CAST
Oftentimes in our analytics work we will want to CAST a column to another data type.
or example, we might have a column containing numbers but the data type of that column is a
F
STRING.
To work with these strings as numbers we use the CAST function.
ut one problem with this is it can cause your whole query to error out if any single row fails to cast to
B
another data type.
In this example, this could happen if one of the rows mistakenly had a word in it instead of a number.
SAFE_CAST is a SQL function that's available in many data warehouses like Google BigQuery.
SAFE_CAST allows you to safely cast or convert data from one data type to another.
If the conversion cannot be performed, instead of causing your query to crash or return an error, the
SAFE_CAST function will return NULL.
sing SAFE_CAST can help you avoid unexpected errors in your queries when dealing with data that
U
may not be in the expected format or contains invalid values for the target data type.
Suppose we have a table named raw_data with the following schema and data.
e want to cast the value column to integers, but there is a record (record_id 2) with an invalid value
W
for an integer.
If we try to cast the value column directly, we may encounter an error:
To avoid this error and handle the casting safely, we can use the SAFE_CAST function:
37
In this example, the SAFE_CAST function takes the value and the target data type (INT64) as
arguments.
hen the value cannot be cast to the target data type, as in the case of record_id 2, the function
W
returns NULL instead of causing an error.
If you want to put the errors in a separate table for further analysis, you can filter the records with NULL
values in the int_value column:
38
nderstanding how to use functions like SAFE_CAST to handle potential data type conversion errors in
U
your SQL queries can be a big help when you need it.
It helps you maintain the stability and reliability of your analysis, allowing you to work with data that may
contain unexpected or invalid values that could otherwise cause your queries to fail.
y using SAFE_CAST, you can also identify and analyze problematic records separately, improving the
B
quality of your data processing and analysis.
39
Tip #21: NULLIF Function in SQL
A while back I made a LinkedIn post about how to check for duplicates in SQL.
I received a question asking not about duplicates, but how to compare whether two columns in the
same table have the same value.
It compares two expressions and returns NULL if they are equal.
Imagine you have a users table with columns for email and backup_email. You want to ensure that the
backup_email is not the same as the primary email for security reasons.
In the results of this query, email_comparison will be NULL if the email and backup_email are the same.
his could be helpful to quickly identify any instances where the backup email is the same as the
T
primary email, which could be a security concern.
In other words, any row where email_comparison is NULL is a row where the email and backup_email
are identical.
40
his way, NULLIF can be used to compare columns and identify identical values, which can be useful in
T
a variety of scenarios where identical values in different columns could be a problem.
Quick tips like this can really speed up your workflow as an analyst.
Happy SQL’ing!
41
Tip #22: PERCENT_RANK in SQL
I thought I didn’t need this SQL window function… until I used it the other day!
At some point, you’re going to need to compare values across a dataset.
aybe you want to rank customers by spending, students by test scores, or products by sales
M
performance.
But instead of just assigning a rank, you want to seewhere each value falls on a scale from 0 to 1.
Let's consider an example where we have a table named “exam_scores” with the following data:
42
We want to calculate the percentile rank of each student's score.
In this example, PERCENT_RANK() calculates the percentile rank of each student's score based on
the sorted scores.
s a result, we can see how each student's score compares to the rest of the dataset in terms of
A
percentiles.
y using this technique, you can quickly identify trends, outliers, and other important characteristics of
B
your dataset, which can be crucial for effective data analysis.
43
Tip #23: Check Divisibility with MOD
s a data analyst, you may encounter situations where you need to determine if a number is divisible
A
by another number.
This can be useful for a lot of things like checking if a number is odd or even.
In SQL, the MOD function can be a useful tool for achieving this goal.
The MOD function returns the remainder of a division operation, making it easy to check for divisibility.
If the result is 0, it indicates that the dividend is divisible by the divisor.
Let's consider an example where we have a table named “numbers” with the following data.
We want to find the numbers in the table that are divisible by 5.
In this case, we can use the MOD function in combination with a WHERE clause.
44
In this example, the query uses the MOD function to filter the rows where the value is divisible by 5 (i.e.,
the remainder of the division by 5 is 0).
As a result, we get a list of numbers in our dataset that meet this condition, which are 20, 25, and 30.
Using the MOD function in SQL is a handy tool for working with numerical data.
It allows you to quickly identify and filter rows based on divisibility, which can be helpful for various data
analysis tasks, such as generating reports, identifying trends, or performing calculations on specific
subsets of your data.
45
Tip #24: Changing Data Types in SQL
When working with data in SQL you may need to convert one data type to another.
or example, you might have a number you want to work with, but it’s stored as a text string in your
F
database.
hat means you can’t use any functions meant for numbers on it (like ROUND, SUM, etc.) until you
T
convert the data type to something numeric.
CAST(expression AS data_type)
uppose we have a table called “sales” with the following columns: sale_id, item_id, price, and
S
discount. The price column has a data type of FLOAT64, while the discount column has a data type of
STRING.
46
If you want to calculate the discounted price, you will first need to convert the discount column to a
numeric data type.
he CAST function allows you to easily convert the discount column to a numeric data type, so you can
T
perform calculations with the price column.
47
Tip #25: Fix Data Issues at the Source
Here’s one of the biggest lessons I’ve learned working with data.
It can be tempting to write the world's most epic CASE statement to clean up the data.
If new data is continuously pouring in, you're going to have to keep adding lines to that CASE statement
every time a new condition appears.
That is unsustainable.
olving issues at the source can be more work initially, but will make your life so much easier in the
S
long run.
Sometimes the best SQL tips don’t involve any SQL at all!
48
Tip #26: DDL, DML, & DQL — What’s the Difference?
As data analysts we’re used to writing SQL queries to pull data.
SQL statements are categorized into different types based on their purpose and functionality.
In this lesson, we'll discuss the three main types of SQL statements: DDL, DML, and DQL.
DL statements are used to define and manage the structure of a database and its objects (such as
D
tables, views, and indexes). The main DDL statements include:
ML statements are used to manage and manipulate the data stored within database objects. The
D
main DML statements include:
he main DQL statement is the SELECT statement, which allows you to retrieve data from one or more
T
tables, apply filters, and aggregate or sort the results.
…
ou don’t need to be an expert at these to be a data analyst, but some familiarity with them can
Y
definitely be helpful!
49
DDL:
DML:
50
Tip #27: A Quick SQL Tip to Save You Time
Sometimes you will have to work with tables that have A LOT of columns.
You either have to manually type out each column you want in the SELECT statement.
Or you can manually type out the columns you don’t want with a SELECT * EXCEPT statement.
You can write a SQL statement that outputs the column names for you!
ELECT
S column_name
FROM
project_id.dataset_id.INFORMATION_SCHEMA.COLUMNS
WHERE
table_name =
'mytable'
ELECT
S column_name
FROM
information_schema.columns
WHERE
table_schema =
'schema name'
AND
table_name =
'table name'
These queries will output the column names in the SQL results.
Then you can copy/paste them all at once and just delete the ones you don’t want.
51
ip #28: Speed Up Your Queries By Joining on This
T
Data Type
Here’s a quick SQL tip to help your queries run faster.
There's nothing that bogs down a data analyst like a slow SQL query.
ne way to improve the performance of your queries is to join tables using integer values instead of
O
strings.
Integers take up less space in memory than strings, leading to faster joins and more efficient
processing.
hen joining on strings, the database has to compare each character in the string, which can be
W
time-consuming, especially for longer strings or large datasets.
Additionally, string comparisons are often case-sensitive, which can lead to slower performance.
If your database contains numbers stored as strings, you can convert them to integers using the CAST
or CONVERT functions, depending on your SQL dialect.
with string_id_table as (
SELECT
CAST
(
id
AS
INT64)
as
id
, other_columns...
FROM
table
)
SELECT
columns...
FROM
string_id_table s
LEFT
JOIN
table2 t
on
s.id = t.id
on't let strings slow you down – optimize your queries by using integers whenever possible. This tip
D
won't fix all your problems, but it should help.
And it’s good to keep things like this in mind to understand how SQL is executing under the hood.
Happy analyzing!
52
ip #29: Understanding Column vs. Row Store
T
Databases
Are you gearing up for data analyst interviews?
One key area you might encounter is the difference between column and row store databases.
ow Store Databases (RDBMS): Think of a row storedatabase like a traditional spreadsheet where
R
data is stored in rows.
This is the most common type of database you might have already used.
It's excellent for operations involving the entire row of data, like entering a new record or retrieving all
details of a specific entry.
It works wonders when you need to process transactions, which is why it's a go-to for online retail,
banking systems, and other transaction-oriented applications.
Column Store Databases: Now, imagine instead of storingdata by rows, you store it by columns.
ach data column is stored separately, making it faster to retrieve all the information in a single column
E
across multiple rows.
It's a star performer when dealing with analytics and data warehousing, as these operations often
require heavy reading and aggregation of specific attributes (like summing up all sales in a month).
- erformance: In row stores, if you need to accessdata from a few columns, you end up
P
scanning entire rows which can be slow. Column stores, on the other hand, can quickly fetch
only the required columns, speeding up query times.
- Storage Efficiency: Column stores can compress data more effectively because the data in a
column is typically more uniform than data in a row.
- Real-world Applications: Knowing the difference canhelp you suggest the right database for the
job. For instance, if your company deals with heavy report generation, a columnar database
might be recommended.
Interview Tip: When asked about databases in an interview,show that you understand not only the
technical differences but also how these affect business decisions and performance.
entioning scenarios like data warehousing for column stores and transaction processing for row
M
stores can demonstrate practical knowledge.
emember, it's not about memorizing definitions. It's about understanding concepts and their impact on
R
the real world of data.
53
Tip #30: The DATE_TRUNC Function
Data analysts often work with dates in their data.
DATE_TRUNC
nowing how to use the DATE_TRUNC function in SQL is important because it helps break down dates
K
into useful time periods, like days or months.
. T
1 imestamp Column: The date or timestamp column that you want to truncate.
2. Time Unit: The unit to truncate to (e.g., year, month, day, hour).
his function returns a date truncated to the beginning of the specified time unit, making it easier to
T
group and compare data across time periods.
et's assume we have a table “sales” with columns “sale_date” (a timestamp) and “amount” (sale
L
amount).
If you want to analyze monthly sales, you can use DATE_TRUNC to truncate sale_date to the month
level:
In this query, DATE_TRUNC(sale_date, MONTH) truncates the sale_date to the first day of each
month, and the SUM(amount) calculates the total sales for each month.
he GROUP BY clause groups the results by the truncated date, allowing you to see the total sales for
T
each month.
emember, the precision of DATE_TRUNC can be changed to other units like year, day, hour, etc.,
R
depending on your needs!
54
Tip #31: FULL OUTER JOIN in SQL
Most people skip learning this SQL join, but that is a mistake.
It’s a lesser used but powerful tool when you need it.
FULL OUTER JOIN is a type of join that combines the results of both LEFT JOIN and RIGHT JOIN.
It returns all rows from both tables, and if there's no match between the tables, NULL values are used
for the columns where no match is found.
his join operation is useful when you want to retain all the data from both tables, even if there's no
T
direct match between them.
FULL OUTER JOIN is different from INNER JOIN and LEFT JOIN in the following ways:
- INNER JOIN returns only the rows where there's a match between the joining columns in both
tables.
- LEFT JOIN returns all the rows from the left table and the matched rows from the right table. If
no match is found, NULL values are used for the columns from the right table.
The “orders” table stores information about orders that are attributed to the ad spend.
55
56
In this result set, you can see that the FULL OUTER JOIN includes all the rows from both tables, even
when there's no match between the ad_id columns.
he unmatched row from the orders table (with order_id 5) has NULL values for the columns from the
T
ad_spend table.
o use FULL OUTER JOIN when you want to retain all the data from both tables, even if there's no
S
direct match between them.
his can be useful in situations where you need to combine data from separate but related tables, such
T
as ad spend and orders.
57
Tip #32: Dealing with NULL Values in SQL
When working with data in SQL, you’ll inevitably encounter NULL values.
The COALESCE function is a handy tool that can help you out!
OALESCE can be particularly useful when you want to provide a default value for a column that may
C
have NULL values.
Suppose you have a table called “products” with product_id, name, price, and discount.
In this table, the discount column contains NULL values for some products.
You want to calculate the final price of each product after applying the discount.
he COALESCE function ensures that if the discount is NULL, it is treated as 0, allowing you to
T
calculate the final price correctly.
It allows you to provide default values for NULLs when needed.
his can help you avoid unexpected results when performing calculations or displaying data, making it
T
a useful tool for data analysts.
58
ip #33: 4 Common SQL Mistakes and How to Avoid
T
Them
SQL is hard.
It’s ok to make mistakes but you want to correct them as soon as possible. Here are four common
beginner SQL mistakes and how to correct them:
But I often see people just slap a DISTINCT in there to make sure there aren't any duplicates.
That's sloppy.
onsider alternative solutions, such as using window functions, to improve performance and
C
customization of your deduplication.
common mistake is using the comparison operator (=) instead of IS NULL or IS NOT NULL when
A
comparing NULL values.
This can result in incorrect results, as NULL values are treated differently in SQL.
Understand how to handle NULL values in your queries to avoid incorrect results.
Another common mistake is using the wrong data type for your columns.
or example, using a text data type for a column that should be a number can result in errors when
F
performing mathematical operations.
It's also not as efficient joining on text as it is integers. Ensure you use the correct data type for each
column.
his will make your code so much easier to understand when you are looking back on it three months
T
later trying to remember what you did!
...
Avoid these common mistakes to improve your SQL skills and ensure accurate, efficient data analysis.
59
Tip #34: How to Rank Values in SQL
Data analysts often have to rank data.
But this can be tricky because in SQL there are three ranking functions that do nearly the same thing.
. R
1 OW_NUMBER
2. RANK
3. DENSE_RANK
The main difference is in how they handle a tie (i.e. equal values).
The example data, query, and output are shown below with examples of each ranking function.
60
As you can see RANK, ROW_NUMBER, AND DENSE_RANK don't produce the same rankings.
OW_NUMBER()assigns unique values to each row, evenif the sales_amount is the same. It will not
R
show any ties.
Use this if you want to make 100% sure you have a tie breaker.
ANK()assigns the same rank to the rows with thesame sales_amount and leaves gaps in the ranking
R
sequence.
This is similar to how a sporting event like golf would rank players in a tournament.
ENSE_RANK()assigns the same rank to the rows withthe same sales_amount without leaving gaps
D
in the ranking sequence.
Knowing the difference between these three can ensure you always use the right one!
61
ip #35: Write Faster SELECT Statements with
T
EXCEPT
Here’s a SQL tip to save you a future carpal tunnel diagnosis.
Ever find yourself typing out way too many column names in your SELECT statement?
ELECT EXCEPT is a useful feature in many SQL dialects that lets you select all columns in a table
S
except for the ones you choose.
This can be super helpful when working with large datasets containing many columns.
Instead of typing out all the columns except for the ones you don’t need you can use SELECT
EXCEPT.
or example, if you want to query all columns except “start_station_name” and “end_station_name,”
F
you can use SELECT EXCEPT as follows:
his will return a result with all columns except for the specified ones. Next time your table has more
T
columns than you’d care to type…
62
ip #36: A Cleaner Way to Filter on Window
T
Functions
One hidden gem SQL command I love is QUALIFY.
his can be particularly useful when you want a single row for each window group based on specific
T
criteria — without the need to write a separate Common Table Expression (CTE) or subquery.
Suppose you want to find the most recent purchase for each customer.
You can use QUALIFY along with ROW_NUMBER() to achieve this without using a CTE or subquery.
Remember QUALIFY next time you are working with window functions!
63
Tip #37: Simplify Your CASE Statements
Did you know you can simplify your CASE WHEN statements in SQL?
This can make your life a lot easier when you have a lot of conditions.
Here's how…
A typical CASE statement for classifying countries would look something like this:
To reformat this in a simpler way you can write it like this:
his simpler approach prevents you from having to write out the entire condition 'country =' for each
T
statement.
Adding little shortcuts like this to your repertoire can make your life as a data analyst much easier!
ne caveat is this will only work when your condition is a literal match like column = 'something.' It
O
won't work with other types of comparison like greater than, LIKE, etc.
64
Tip #38: Appending Together Two Datasets
When working with data, a common task is appending one table of data to another.
ppending data is kind of like adding more pages to the end of a book or adding more items to the end
A
of a shopping list.
The original data stays the same, but new data is added to the end.
For example, if you have a list of names like this: Alice, Bob, Charlie
And you append the name "Dave" to the list, the list would then look like this:
. U
1 NION ALL
2. UNION DISTINCT
hile they both serve the purpose of merging query results, there are differences in their behavior and
W
use cases.
NION DISTINCT:The UNION DISTINCT operation combinesthe results of two or more SELECT
U
queries and removes any duplicate rows.
se UNION DISTINCT when you want to merge results from different queries and ensure that the
U
output contains only unique rows, without duplicates.
NION ALL:The UNION ALL operation combines the resultsof two or more SELECT queries and
U
retains all rows, including duplicates.
65
ince UNION ALL does not perform any duplicate elimination, it is generally faster than UNION
S
DISTINCT.
se UNION ALL when you want to merge results from different queries and keep all the rows, including
U
duplicates, or when you know that there are no duplicates in the result sets and want to improve the
query performance.
Append away!
66
Tip #39: Identifying Duplicates in Your Dataset
Finding duplicates in your data doesn’t have to be a headache.
Just memorize this common SQL pattern for finding dupes and you'll be set.
With SQL, you can use GROUP BY and HAVING to quickly identify rows that appear more than once.
For example, you can spot duplicate user registrations or redundant entries in any table.
Keep this common SQL pattern in mind next time you need to find a duplicate.
67
Tip #40: Working with Unstructured Data in SQL
Learning to work with data that's not in a table format is a big step as a data analyst.
ou’re probably used to seeing data formatted as rows and columns, like in a spreadsheet or SQL
Y
table.
JSON (JavaScript Object Notation) is just a different way to format data that’s not rows and columns.
hy is JSON important? In today's interconnected world, data is everywhere and it's often served to us
W
via APIs (application programming interfaces).
PIs use JSON to deliver data because it's light, easy to understand, and can be used by many
A
programming languages.
Now, let's imagine you've received a JSON file from an API and you want to work with this data in SQL.
Suppose we have a table named “user_profiles” with the following schema and data:
68
SQL has functions to help us handle JSON data.
One common SQL function for working with JSON is the JSON_VALUE function.
. T
1 he JSON data column (profile_data).
2. A JSON path expression ('$.name') that specifies the key to extract the value from.
The $.name JSON path expression refers to the "name" key in the JSON data.
nowing how to extract specific values or manipulate JSON data can greatly enhance your ability to
K
analyze and work with such data.
emember, JSON and SQL might seem like different languages, but with a little practice, you'll be able
R
to translate between them with ease. It's all part of your journey to becoming a versatile data analyst.
Embrace it, play around with it, and soon, you'll be extracting and analyzing data like a pro.
Happy analyzing!
69
Tip #41: The EXTRACT Function
hile working in SQL you might need to extract specific parts of a date or time such as the year,
W
month, or day.
. T
1 he unit (e.g., year, month, day) you want to extract.
2. The date or time value from which to extract it.
Let's consider an example in which we have a table named “employees” with the following data:
ere I show examples of how to extract the YEAR, MONTH, and DAY from the hire_date column in our
H
table.
s a data analyst, understanding how to use the EXTRACT function in SQL is essential for working
A
with date and time data.
70
Tip #42: A SQL Function for Cleaning Data
The REPLACE function in SQL is a handy tool for cleaning data.
It allows you to replace any characters you want with anything else.
It's commonly used to remove characters from a string of text by replacing them with an empty string "".
REPLACE
(input_string, search_string, replacement_string)
et's consider an example where we have a table named “users” with email addresses formatted with
L
<> around them.
We want to remove the <> characters from the email addresses.
71
We can use the REPLACE function in a query to achieve this.
In this example, we use nested REPLACE functions to remove both < and > characters.
he inner REPLACE function removes the < character by replacing it with an empty string (''), and the
T
outer REPLACE function removes the > character in the same way.
his can be particularly helpful when working with datasets that contain inconsistencies, typos, or
T
specific formatting that needs to be adjusted before analysis.
hile there are more customizable solutions (like Regex) for cleaning data, the REPLACE function
W
offers a simple and efficient method for making these adjustments directly within your SQL queries,
without the need for external tools or programming languages.
72
Tip #43: How to Deal with a Divide by Zero Error
If you’ve been an analyst for long, you have probably run into this common error: Can’t divide by zero.
AFE_DIVIDE is a function in SQL that allows you to perform division operations while safely handling
S
potential divide-by-zero errors.
If a divide-by-zero error is encountered, instead of causing your query to crash or return an error, the
SAFE_DIVIDE function will return NULL.
Suppose we have a table named “sales” with the following schema and data:
e want to calculate the average revenue per unit for each product. If we try to divide total_revenue by
W
units_sold directly, we will encounter a divide-by-zero error for product_id 2.
To avoid this error and handle the division safely, we can use the SAFE_DIVIDE function.
. T
1 he numerator (total_revenue).
2. The denominator (units_sold).
73
hen the denominator is zero, as in the case of product_id 2, the function returns NULL instead of
W
causing an error.
s a data analyst, understanding how to use functions like SAFE_DIVIDE to handle potential errors in
A
your SQL queries is important.
It helps you maintain the stability and reliability of your analysis, allowing you to work with data that may
contain unexpected zero values or other edge cases that could otherwise cause your queries to fail.
ote: SAFE_DIVIDE is what this function is called in BigQuery. Other SQL environments may have a
N
different name for this function but it is essentially the same thing. To find yours, just type “Safe divide
[insert platform i.e. Snowflow, Postgres, etc.],” into Google.
74
Tip #44: My Most Used Data Cleaning Function
I reviewed all the SQL code I’ve written recently.
The LOWER function in SQL is a string manipulation function that converts text to lowercase.
It is useful when you need to normalize text data, such as email addresses, to ensure consistent
formatting and avoid issues with case sensitivity in your queries.
mail addresses are often entered by users in a free-form manner, leading to inconsistent
E
capitalization.
ome SQL environments are case-sensitive, which means that 'kyle@thequery.io' and
S
'Kyle@thequery.io' would be treated as different email addresses, even though they represent the same
person.
y using the LOWER() function, you can standardize email addresses to lowercase, ensuring that your
B
queries and joins work correctly regardless of the original capitalization.
Usage is straightforward.
et's consider an example where we have a table named “users” with email addresses in various
L
capitalizations:
75
We want to convert all email addresses to lowercase, so we use LOWER() to achieve this.
s a data analyst, understanding how to use the LOWER() function in SQL is essential as it allows you
A
to standardize and clean text data within your queries.
76
ip #45: Referencing a Table Name Alias in the
T
WHERE Clause
A common SQL mistake is attempting to reference an alias in the WHERE clause.
The WHERE clause can only reference the columns in the FROM clause at the time of execution.
77
Tip #46: How to Learn Joins
If you are struggling with JOINs in SQL, here is my recommendation...
ne of the key concepts in SQL is understanding the different types of relationships that can exist
O
between tables.
. O
1 ne-to-Many
2. One-to-One
3. Many-to-Many
ust as a parent can have multiple children, a table in a one-to-many relationship can have multiple
J
records in another table.
For example, in a database of customers and orders, each customer can have multiple orders.
his relationship is represented by a foreign key in the orders table that references the primary key of
T
the customers table (customer id).
Similarly, a record in one table can only be related to one record in another table.
or example, in a database of customers and their login credentials, each customer only has one set of
F
login credentials (username/password combo).
ust as a recipe can have multiple ingredients and an ingredient can be used in multiple recipes, a
J
record in one table can be related to multiple records in another table, and vice versa.
imilarly, in a database of students and classes, a student can be enrolled in multiple classes, and a
S
class can have multiple students.
his relationship is represented by a junction table, also known as a join table, which contains the
T
foreign keys of both tables.
78
In order to effectively query your data, you need to understand the types of relationships different tables
have to each other.
his allows you to create a visual map in your mind of how the data is stored and what you need to do
T
to shape the data the way you want for analysis.
This is typically one of the first things I do when coming across new data.
tart to think in terms of data and relationships and you will have a much better understanding of the
S
data you are working with!
79
ip #47: Generating All Possible Combinations with
T
CROSS JOIN
join type that’s often overlooked (but can be super useful for certain analytics tasks) is theCROSS
A
JOIN.
A CROSS JOIN returns every possible combination of rows from the two tables being joined.
his might sound weird at first, but it’s great for scenarios where you need all permutations of two lists,
T
like all possible color-size combinations, or when creating date “scaffolds” for time series.
It can also be used to help with inefficient joins, but that is a lesson for another time.
- colors table
-
color
---------
Red
Green
Blue
- sizes table
-
size
---------
Small
Medium
Large
SELECT
c.color,
s.size
FROM
colors c
CROSS
JOIN
sizes
s
ORDER
BY
c.color, s.size
It’s a quick way to generate permutations without manually listing them out or writing a subquery.
80
The output would look something like this:
olor
c | size
--------|--------
Blue
| Small
Blue
| Medium
Blue
| Large
Green
| Small
Green
| Medium
Green
| Large
Red
| Small
Red
| Medium
Red
| Large
ust remember that CROSS JOIN can produce very large result sets if your tables are big, so use it
J
carefully!
.S. This tip illustrates the basics of a CROSS JOIN. There is a lot more you can do with it. I’ll teach
P
you some crazy CROSS JOIN stuff another time.
81
ip #48: Traversing Hierarchical Data with Recursive
T
CTEs
orking with hierarchies (like organization charts, folder structures, subordinates in a company, etc.)
W
can be tricky in SQL.
You typically need to use a Recursive CTE, which can be a bit mind blowing. 🤯
Recursive CTEis super helpful for these scenariosbecause it lets you write a query that references
A
itself, naturally walking through data relationships of unlimited depth.
- nchor Query: This is the starting place – the “rootnode” or “top-level” record.
A
- Recursive Query: This part references the CTE itself,allowing it to progress from one level of
hierarchy to the next.
Let’s say we have a table called “employees” with the following columns:
mployee_id
e manager_id employee_name
-----------
---------- -------------
1
NULL Emily (CEO)
2
1 Sarah
3
1 Carlos
4
2 Judy
5
4 Aaron
UNION
ALL
82
SELECT
e.employee_id,
e.manager_id,
e.employee_name,
eh.level +
1
FROM
employees e
JOIN
employee_hierarchy eh
ON
e.manager_id = eh.employee_id
)
SELECT
employee_id,
manager_id,
employee_name,
level
FROM
employee_hierarchy
ORDER
BY
level
, employee_id
irst, theAnchor Queryfinds the top-level employee(manager_id IS NULL). In our case, that’s the
F
CEO Emily.
hen, theRecursive Queryjoins employees back to the CTE (employee_hierarchy) using manager_id
T
= employee_id, continuously pulling in rows for the next level down.
ach time the recursion runs, we add 1 to the level column so we can see where each employee falls in
E
the hierarchy.
hey’re usually more straightforward than subqueries for multi-level joins, and you can control how you
T
step through the data (e.g., track depth, filter partial hierarchies).
That’s it!
nce you see how the anchor and recursive parts work together, recursive CTEs become a powerful
O
tool for any analyst handling hierarchical data.
Recursive CTEs can be tricky. To learn more about themcheck out the Bigquery Documentation here.
Happy querying!
83
ip #49: Filtering Aggregated Data with HAVING (vs.
T
WHERE)
hen you first learn SQL, you quickly meet the WHERE clause which is used to filter rows before
W
they’re grouped.
But there’s another clause calledHAVINGthat might feel a bit mysterious at first.
AVING is specifically used to filteraggregatedresults(like SUM, COUNT, AVG, etc.) after you’ve
H
used GROUP BY.
If you forget to use HAVING and try to filter an aggregated column with WHERE, you’ll get an error.
Suppose we have a table named “sales” that shows sales from different regions:
egion
r sale_id amount
North
1001 5000
North
1002 7000
South
1003 4000
East
1004 6000
East
1005 8000
West
1006 3000
West
1007 2000
North
1008 9000
SELECT
region,
SUM
(amount)
AS
total_sales
FROM
sales
GROUP
BY
region
HAVING
SUM
(amount) >
10000
ORDER
BY
total_sales
DESC
inally, we usedHAVING SUM(amount) > 10000. Now thateach region’s amounts are summed, we
F
filter to show only those with totals above 10,000.
If we tried to put SUM(amount) > 10000 in a WHERE clause, we’d get an error because you can’t filter
an aggregated value at the row-level filter stage.
84
astering HAVING is essential for intermediate SQL, especially when you’re building reports or
M
dashboards that need to show only groups meeting certain criteria.
Happy querying!
85
Tip #50: WHERE TRUE
One thing that can be a pain in SQL is figuring out what to put in your WHERE clause.
Here’s a weird tip I discovered recently that changed the way I write queries.
o let’s say you’re querying and playing around with different conditions in the WHERE clause,
S
commenting different lines in and out as you go.
But it’s a pain when you need to comment out the FIRST condition in your WHERE clause.
Then you have to rearrange what order everything is in so the query doesn't break.
ut starting your WHERE clause with “WHERE TRUE” makes exploring and tweaking queries so much
B
easier!
Instead of commenting out filters one by one, you can simply toggle conditions without breaking the
query.
You can easily add or remove filters (just comment/uncomment) without worrying about syntax errors.
86
It’s a small trick, but it can save you so much time when exploring data or debugging long queries.
Plop a little “WHERE TRUE” in there and you’ll save yourself the headache.
87
Tip #51: An ORDER BY Trick
Did you know you can ORDER BY anything in SQL?
et’s say you only want a random group of 10 customers from your database. To do this, we can
L
ORDER BY the function RAND:
Think of it like each row is running this function and getting a random number assigned to it.
Then at the end the rows are ordered by this random number that is assigned to each row.
That's how you get randomized ordering. Then we just take the first 10 results by using LIMIT.
But you can actually order by anything you want, not just RAND.
earning little nuances like this help you better understand how SQL works so you can write queries
L
with confidence.
88
Tip #52: CTEs are Simpler Than You Think
I often see CTEs listed on “advanced SQL” lists.
Writing a CTE is actually one of the simplest ways to structure and organize your queries.
If you know how to write a SELECT statement, you’re already 90% of the way there.
- CTE is just a temporary result set you can reference later in your query.
A
- It starts with the WITH keyword, followed by the name of your CTE and the query that defines it.
- If there are multiple CTEs you want to write, just separate them with a comma.
- You can think of it as a cleaner, more readable way to reuse logic in your SQL script.
Here’s an example:
ITH sales_by_region AS (
W
SELECT
region,
SUM
(sales)
AS
total_sales
FROM
sales_data
GROUP
BY
region
)
ELECT
S *
FROM
sales_by_region
WHERE
total_sales >
10000
In this query we define a CTE called “sales_by_region” to calculate sales totals by region.
Then we use it to filter for regions with more than $10,000 in sales.
There are obviously more complex implementations than this, but this is the gist of it!
If CTEs have ever felt confusing to you, I hope this helps simplify things.
89
Tip #53: SQL JOINs Explained
SQL joins can feel like a mind-bending puzzle.
Don't worry!
This amazing visual by C.L Moffatt is here to clear things up for you.
It shows seven different ways you can join two tables together in SQL:
.
1 Inner Join
2. Left Join
3. Right Join
4. Full Join
5. Left Excluding Join
6. Right Excluding Join
7. Outer Excluding Join
ome might argue that the last three aren't technically joins, but we're going to consider them as such
S
for simplicity.
- Inner Join: This is your go-to Join, returning allmatching records from both tables.
- Left Join: This fetches all records from the lefttable, and any matching records from the right
table.
- Right Join: This is the opposite of a Left Join, bringing all records from the right table, and any
matching ones from the left.
- Outer Join: This returns all records from both tables, matching wherever possible.
90
- eft Excluding Join: This pulls all records from the left table that don't have a match in the right
L
table.
- Right Excluding Join: It fetches all records fromthe right table that don't find a match in the left
table.
- Outer Excluding Join: This returns all records fromboth tables that do not match.
ide note: I have never actually used a RIGHT JOIN in practice but nevertheless I still like this visual for
S
completeness.
Trust me, once you get the hang of it, SQL Joins will no longer feel like a brain teaser!
Don't shy away from getting your hands dirty with some SQL Join exercises.
91
Tip #54: NULL Values in WHERE Clause
Null values in SQL can behave weirdly sometimes.
If you don't know this, it can cause unexpected query output leading to frustration.
Let's say you are trying to remove any rows from your “animals” table where the type is “dog.”
ELECT
S *
FROM
animals
WHERE
type
<>
'dog'
This will remove any rows from the output with the type of “dog.”
Great.
But it will also not return any rows where the column is NULL.
To fix this, you need to add a COALESCE() function in your WHERE clause like this:
ELECT
S *
FROM
animals
WHERE
COALESCE
(
type
,
''
) <>
'dog'
his will turn any null values into an empty string which will evaluate to true when compared as not
T
equal to “dog.”
his issue caused me some major frustration with a query I was writing until I figured out what the
T
problem was.
92
ip #55: Calculating Running Totals with Window
T
Functions
hen you want to see how a metric accumulates over time (like daily sales or a running count of user
W
signups), window functions are your go to.
SUM(column_name) OVER (
PARTITION BY partition_column
ORDER BY ordering_column
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
This might look confusing at first, but it makes sense once you get the hang of it.
- ale_date
S
- Product_id
- Amount
You’d like to see a running total of amount per product, day by day.
93
PARTITION BY product_id groups the data so each product’s total is tracked independently.
unning totals are useful for things like trend analysis to see how a value (such as sales) grows over
R
time, cumulative counts to track how many sign-ups or events have happened up to a certain date, and
rolling metrics.
94
ip #56: Generate a “Date Spine” to Avoid Missing
T
Dates in Charts
Here’s a SQL tip I never anticipated needing.
Have you ever created a chart and noticed some dates missing?
hat happens when your data table doesn’t have entries for those dates – they don’t show up in your
T
query results.
date spine solves this by ensuring every date in a range is present, whether there was any activity
A
(sales, sign-ups, etc.) or not.
WITH date_spine AS (
SELECT
day
FROM
UNNEST(GENERATE_DATE_ARRAY(
'2023-01-01'
,
'2023-01-31'
,
INTERVAL
1
DAY
))
AS
day
)
SELECT
ds.day
AScalendar_date,
COALESCE
(
S
UM(s.sales_amount),
0)
AS
daily_sales
FROM
date_spine ds
LEFT
JOIN
sales_table s
ON
ds.day = s.sale_date
GROUP
BY
ds.day
ORDER
BY
ds.day
sing a LEFT JOIN ensures all dates from the spine appear, even if no matching sales row exists
U
(which yields NULL, replaced by 0 via COALESCE).
This ensures your charts and reports have every date, preventing misleading gaps.
Omitting those days can skew averages or trends and make your charts look weird.
95
nderstanding the technique of creating and joining on a date spine ensures you never miss a date
U
again, even if your underlying data doesn’t record an event for that day!
P.S. Another option is to create a “date dimension” table that has all the dates in it and join to that:
96
Tip #57: Validating String Length
When cleaning or validating text fields, you often need to check how long a string is.
The LENGTH function (or LEN in some SQL dialects) returns the number of characters in a string.
common example is phone number validation — ensuring each phone number has the correct digit
A
count.
Suppose your table “users” has a column “phone_number” with inconsistent formats.
One way you might detect which numbers are the wrong length is by using a CASE statement:
In this query, LENGTH(phone_number) checks how many digits the phone number has.
CASE classifies each phone number as either “Valid Format” or “Invalid Format.”
his allows you to quickly spot values that deviate from expected lengths (i.e., short or overly long
T
strings).
It also improves quality control, preventing errors downstream by flagging problematic entries early
(such as phone numbers or ID fields).
he LENGTH function by itself might not always be that useful but when paired with a CASE statement
T
it can be quite powerful!
97
ip #58: Combining Columns into a Unique Key with
T
CONCAT
Here’s a trick I learned early in my data career that I’ve used dozens of times.
Using CONCAT, you can merge those columns into one unique key for easier referencing or deduping.
- ale_date
S
- Store_id
- Product_id
- Amount
o single column is guaranteed unique, but (sale_date, store_id, product_id) together do form a unique
N
record.
98
CONCAT(...) merges multiple columns into one string.
Casting ensures all pieces are treated as strings (important if some are dates/numbers).
uickly spot exact duplicates by grouping or filtering on this new key and simplify joins on multi-column
Q
matches by using the single key as a match field.
nsure data integrity by making it clear that the combination of values is what defines each unique
E
record.
hen no natural primary key exists, building one with CONCAT can save you plenty of hassle down the
W
road and keeps your data consistent and well-organized.
99
Tip #59: Optimize Your Query Performance and Cost
You can do all the query optimizations in the world.
ut the single biggest thing that accounts for 99% of reducing costs and speeding up your queries is
B
this:
nlike traditional row-store databases (like MySQL or PostgreSQL), where the entire row is typically
U
retrieved, column-store systems only read the columns you request.
By choosing fewer columns, you reduce the amount of data that has to be read from disk.
his directly translates to less I/O, lowers processing time, and (especially in BigQuery) reduces query
T
costs because you’re billed largely by how much data you scan.
Here’s an example:
100
In the “Not optimal” query, you scan every column from “big_table,” likely increasing cost and execution
time significantly compared to selecting only “user_id” and “total_spend.”
Bottom Line: Always aim toselect only the columnsyou actually need in your queries.
It’s the quickest win for reducing cost and improving performance in a columnar data warehouse like
BigQuery.
By doing so, you’ll keep queries both fast and cost-effective.
101
Tip #60: Fact vs. Dimension Tables
ne of the first SQL lessons I learned in my first job was the difference betweenfactanddimension
O
tables.
At first it wasn’t obvious why it mattered — until I had to actually use it.
If you worked at a car dealership, the fact tables might be car inventory (what’s in stock) and car sales
(what’s been sold).
In this example, those could be the make of the cars (Ford, GM, Toyota, etc.), car type (SUV, sedan,
truck), odometer range, and year.
If you’ve ever filtered a car search by price, mileage, or body type, you were interacting withdimension
tables.
Typically, an analysis like this would involve something like sales and inventory turnover.
nderstanding this helps you structure data efficiently, write better queries, and build dashboards that
U
make sense.
If you’re working with analytics, it’s one of the most useful concepts to know!
Got a dataset?
102
ip #61: Aggregate Your Data at Multiple Levels with
T
GROUPING SETS
I prefer simple analytics.
ut there’s always been one challenge: calculating metrics at different levels of granularity. Let’s say
B
you’re tracking trial sign-ups and conversions.
103
Why This Works Well for Google Sheets:
- re-Aggregated Data: This single query precomputes all necessary granularities, allowing you
P
to use simple SUMIFS or AVERAGEIFS in Google Sheets to reference different levels.
- Eliminates Extra Queries: Instead of running separate queries for daily, weekly, and
channel-level conversion rates, this one query returns all the data you need in a single result
set.
- Handles Calculated Metrics Correctly: Since conversion rate is a ratio, it must be computed at
each level separately — this query ensures correct calculations rather than just summing rates
incorrectly.
his method keeps everything clean, reduces query costs, and ensures that you get the correct
T
calculations at every level.
104
ip #62: Check for Multiple NULL Columns Using
T
COALESCE
Normally, if you want rows where all three columns (x, y, z) are NULL, you might write:
WHERE x IS NULL
AND y IS NULL
AND z IS NULL
his one-liner can make your queries simpler and easier to read when checking for columns that must
T
all be NULL.
105
Tip #63: Calculate Moving Averages in SQL
rolling average(or moving average) helps smoothout short-term fluctuations to highlight
A
longer-term trends in time-series data.
In stock market analysis, for example, you might want a5-dayor30-dayrolling average of closing
prices to see a smoothed trend rather than day-to-day noise.
You can compute a 5-day moving average of closing_price per stock symbol like this:
106
Using PARTITION BY symbol ensures that each stock (symbol) is processed independently.
ORDER BY trade_date defines the sequence of rows (from oldest to newest trades).
OWS BETWEEN 4 PRECEDING AND CURRENT ROW ensures that for each row, the window
R
includes the current day plus the four previous days, totaling five days.
Here are a few more practical use cases for rolling averages:
ebsite Traffic or User Metrics:Monitor rolling 7-dayor 30-day averages for daily active users,
W
sign-ups, or page views.
ales or Inventory Planning:Track product sales averages over flexible periods to better forecast
S
inventory needs or spot seasonal effects.
ou can also combine multiple moving averages (short and long-term) in the same query for richer
Y
analysis.
hether it’s financial data, user metrics, or daily sales, rolling averages help reduce noise so you can
W
focus on real trends over time.
Happy analyzing!
107
Tip #64: Spot Checking and Testing
ometimes you’re not 100% sure your SQL results are correct, but you spot check a few rows, and
S
they all look fine.
This is a good start, but it’s even better to buildsystematic testsinto your queries.
Here’s why:
anually verifying a few rows (or edge cases) can quickly reveal mismatches between your
M
expectations and the data.
If the small sample matches what you expect, it’s a positive sign, but it’s not proof that the entire
dataset is correct.
dding automated tests via conditions in your queries that fail when something is wrong provides
A
confidence.
uppose you have an “orders” table and a related “line_items” table. Each order in “orders” has a
S
“total_amount,” and the “line_items” for that order contain a breakdown of products, quantities, and
prices.
108
e want to ensure that thesum of the line items for each orderactually matches the “total_amount”
W
column in the “orders” table.
If there’s more than a tiny difference — say more than 1 cent — it likely indicates a data problem.
In the query above, we sum up the extended price (quantity * unit_price) for each order and compare it
to “orders.total_amount.”
If “expected_total” in the “orders” table doesn’t match the actual sum of the line items (line_item_total)
within a 1-cent margin, we return those rows.
If any rows appear, you know exactly which order(s) need deeper investigation.
y writing targeted test queries like this, you’ll build far more confidence in your data and you’ll catch
B
subtle mistakes before they hit a dashboard or report!
109
ip #65: Query Multiple Tables at Once with Wildcard
T
Tables
hen you have multiple tables with similar names and schemas (often seen in partitioned or sharded
W
datasets), BigQuery’s wildcard table feature lets you query them all in one go.
Instead of writing one query per table, you can reference them with a single wildcard pattern.
Suppose you store monthly data in separate tables “mytable_2023_0,” “mytable_2023_02,” and so on.
SELECT
*
FROM̀project_id.dataset_id.mytable_*`
WHERE
_TABLE_SUFFIX
BETWEEN
'2023_01'
AND
'2023_03'
TABLE_SUFFIX is an automatic column that BigQuery provides that indicates the matched substring
_
from the table name (e.g., 2023_01).
HERE _TABLE_SUFFIX BETWEEN '2023_01' AND '2023_03' filters which tables you’re including in
W
the query.
sing a wildcard table allows for convenience (one query for multiple tables, instead of multiple
U
queries), scalability (easily handle many monthly, daily, or weekly tables without rewriting queries), and
flexibility (combine it with standard SQL clauses like WHERE, GROUP BY, etc.).
xtra Tip:To limit costs, always filter by _TABLE_SUFFIXso you don’t inadvertently scan every
E
wildcard-matching table in your dataset.
Wildcard tables in BigQuery keep your SQL tidier and your process more efficient!
110
ip #66: Creating Your Own SQL Functions to Reuse
T
Logic
Sometimes, built-in SQL functions don’t cover exactly what you need.
aybe you repeatedly filter out “test” emails from your data and hate copying the same WHERE clause
M
everywhere.
In databases that supportUser-Defined Functions (UDFs), such as BigQuery, you can create your
own function once and reuse it in multiple queries.
Let’s say your ecommerce table has rows with “test” email addresses you want to exclude every time.
It returns TRUE if the passed-in email is in the specified list, FALSE otherwise.
You can update this function once to add or remove test emails.
111
Instead of typing a long WHERE clause for test emails in each query, you simply call the function once.
If you need to add new “test” addresses, just update the function, and all queries using it reflect the
change.
112
Tip #67: When to Use DISTINCT vs. GROUP BY
hen you’re trying to get unique records from a table, you might ask, “Should I use DISTINCT or
W
GROUP BY?”
oth remove duplicates, but they serve different needs, and new analysts often get confused by which
B
to use.
ELECT
S DISTINCT
customer_id
FROM
orders
DISTINCT removes duplicates, returning each customer_id once, ignoring any duplicates.
It’s ideal when you’re not trying to sum, average, or otherwise aggregate.
It’s good for quickly listing unique IDs or combinations of columns (e.g., (region, product_id) pairs).
SELECT
customer_id,
COUNT
(*)
AS
order_count
FROM
orders
GROUP
BY
customer_id
ot only does GROUP BY ensure each customer_id is unique, but it also allows you to do calculations
N
(like COUNT, SUM, etc.) for each group.
You can include multiple columns to group on and add as many aggregates as you need.
It’s good for summaries, totals, averages, or any aggregated stats across categories or groups.
Let’s say you want to count how many unique customers exist:
- DISTINCT approach
-
SELECT
COUNT
(
D
ISTINCT
customer_id)
AS
unique_customers
FROM
orders
Here, DISTINCT directly creates unique rows for customer_id before counting.
113
Alternatively:
- GROUP BY approach
-
SELECT
COUNT
(*)
AS
unique_customers
FROM
(
SELECT
customer_id
FROM
orders
GROUP
BY
customer_id
) subquery
You group by customer_id (one row per customer), and then count how many groups there are.
UseGROUP BYwhen you want to summarize or aggregate data across those unique rows.
Happy querying!
114
Tip #68: Slowly Changing Dimensions
We’ve talked aboutfact tablesvs.dimension tables.
But did you know there is a table structure that is like a combination of the two?
lowly changing dimensions handle the scenario where dimension attributes change over time, yet you
S
still need historical context.
or instance, a customer can move from San Francisco to Seattle, and you want to record both
F
addresses in your history, not just overwrite it.
If you overwrote the address on every change, you’d lose historical info about where they used to live.
CREATE
TABLE
customer_dimension (
customer_id
INT
,
customer_name
VARCHAR
(
100
),
address
VARCHAR
(
200
),
city
VARCHAR
(
100
),
state
VARCHAR
(
2
),
start_date
DATE
,
end_date
DATE
,
current_flag
BOOLEAN
)
hen the customer’s address changes, you “close” the old record by setting “end_date” to the change
W
date and “current_flag” = FALSE.
hen you create a new record with the updated address, “start_date” = change date, “end_date” =
T
NULL, and “current_flag” = TRUE.
ou can join your fact table to “customer_dimension” using the date range (start_date / end_date) to
Y
figure out which address was active at the time of a particular purchase.
ou never lose the original data and analysts can see exactly how a customer’s address evolved over
Y
time.
lowly changing dimensions ensure your data warehouse accurately reflects historical context, giving
S
you deeper insights into how dimension attributes evolve over time!
115
ip #69: Sometimes You Just Need a Giant CASE
T
Statement
SQL can feel elegant — until you have a hundred messy conditions that need wrangling.
When your data is inconsistent or mislabeled, you might need a huge CASE statement to clean it.
Suppose you’ve got an events table with columns like “event_type,” “event_name,” and “ticket_price.”
The data is all over the place, and you need to bucket them into a few consolidated categories.
Sometimes the simplest (if not prettiest) approach is one giant CASE:
If your data source is chaotic, a big CASE might be the cleanest interim fix.
116
ong-term, try standardizing your data at the source so you don’t need a monstrous CASE in every
L
query.
nother option is to create alookupormappingtable if these categories are stable and repeated in
A
many places. But even that can get unwieldy if conditions are extremely scattered.
There isn’t always a magical SQL shortcut —huge CASEstatementsare sometimes inevitable.
o your best to keep it organized, add comments, and remember that “done is better than perfect”
D
when messy data meets tight deadlines.
117
Tip #70: Avoiding the “Double Counting” Pitfall
ne of the trickiest mistakes for new SQL users is accidentallydouble-countingrows whenjoiningor
O
aggregating.
You might expect a certain total, then you see an inflated sum, and it’s not immediately clear why.
If you join a one-to-one table (like “customers”) to a one-to-many table (such as “orders”), your
“customers” data will replicate for every order, causing inflated totals.
QUALIFY ensures only one row per email (e.g., latest created user).
This prevents transactions from being duplicated due to multiple user rows.
118
If it’s not (or even if it is), I like to use QUALIFY ROW_NUMBER() to ensure it’s always unique in the
future.
119
ip #71: Combine Multiple Rows into One with
T
STRING_AGG
ver needed to consolidate multiple text rows (like customer messages) into a single row for each
E
conversation?
It allows you to concatenate values from multiple rows within a group, using a delimiter of your choice
(like a comma).
- onversation_id
C
- Created_at
- Content
- … other columns
You can group messages by conversation and combine all content fields into one row:
TRING_AGG(content, ', ') merges all content from the same group into a single string, separated by
S
commas.
120
MIN(created_at) helps you track the earliest message if you need context on ordering.
If you only want certain message types or users, apply conditions in the WHERE clause or use a CTE.
TRING_AGG allows you to quickly see the entire conversation in one cell and provides a concise view
S
rather than multiple rows.
You can also join or pivot your data more easily when you need aggregated text.
121
ip #72: Parsing Text with REGEX_EXTRACT: Don’t
T
Fear the Regex!
For many data analysts,regex(regular expressions)looks like a cryptic language.
ut often you only need a few basic patterns to handle common tasks, like extracting a code from a
B
messy string.
That’s where REGEX_EXTRACT (available in BigQuery and other SQL dialects) comes in handy.
"Thank you for your purchase. Your order #67890 ships soon."
SELECT
message,
REGEX_EXTRACT(message, r
'order\s#(\d+)'
)
AS
order_id
FROM
mydataset.orders_table;
- rder\s# looks for the word “order” followed by whitespace and a hash sign (#).
o
- (\d+) is a capture group: \d means “digit.”
- + means “one or more digits.”
- Putting it in parentheses means we want to capture that numeric sequence.
EGEX_EXTRACT(message, ...) returns only what’s inside the first capture group, which is the digits
R
right after #.
ou don’t need to master every advanced regex trick — knowing “digits \d,” “wildcard .,” “plus +,” or
Y
“asterisk *” gets you a long way.
Instead of writing a big substring dance, REGEX_EXTRACT zeroes in on exactly what you need.
122
Here are a few more regex quick tips:
- scape Characters: Notice you often need double backslashes in some SQL dialects, like \\d+.
E
- Testing: Use an online regex tester tool to confirm your pattern before plugging it into SQL.
- Document: Regex can be cryptic — add comments explaining what your pattern does.
Regex can seem intimidating, but for routine text extraction, you’ll mostly rely on a few simple patterns.
ith REGEX_EXTRACT, you can slice through messy text fields like a pro, grabbing only the data you
W
need.
123
Tip #73: Easily Group Data by Month with LAST_DAY
hen building monthly reports or financial dashboards, you often need to bucket transactions into their
W
month (not just by date).
It takes a date or datetime and returns the last day of that month, which you can then use as a
consistent monthly identifier.
AST_DAY(sale_date) converts each date to the last calendar day of its month (e.g., all April 2023
L
sales become 2023-04-30).
ROUP BY the resulting end-of-month value lumps all April sales into a single row (with the date
G
2023-04-30).
ou get one row per month, which is perfect for monthly dashboards, trend lines, or financial
Y
summaries.
124
In some SQL engines, you can also group by DATE_TRUNC('month', sale_date) or something similar
— both approaches yield a clear monthly grouping.
AST_DAY is especially nice if you prefer using anactual dateto represent the month-end, which feels
L
more intuitive in financial contexts.
AST_DAY does the heavy lifting, turning your daily or even timestamped data into neat, month-based
L
aggregates!
125
ip #74: Anonymizing Personally Identifiable
T
Information with a Hash Function
ometimes you need to remove or obscure personal data (like emails, phone numbers, etc.) so that
S
individuals can’t be identified.
ash Algorithm(e.g., SHA256, MD5) takes the input(email, phone) and generates a unique
H
fixed-length output string (the “hash”).
he original data can’t be reconstructed from the hash, protecting your dataset from exposing personal
T
info.
SELECT
user_id,
email,
TO_HEX(SHA256(email))
AS
hashed_email
FROM
mydataset.users
TO_HEX converts the binary hash output into a human-readable hexadecimal string.
- ossible Collisions: Theoretically, different inputs can produce the same hash, but strong
P
algorithms like SHA256 make collisions extremely rare.
- Salting: To further enhance security (especially with common strings like “john@gmail.com”),
you can add a random “salt” to the input before hashing, ensuring the hash is unique even if
someone guesses the original.
ashing is a simple yet powerful technique to anonymize personal data, preserving key analytical
H
capabilities without revealing sensitive information.
126
Tip #75: Joining Beyond Simple IDs
Joins don’t always revolve around matching a single id column.
In real-world data, you might match onmultiple conditions— including ranges and inequalities.
Sometimes, you join on a simple key, but other times, your join condition might look something like this:
his isn’t just a direct equality join — it confirms that the transaction happened within 30 days of
T
sign-up.
JOIN subscriptions s
ON u.user_id = s.user_id
AND u.event_date BETWEEN s.start_date AND s.end_date
You can join on date ranges, categories, locations, and other conditions.
If you are ever stumped on a join, consider that you may need multiple criteria!
127
ip #76: Parsing Dates with PARSE_DATE: Fixing
T
Messy Date Formats
If you’ve worked with raw data, you’ve probably come across dates in all kinds of non-standard formats.
QL needs dates in a consistent format (like YYYY-MM-DD) to perform calculations, filter ranges, and
S
group by time.
hat’s where PARSE_DATE (or its sibling functions like PARSE_DATETIME) comes in — it helps
T
convert messy date stringsinto proper date objects.
Let’s say you have a table with a column “raw_date” containing dates in various formats:
- 2/31/2023 (MM/DD/YYYY)
1
- 31-12-2023 (DD-MM-YYYY)
- 2023.12.31 (YYYY.MM.DD)
You can use PARSE_DATE to standardize them into a proper DATE column.
128
I am also prefixing PARSE_DATE with SAFE so it returns null if the date is not in the expected format.
he result is a clean “final_parsed_date” column that you can use for comparisons, aggregations, or
T
reporting.
If multiple formats exist in one column, you may want to identify them with CASE statements and parse
them conditionally.
CASE
WHEN raw_date LIKE '%/%/%' THEN PARSE_DATE('%m/%d/%Y', raw_date)
WHEN raw_date LIKE '%-%-%' THEN PARSE_DATE('%d-%m-%Y', raw_date)
ELSE NULL
END
AS
standardized_date
Data from APIs, CSVs, or external systems often use inconsistent date formats.
PARSE_DATE is your go-to for turning chaotic date strings into clean, consistent date values in SQL.
nce parsed, your dates are ready for analysis, and your queries will run smoother and produce
O
accurate results!
129
ip #77: Splitting Bad Rows into an Error Table with
T
SAFE_CAST
hen processing messy data, the last thing you want is a few invalid rows crashing your entire query or
W
pipeline.
et’s say you have a “transactions” table, and the “transaction_amount” column is supposed to be
L
numeric.
130
If it fails (e.g., 💸 200), itreturns NULLinstead of throwing an error.
Valid rows can then be selected below where id_status <> ‘error’ or vice versa.
This approach handles errors gracefully and keeps your process running even with bad data.
And invalid rows aren’t lost — they’re tracked for future investigation.
The same approach with SAFE_CAST can handle invalid dates, strings, or other data types.
sing SAFE_CAST to split data into valid and invalid paths ensures you can cleanly handle messy
U
data!
131
Tip #78: Filtering on Boolean
Here’s a small SQL tip that makes your queries cleaner.
Let’s say you have a users table with a Boolean column called “is_active.”
If you want to find all active users, you might think you need to write:
ELECT
S *
FROM
users
WHERE
is_active =
TRUE
ELECT
S *
FROM
users
WHERE
is_active
Since “is_active” is already aBoolean, the WHEREclause evaluates it as either TRUE or FALSE.
If you want to filter for inactive users (FALSE values), you might be tempted to write:
ELECT
S *
FROM
users
WHERE
is_active =
FALSE
ELECT
S *
FROM
users
WHERE
NOT
is_active
This works the same way, but it’s shorter and easier to read.
132
ip #79: Using GREATEST and LEAST to Compare
T
Values in a Row
If you want to find the largest value in a list of columns or expressions, you might think of using MAX.
But in SQL, MAX is a reserved keyword for aggregating values across rows.
Instead, use the GREATEST (or LEAST)function to findthe largest (or smallest) value across columns
in a single row.
hese functions work similarly but operate on individual rows, not across rows (unlike the MAX or MIN
T
aggregate functions).
oth functions compare all the values provided in the list and return thelargestorsmallest,
B
respectively.
If any value is NULL, the result will also be NULL unless you handle it with a function like COALESCE
or IFNULL.
Let’s say you have a table “student_scores” with three exam scores for each student.
133
y combining GREATEST and LEAST with IF(), IFNULL(), and COALESCE(), this query robustly
B
handles NULL values while providing meaningful comparisons within a row.
134
ip #80: When to Use FIRST_VALUE and
T
LAST_VALUE
When working with data, you frequently need to identify the first or last record in a group (or partition).
. M
1 IN/MAX Aggregations
2. FIRST_VALUE/LAST_VALUE Window Functions
IN and MAX return the smallest and largest value, respectively, in a group, so are useful when you
M
only need to know the extreme value (like the earliest or latest date) for each group.
hey only return a single value and don’t give you any additional context from the row that contains that
T
value.
IRST_VALUE and LAST_VALUEreturn the value from thefirst and last row, respectively, in a partition
F
based on a specified order.
hese are usefulwhen you want the complete context(i.e., additional columns) from the first or last row
T
— not just the extreme value.
135
If your only goal is to get the earliest and latest purchase dates for each customer, you can use MIN
and MAX.
ut suppose you want to know not only the dates, but also the products from the first and last
B
purchases.
MIN and MAX can’t do this because they only return a single value from the aggregated column.
Choose the function that best fits the level of detail your analysis requires.
136
Tip #81: Demystifying Window Function Keywords
Window functions are awesome.
Window functions let you calculate metrics over partitions of your data.
t first, the basics of window functions like OVER(PARTITION BY column ORDER BY column) seem a
A
bit confusing, but relatively simple.
ut things can quickly feel overwhelming when you see keywords like ROWS BETWEEN
B
UNBOUNDED PRECEDING AND CURRENT ROW.
nd the window frame (specified with the ROWS clause) determines which rows in that partition are
A
used for the calculation.
If you want a specific number of rows it might look like this:
- OWS BETWEEN 2 PRECEDING AND CURRENT ROW: Includes the current row and the two
R
rows before it.
- ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING: Includes the current row and the
next row.
The default behavior if you don’t specify a frame is often the entire partition.
But by defining a custom frame, you can get more granular control.
137
The keywords are just defininghow much data to include.
138
Tip #82: How to Handle Missing Data
hen your data contains missing or incomplete values, you have several strategies to ensure your
W
queries return meaningful results.
1. Filtering:
For example:
ELECT
S *
FROM
your_table
WHERE
column
ISNOT
NULL
Substitute missing values with a default value using functions like COALESCE (or IFNULL in MySQL).
SELECT
student_id,
COALESCE
(score,
0)
AS
score
FROM
your_table
Sometimes it’s better to replace missing data with a calculated value (e.g., an average).
You can compute the average of the available data and then use it to fill in the gaps:
WITH avg_value AS (
SELECT
AVG
(score)
AS
avg_score
FROM
your_table
WHERE
score
IS
NOTNULL
)
SELECT
student_id,
COALESCE
(score, (
SELECT
avg_score
FROM
avg_value))
AS
score
FROM
your_table
Choose the strategy that best fits your data quality requirements and analysis needs.
139
ip #83: Understanding the Difference Between
T
NULL and Zero
As a beginner querying data, one thing that can trip you up is dealing with NULL vs. zeros.
They’re not the same thing, and mixing them up can lead to incorrect results in your analysis.
For example, a customer didn’t enter their email address, so the email column is NULL.
For example, aggregate functions like SUM or AVG automatically ignore NULL values.
If you treat NULL as zero without thinking, you might misinterpret missing data as “nothing happened,”
when in reality, the data just isn’t available.
ransaction_id | revenue
t
---------------|--------
1
| 100
2
| NULL
3
| 200
SELECT
SUM
(revenue)
AS
total_revenue
FROM
sales
SELECT
AVG
(revenue)
AS
average_revenue
FROM
sales
Only 100 and 200 are included, so the average is (100 + 200) / 2.
140
Youdon’t need to explicitly filter NULLs— SQL ignoresthem in aggregate functions by default.
ometimes, you may want to treat NULL as a specific value (like zero) to make the data easier to
S
analyze.
SELECT
SUM
(
COALESCE
(revenue,
0
)
)
AS
total_revenue_with_defaults,
AVG
(
COALESCE
(revenue,
0
)
)
AS
average_revenue_with_defaults
FROM
sales
Understanding the difference between NULL and zero is key to accurate analysis!
141
ip #84: Understanding COUNT(*), COUNT(column),
T
and COUNT(1)
As a data analyst, you’ll frequently use the COUNT function to count rows or values.
ut you might have noticed there are different ways to use COUNT, like COUNT(*), COUNT(column),
B
and even COUNT(1).
What’s the difference between them, and when should you use each?
ELECT
S COUNT
(*)
AS
total_orders
FROM
orders
ELECT
S COUNT
(customer_id)
AS
valid_orders
FROM
orders
ELECT
S COUNT
(
1
)
AS
total_orders
FROM
orders
tick with COUNT(*) for total row counts and COUNT(column) when focusing on non-NULL values in a
S
specific column.
And don’t worry — if you see COUNT(1), it’s simply another way of expressing the same idea.
142
ip #85: Best Practices for Using INNER JOIN vs.
T
LEFT JOIN and Filtering Data
When filtering data in a query that involves joins, you have two options for placing your filter conditions:
Deciding where to put your filters depends on your join type and the result you want.
hether you put the filter in the JOIN’s ON clause or in the WHERE clause, the results are generally
W
the same because only matching rows are returned.
EFT JOINreturns all rows from the left table, plusmatching rows from the right table (with NULL for
L
non-matches).
iltering in the JOIN (ON) clause keeps all left-table rows. Conditions on the right table are applied
F
during the join so unmatched rows remain (with NULL in the filtered columns).
iltering in the WHERE clause applies after the join. Any condition on right-table columns will filter out
F
rows with NULL values, effectively turning a LEFT JOIN into an INNER JOIN for those conditions.
- Using WHERE
-
SELECT
c.customer_name, o.order_amount
FROM
customers c
INNER
JOIN
orders o
ON
c.customer_id = o.customer_id
WHERE
o.order_amount >
50
143
Here’s an example for filtering in the LEFT JOIN:
ELECT
S c.customer_name, o.order_amount
FROM
customers c
LEFT
JOIN
orders o
ON
c.customer_id = o.customer_id
AND
o.order_amount >
50
ustomer_name | order_amount
c
--------------|-------------
Alice
| NULL -- (Alice's order_amount did
not meet the condition)
Bob
| 75 -- (Bob's order qualifies)
Charlie
| NULL -- (No order exists for Charlie)
If you put a right-table filter in the WHERE clause with a LEFT JOIN, rows with no match (where the
right table is NULL) will be filtered out, which might not be what you intend.
Example (Problematic):
ELECT
S c.customer_name, o.order_amount
FROM
customers c
LEFT
JOIN
orders o
ON
c.customer_id = o.customer_id
WHERE
o.order_amount >
50
hen you place a filter on a right-table column in the WHERE clause of a LEFT JOIN, rows where that
W
column is NULL get excluded.
As a result, the query behaves like an INNER JOIN for that condition.
his effectively means that only rows with a matching order (that meet the condition) are returned, just
T
like with an INNER JOIN.
e want to be as explicit as possible when writing SQL. So you should change this to be an INNER
W
JOIN with both filter conditions in the join.
- or INNER JOINs, filtering with the JOIN clause or the WHERE clause yields equivalent results.
F
- For LEFT JOINs, place filters for the right table in the JOIN clause to preserve all left-table rows.
- When you place a filter on a right-table column in the WHERE clause of a LEFT JOIN, that is
essentially just an INNER JOIN, so that’s what you should use instead to be more explicit.
Understanding where to put your filters helps ensure your query returns exactly the data you need.
It’s a detailed nuance but tips like this really help take your SQL skills beyond beginner!
144
ip #86: Exploring and Validating Your Data with
T
Simple Queries
hen working with data, one of the most important steps isvalidatingthat your data is behaving as
W
you expect.
It’s common to assume your SQL query is working perfectly, but a simple typo or misunderstanding
about your data can lead to errors in your analysis.
great way to catch these issues is by writingsmallvalidation queriesto ensure your data is
A
consistent and your query logic is correct.
Here’s a consolidated list of different ways you can validate your data with simple SQL queries:
A significant difference between “total_rows” and “valid_rows” indicates missing or invalid entries.
145
discrepancy between “source_row_count” and “post_join_row_count” may signal issues with your join
A
logic (dropped or duplicated rows).
here are many ways to write little queries that should return zero results if what you believe to be true
T
about the data is actually true.
y running these small queries, you can immediately spot anomalies such as unexpected duplicates,
B
missing values, or transformation errors.
Always validate your data and query logic with small checks along the way.
It might seem tedious, but these little queries can save you from major headaches down the line!
146
ip #87: Why You Might Want to Use a View in
T
BigQuery
hen working with BigQuery, you might wonder why you’d use aviewinstead of just querying a table
W
directly or creating a new table.
view can be a powerful tool that can make your workflows more efficient, simplify complex queries,
A
and provide real-time data without manual updates.
It doesn’t store data itself. Instead, it acts as a dynamic query that runs every time you access it.
hink of it as awindow into your data— you definethe query once, and it pulls the latest results
T
every time you use it.
A view can be advantageous over a table because it provides real-time data updates.
Since views run the query each time they’re accessed, they always return the most up-to-date data.
view that calculates total sales by region will always reflect the latest numbers without needing to
A
recreate or update a table:
REATE
C VIEW
region_sales_view
AS
SELECT
region
,
SUM
(sales_amount)
AS
total_sales
FROM
sales
GROUP
BY
region
lot of SQL environments (like BigQuery) also have options to “save query as view,” making it easier
A
for you.
Instead of rebuilding a summary table every hour, the view runs this query dynamically, saving time and
effort.
ou can write a single query once, save it as a view, and let others use it without worrying about the
Y
underlying details.
onsider an example in which you create a query that calculates revenue by product and excludes test
C
data.
147
Instead of copying and pasting this query every time, save it as a view:
REATE
C VIEW
clean_product_revenue
AS
SELECT
product_name
,
SUM
(revenue)
AS
total_revenue
FROM
sales
WHERE
is_test =
FALSE
GROUP
BY
product_name
The accounting team can now use this view in their queries:
ELECT
S *
FROM
clean_product_revenue
WHERE
product_name =
'Widget A'
They don’t need to know how the revenue is calculated — they just filter the results.
For example, the accounting team might want a report on monthly revenue.
ELECT
S *
FROM
monthly_revenue
WHERE
month
>=
'2024-01-01'
Instead of having multiple analysts write their own (potentially inconsistent) queries, a view
standardizes the approach.
Because views don’t store data, they’re cheaper than creating new tables..
- erformance: Since views re-run the query every time, they can be slower for large datasets
P
compared to materialized tables.
- Complex Joins or Aggregations: If your view involves very heavy calculations, consider creating
a materialized view or summary table instead for better performance.
If you want to simplify your workflow, save time, and provide others with easy access to complex
queries, consider using a view in BigQuery!
148
Tip #88: Using the REPEAT Function
he REPEAT function might seem like an odd one to use in SQL — why would you ever need to repeat
T
a string multiple times?
hile it’s not something you’ll use daily, there are real-world scenarios where this function can be
W
surprisingly handy.
The REPEAT function takes a string (or bytes) and repeats it a specified number of times.
You’re tasked with highlighting priority levels for open support tickets.
he business team wants a simple visual indicator to represent the urgency of a ticket based on its
T
priority level:
Instead of hardcoding this logic, you can use the REPEAT function to dynamically generate these
indicators based on the ticket priority.
149
You can use REPEAT to generate the visual indicators:
SELECT
ticket_id,
priority
,
description,
REPEAT
(
'*'
,
CASE
WHEN
priority
=
'Low'
THEN1
WHEN
priority
=
'Medium'THEN
2
WHEN
priority
=
'High'THEN
3
ELSE
0
END
)
AS
visual_indicator
FROM
support_tickets
-- Result
he “visual_indicator” column makes it easier for non-technical stakeholders to quickly see ticket
T
urgency in reports or dashboards.
ou can extend this logic to represent SLAs (service-level agreements), customer tiers, or any other
Y
classification that benefits from a repeated pattern.
Instead of manually coding symbols for each priority level, you can rely on the logic to dynamically
adjust as priority levels or business rules change.
- se REPEAT to create a text-based progress bar (e.g., repeating # based on a percentage of
U
completion).
- Use REPEAT to create fixed-length placeholders (e.g., masking credit card numbers).
- Quickly generate repetitive text patterns when testing queries or database performance.
hether it’s visual indicators, progress bars, or placeholder strings, REPEAT can make your SQL
W
queries more functional and your reports more insightful.
Next time you need a little flair in your data, give it a try!
150
ip #89: Using the SPLIT Function for String
T
Manipulation
The SPLIT function is a powerful tool for breaking apart strings into arrays based on a delimiter.
Once you’ve split a string, you can access specific parts of the result using array indexing.
common pattern is using SPLIT and grabbing thefirst elementof the resulting array, which can be
A
incredibly useful for tasks like cleaning or transforming messy data.
ou can access specific parts of the resulting array usingsquare brackets([index]), where indexing
Y
starts at 0.
ay you’re working with a “users” table that stores email addresses, but you only need the username
S
(the part before the @).
151
SPLIT(email, '@') breaks the email into an array: ['alice', 'example.com'].
. S
1 PLIT(column, 'delimiter')[OFFSET(0)]
2. SPLIT(column, 'delimiter')[OFFSET(1)]
If you wanted the 3rd element you would put “2”, etc.
ro Tip: Pair SPLIT with other string functions likeTRIM or UPPER for even more control when
P
cleaning or transforming data.
Mastering this will make your string manipulation tasks much more simple!
152
ip #90: Using the BAG_OF_WORDS Function for
T
Text Analysis
hen analyzing text data in BigQuery, you might think you need Python or specialized tools for tasks
W
like identifying commonly used words in customer service messages or community interactions.
owever, BigQuery’s BAG_OF_WORDS function can help you performbasic text analysisright inside
H
SQL.
he BAG_OF_WORDS function takes atokenized document(an array of words or terms) and returns
T
an array of terms and their frequencies.
It essentially gives you a bag-of-words model (a foundational concept in text analysis) by showing how
often each word appears in a piece of text.
Imagine you work with a “customer_service” table that contains messages from your support team.
You want to find out the most common words customers are using to describe issues.
You can use BigQuery’s SPLIT function to break the messages into individual words (tokens).
essage_id | message_content
m
-----------|---------------------------------------------------------
1
| The delivery was late and the product is damaged.
2
| I need
help
witha refund
for
a damaged
product.
3
| The product was
notdelivered
on
time.
SELECT
message_id,
SPLIT
(
LOWER
(message_content),
' '
)
AS
tokenized_message
FROM
customer_service
essage_id | tokenized_message
m
-----------|---------------------------------------------------------------
1
| ["the", "delivery", "was", "late", "and", "the", "product", "is", "damaged"]
2
| ["i", "need", "help", "with", "a", "refund", "for", "a", "damaged", "product"]
3
| ["the", "product", "was", "not", "delivered", "on", "time"]
153
Step 2: Use BAG_OF_WORDS to Count Words
Now, use the BAG_OF_WORDS function to calculate the frequency of each word in each message.
SELECT
message_id,
BAG_OF_WORDS(
SPLIT
(
LOWER
(message_content),
' '
))
AS
word_frequencies
FROM
customer_service
essage_id | word_frequencies
m
-----------|-----------------------------------------------------------------------
1
| [{"term": "the", "count": 2}, {"term": "delivery", "count": 1}, ...]
2
| [{"term": "i", "count": 1}, {"term": "need", "count": 1}, ...]
3
| [{"term": "the", "count": 1}, {"term": "product", "count": 1}, ...]
o analyze the most common words across all messages, flatten the array of terms and aggregate the
T
counts:
SELECT
word.term,
SUM
(word.count)
AS
total_count
FROM
customer_service,
UNNEST(BAG_OF_WORDS(
SPLIT
(
LOWER
(message_content),
' '
)))
AS
word
GROUP
BY
word.term
ORDER
BY
total_count
DESC
erm
t | total_count
----------|-------------
the
| 4
product
| 3
damaged
| 2
delivery | 2
was
| 2
se BAG_OF_WORDS next time you want to dig into the words your customers or community are
U
using!
154
ip #91: You’re Never Going to Get the Data Exactly
T
How You Want It—And That’s Okay
s a data analyst, one of the hardest lessons to learn isn’t about SQL syntax or complex joins — it’s
A
aboutmanaging expectations.
ere’s the reality: in the world of modern data, you’re almost never going to get the exact data you want
H
in the format you need.
ompanies use dozens of software tools, such as CRMs, e-commerce platforms, marketing
C
automation, analytics tools, and more.
Each of these systems stores data differently, with its own quirks, limitations, and gaps.
hile APIs give us programmatic access to data, they rarely provide all the fields you want or the
W
granularity you need.
or example, an API might provide order data but leave out refund information, or it might give you
F
aggregated metrics but no raw data to drill into.
Many systems don’t store data indefinitely or don’t collect the data you need in the first place.
ven if you can get the data, you’ll probably spend more time cleaning it up (deduping, transforming,
E
validating) than actually analyzing it.
If you’re banging your head against the wall trying to figure out why it’s so hard to get the perfect
dataset, know this:everyone struggles with this.
It doesn’t mean you’re bad at your job or that you don’t know enough.
155
It’s okay to aim for 95%.
In most cases, you don’t need a perfect dataset to make meaningful insights.
If you can answer the big questions with the data you have, that’s usually good enough.
When you can’t get the exact data, think about workarounds:
It’s okay to say, “We don’t have data on X, but here’s what we can infer from Y.”
---------------
Imagine you’re tasked with analyzing customer lifetime value (LTV) for an e-commerce business, but
the API only provides total order revenue and a list of customer IDs.
While this might seem like a blocker, here’s how you can adapt:
- ocus on What You Have: Use total revenue as a proxyfor LTV, knowing it’s not perfect.
F
- Estimate Missing Pieces: If you know the average refundrate or shipping cost, apply it as a
general adjustment.
- Communicate Assumptions: Share your calculations andhighlight the limitations so
stakeholders understand what’s missing.
s long as you’re creative, flexible, and focused on delivering value, you’re doing exactly what a great
A
data analyst should!
156
ip #92: Why GROUP BY Columns Must Appear in
T
the SELECT Clause (Unless Aggregated)
When working with SQL, you might run into an error saying something like:
“Column "X" must appear in the GROUP BY clause or be used in an aggregate function.”
This can be confusing at first, but it’s a fundamental rule of how SQL handles grouped data.
When you use GROUP BY in a query, every column in your SELECT clause must either:
When you use GROUP BY, SQL groups rows based on the specified columns.
his is because SQL can’t return individual values for columns that aren’t part of the group — it doesn’t
T
know which value to pick!
SELECT
product_name,
SUM
(revenue)
AS
total_revenue
FROM
sales
GROUP
BY
product_name
ere, “product_name” is in the GROUP BY clause because it defines how rows are grouped, and
H
SUM(revenue) is aggregated.
If you try to include a column in the SELECT clause that isn’t in the GROUP BY or aggregated, SQL
won’t know how to handle it:
SELECT
product_name,
region,
SUM
(revenue)
AS
total_revenue
FROM
sales
GROUP
BY
product_name
157
Error: column "region" must appear in the GROUP BY clause or be used in an aggregate function.
his happens because SQL groups the rows by “product_name,” but “region” isn’t part of the grouping
T
or aggregated.
If you want “region” in the results, you need to group by both “product_name” and “region”:
SELECT
product_name,
region,
SUM
(revenue)
AS
total_revenue
FROM
sales
GROUP
BY
product_name, region
hen in doubt, focus on the relationship between the SELECT clause and GROUP BY clause: they
W
must align.
158
Tip #93: WHERE EXISTS
Here’s a SQL tip that most experienced analysts know, but most beginners don’t know.
In your data analyst career, you might see queries that include WHERE EXISTS.
They both check if there’s at least one matching row in the subquery.
While they can sometimes be interchangeable, they also have subtle differences.
Understanding those differences can help you write more efficient queries.
It stops evaluating as soon as it finds the first match, making it efficient for certain use cases.
ELECT
S customer_id
FROM
customers c
WHERE
EXISTS
(
SELECT
1
FROM
orders o
WHERE
o.customer_id = c.customer_id
)
This query returns all customers who have at least one order.
It retrieves all possible matches before filtering, which can make it less efficient with large datasets.
ELECT
S customer_id
FROM
customers
WHERE
customer_id
IN
(
SELECT
customer_id
FROM
orders
)
This query also returns all customers who have at least one order.
159
If you do spot EXISTS in someone else’s code, just remember it’s basically another way of saying,
“Does at least one row exist in this subquery?”
ou’ll often get the same result, and you don’t need to change your approach unless you’re dealing with
Y
performance issues on very big tables.
160
ip #94: Why You Should Always Comment Your
T
Code
Writing SQL queries can feel like solving a puzzle.
hen you’re deep into writing complex joins, subqueries, and window functions, you can see the
W
puzzle results and it makes sense in that moment.
But what happens when you (or someone else) revisit that query six months later?
ou won’t always remember why you wrote a query the way you did, especially when revisiting it weeks
Y
or months later.
Comments act as a time capsule, helping you recall your thought process.
If you work on a team, your coworkers will need to understand your queries and comments provide
clarity, ensuring they don’t waste time reverse-engineering your logic.
hen something breaks or needs updating, comments make it easier to pinpoint what each part of your
W
query is doing, saving valuable time.
- At the top of your query, briefly explain what it does.
- This query calculates total revenue by product category and region for the
-
current quarter.
- Use comments to break your query into logical sections, especially for complex operations.
- Exclude test accounts (e.g., emails ending in '@test.com') and inactive users
-
WHERE email NOT LIKE '%@test.com'
AND is_active = TRUE
161
- For quick clarifications, add inline comments next to specific lines.
SELECT
product_id,
SUM
(revenue)
AStotal_revenue,
-- Aggregate revenue
for each product
COUNT
(
DISTINCT
customer_id)AS
unique_customers
-- Count unique buyers
FROM
sales
- This query calculates the monthly revenue per region and identifies
-
-- the top-performing region for each month.
WITH regional_revenue AS (
-- Step 1: Aggregate monthly revenue by region
SELECT
DATE_TRUNC(order_date,
MONTH
)
AS
order_month,
region,
SUM
(order_total)
AS
total_revenue
FROM
orders
WHERE
order_status ='completed'
-- Exclude canceledor pending orders
GROUP
BY
order_month, region
),
ranked_regions
AS
(
-- Step 2: Rank regions by revenue for each month
SELECT
order_month,
region,
total_revenue,
RANK
()
OVER
(
P
ARTITION
BY
order_monthORDER
BY
total_revenue
DESC
)
AS
revenue_rank
FROM
regional_revenue
)
-- Step 3: Filter to the top-performing region for each month
SELECT
order_month,
region
AStop_region,
total_revenue
AS
top_revenue
FROM
ranked_regions
WHERE
revenue_rank =1
ORDER
BY
order_month
With comments, it’s clear what each step is doing, even for someone new to the query.
162
ommenting your SQL code might feel like an extra step, but it saves time, frustration, and confusion
C
down the line.
hether it’s for debugging, team collaboration, or just making your queries easier to revisit, clear
W
comments are a small effort with a big payoff.
163
ip #95: When to Use REGEXP_CONTAINS Instead
T
of LIKE
Here’s one SQL lesson you won’t know you need until you are faced with it:
The LIKE operator is great for simple pattern matching, but it has its limitations.
hen you need more complex or precise text filtering, REGEXP_CONTAINS (or REGEXP_LIKE in
W
some databases) steps in as the superior tool.
- dvanced Pattern Matching:LIKE is limited to basicwildcards (% and _) and can only match
A
one pattern at a time unless you use multiple OR conditions. REGEXP_CONTAINS can match
multiple patterns and support full regular expressions, allowing you to increase complexity.
- Case Sensitivity Control:LIKE behavior with casesensitivity depends on the database’s
collation settings, while REGEXP_CONTAINS can explicitly match case-sensitive patterns.
We want to find rows where the column contains either “error” or “failed”:
- Using REGEXP_CONTAINS
-
SELECT
*
FROM
logs
WHERE
REGEXP_CONTAINS(message, r
'error|failed'
);
lso, LIKE cannot handle flexible character sets or ranges, but REGEXP_CONTAINS allows for more
A
advanced matching.
164
Now let’s match rows where a column contains a 5-digit ZIP code:
- Using REGEXP_CONTAINS
-
SELECT
*
FROM
addresses
WHERE
REGEXP_CONTAINS(zip_code, r
'^\d{5}$'
);
y knowing the strengths and limitations of both, you can choose the right tool for each scenario, and
B
save yourself a lot of frustration with complex text queries!
165
ip #96: The SELECT Clause is Independent of the
T
Query’s Processing
When writing SQL, remember that the SELECT clause only controls what you see in your results.
he logic in your WHERE, HAVING, ORDER BY, and other clauses runs regardless of whether their
T
expressions appear in the SELECT clause.
WHERE, HAVING, ORDER BY, etc., perform filtering, grouping, and ordering of your data.
ELECT
S customer_id, order_date
FROM
orders
WHERE
order_amount > (
SELECT
AVG
(order_amount)
FROM
orders)
he WHERE clause is using a subquery to filter orders by comparing each order’s amount to the
T
overall average order amount.
ven though the computed average isn’t shown in the SELECT results, it still influences which rows are
E
returned.
The bottom line is you don’t have to include every part of your query’s logic in the SELECT clause.
his allows you to focus on displaying only the data you need, while the underlying logic runs
T
independently in the background.
166
ip #97: Don’t Use CASE Statements as Join
T
Conditions
Here’s a mistake I see beginners make all the time:
hile it might work, it’s not good practice and can lead to performance issues, unnecessary complexity,
W
and even query failures at scale.
beginner might think to him/herself, “I need to match on different conditions depending on the data,
A
so I’ll just write a CASE statement in my JOIN clause.”
ELECT
S a.*, b.*
FROM
table_a a
JOIN
table_b b
ON
CASE
WHEN
a.column_x =
'A'
THEN
b.column_y = a.column_z
WHEN
a.column_x =
'B'
THEN
b.column_y = a.column_w
END
At first glance, it seems like a clever way to account for complex logic in your joins.
But while SQL will allow this, it’s not the right approach.
Here’s why:
- erformance Issues: A CASE statement in a JOIN forcesthe database to evaluate conditions
P
row by row, making it much harder to optimize. This can significantly slow down your query,
especially on large datasets.
- Hard to Debug: It’s difficult to understand, maintain,and debug queries with conditional joins,
especially as your logic grows more complex.
- Scalability Problems: What works on a small datasetmight blow up on a larger one.
Complex join conditions can overwhelm the query planner, leading to compute errors or even timeouts.
This simplifies the join and makes your query easier to read and maintain.
167
Here’s how to rewrite the query from above - pre-compute join keys in a CTE:
WITH precomputed AS (
SELECT
*,
CASE
WHEN
column_x =
'A'
THEN
column_z
WHEN
column_x =
'B'
THEN
column_w
END
ASjoin_key
FROM
table_a
)
SELECT
p.*, b.*
FROM
precomputed p
JOIN
table_b b
ON
p.join_key = b.column_y;
While putting a CASE in a JOIN condition might seem like a quick fix, it’s not good practice.
It will save you time, headaches, and compute resources in the long run.
168
Tip #98: HAVING MAX and HAVING MIN
Here’s a weird SQL tip you probably didn’t know about.
ost SQL users think of MAX() and MIN() as go-to functions for returning the largest or smallest value
M
in a group.
ut did you know you can use HAVING MAX and HAVING MIN to restrict aggregation to only the rows
B
with themaximum or minimum value?
It’s like magic for narrowing down data to key rows without extra subqueries.
et’s say you’re analyzing climate data and want to identify a year when thehighest temperature
L
occurred for each season.
Instead of writing a long query with subqueries, you can use HAVING MAX to directly filter for rows with
the maximum temperature:
AVING MAX temp filters the rows to only include those with the highest temperature within each
H
season.
169
ANY_VALUE retrieves the year from one of the rows with the maximum temperature.
Instead of writing a separate subquery to find the maximum value and then joining it back, you can use
HAVING MAX to simplify your query.
This works well for other scenarios like tracking top performers, highest sales, or peak events.
SELECT
season,
ANY_VALUE(
year
HAVING
MIN
temp)
AS
year_with_min_temp,
MIN
(temp)
AS
min_temperature
FROM
Temperatures
GROUP
BY
season
AVING MAX and HAVING MIN are great tools for isolating rows tied to extreme values within groups,
H
without requiring additional subqueries.
hether you’re analyzing weather patterns, sales, or performance metrics, this trick can save you time
W
and simplify your SQL queries.
170
Tip #99: Data Normalization Basics
As a data analyst, you don’t have to be a data normalization expert.
It’s a fancy term that can seem intimidating, but the concept is actually pretty simple.
In relational databases, data is typically organized into multiple related tables rather than one big table.
- educe redundancy
R
- Im
prove data integrity(data is accurate and consistent)
he idea is to split data into related tables and use keys (like primary keys and foreign keys) to connect
T
them.
his is inefficient (you’re wasting storage space) and Error-prone (if John updates his email, you need
T
to update it in multiple places, which could lead to mistakes).
171
2. Orders Table
In this normalized design, customer information is stored only once, and the orders table references the
customer via “customer_id.”
172
Tip #100: Using || for String Concatenation
Let’s finish with a fun one…
When you first see a double pipesymbol (“||”) inSQL, it can feel like you’re reading a foreign language.
But don’t worry, you’re not missing a secret underground SQL club.
It’s just shorthand forconcatenate.It’s a clean, efficient alternative to the CONCAT() function.
et’s say you have a table “customers,” and you want to combine their first and last names into a single
L
column:
SELECT
first_name ||
' '
|| last_name
AS
full_name
FROM
customers
|| is faster to type and easier to read compared to CONCAT() and it handles separators like spaces,
hyphens, or custom strings directly in the query.
BigQuery allows you to use || with arrays for even more functionality.
If any value in the || operation is NULL, the result will also be NULL.
To avoid this, use COALESCE() to replace nulls with a default value:
SELECT
COALESCE
(first_name,
'Unknown'
) ||
' '
||
COALESCE
(last_name,
'Unknown'
)
AS
full_name
FROM
customers
Now if you ever see a || in someone else’s query, you’ll know what it is.
It’s not hieroglyphics. Just a simple alternative to something you already know and understand!
173
Conclusion
You did it!
I’ve been doing this for years, but every time I dive into SQL documentation, I find features I’ve never
used or things that make me go,“Wait, what is that?”
Instead of aiming to know all of SQL, focus on getting comfortable with learningwhat you need when
you need it.
Even experts don’t know every corner of SQL or every function out there.
But every query you write and every error you fix will build your skill set.
You never truly “arrive” in SQL, and that’s the beauty of it.
174
Get comfortable not knowing.
Keep practicing, stay curious, and trust that each new discovery makes you a better data analyst.
Please share a quick post on LinkedIn or send the link to someone who might benefit from it.
Mention one thing you learned or enjoyed, tag me in it, and include a link so others can find it too.
Your post may be the nudge someone else needs along their SQL journey.
Thank you for being part of this—and for helping others find the resources they need.
🥳 Lastly, I want to congratulate you for taking the time to improve your SQL skills!
If you’d like to keep learning, I regularly share SQL tips and other insights on LinkedIn every week.
I’m expanding to other platforms as well so you can follow me over there if you have these apps
(YouTube, TikTok, Instagram, X, Threads, etc.).
I’m also working on more in‑depth SQL courses and bootcamps that take you beyond just tips.
That’s how you can really grow your skills, so keep an eye out for those coming soon…
Until then, stay curious, keep practicing, and I’ll see you around!
Kyle
175