0% found this document useful (0 votes)

26 views

100 SQL Tips

This document provides 100 SQL tips designed to enhance SQL skills and improve query performance. The tips cover a wide range of topics, including data exploration, string manipulation, error handling, and best practices for writing SQL queries. Each tip is aimed at helping users become more proficient in SQL and avoid common pitfalls.

Uploaded by

mnlmnl33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

100 SQL Tips

Uploaded by

mnlmnl33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 176

100

SQL TIPS
To Take Your Skills to the Next Level

KYLE MALONE
‭Table of Contents‬

I‭ntroduction‬‭................................................................................................................................................‬‭4‬
‭Tip‬‭#1:‬‭The‬‭Most‬‭Common‬‭Pattern‬‭for‬‭Exploring‬‭Data‬‭..............................................................................‬‭6‬
‭Tip‬‭#2:‬‭How‬‭to‬‭Do‬‭SUMIFS‬‭in‬‭SQL‬‭...........................................................................................................‬‭8‬
‭Tip‬‭#3:‬‭My‬‭Favorite‬‭String‬‭Function‬‭.........................................................................................................‬‭10‬
‭Tip‬‭#4:‬‭How‬‭to‬‭Extract‬‭Part‬‭of‬‭a‬‭String‬‭in‬‭SQL‬‭........................................................................................‬‭12‬
‭Tip‬‭#5:‬‭Understanding‬‭SQL’s‬‭Order‬‭of‬‭Execution‬‭....................................................................................‬‭14‬
‭Tip‬‭#6:‬‭Proper‬‭Use‬‭of‬‭the‬‭ORDER‬‭BY‬‭Clause‬‭........................................................................................‬‭15‬
‭Tip‬‭#7:‬‭A‬‭SQL‬‭Formatting‬‭Tip‬‭..................................................................................................................‬‭17‬
‭Tip‬‭#8:‬‭Writing‬‭More‬‭Readable,‬‭Explicit‬‭SQL‬‭..........................................................................................‬‭18‬
‭Tip‬‭#9:‬‭Working‬‭with‬‭UNIX‬‭Timestamps‬‭in‬‭SQL‬‭......................................................................................‬‭19‬
‭Tip‬‭#10:‬‭An‬‭Alternative‬‭to‬‭CASE‬‭Statements‬‭..........................................................................................‬‭21‬
‭Tip‬‭#11:‬‭ROUND‬‭function‬‭in‬‭SQL‬‭............................................................................................................‬‭23‬
‭Tip‬‭#12:‬‭A‬‭Special‬‭Way‬‭to‬‭Round‬‭Numbers‬‭in‬‭SQL‬‭................................................................................‬‭25‬
‭Tip‬‭#13:‬‭CTEs‬‭vs.‬‭Subqueries‬‭.................................................................................................................‬‭27‬
‭Tip‬‭#14:‬‭Pivot‬‭Data‬‭with‬‭CASE‬‭Statements‬‭.............................................................................................‬‭28‬
‭Tip‬‭#15:‬‭INNER‬‭JOIN‬‭vs.‬‭WHERE‬‭IN‬‭......................................................................................................‬‭30‬
‭Tip‬‭#16:‬‭Introducing‬‭GROUP‬‭BY‬‭ALL‬‭.....................................................................................................‬‭31‬
‭Tip‬‭#17:‬‭Time‬‭Series‬‭Analysis‬‭with‬‭LAG‬‭..................................................................................................‬‭32‬
‭Tip‬‭#18:‬‭Self‬‭Joins‬‭...................................................................................................................................‬‭34‬
‭Tip‬‭#19:‬‭Cleaning‬‭Data‬‭with‬‭TRIM‬‭...........................................................................................................‬‭35‬
‭Tip‬‭#20:‬‭Avoid‬‭SQL‬‭Errors‬‭with‬‭SAFE_CAST‬‭.........................................................................................‬‭37‬
‭Tip‬‭#21:‬‭NULLIF‬‭Function‬‭in‬‭SQL‬‭...........................................................................................................‬‭40‬
‭Tip‬‭#22:‬‭PERCENT_RANK‬‭in‬‭SQL‬‭.........................................................................................................‬‭42‬
‭Tip‬‭#23:‬‭Check‬‭Divisibility‬‭with‬‭MOD‬‭.......................................................................................................‬‭44‬
‭Tip‬‭#24:‬‭Changing‬‭Data‬‭Types‬‭in‬‭SQL‬‭....................................................................................................‬‭46‬
‭Tip‬‭#25:‬‭Fix‬‭Data‬‭Issues‬‭at‬‭the‬‭Source‬‭...................................................................................................‬‭48‬
‭Tip‬‭#26:‬‭DDL,‬‭DML,‬‭&‬‭DQL‬‭—‬‭What’s‬‭the‬‭Difference?‬‭...........................................................................‬‭49‬
‭Tip‬‭#27:‬‭A‬‭Quick‬‭SQL‬‭Tip‬‭to‬‭Save‬‭You‬‭Time‬‭...........................................................................................‬‭51‬
‭Tip‬‭#28:‬‭Speed‬‭Up‬‭Your‬‭Queries‬‭By‬‭Joining‬‭on‬‭This‬‭Data‬‭Type‬‭.............................................................‬‭52‬
‭Tip‬‭#29:‬‭Understanding‬‭Column‬‭vs.‬‭Row‬‭Store‬‭Databases‬‭....................................................................‬‭53‬
‭Tip‬‭#30:‬‭The‬‭DATE_TRUNC‬‭Function‬‭.....................................................................................................‬‭54‬
‭Tip‬‭#31:‬‭FULL‬‭OUTER‬‭JOIN‬‭in‬‭SQL‬‭.......................................................................................................‬‭55‬
‭Tip‬‭#32:‬‭Dealing‬‭with‬‭NULL‬‭Values‬‭in‬‭SQL‬‭.............................................................................................‬‭58‬
‭Tip‬‭#33:‬‭4‬‭Common‬‭SQL‬‭Mistakes‬‭and‬‭How‬‭to‬‭Avoid‬‭Them‬‭..................................................................‬‭59‬
‭Tip‬‭#34:‬‭How‬‭to‬‭Rank‬‭Values‬‭in‬‭SQL‬‭......................................................................................................‬‭60‬
‭Tip‬‭#35:‬‭Write‬‭Faster‬‭SELECT‬‭Statements‬‭with‬‭EXCEPT‬‭......................................................................‬‭62‬
‭Tip‬‭#36:‬‭A‬‭Cleaner‬‭Way‬‭to‬‭Filter‬‭on‬‭Window‬‭Functions‬‭..........................................................................‬‭63‬
‭ ip‬‭#37:‬‭Simplify‬‭Your‬‭CASE‬‭Statements‬‭................................................................................................‬‭64‬
T
‭Tip‬‭#38:‬‭Appending‬‭Together‬‭Two‬‭Datasets‬‭............................................................................................‬‭65‬
‭Tip‬‭#39:‬‭Identifying‬‭Duplicates‬‭in‬‭Your‬‭Dataset‬‭.......................................................................................‬‭67‬
‭Tip‬‭#40:‬‭Working‬‭with‬‭Unstructured‬‭Data‬‭in‬‭SQL‬‭....................................................................................‬‭68‬
‭Tip‬‭#41:‬‭The‬‭EXTRACT‬‭Function‬‭............................................................................................................‬‭70‬
‭Tip‬‭#42:‬‭A‬‭SQL‬‭Function‬‭for‬‭Cleaning‬‭Data‬‭............................................................................................‬‭71‬
‭Tip‬‭#43:‬‭How‬‭to‬‭Deal‬‭with‬‭a‬‭Divide‬‭by‬‭Zero‬‭Error‬‭...................................................................................‬‭73‬
‭Tip‬‭#44:‬‭My‬‭Most‬‭Used‬‭Data‬‭Cleaning‬‭Function‬‭.....................................................................................‬‭75‬
‭Tip‬‭#45:‬‭Referencing‬‭a‬‭Table‬‭Name‬‭Alias‬‭in‬‭the‬‭WHERE‬‭Clause‬‭...........................................................‬‭77‬
‭Tip‬‭#46:‬‭How‬‭to‬‭Learn‬‭Joins‬‭....................................................................................................................‬‭78‬
‭Tip‬‭#47:‬‭Generating‬‭All‬‭Possible‬‭Combinations‬‭with‬‭CROSS‬‭JOIN‬‭........................................................‬‭80‬
‭Tip‬‭#48:‬‭Traversing‬‭Hierarchical‬‭Data‬‭with‬‭Recursive‬‭CTEs‬‭...................................................................‬‭82‬
‭Tip‬‭#49:‬‭Filtering‬‭Aggregated‬‭Data‬‭with‬‭HAVING‬‭(vs.‬‭WHERE)‬‭.............................................................‬‭84‬
‭Tip‬‭#50:‬‭WHERE‬‭TRUE‬‭..........................................................................................................................‬‭86‬
‭Tip‬‭#51:‬‭An‬‭ORDER‬‭BY‬‭Trick‬‭..................................................................................................................‬‭88‬
‭Tip‬‭#52:‬‭CTEs‬‭are‬‭Simpler‬‭Than‬‭You‬‭Think‬‭............................................................................................‬‭89‬
‭Tip‬‭#53:‬‭SQL‬‭JOINs‬‭Explained‬‭................................................................................................................‬‭90‬
‭Tip‬‭#54:‬‭NULL‬‭Values‬‭in‬‭WHERE‬‭Clause‬‭...............................................................................................‬‭92‬
‭Tip‬‭#55:‬‭Calculating‬‭Running‬‭Totals‬‭with‬‭Window‬‭Functions‬‭..................................................................‬‭93‬
‭Tip‬‭#56:‬‭Generate‬‭a‬‭“Date‬‭Spine”‬‭to‬‭Avoid‬‭Missing‬‭Dates‬‭in‬‭Charts‬‭......................................................‬‭95‬
‭Tip‬‭#57:‬‭Validating‬‭String‬‭Length‬‭.............................................................................................................‬‭97‬
‭Tip‬‭#58:‬‭Combining‬‭Columns‬‭into‬‭a‬‭Unique‬‭Key‬‭with‬‭CONCAT‬‭.............................................................‬‭98‬
‭Tip‬‭#59:‬‭Optimize‬‭Your‬‭Query‬‭Performance‬‭and‬‭Cost‬‭..........................................................................‬‭100‬
‭Tip‬‭#60:‬‭Fact‬‭vs.‬‭Dimension‬‭Tables‬‭.......................................................................................................‬‭102‬
‭Tip‬‭#61:‬‭Aggregate‬‭Your‬‭Data‬‭at‬‭Multiple‬‭Levels‬‭with‬‭GROUPING‬‭SETS‬‭............................................‬‭103‬
‭Tip‬‭#62:‬‭Check‬‭for‬‭Multiple‬‭NULL‬‭Columns‬‭Using‬‭COALESCE‬‭............................................................‬‭105‬
‭Tip‬‭#63:‬‭Calculate‬‭Moving‬‭Averages‬‭in‬‭SQL‬‭.........................................................................................‬‭106‬
‭Tip‬‭#64:‬‭Spot‬‭Checking‬‭and‬‭Testing‬‭......................................................................................................‬‭108‬
‭Tip‬‭#65:‬‭Query‬‭Multiple‬‭Tables‬‭at‬‭Once‬‭with‬‭Wildcard‬‭Tables‬‭...............................................................‬‭110‬
‭Tip‬‭#66:‬‭Creating‬‭Your‬‭Own‬‭SQL‬‭Functions‬‭to‬‭Reuse‬‭Logic‬‭.................................................................‬‭111‬
‭Tip‬‭#67:‬‭When‬‭to‬‭Use‬‭DISTINCT‬‭vs.‬‭GROUP‬‭BY‬‭.................................................................................‬‭113‬
‭Tip‬‭#68:‬‭Slowly‬‭Changing‬‭Dimensions‬‭..................................................................................................‬‭115‬
‭Tip‬‭#69:‬‭Sometimes‬‭You‬‭Just‬‭Need‬‭a‬‭Giant‬‭CASE‬‭Statement‬‭..............................................................‬‭116‬
‭Tip‬‭#70:‬‭Avoiding‬‭the‬‭“Double‬‭Counting”‬‭Pitfall‬‭.....................................................................................‬‭118‬
‭Tip‬‭#71:‬‭Combine‬‭Multiple‬‭Rows‬‭into‬‭One‬‭with‬‭STRING_AGG‬‭............................................................‬‭120‬
‭Tip‬‭#72:‬‭Parsing‬‭Text‬‭with‬‭REGEX_EXTRACT:‬‭Don’t‬‭Fear‬‭the‬‭Regex!‬‭................................................‬‭122‬
‭Tip‬‭#73:‬‭Easily‬‭Group‬‭Data‬‭by‬‭Month‬‭with‬‭LAST_DAY‬‭.........................................................................‬‭124‬
‭Tip‬‭#74:‬‭Anonymizing‬‭Personally‬‭Identifiable‬‭Information‬‭with‬‭a‬‭Hash‬‭Function‬‭..................................‬‭126‬
‭Tip‬‭#75:‬‭Joining‬‭Beyond‬‭Simple‬‭IDs‬‭......................................................................................................‬‭127‬
‭Tip‬‭#76:‬‭Parsing‬‭Dates‬‭with‬‭PARSE_DATE:‬‭Fixing‬‭Messy‬‭Date‬‭Formats‬‭.............................................‬‭128‬
‭Tip‬‭#77:‬‭Splitting‬‭Bad‬‭Rows‬‭into‬‭an‬‭Error‬‭Table‬‭with‬‭SAFE_CAST‬‭.......................................................‬‭130‬
‭Tip‬‭#78:‬‭Filtering‬‭on‬‭Boolean‬‭.................................................................................................................‬‭132‬
‭Tip‬‭#79:‬‭Using‬‭GREATEST‬‭and‬‭LEAST‬‭to‬‭Compare‬‭Values‬‭in‬‭a‬‭Row‬‭.................................................‬‭133‬
‭Tip‬‭#80:‬‭When‬‭to‬‭Use‬‭FIRST_VALUE‬‭and‬‭LAST_VALUE‬‭....................................................................‬‭135‬

‭2‬
‭ ip‬‭#81:‬‭Demystifying‬‭Window‬‭Function‬‭Keywords‬‭...............................................................................‬‭137‬
T
‭Tip‬‭#82:‬‭How‬‭to‬‭Handle‬‭Missing‬‭Data‬‭...................................................................................................‬‭139‬
‭Tip‬‭#83:‬‭Understanding‬‭the‬‭Difference‬‭Between‬‭NULL‬‭and‬‭Zero‬‭.........................................................‬‭140‬
‭Tip‬‭#84:‬‭Understanding‬‭COUNT(*),‬‭COUNT(column),‬‭and‬‭COUNT(1)‬‭.................................................‬‭142‬
‭Tip‬‭#85:‬‭Best‬‭Practices‬‭for‬‭Using‬‭INNER‬‭JOIN‬‭vs.‬‭LEFT‬‭JOIN‬‭and‬‭Filtering‬‭Data‬‭...............................‬‭143‬
‭Tip‬‭#86:‬‭Exploring‬‭and‬‭Validating‬‭Your‬‭Data‬‭with‬‭Simple‬‭Queries‬‭........................................................‬‭145‬
‭Tip‬‭#87:‬‭Why‬‭You‬‭Might‬‭Want‬‭to‬‭Use‬‭a‬‭View‬‭in‬‭BigQuery‬‭....................................................................‬‭147‬
‭Tip‬‭#88:‬‭Using‬‭the‬‭REPEAT‬‭Function‬‭...................................................................................................‬‭149‬
‭Tip‬‭#89:‬‭Using‬‭the‬‭SPLIT‬‭Function‬‭for‬‭String‬‭Manipulation‬‭...................................................................‬‭151‬
‭Tip‬‭#90:‬‭Using‬‭the‬‭BAG_OF_WORDS‬‭Function‬‭for‬‭Text‬‭Analysis‬‭........................................................‬‭153‬
‭Tip‬‭#91:‬‭You’re‬‭Never‬‭Going‬‭to‬‭Get‬‭the‬‭Data‬‭Exactly‬‭How‬‭You‬‭Want‬‭It—And‬‭That’s‬‭Okay‬‭..................‬‭155‬
‭Tip‬‭#92:‬‭Why‬‭GROUP‬‭BY‬‭Columns‬‭Must‬‭Appear‬‭in‬‭the‬‭SELECT‬‭Clause‬‭(Unless‬‭Aggregated)‬‭..........‬‭157‬
‭Tip‬‭#93:‬‭WHERE‬‭EXISTS‬‭.....................................................................................................................‬‭159‬
‭Tip‬‭#94:‬‭Why‬‭You‬‭Should‬‭Always‬‭Comment‬‭Your‬‭Code‬‭.......................................................................‬‭161‬
‭Tip‬‭#95:‬‭When‬‭to‬‭Use‬‭REGEXP_CONTAINS‬‭Instead‬‭of‬‭LIKE‬‭..............................................................‬‭164‬
‭Tip‬‭#96:‬‭The‬‭SELECT‬‭Clause‬‭is‬‭Independent‬‭of‬‭the‬‭Query’s‬‭Processing‬‭............................................‬‭166‬
‭Tip‬‭#97:‬‭Don’t‬‭Use‬‭CASE‬‭Statements‬‭as‬‭Join‬‭Conditions‬‭....................................................................‬‭167‬
‭Tip‬‭#98:‬‭HAVING‬‭MAX‬‭and‬‭HAVING‬‭MIN‬‭.............................................................................................‬‭169‬
‭Tip‬‭#99:‬‭Data‬‭Normalization‬‭Basics‬‭......................................................................................................‬‭171‬
‭Tip‬‭#100:‬‭Using‬‭||‬‭for‬‭String‬‭Concatenation‬‭...........................................................................................‬‭173‬
‭Conclusion‬‭.............................................................................................................................................‬‭174‬

‭3‬
‭Introduction‬
‭Welcome to 100 SQL Tips to Level Up Your Skills!‬

‭First off, here’s a virtual high five! 🖐️ ‬

‭You took action to download this e-book and invest in your growth.‬

‭That’s an important step.‬

I‭n this e-book, I share with you the top 100 SQL techniques that I’ve learned over the last decade of my‬
‭career in data analytics.‬

‭In my opinion,‬‭SQL is the most critical skill‬‭you‬‭need as a data analyst.‬

‭It’s not just a “nice-to-have”— it’s the backbone of data work.‬

I‭t’s the difference between a data analyst who’s just competent and one who stands out, gets‬
‭promotions and pay increases, and drives the business strategy of their company.‬

I‭’ve seen over and over how the level of someone’s SQL skills can be the difference-maker in‬
‭advancing their data career.‬

‭I once interviewed a hungry candidate who was eager to join our analytics team.‬

‭He had all the enthusiasm in the world.‬

‭But he didn’t know any SQL.‬

‭We were a small team, juggling complex projects every day.‬

‭And despite his passion, hiring him simply wasn’t feasible.‬

‭No one had time to teach SQL from scratch with everything else on our plates.‬

‭ ou can have the best communication skills and business acumen, but if you can’t write SQL queries,‬
Y
‭you’ll struggle to add real value as a data analyst.‬

‭But I don’t want that for you.‬

‭I want you to crush it in your career, get promoted, and achieve whatever you dream of.‬

‭That’s why I put together this e-book.‬

‭ hese 100 tips condense the lessons I’ve learned (sometimes the hard way) through years of‬
T
‭experience with SQL.‬

‭I remember earlier in my career, I didn’t know where to turn to improve my SQL skills.‬

‭And I want to help those of you in the same place that I was in years ago.‬

‭4‬
‭These tips are designed to close knowledge gaps and provide a solid foundation you can build on.‬

‭If you apply them, you’ll move past the basics and start writing queries that unlock real business value.‬

‭You’ll be the person bosses turn to for answers.‬

‭Think of this as your shortcut to mastering the most important tool in data analytics.‬

‭ y the last page, you’ll be writing SQL with a level of confidence that sets you apart, whether you’re‬
B
‭eyeing your first data job or gunning for that senior role.‬

‭You can do this!‬

‭So let’s dig in.‬

‭Your future self will thank you.‬

‭To Your Success, 🤝‬

‭Kyle‬

‭P.S. The tips are not in any particular order — just an order that I thought had a nice flow.‬

‭P.P.S. Please excuse the inconsistent format of the SQL code images. I like to mix it up! 🤣‬

‭5‬
‭Tip #1: The Most Common Pattern for Exploring Data‬
‭ s a data analyst, exploring and understanding your dataset is crucial before diving into more complex‬
A
‭analysis.‬

‭ ne common pattern for exploring data is to find the most frequent values for a specific column, ranked‬
O
‭from top to bottom.‬

‭Here's a SQL tip for finding the most common values for a column in your dataset.‬

‭Let's consider an example where we have a table named “orders” with the following data:‬

‭6‬
‭A query to find the most common product categories would look something like the query above.‬

I‭n this example, the query returns the product categories and their respective frequencies, ordered by‬
‭frequency in descending order.‬

‭This helps us quickly identify the most common product categories in our dataset.‬

‭ ne other tip is you can try to GROUP BY 1 ORDER BY 2, instead of typing the column names‬
O
‭explicitly.‬

‭Not every SQL environment supports this but it can make your query cleaner and faster to write.‬

‭ s a data analyst, using this pattern for exploring your data will enable you to better understand your‬
A
‭dataset and make more informed decisions during your analysis.‬

‭It's a simple yet effective way to quickly gain insights into the distribution of values within your data.‬

‭7‬
‭Tip #2: How to Do SUMIFS in SQL‬
‭SUMIFS is my favorite spreadsheet function.‬

‭But what if you want to do something similar in SQL?‬

I‭n SQL, the combination of the CASE statement within the SUM function can be used to achieve‬
‭functionality similar to SUMIFS in Excel.‬

‭The first time I saw that you could put a CASE statement inside a SUM, it blew my mind. 🤯‬
‭ he CASE statement allows you to apply conditional logic within the aggregation function. This lets you‬
T
‭sum the values based on specific criteria.‬

‭Suppose we have a table named “sales” with the following data:‬

‭ e want to calculate the total quantity sold for product_id 1, but only for completed sales. In this case,‬
W
‭we can use a CASE statement within the SUM function to apply the necessary conditions:‬

‭8‬
‭In this example, the CASE statement checks if the product_id is 1 and if the sale_status is 'Completed.'‬

‭If both conditions are met, it returns the quantity value; otherwise, it returns 0.‬

‭ he SUM function then adds up the returned values, giving us the total quantity sold for product_id 1‬
T
‭with completed sales.‬

‭ sing a CASE statement within a SUM function allows you to perform conditional aggregations similar‬
U
‭to SUMIFS in Excel.‬

‭Hope you put this to good use!‬

‭9‬
‭Tip #3: My Favorite String Function‬
‭Working with strings in SQL is one of the most common things you do as a data analyst.‬

‭ s a part of that, you may need to combine the values of multiple columns or strings in your SQL‬
A
‭queries.‬

‭One useful function for this purpose is the CONCAT function.‬

I‭n this lesson, we'll introduce you to the CONCAT function and explain how it can be helpful in your‬
‭SQL queries.‬

‭What is CONCAT?‬

‭The CONCAT function is used to concatenate (or join) two or more strings or column values together.‬

‭It returns a single string that is the result of combining the input strings in the order they are provided.‬

‭The basic syntax for the CONCAT function is:‬

CONCAT(string1, string2, ..., stringN)‬

‭

‭ uppose we have a table called “employees” with the following columns: first_name, last_name, and‬
S
‭email.‬

‭ e want to create a query that returns a full name by combining the first_name and last_name‬
W
‭columns.‬

I‭n this query, we use the CONCAT function to join the first_name and last_name columns, separated by‬
‭a space character.‬

‭The result is aliased as full_name.‬

‭ y using this function, you can easily create new strings by joining existing ones, which can be useful‬
B
‭for formatting.‬

‭ .S. Another cool thing CONCAT can do is create a unique key on your table. If none of the columns‬
P
‭are unique by themselves but a combination of columns is unique, then you can just concatenate them‬
‭together to create a unique key.‬

‭10‬
‭11‬
‭Tip #4: How to Extract Part of a String in SQL‬
‭Cleaning data in SQL can be one of the hardest parts of a data analyst's job.‬

‭This often involves extracting only PART of a string of text.‬

‭Here's a good way to do that.‬

I‭n SQL, string functions like STRPOS() and SUBSTRING() can be used to manipulate and extract‬
‭specific parts of text data.‬

‭STRPOS() returns the position of the first occurrence of a part of a piece of text.‬

I‭n combination with other string functions like SUBSTRING(), you can extract parts of a string based on‬
‭the position of a specific character.‬

‭Let's consider an example where we have a table named “file_paths” with the following data...‬

‭ e want to extract the file names, which are all characters to the right of the last / in the file_path‬
W
‭column.‬

‭We can use the STRPOS() and SUBSTRING() functions in a query to achieve this.‬

‭12‬
I‭n this example, the STRPOS(file_path, '/') function finds the position of the first occurrence of / in the‬
‭file_path column.‬

‭We add 1 to this position to start from the character immediately after the /.‬

‭ inally, we use the SUBSTRING() function to extract the substring starting at the calculated position‬
F
‭until the end of the string.‬

‭Boom!‬

‭File name extracted.‬

‭13‬
‭Tip #5: Understanding SQL’s Order of Execution‬
‭ common confusion for new analysts is that SQL‬‭doesn’t‬‭execute in the same top-to-bottom order‬
A
‭you see in your query.‬

‭ o if you’re ever surprised that you can’t reference a SELECT alias in WHERE, or run into issues with a‬
S
‭window function, it’s probably due to how SQL is actually evaluated under the hood.‬

‭The Actual Order‬

‭ .‬
1 ‭ ROM‬‭: Identify which table you’re pulling data from.‬
F
‭2.‬ ‭JOIN‬‭: Combine the data from multiple tables if needed.‬
‭3.‬ ‭WHERE‬‭: Filter out rows that don’t meet your conditions.‬
‭4.‬ ‭GROUP BY‬‭: Group rows for aggregation.‬
‭5.‬ ‭HAVING‬‭: Filter the grouped results.‬
‭6.‬ ‭SELECT‬‭: Choose which columns or expressions to display.‬
‭7.‬ ‭ORDER BY‬‭: Sort the results.‬
‭8.‬ ‭LIMIT‬‭: Return only a subset of rows (e.g., top 10).‬

‭Here’s an example to illustrate why this matters. Let’s say you write:‬

SELECT‬
‭
customer_id,‬
‭
COUNT‬
‭ (*)‬‭
‭ AS‬‭
order_count‬
FROM‬‭
‭ orders‬
WHERE‬‭
‭ order_count >‬‭
5‬‭
-- This won't work‬
GROUP‬‭
‭ BY‬‭
customer_id‬

‭This‬‭fails‬‭because order_count (an‬‭aggregate‬‭) doesn’t‬‭exist yet when WHERE runs.‬

‭You’d need:‬

SELECT‬
‭
customer_id,‬
‭
COUNT‬
‭ (*)‬‭
‭ AS‬‭
order_count‬
FROM‬‭
‭ orders‬
GROUP‬‭
‭ BY‬‭
customer_id‬
HAVING‬‭
‭ COUNT‬
(*) >‬‭
‭ 5‬

‭ ere, we use‬‭HAVING‬‭to filter after the grouping and counting is done. This makes sense because by‬
H
‭the time you hit the WHERE clause, the SELECT clause hasn’t run yet!‬

‭ QL runs in a specific‬‭logical‬‭order—knowing this helps you avoid those “Why won’t this work?”‬
S
‭moments. ‬ 🤔
‭14‬
‭Tip #6: Proper Use of the ORDER BY Clause‬
‭ hen you're working with SQL, it's crucial to write efficient queries to retrieve the data you need as‬
W
‭quickly as possible.‬

‭ ne way to optimize performance is by ensuring that you use the ORDER BY clause only at the end of‬
O
‭your query, rather than in subqueries or derived tables.‬

‭ he reason behind this is that‬‭sorting data can be‬‭computationally expensive‬‭, especially when‬
T
‭dealing with large datasets.‬

‭So you want to sort as little as possible.‬

‭ he image below shows an example of an inefficient ORDER BY at the beginning of a query vs. an‬
T
‭efficient ORDER BY at the end.‬

‭15‬
‭ y using ORDER BY at the end of your query, you give the database engine the opportunity to filter,‬
B
‭join, or aggregate the data before sorting it.‬

‭This reduces the amount of data that needs to be sorted, making the process more efficient and faster.‬

‭16‬
‭Tip #7: A SQL Formatting Tip‬
‭Learning good SQL formatting is one of the highest leverage things you can do as a new data analyst.‬

‭That's because it's easy to learn but can be detrimental to your interviews if you don't do it.‬

‭Nothing screams “amateur” like poorly formatted SQL code.‬

‭ earning to format your SQL is the EASIEST and FASTEST way to make you look like you know what‬
L
‭you’re doing.‬

‭Here’s a tip.‬‭Don’t be afraid of using a lot of vertical‬‭space.‬

‭Here’s what I mean:‬

‭Instead of writing your SQL like this…‬

‭ELECT‬‭
S name‬
, age, country, occupation, hobby‬‭
‭ FROM‬‭users‬
WHERE‬‭
‭ country =‬‭
'USA'‬‭
AND‬‭
age >‬‭
21‬‭
ORDER‬‭
BY‬‭
name‬‭
DESC‬

‭Try writing it like this...‬

SELECT‬
‭
name‬
‭ ,‬
‭
age,‬
‭
country,‬
‭
occupation,‬
‭
hobby‬
‭
FROM‬‭
‭ users‬
WHERE‬‭
‭ country =‬‭
'USA'‬
AND‬‭
‭ age >‬‭
21‬
ORDER‬‭
‭ BY‬‭
name‬‭
DESC‬

‭Easier to read, right?‬

‭Keep in mind there is no one right way to format SQL code.‬

‭Every team is different.‬

‭But all of them follow some of these basic best practices like using a lot of vertical space.‬

‭SQL isn't just about getting the right data.‬

‭It's about writing queries that others can easily understand and maintain.‬

‭17‬
‭Tip #8: Writing More Readable, Explicit SQL‬
‭Just because two SQL queries do the same thing, doesn’t mean they are equal.‬

‭Here’s an example of how a senior analyst vs. a more junior analyst might write the same query.‬

‭Let’s say we want to convert a datetime from UTC to Arizona time.‬

‭Junior Analyst:‬

DATETIME_SUB(created_at, INTERVAL 7 HOUR) as created_at‬

‭

‭Senior analyst:‬

DATETIME(created_at, 'US/Arizona') as created_at‬

‭

‭Both of these queries get you the same result. But the second one is more explicit.‬

‭It articulates exactly what you’re trying to do to someone else reading your query.‬

‭They are converting it to a local time zone.‬

I‭n the first example, the reader has to guess why seven hours are being subtracted from the created_at‬
‭column.‬

‭Explicit queries reduce the risk of misinterpretation or errors when others work with your code.‬

‭ asily readable SQL strengthens team collaboration and eases the onboarding process for new‬
E
‭analysts.‬

‭Remember, it's not just about getting the result.‬

‭How you get there is also important!‬

‭The goal is to write SQL that not only works but works well for everyone who reads it!‬

‭18‬
‭Tip #9: Working with UNIX Timestamps in SQL‬
‭Have you ever seen a timestamp that just looked like a really long number?‬

‭Timestamps usually look something like this: "2021-06-14 12:33:07"‬

‭But sometimes they can look like this: “1623647987” 🤔‬

‭It’s called a UNIX Timestamp.‬

I‭t’s a representation of time as the number of seconds that have elapsed since January 1, 1970,‬
‭00:00:00 Coordinated Universal Time (UTC), not counting leap seconds.‬

‭This point in time is called the "UNIX Epoch."‬

‭ NIX timestamps are a convenient way to store and manipulate date and time values in various‬
U
‭programming languages and databases, including SQL.‬

‭ hey are especially useful for calculating time intervals and comparing dates since they represent time‬
T
‭as a single numeric value.‬

‭Different SQL databases have their own set of functions to handle UNIX timestamps.‬

‭Here are some examples showing how PostgreSQL, MySQL, and BigQuery handle these.‬

‭Convert a UNIX timestamp to a date and time value using TIMESTAMP_SECONDS() function:‬

SELECT‬‭
‭ TIMESTAMP_SECONDS(‬
1623647987‬
‭ )‬‭
‭ AS‬‭
date_time‬

‭Convert a date and time value to a UNIX timestamp using the UNIX_SECONDS() function:‬

SELECT‬‭
‭ UNIX_SECONDS(‬
TIMESTAMP‬‭
‭ "2021-06-14 12:33:07"‬
)‬‭
‭ AS‬‭
unix_timestamp‬

‭Let's consider an example where we have a table named user_logins with the following data:‬

‭ e want to convert the UNIX timestamps to human-readable date and time values. In PostgreSQL, we‬
W
‭can do this using the to_timestamp() function:‬

‭19‬
‭Next time you see a timestamp shown as a long string of numbers you will know what to do!‬

‭20‬
‭Tip #10: An Alternative to CASE Statements‬
‭Sometimes CASE statements in SQL can get long and hard to read.‬

‭There is a better way.‬

‭In certain SQL dialects (e.g. BigQuery), you can use the IF() function as a cleaner alternative.‬

‭Here’s an example...‬

‭Suppose we have an employees table with the following columns:‬

id, name, age‬

‭

‭We want to add a column indicating if an employee is younger than 40 or not.‬

‭The code below shows how to write that with both CASE and IF.‬

‭21‬
‭The IF function takes three arguments:‬

‭ .‬ ‭a condition‬
1
‭2.‬ ‭the value to return if the condition is true‬
‭3.‬ ‭the value to return if the condition is false‬

‭I’m not saying you should ALWAYS use IF over CASE statements.‬

‭I think it works best to use IF when the logic is very simple like the example above.‬

‭But if you’ve never used it, try out IF next time you are about to write a CASE statement!‬

‭22‬
‭Tip #11: ROUND function in SQL‬
‭ common mistake Data Analysts make (and one I’ve made myself) is including‬‭too many decimal‬
A
‭places‬‭in presentations to stakeholders.‬

‭SQL provides a versatile function called ROUND that helps you combat this.‬

‭The ROUND function takes two arguments:‬

‭ .‬ T
1 ‭ he number you want to round‬
‭2.‬ ‭The number of decimal places to round to.‬

‭If you pass a‬‭positive value‬‭as the second argument,‬‭it rounds to that many decimal places.‬

‭ ‬‭negative value‬‭will round to the left of the decimal‬‭point and passing zero will round the number to‬
A
‭the nearest integer.‬

‭ ere’s an example where we have a table named “product_prices” and we want to round the prices‬
H
‭multiple ways:‬

‭23‬
I‭n this example, the query uses the ROUND function to round the prices to two decimal places, the‬
‭nearest integer, and the nearest multiple of 10.‬

‭As a result, we can see the rounded prices for each product in our dataset.‬

‭Happy rounding. 🟠‬

‭24‬
‭Tip #12: A Special Way to Round Numbers in SQL‬
‭If SQL already has a ROUND function, why do we also need FLOOR and CEILING?‬

‭This stumped me for a while, but I later realized why.‬

‭The key difference is‬‭how each function handles decimal‬‭values‬‭:‬

-‭ ‬ ‭ LOOR‬‭always rounds‬‭down‬‭to the nearest whole number.‬

F
‭-‬ ‭CEILING‬‭always rounds‬‭up‬‭to the nearest whole number.‬
‭-‬ ‭ROUND‬‭rounds to the‬‭nearest‬‭number or decimal place.‬

‭ hile ROUND is great for general number rounding, FLOOR and CEILING are needed‬‭to control the‬
W
‭direction of rounding‬‭, such as when bucketing data.‬

‭Example: Customer Spend in 30-Day LTV Buckets‬

‭ et’s say you want to analyze how much customers spend in their first‬‭30, 60, 90, etc., days‬‭after‬
L
‭signing up.‬

‭ ou need to‬‭group spending into fixed 30-day buckets‬‭,‬‭ensuring that each period starts at a multiple‬
Y
‭of 30 days.‬

‭25‬
I‭f we used ROUND, a transaction at‬‭45 days‬‭would round to‬‭60‬‭instead of being correctly placed in the‬
‭30 to 60-day bucket‬‭.‬

‭Example:‬

‭ELECT‬‭
S ROUND‬
(‭
‭4
‬5‬‭
/‬‭
30‬
) *‬‭
‭ 30‬
;‬‭
‭ -- Returns 60 (incorrect‬‭
for bucketing)‬
SELECT‬‭
‭ FLOOR‬
(‭
‭4
‬5‬‭
/‬‭
30‬
) *‬‭
‭ 30‬
‭‬‭
; -- Returns 30 (correct‬‭
for bucketing)‬

‭By using FLOOR, we ensure that transactions always fall into the correct 30-day range.‬

‭Key Takeaways:‬

-‭ ‬ ‭ se FLOOR when bucketing data into fixed intervals (e.g., 30-day periods).‬
U
‭-‬ ‭Use CEILING when you need to round up (e.g., determining “at most X days”).‬
‭-‬ ‭Use ROUND only when rounding to the nearest value is appropriate.‬

‭ nderstanding these differences helps ensure accuracy in SQL reporting and analysis—especially for‬
U
‭LTV, retention, and cohort studies.‬

‭Hope you put this to good use!‬

‭26‬
‭Tip #13: CTEs vs. Subqueries‬
‭When should you use CTEs (common table expressions) vs. subqueries in SQL?‬

‭Here's a rule of thumb I like to use...‬

‭ o refresh your memory, I like to think of CTEs and subqueries as ways to organize the writing of your‬
T
‭SQL code.‬

‭ CTE is kind of like making a bunch of little tables individually, and then working with them in the lower‬
A
‭parts of your query.‬

‭A subquery is like writing a query within a query.‬

‭This usually looks something like:‬

‭ELECT‬‭
S columns...‬
FROM‬‭
‭ (‬
SELECT‬‭
‭ col1, col2, etc.‬‭
FROM‬‭
table_name‬‭
WHERE‬‭
...)‬
...‬
‭

‭First off I will say, I am not a fan of subqueries.‬

‭Whenever I see a big subquery inside another query it makes my mind explode. 🤯‬
‭However, I understand sometimes it's just easy to write a little subquery when you have to.‬

‭That said, here is my rule:‬

‭Use CTEs 99% of the time.‬

‭Only use subqueries when they are VERY simple and easy to understand.‬

I‭f it takes more than five seconds to read and understand what the subquery is doing, make it a CTE‬
‭instead.‬

‭This will make it easier to write, understand, and debug your code.‬

‭It will also make it WAY easier for someone else who didn't write your query to read and understand it.‬

‭Follow this rule and your other team members and future self will thank you!‬

‭27‬
‭Tip #14: Pivot Data with CASE Statements‬
‭One useful feature of CASE statements is to pivot data dynamically.‬

‭CASE statements can transform data in ways beyond simple conditional expressions.‬

‭ ne interesting application of CASE statements + aggregate functions is to pivot data dynamically,‬

O
‭creating a more readable and structured view of your dataset.‬

‭ ivoting data means converting rows into columns, which can be especially useful when dealing with‬
P
‭categorical data.‬

‭Here's an example of how you can use the CASE statement to pivot data:‬

I‭n this query, we used SUM functions and CASE statements to pivot the age_group column, converting‬
‭it into separate columns for each age group's dog count.‬

‭Here is what that looks like before and after the pivot:‬

‭28‬
-- BEFORE PIVOT‬
‭

‭og_breed
d | age_group | number_of_dogs‬
------------------|------------|---------------‬
‭
Labrador
‭ | Puppy | 10‬
Labrador
‭ | Adult | 35‬
Labrador
‭ | Senior | 15‬
Golden Retriever | Puppy
‭ | 12‬
Golden Retriever | Adult
‭ | 30‬
Golden Retriever | Senior
‭ | 20‬
Beagle
‭ | Puppy | 8‬
Beagle
‭ | Adult | 25‬
Beagle
‭ | Senior | 12‬

-- AFTER PIVOT‬
‭

‭og_breed
d | puppy_count | adult_count | senior_count‬
------------------|-------------|-------------|--------------‬
‭
Labrador
‭ | 10 | 35 | 15‬
Golden Retriever | 12
‭ | 30 | 20‬
Beagle
‭ | 8 | 25 | 12‬

📊 📊
‭ ‬‭Enjoying this e‑book?‬‭ ‬

‭If you’ve learned something valuable so far, I have a favor to ask.‬

‭ here are many people (just like you) out there who want to learn SQL but feel lost and don’t know‬
T
‭where to start.‬

‭All you need to do is share a quick post on LinkedIn (or your preferred social platform).‬

‭Tag me, mention how this e‑book helped you, and include a link so others can find it.‬

‭ eople listen to others’ recommendations and your recommendation could guide them in the right‬
P
‭direction.‬

‭Thanks for sharing this e-book and helping others on their path.‬

‭Now back to our regularly scheduled programming…‬

‭29‬
‭Tip #15: INNER JOIN vs. WHERE IN‬
‭Two paths, same destination.‬

‭When filtering data, you might see two approaches that do basically the same thing.‬

‭1.‬ ‭INNER JOIN‬‭:‬

‭ELECT‬‭
S t1.*‬
FROM‬‭
‭ table1 t1‬
JOIN‬‭
‭ table2 t2‬
ON‬‭
‭ t1.column = t2.column‬

‭2.‬ ‭WHERE IN‬‭:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ table1‬
WHERE‬‭
‭ column‬‭
IN‬‭
(‭
S
‬ELECT‬‭
column‬‭
FROM‬‭
table2)‬

‭Which is more efficient?‬

‭In many modern SQL engines, both often end up with similar execution plans.‬

‭Readability and personal‬‭preference play a big role.‬

‭Some find WHERE IN more explicit, especially when using a simple subquery.‬

‭Others find an INNER JOIN more standard for two-table relationships.‬

‭I personally gravitate toward INNER JOINs as the primary way to filter data if possible.‬

‭But sometimes I use a little WHERE IN if it’s a simple subquery.‬

‭The real takeaway‬‭is either way works!‬

‭There’s no single “right” choice.‬

‭Focus on‬‭correct results‬‭and writing‬‭clear‬‭code.‬

‭As you gain experience, you’ll develop preferences, but don’t let this topic confuse you.‬

‭The database optimizer usually handles both similarly, especially for reasonably sized datasets.‬

‭Ultimately,‬‭clarity‬‭is key.‬

‭Pick the syntax that best communicates the intent to you and to those who will be reading your code.‬

‭30‬
‭Tip #16: Introducing GROUP BY ALL‬
‭This SQL tip will change your life.‬

‭You can now use GROUP BY ALL in your queries!‬

‭Forget the old ways…‬

GROUP BY date, region, product, sales_channel‬

‭

‭Phew 🥵‬
GROUP BY 1,2,3,4‬
‭

‭Better... 🤨‬
‭Now, just simply:‬

GROUP BY ALL‬
‭

‭Beautiful, elegant.‬

‭The way it always should have been!‬

‭31‬
‭Tip #17: Time Series Analysis with LAG‬
‭Time series analysis is one of the most important skills you can have as a data analyst.‬

‭This is because many important insights are related to a change in some metric over time.‬

‭The LAG function in SQL is a window function that is helpful for this.‬

‭ sing LAG, you can calculate period-to-period changes, such as the difference in sales between‬
U
‭consecutive months or the growth in user signups from week to week.‬

‭It can seem a bit confusing at first since it’s a window function, but I promise you’ll get the hang of it.‬

‭Below I show how to use it to calculate a month-over-month change in the sales column.‬

‭ et's consider an example where we have a table named “monthly_sales” with the following monthly‬
L
‭sales data:‬

‭32‬
I‭n this example, the LAG function is first used within the OVER clause, which defines the order of rows‬
‭to be used in the calculation.‬

‭ his retrieves the sales value from the previous row, relative to the current row, when sorted by the‬
T
‭month column (previous_month_sales).‬

‭Then, we subtract the previous month's sales from the current month's sales to get the sales change.‬

‭ y using the LAG function, you can gain insights into how values change over time, helping you identify‬
B
‭trends, growth rates, and patterns in your data.‬

‭33‬
‭Tip #18: Self Joins‬
‭The first time I saw a self join in SQL I thought it was so weird.‬

‭Why would you need to match a table… with itself?‬

‭But they’re super useful for analyzing relationships within the same dataset.‬

‭Here’s a quick example.‬

‭ et’s say you’re a data analyst working with employee data, and you want to find which employees‬
L
‭report to which managers—all from the same employees table.‬

‭Here’s how you’d do it:‬

SELECT‬
‭
e.employee_id,‬
‭
e.first_name‬‭
‭ AS‬‭
employee_name,‬
m.first_name‬‭
‭ AS‬‭
manager_name‬
FROM‬‭
‭ employees e‬
JOIN‬‭
‭ employees m‬
ON‬‭
‭ e.manager_id = m.employee_id‬
ORDER‬‭
‭ BY‬‭
manager_name, employee_name‬

‭What’s happening here?‬

‭This query lists employees alongside their manager’s name.‬

‭Think of self joins like comparing rows from the same table to find patterns or relationships.‬

-‭ ‬ ‭ mployees e: This is the original table (the “employee” we care about).‬

e
‭-‬ ‭employees m: This is a second instance of the same table (acting as the “manager”).‬
‭-‬ ‭e.manager_id = m.employee_id: Matches each employee to their manager.‬

‭ quick rule of thumb is‬‭self joins are perfect for tables that have‬‭hierarchies.‬
A
‭(managers/employees, product categories, etc.)‬

‭If you come across a table that has hierarchies, remember you might need a self join to analyze it!‬

‭34‬
‭Tip #19: Cleaning Data with TRIM‬
‭Extra spaces before and after words in SQL columns can drive analysts crazy.‬

‭ his can happen for a lot of reasons — a common one is when a user accidentally types a space‬
T
‭before or after their email in an online form or survey.‬

‭There are often processes in place to ensure this doesn't happen, but that's not always the case.‬

I‭ can't tell you how many times I've tried for hours to figure out what's wrong with my SQL query only to‬
‭realize that one of the fields has a sneaky space in front of it!‬

‭To combat this you can use the TRIM function in SQL.‬

‭ he TRIM function is a string manipulation function that removes leading and trailing spaces from a‬
T
‭given string.‬

‭ et's consider an example where we have a table named “users” with user information that includes‬
L
‭first names and email addresses with extra spaces.‬

‭We want to remove the extra spaces from the first_name and email columns.‬

‭35‬
‭ y using the TRIM() function, you can ensure that your queries, joins, and comparisons work correctly‬
B
‭and avoid issues caused by extra spaces in user-submitted data.‬

‭36‬
‭Tip #20: Avoid SQL Errors with SAFE_CAST‬
‭Oftentimes in our analytics work we will want to CAST a column to another data type.‬

‭ or example, we might have a column containing numbers but the data type of that column is a‬
F
‭STRING.‬

‭To work with these strings as numbers we use the CAST function.‬

‭An example of this is: CAST(column_name as NUMERIC)‬

‭ ut one problem with this is it can cause your whole query to error out if any single row fails to cast to‬
B
‭another data type.‬

‭In this example, this could happen if one of the rows mistakenly had a word in it instead of a number.‬

‭To avoid errors you can use the SAFE_CAST function.‬

‭SAFE_CAST is a SQL function that's available in many data warehouses like Google BigQuery.‬

‭SAFE_CAST allows you to safely cast or convert data from one data type to another.‬

I‭f the conversion cannot be performed, instead of causing your query to crash or return an error, the‬
‭SAFE_CAST function will return NULL.‬

‭ sing SAFE_CAST can help you avoid unexpected errors in your queries when dealing with data that‬
U
‭may not be in the expected format or contains invalid values for the target data type.‬

‭Here's an example to illustrate its usage…‬

‭Suppose we have a table named raw_data with the following schema and data.‬

‭ e want to cast the value column to integers, but there is a record (record_id 2) with an invalid value‬
W
‭for an integer.‬

‭If we try to cast the value column directly, we may encounter an error:‬

‭To avoid this error and handle the casting safely, we can use the SAFE_CAST function:‬

‭37‬
I‭n this example, the SAFE_CAST function takes the value and the target data type (INT64) as‬
‭arguments.‬

‭ hen the value cannot be cast to the target data type, as in the case of record_id 2, the function‬
W
‭returns NULL instead of causing an error.‬

I‭f you want to put the errors in a separate table for further analysis, you can filter the records with NULL‬
‭values in the int_value column:‬

‭38‬
‭ nderstanding how to use functions like SAFE_CAST to handle potential data type conversion errors in‬
U
‭your SQL queries can be a big help when you need it.‬

I‭t helps you maintain the stability and reliability of your analysis, allowing you to work with data that may‬
‭contain unexpected or invalid values that could otherwise cause your queries to fail.‬

‭ y using SAFE_CAST, you can also identify and analyze problematic records separately, improving the‬
B
‭quality of your data processing and analysis.‬

‭39‬
‭Tip #21: NULLIF Function in SQL‬
‭A while back I made a LinkedIn post about how to check for duplicates in SQL.‬

I‭ received a question asking not about duplicates, but how to compare whether two columns in the‬
‭same table have the same value.‬

‭There are a number of ways to do this.‬

‭Today I will be discussing one good approach:‬‭The NULLIF function.‬

‭NULLIF is a conditional function in SQL, meaning it uses IF/THEN logic.‬

‭It compares two expressions and returns NULL if they are equal.‬

‭Otherwise, it returns the first expression.‬

I‭magine you have a users table with columns for email and backup_email. You want to ensure that the‬
‭backup_email is not the same as the primary email for security reasons.‬

‭Here's how you could use NULLIF in this case:‬

‭In the results of this query, email_comparison will be NULL if the email and backup_email are the same.‬

‭ his could be helpful to quickly identify any instances where the backup email is the same as the‬
T
‭primary email, which could be a security concern.‬

I‭n other words, any row where email_comparison is NULL is a row where the email and backup_email‬
‭are identical.‬

‭40‬
‭ his way, NULLIF can be used to compare columns and identify identical values, which can be useful in‬
T
‭a variety of scenarios where identical values in different columns could be a problem.‬

‭Quick tips like this can really speed up your workflow as an analyst.‬

‭Happy SQL’ing!‬

‭41‬
‭Tip #22: PERCENT_RANK in SQL‬
‭I thought I didn’t need this SQL window function… until I used it the other day!‬

‭Here’s a lesson on using PERCENT_RANK in SQL.‬

‭At some point, you’re going to need to compare values across a dataset.‬

‭ aybe you want to rank customers by spending, students by test scores, or products by sales‬
M
‭performance.‬

‭But instead of just assigning a rank, you want to see‬‭where each value falls on a scale from 0 to 1‬‭.‬

‭That’s where‬‭PERCENT_RANK‬‭comes in.‬

‭This makes it easy to identify‬‭percentiles and outliers‬‭.‬

‭So what does PERCENT_RANK do?‬

-‭ ‬ I‭t calculates the‬‭relative rank‬‭of a row within a‬‭dataset as a‬‭percentage‬‭.‬

‭-‬ ‭The result ranges from‬‭0 to 1‬‭, where 0 represents‬‭the‬‭lowest‬‭value, and 1 represents the‬
‭highest‬‭value.‬
‭-‬ ‭It’s useful for identifying‬‭top performers, median‬‭values, and outliers‬‭in any dataset.‬

‭Let's consider an example where we have a table named “exam_scores” with the following data:‬

‭42‬
‭We want to calculate the percentile rank of each student's score.‬

I‭n this example, PERCENT_RANK() calculates the percentile rank of each student's score based on‬
‭the sorted scores.‬

‭ s a result, we can see how each student's score compares to the rest of the dataset in terms of‬
A
‭percentiles.‬

‭ y using this technique, you can quickly identify trends, outliers, and other important characteristics of‬
B
‭your dataset, which can be crucial for effective data analysis.‬

‭43‬
‭Tip #23: Check Divisibility with MOD‬
‭ s a data analyst, you may encounter situations where you need to determine if a number is divisible‬
A
‭by another number.‬

‭This can be useful for a lot of things like checking if a number is odd or even.‬

‭In SQL, the MOD function can be a useful tool for achieving this goal.‬

‭The MOD function returns the remainder of a division operation, making it easy to check for divisibility.‬

‭It takes two arguments: the dividend and the divisor.‬

‭If the result is 0, it indicates that the dividend is divisible by the divisor.‬

‭Let's consider an example where we have a table named “numbers” with the following data.‬

‭We want to find the numbers in the table that are divisible by 5.‬

‭In this case, we can use the MOD function in combination with a WHERE clause.‬

‭44‬
I‭n this example, the query uses the MOD function to filter the rows where the value is divisible by 5 (i.e.,‬
‭the remainder of the division by 5 is 0).‬

‭As a result, we get a list of numbers in our dataset that meet this condition, which are 20, 25, and 30.‬

‭Using the MOD function in SQL is a handy tool for working with numerical data.‬

I‭t allows you to quickly identify and filter rows based on divisibility, which can be helpful for various data‬
‭analysis tasks, such as generating reports, identifying trends, or performing calculations on specific‬
‭subsets of your data.‬

‭45‬
‭Tip #24: Changing Data Types in SQL‬
‭When working with data in SQL you may need to convert one data type to another.‬

‭ or example, you might have a number you want to work with, but it’s stored as a text string in your‬
F
‭database.‬

‭ hat means you can’t use any functions meant for numbers on it (like ROUND, SUM, etc.) until you‬
T
‭convert the data type to something numeric.‬

‭One useful function for this purpose is the CAST function.‬

‭The basic syntax for the CAST function is:‬

CAST(expression AS data_type)‬
‭

‭It’s simple. Here’s an example of how to use it:‬

‭ uppose we have a table called “sales” with the following columns: sale_id, item_id, price, and‬
S
‭discount. The price column has a data type of FLOAT64, while the discount column has a data type of‬
‭STRING.‬

‭46‬
I‭f you want to calculate the discounted price, you will first need to convert the discount column to a‬
‭numeric data type.‬

‭ he CAST function allows you to easily convert the discount column to a numeric data type, so you can‬
T
‭perform calculations with the price column.‬

‭47‬
‭Tip #25: Fix Data Issues at the Source‬
‭Here’s one of the biggest lessons I’ve learned working with data.‬

‭Bad data will always find a way into your database.‬

‭But the worst mistake you can make?‬

‭Trying to fix everything in SQL.‬

‭It can be tempting to write the world's most epic CASE statement to clean up the data.‬

‭And sometimes that’s what you have to do.‬

‭But whenever possible,‬‭fix the data at the source.‬

I‭f new data is continuously pouring in, you're going to have to keep adding lines to that CASE statement‬
‭every time a new condition appears.‬

‭That is unsustainable.‬

‭ olving issues at the source can be more work initially, but will make your life so much easier in the‬
S
‭long run.‬

‭Sometimes the best SQL tips don’t involve any SQL at all!‬

‭48‬
‭Tip #26: DDL, DML, & DQL — What’s the Difference?‬
‭As data analysts we’re used to writing SQL queries to pull data.‬

‭But there is more to SQL than that.‬

‭SQL statements are categorized into different types based on their purpose and functionality.‬

‭In this lesson, we'll discuss the three main types of SQL statements: DDL, DML, and DQL.‬

‭1. DDL (Data Definition Language)‬

‭ DL statements are used to define and manage the structure of a database and its objects (such as‬
D
‭tables, views, and indexes). The main DDL statements include:‬

-‭ ‬ ‭ REATE: Creates a new database object (e.g., table, index, or view).‬

C
‭-‬ ‭ALTER: Modifies an existing database object.‬
‭-‬ ‭DROP: Deletes a database object.‬
‭-‬ ‭TRUNCATE: Removes all data from a table without deleting the table itself.‬

‭2. DML (Data Manipulation Language)‬

‭ ML statements are used to manage and manipulate the data stored within database objects. The‬
D
‭main DML statements include:‬

-‭ ‬ I‭NSERT: Inserts new records into a table.‬

‭-‬ ‭UPDATE: Modifies existing records in a table.‬
‭-‬ ‭DELETE: Deletes records from a table.‬
‭-‬ ‭MERGE: Inserts or updates records based on a specified condition (supported in some SQL‬
‭dialects).‬

‭3. DQL (Data Query Language)‬

‭DQL statements are what you’re probably used to.‬

‭They’re used to query and retrieve data from database objects.‬

‭ he main DQL statement is the SELECT statement, which allows you to retrieve data from one or more‬
T
‭tables, apply filters, and aggregate or sort the results.‬

‭…‬

‭Here are some examples of DDL and DML statements.‬

‭ ou don’t need to be an expert at these to be a data analyst, but some familiarity with them can‬
Y
‭definitely be helpful!‬

‭49‬
‭DDL:‬

‭DML‬‭:‬

‭50‬
‭Tip #27: A Quick SQL Tip to Save You Time‬
‭Sometimes you will have to work with tables that have A LOT of columns.‬

‭And let’s say you only want half of them.‬

‭You either have to manually type out each column you want in the SELECT statement.‬

‭Or you can manually type out the columns you don’t want with a SELECT * EXCEPT statement.‬

‭Whenever I came across this scenario it used to take me forever.‬

‭But here’s a quick tip I learned to help prevent this…‬

‭You can write a SQL statement that outputs the column names for you!‬

‭Here's how to do this using BigQuery:‬

‭ELECT‬‭
S column_name‬
FROM‬‭
‭ project_id.dataset_id.INFORMATION_SCHEMA.COLUMNS‬
WHERE‬‭
‭ table_name =‬‭
'mytable'‬

‭Most Other Relational Databases (MySQL, Postgres, SQL Server):‬

‭ELECT‬‭
S column_name‬
FROM‬‭
‭ information_schema.columns‬
WHERE‬‭
‭ table_schema =‬‭
'schema name'‬
AND‬‭
‭ table_name =‬‭
'table name'‬

‭These queries will output the column names in the SQL results.‬

‭Then you can copy/paste them all at once and just delete the ones you don’t want.‬

‭I find it’s much faster to do it this way.‬

‭I hope this comes in handy for you in your SQL journey.‬

‭51‬
‭ ip #28: Speed Up Your Queries By Joining on This‬
T
‭Data Type‬
‭Here’s a quick SQL tip to help your queries run faster.‬

‭There's nothing that bogs down a data analyst like a slow SQL query.‬

‭Often, we end up with numbers stored as strings in our data.‬

‭Convert them to integers if you can!‬

‭ ne way to improve the performance of your queries is to join tables using integer values instead of‬
O
‭strings.‬

‭Why does this matter? It's all about memory usage.‬

I‭ntegers take up less space in memory than strings, leading to faster joins and more efficient‬
‭processing.‬

‭ hen joining on strings, the database has to compare each character in the string, which can be‬
W
‭time-consuming, especially for longer strings or large datasets.‬

‭Additionally, string comparisons are often case-sensitive, which can lead to slower performance.‬

I‭f your database contains numbers stored as strings, you can convert them to integers using the CAST‬
‭or CONVERT functions, depending on your SQL dialect.‬

‭Here's an example using BigQuery's CAST function:‬

with string_id_table as (‬
‭
SELECT‬
‭
CAST‬
‭ (‬
‭id‬‭
‭ AS‬‭
INT64)‬‭
as‬‭
id‬
, other_columns...‬
‭
FROM‬‭
‭ table‬
)‬
‭

SELECT‬
‭
columns...‬
‭
FROM‬‭
‭ string_id_table s‬
LEFT‬‭
‭ JOIN‬‭
table2 t‬
on‬‭
‭ s.id = t.id‬

‭ on't let strings slow you down – optimize your queries by using integers whenever possible. This tip‬
D
‭won't fix all your problems, but it should help.‬

‭And it’s good to keep things like this in mind to understand how SQL is executing under the hood.‬

‭Happy analyzing!‬

‭52‬
‭ ip #29: Understanding Column vs. Row Store‬
T
‭Databases‬
‭Are you gearing up for data analyst interviews?‬

‭One key area you might encounter is the difference between column and row store databases.‬

‭Here's a breakdown that will help you shine in your discussions.‬

‭ ow Store Databases (RDBMS)‬‭: Think of a row store‬‭database like a traditional spreadsheet where‬
R
‭data is stored in rows.‬

‭This is the most common type of database you might have already used.‬

I‭t's excellent for operations involving the entire row of data, like entering a new record or retrieving all‬
‭details of a specific entry.‬

I‭t works wonders when you need to process transactions, which is why it's a go-to for online retail,‬
‭banking systems, and other transaction-oriented applications.‬

‭Column Store Databases‬‭: Now, imagine instead of storing‬‭data by rows, you store it by columns.‬

‭This is what column store databases do.‬

‭ ach data column is stored separately, making it faster to retrieve all the information in a single column‬
E
‭across multiple rows.‬

I‭t's a star performer when dealing with analytics and data warehousing, as these operations often‬
‭require heavy reading and aggregation of specific attributes (like summing up all sales in a month).‬

‭Why Should You Care?‬

‭-‬ ‭ erformance‬‭: In row stores, if you need to access‬‭data from a few columns, you end up‬
P
‭scanning entire rows which can be slow. Column stores, on the other hand, can quickly fetch‬
‭only the required columns, speeding up query times.‬
‭-‬ ‭Storage Efficiency‬‭: Column stores can compress data more effectively because the data in a‬
‭column is typically more uniform than data in a row.‬
‭-‬ ‭Real-world Applications‬‭: Knowing the difference can‬‭help you suggest the right database for the‬
‭job. For instance, if your company deals with heavy report generation, a columnar database‬
‭might be recommended.‬

I‭nterview Tip‬‭: When asked about databases in an interview,‬‭show that you understand not only the‬
‭technical differences but also how these affect business decisions and performance.‬

‭ entioning scenarios like data warehousing for column stores and transaction processing for row‬
M
‭stores can demonstrate practical knowledge.‬

‭ emember, it's not about memorizing definitions. It's about understanding concepts and their impact on‬
R
‭the real world of data.‬

‭53‬
‭Tip #30: The DATE_TRUNC Function‬
‭Data analysts often work with dates in their data.‬

‭Here’s a flexible date function I use all the time.‬

‭DATE_TRUNC‬

‭ nowing how to use the DATE_TRUNC function in SQL is important because it helps break down dates‬
K
‭into useful time periods, like days or months.‬

‭This makes analyzing trends and patterns in data much easier.‬

‭When using DATE_TRUNC, always specify two arguments:‬

‭ .‬ T
1 ‭ imestamp Column: The date or timestamp column that you want to truncate.‬
‭2.‬ ‭Time Unit: The unit to truncate to (e.g., year, month, day, hour).‬

‭ his function returns a date truncated to the beginning of the specified time unit, making it easier to‬
T
‭group and compare data across time periods.‬

‭ et's assume we have a table “sales” with columns “sale_date” (a timestamp) and “amount” (sale‬
L
‭amount).‬

I‭f you want to analyze monthly sales, you can use DATE_TRUNC to truncate sale_date to the month‬
‭level:‬

I‭n this query, DATE_TRUNC(sale_date, MONTH) truncates the sale_date to the first day of each‬
‭month, and the SUM(amount) calculates the total sales for each month.‬

‭ he GROUP BY clause groups the results by the truncated date, allowing you to see the total sales for‬
T
‭each month.‬

‭ emember, the precision of DATE_TRUNC can be changed to other units like year, day, hour, etc.,‬
R
‭depending on your needs!‬

‭54‬
‭Tip #31: FULL OUTER JOIN in SQL‬
‭Most people skip learning this SQL join, but that is a mistake.‬

‭I’m talking about the FULL OUTER JOIN.‬

‭It’s a lesser used but powerful tool when you need it.‬

‭I’ll explain one of those scenarios.‬

‭FULL OUTER JOIN is a type of join that combines the results of both LEFT JOIN and RIGHT JOIN.‬

I‭t returns all rows from both tables, and if there's no match between the tables, NULL values are used‬
‭for the columns where no match is found.‬

‭ his join operation is useful when you want to retain all the data from both tables, even if there's no‬
T
‭direct match between them.‬

‭FULL OUTER JOIN is different from INNER JOIN and LEFT JOIN in the following ways:‬

‭-‬ I‭NNER JOIN returns only the rows where there's a match between the joining columns in both‬
‭tables.‬
‭-‬ ‭LEFT JOIN returns all the rows from the left table and the matched rows from the right table. If‬
‭no match is found, NULL values are used for the columns from the right table.‬

‭Let's look at an example.‬

‭Suppose you have two tables: “ad spend” and “orders.”‬

‭The “ad_spend” table stores information about the advertising costs.‬

‭The “orders” table stores information about orders that are attributed to the ad spend.‬

‭55‬
‭56‬
I‭n this result set, you can see that the FULL OUTER JOIN includes all the rows from both tables, even‬
‭when there's no match between the ad_id columns.‬

‭ he unmatched row from the orders table (with order_id 5) has NULL values for the columns from the‬
T
‭ad_spend table.‬

‭ o use FULL OUTER JOIN when you want to retain all the data from both tables, even if there's no‬
S
‭direct match between them.‬

‭ his can be useful in situations where you need to combine data from separate but related tables, such‬
T
‭as ad spend and orders.‬

‭57‬
‭Tip #32: Dealing with NULL Values in SQL‬
‭When working with data in SQL, you’ll inevitably encounter NULL values.‬

‭The COALESCE function is a handy tool that can help you out!‬

‭ OALESCE can be particularly useful when you want to provide a default value for a column that may‬
C
‭have NULL values.‬

‭Let's look at an example.‬

‭Suppose you have a table called “products” with product_id, name, price, and discount.‬

‭In this table, the discount column contains NULL values for some products.‬

‭You want to calculate the final price of each product after applying the discount.‬

‭If the discount is NULL, you want to treat it as 0.‬

‭You can use the COALESCE function to achieve this.‬

‭ he COALESCE function ensures that if the discount is NULL, it is treated as 0, allowing you to‬
T
‭calculate the final price correctly.‬

‭It allows you to provide default values for NULLs when needed.‬

‭ his can help you avoid unexpected results when performing calculations or displaying data, making it‬
T
‭a useful tool for data analysts.‬

‭58‬
‭ ip #33: 4 Common SQL Mistakes and How to Avoid‬
T
‭Them‬
‭SQL is hard.‬

I‭t’s ok to make mistakes but you want to correct them as soon as possible. Here are four common‬
‭beginner SQL mistakes and how to correct them:‬

‭1. Overusing the DISTINCT keyword‬

‭The DISTINCT keyword is good for removing duplicate data.‬

‭But I often see people just slap a DISTINCT in there to make sure there aren't any duplicates.‬

‭That's sloppy.‬

‭ onsider alternative solutions, such as using window functions, to improve performance and‬
C
‭customization of your deduplication.‬

‭2. Working with NULL values‬

‭ common mistake is using the comparison operator (=) instead of IS NULL or IS NOT NULL when‬
A
‭comparing NULL values.‬

‭This can result in incorrect results, as NULL values are treated differently in SQL.‬

‭Understand how to handle NULL values in your queries to avoid incorrect results.‬

‭3. Not using the correct data type‬

‭Another common mistake is using the wrong data type for your columns.‬

‭ or example, using a text data type for a column that should be a number can result in errors when‬
F
‭performing mathematical operations.‬

I‭t's also not as efficient joining on text as it is integers. Ensure you use the correct data type for each‬
‭column.‬

‭4. Overusing complex subqueries‬

‭A common mistake is using complex subqueries.‬

‭Consider using CTEs instead of subqueries wherever you can.‬

‭ his will make your code so much easier to understand when you are looking back on it three months‬
T
‭later trying to remember what you did!‬

‭...‬

‭Avoid these common mistakes to improve your SQL skills and ensure accurate, efficient data analysis.‬

‭59‬
‭Tip #34: How to Rank Values in SQL‬
‭Data analysts often have to rank data.‬

‭But this can be tricky because in SQL there are three ranking functions that do nearly the same thing.‬

‭The three functions I'm referring to are:‬

‭ .‬ R
1 ‭ OW_NUMBER‬
‭2.‬ ‭RANK‬
‭3.‬ ‭DENSE_RANK‬

‭So what’s different about them?‬

‭And when should you use each?‬

‭The main difference is in how they handle a tie (i.e. equal values).‬

‭Suppose you have a table called “sales.”‬

‭You want to rank the salespeople based on their sales amount.‬

‭The example data, query, and output are shown below with examples of each ranking function.‬

‭60‬
‭As you can see RANK, ROW_NUMBER, AND DENSE_RANK don't produce the same rankings.‬

‭ OW_NUMBER()‬‭assigns unique values to each row, even‬‭if the sales_amount is the same. It will not‬
R
‭show any ties.‬

‭Use this if you want to make 100% sure you have a tie breaker.‬

‭ ANK()‬‭assigns the same rank to the rows with the‬‭same sales_amount and leaves gaps in the ranking‬
R
‭sequence.‬

‭This is similar to how a sporting event like golf would rank players in a tournament.‬

‭ ENSE_RANK()‬‭assigns the same rank to the rows with‬‭the same sales_amount without leaving gaps‬
D
‭in the ranking sequence.‬

‭Knowing the difference between these three can ensure you always use the right one!‬

‭61‬
‭ ip #35: Write Faster SELECT Statements with‬
T
‭EXCEPT‬
‭Here’s a SQL tip to save you a future carpal tunnel diagnosis.‬

‭Ever find yourself typing out way too many column names in your SELECT statement?‬

‭SELECT statements can be brutal.‬

‭Give you fingers a break and try SELECT EXCEPT!‬

‭ ELECT EXCEPT is a useful feature in many SQL dialects that lets you select all columns in a table‬
S
‭except for the ones you choose.‬

‭This can be super helpful when working with large datasets containing many columns.‬

‭Here's a brief example of how to use SELECT EXCEPT.‬

‭Consider a table called “bike_sharing_rides” with the following columns:‬

‭rip_id, subscriber_type, bikeid, start_time, start_station_id,‬

t
start_station_name, end_station_id, end_station_name, duration_minutes‬
‭
+ 20 more...‬
‭

‭That's a lot of columns. What if you don't need them all?‬

I‭nstead of typing out all the columns except for the ones you don’t need you can use SELECT‬
‭EXCEPT.‬

‭ or example, if you want to query all columns except “start_station_name” and “end_station_name,”‬
F
‭you can use SELECT EXCEPT as follows:‬

‭ his will return a result with all columns except for the specified ones. Next time your table has more‬
T
‭columns than you’d care to type…‬

‭Breathe a sigh of relief and remember SELECT EXCEPT.‬

‭62‬
‭ ip #36: A Cleaner Way to Filter on Window‬
T
‭Functions‬
‭One hidden gem SQL command I love is QUALIFY.‬

‭QUALIFY allows you to filter the results based on a window function.‬

‭ his can be particularly useful when you want a single row for each window group based on specific‬
T
‭criteria — without the need to write a separate Common Table Expression (CTE) or subquery.‬

‭Here's what I mean…‬

‭Suppose you want to find the most recent purchase for each customer.‬

‭You can use QUALIFY along with ROW_NUMBER() to achieve this without using a CTE or subquery.‬

‭Nice and simple.‬

‭Remember QUALIFY next time you are working with window functions!‬

‭63‬
‭Tip #37: Simplify Your CASE Statements‬
‭Did you know you can simplify your CASE WHEN statements in SQL?‬

‭This can make your life a lot easier when you have a lot of conditions.‬

‭I was writing SQL for years before I learned this.‬

‭Here's how…‬

‭A typical CASE statement for classifying countries would look something like this:‬

CASE WHEN country = 'United States' then 'US'‬

‭
WHEN country = 'Canada' then 'CA'‬
‭
...‬
‭
ELSE 'Other'‬‭
‭ END‬‭
as‬‭
country_abbv‬

‭To reformat this in a simpler way you can write it like this:‬

‭ his simpler approach prevents you from having to write out the entire condition 'country =' for each‬
T
‭statement.‬

‭Adding little shortcuts like this to your repertoire can make your life as a data analyst much easier!‬

‭ ne caveat is this will only work when your condition is a literal match like column = 'something.' It‬
O
‭won't work with other types of comparison like greater than, LIKE, etc.‬

‭64‬
‭Tip #38: Appending Together Two Datasets‬
‭When working with data, a common task is appending one table of data to another.‬

‭ ppending data is kind of like adding more pages to the end of a book or adding more items to the end‬
A
‭of a shopping list.‬

‭The original data stays the same, but new data is added to the end.‬

‭For example, if you have a list of names like this: Alice, Bob, Charlie‬

‭And you append the name "Dave" to the list, the list would then look like this:‬

‭Alice, Bob, Charlie, Dave‬

‭To append data in SQL we use the UNION command.‬

‭But there are two ways to do a UNION:‬

‭ .‬ U
1 ‭ NION ALL‬
‭2.‬ ‭UNION DISTINCT‬

‭An example would look something like this:‬

‭ hile they both serve the purpose of merging query results, there are differences in their behavior and‬
W
‭use cases.‬

‭ NION DISTINCT‬‭:‬‭The UNION DISTINCT operation combines‬‭the results of two or more SELECT‬
U
‭queries and removes any duplicate rows.‬

‭ se UNION DISTINCT when you want to merge results from different queries and ensure that the‬
U
‭output contains only unique rows, without duplicates.‬

‭ NION ALL‬‭:‬‭The UNION ALL operation combines the results‬‭of two or more SELECT queries and‬
U
‭retains all rows, including duplicates.‬

‭65‬
‭ ince UNION ALL does not perform any duplicate elimination, it is generally faster than UNION‬
S
‭DISTINCT.‬

‭ se UNION ALL when you want to merge results from different queries and keep all the rows, including‬
U
‭duplicates, or when you know that there are no duplicates in the result sets and want to improve the‬
‭query performance.‬

‭Now you’ll always know which one to use.‬

‭Append away!‬

‭66‬
‭Tip #39: Identifying Duplicates in Your Dataset‬
‭Finding duplicates in your data doesn’t have to be a headache.‬

‭Just memorize this common SQL pattern for finding dupes and you'll be set.‬

‭It's actually pretty simple.‬

‭With SQL, you can use GROUP BY and HAVING to quickly identify rows that appear more than once.‬

‭Check out this query:‬

‭Why This Works:‬

-‭ ‬ ‭ ROUP BY email groups rows with the same email address.‬

G
‭-‬ ‭COUNT(*) totals how many times each email appears.‬
‭-‬ ‭HAVING COUNT(*) > 1 filters out rows that don’t have duplicates.‬

‭This is perfect for cleaning up datasets or enforcing unique constraints.‬

‭For example, you can spot duplicate user registrations or redundant entries in any table.‬

‭Keep this common SQL pattern in mind next time you need to find a duplicate.‬

‭Now go ahead and find those pesky dupes!‬

‭67‬
‭Tip #40: Working with Unstructured Data in SQL‬
‭Learning to work with data that's not in a table format is a big step as a data analyst.‬

‭And it can be very confusing looking at first.‬

‭ ou’re probably used to seeing data formatted as rows and columns, like in a spreadsheet or SQL‬
Y
‭table.‬

‭JSON (JavaScript Object Notation) is just a different way to format data that’s not rows and columns.‬

‭JSON data is represented as key-value pairs, similar to a dictionary in Python.‬

‭ hy is JSON important? In today's interconnected world, data is everywhere and it's often served to us‬
W
‭via APIs (application programming interfaces).‬

‭ PIs use JSON to deliver data because it's light, easy to understand, and can be used by many‬
A
‭programming languages.‬

‭Now, let's imagine you've received a JSON file from an API and you want to work with this data in SQL.‬

‭How do you do that? Let's explore.‬

‭Suppose we have a table named “user_profiles” with the following schema and data:‬

‭68‬
‭SQL has functions to help us handle JSON data.‬

‭One common SQL function for working with JSON is the JSON_VALUE function.‬

‭This lets us extract things from our JSON data.‬

‭In this example, the JSON_VALUE function takes two arguments:‬

‭ .‬ T
1 ‭ he JSON data column (profile_data).‬
‭2.‬ ‭A JSON path expression ('$.name') that specifies the key to extract the value from.‬

‭The $.name JSON path expression refers to the "name" key in the JSON data.‬

‭ nowing how to extract specific values or manipulate JSON data can greatly enhance your ability to‬
K
‭analyze and work with such data.‬

‭ emember, JSON and SQL might seem like different languages, but with a little practice, you'll be able‬
R
‭to translate between them with ease. It's all part of your journey to becoming a versatile data analyst.‬

‭So don't be afraid of JSON!‬

‭Embrace it, play around with it, and soon, you'll be extracting and analyzing data like a pro.‬

‭Happy analyzing!‬

‭69‬
‭Tip #41: The EXTRACT Function‬
‭ hile working in SQL you might need to extract specific parts of a date or time such as the year,‬
W
‭month, or day.‬

‭We can accomplish this using the EXTRACT function.‬

‭EXTRACT takes two arguments:‬

‭ .‬ T
1 ‭ he unit (e.g., year, month, day) you want to extract.‬
‭2.‬ ‭The date or time value from which to extract it.‬

‭Let's consider an example in which we have a table named “employees” with the following data:‬

‭ ere I show examples of how to extract the YEAR, MONTH, and DAY from the hire_date column in our‬
H
‭table.‬

‭As you can see, it's pretty easy!‬

‭ s a data analyst, understanding how to use the EXTRACT function in SQL is essential for working‬
A
‭with date and time data.‬

‭70‬
‭Tip #42: A SQL Function for Cleaning Data‬
‭The REPLACE function in SQL is a handy tool for cleaning data.‬

‭It allows you to replace any characters you want with anything else.‬

‭It's commonly used to remove characters from a string of text by replacing them with an empty string "".‬

‭The syntax for the REPLACE function is as follows:‬

REPLACE‬
‭ (input_string, search_string, replacement_string)‬
‭

‭ et's consider an example where we have a table named “users” with email addresses formatted with‬
L
‭<> around them.‬

‭This is something I've run into as a data analyst.‬

‭We want to remove the <> characters from the email addresses.‬

‭71‬
‭We can use the REPLACE function in a query to achieve this.‬

‭In this example, we use nested REPLACE functions to remove both < and > characters.‬

‭ he inner REPLACE function removes the < character by replacing it with an empty string (''), and the‬
T
‭outer REPLACE function removes the > character in the same way.‬

‭ his can be particularly helpful when working with datasets that contain inconsistencies, typos, or‬
T
‭specific formatting that needs to be adjusted before analysis.‬

‭ hile there are more customizable solutions (like Regex) for cleaning data, the REPLACE function‬
W
‭offers a simple and efficient method for making these adjustments directly within your SQL queries,‬
‭without the need for external tools or programming languages.‬

‭72‬
‭Tip #43: How to Deal with a Divide by Zero Error‬
‭If you’ve been an analyst for long, you have probably run into this common error: Can’t divide by zero.‬

‭Here's how to deal with it.‬

‭ AFE_DIVIDE is a function in SQL that allows you to perform division operations while safely handling‬
S
‭potential divide-by-zero errors.‬

I‭f a divide-by-zero error is encountered, instead of causing your query to crash or return an error, the‬
‭SAFE_DIVIDE function will return NULL.‬

‭Suppose we have a table named “sales” with the following schema and data:‬

‭ e want to calculate the average revenue per unit for each product. If we try to divide total_revenue by‬
W
‭units_sold directly, we will encounter a divide-by-zero error for product_id 2.‬

‭To avoid this error and handle the division safely, we can use the SAFE_DIVIDE function.‬

‭In this example, the SAFE_DIVIDE function takes two arguments:‬

‭ .‬ T
1 ‭ he numerator (total_revenue).‬
‭2.‬ ‭The denominator (units_sold).‬

‭73‬
‭ hen the denominator is zero, as in the case of product_id 2, the function returns NULL instead of‬
W
‭causing an error.‬

‭ s a data analyst, understanding how to use functions like SAFE_DIVIDE to handle potential errors in‬
A
‭your SQL queries is important.‬

I‭t helps you maintain the stability and reliability of your analysis, allowing you to work with data that may‬
‭contain unexpected zero values or other edge cases that could otherwise cause your queries to fail.‬

‭ ote: SAFE_DIVIDE is what this function is called in BigQuery. Other SQL environments may have a‬
N
‭different name for this function but it is essentially the same thing. To find yours, just type “Safe divide‬
‭[insert platform i.e. Snowflow, Postgres, etc.],” into Google.‬

‭74‬
‭Tip #44: My Most Used Data Cleaning Function‬
‭I reviewed all the SQL code I’ve written recently.‬

‭Interestingly, the data cleaning function I use most often is LOWER.‬

‭The LOWER function in SQL is a string manipulation function that converts text to lowercase.‬

I‭t is useful when you need to normalize text data, such as email addresses, to ensure consistent‬
‭formatting and avoid issues with case sensitivity in your queries.‬

‭ mail addresses are often entered by users in a free-form manner, leading to inconsistent‬
E
‭capitalization.‬

‭ ome SQL environments are case-sensitive, which means that 'kyle@thequery.io' and‬
S
‭'Kyle@thequery.io' would be treated as different email addresses, even though they represent the same‬
‭person.‬

‭ y using the LOWER() function, you can standardize email addresses to lowercase, ensuring that your‬
B
‭queries and joins work correctly regardless of the original capitalization.‬

‭Usage is straightforward.‬

‭ et's consider an example where we have a table named “users” with email addresses in various‬
L
‭capitalizations:‬

‭75‬
‭We want to convert all email addresses to lowercase, so we use LOWER() to achieve this.‬

‭Simple, but powerful.‬

‭ s a data analyst, understanding how to use the LOWER() function in SQL is essential as it allows you‬
A
‭to standardize and clean text data within your queries.‬

‭76‬
‭ ip #45: Referencing a Table Name Alias in the‬
T
‭WHERE Clause‬
‭A common SQL mistake is attempting to reference an alias in the WHERE clause.‬

‭This doesn’t work.‬

‭The WHERE clause can only reference the columns in the FROM clause at the time of execution.‬

‭The content you specify in your SELECT statement comes after.‬

‭77‬
‭Tip #46: How to Learn Joins‬
‭If you are struggling with JOINs in SQL, here is my recommendation...‬

‭You should learn more about data relationships.‬

‭Here's a quick lesson:‬

‭ ne of the key concepts in SQL is understanding the different types of relationships that can exist‬
O
‭between tables.‬

‭These relationships can be broken down into three major categories:‬

‭ .‬ O
1 ‭ ne-to-Many‬
‭2.‬ ‭One-to-One‬
‭3.‬ ‭Many-to-Many‬

‭A‬‭ONE-TO-MANY relationship‬‭is like a parent-child‬‭relationship.‬

‭ ust as a parent can have multiple children, a table in a one-to-many relationship can have multiple‬
J
‭records in another table.‬

‭For example, in a database of customers and orders, each customer can have multiple orders.‬

‭But an order can only have one customer.‬

‭ his relationship is represented by a foreign key in the orders table that references the primary key of‬
T
‭the customers table (customer id).‬

‭A‬‭ONE-TO-ONE relationship‬‭is like a lock and key.‬

‭A lock can only be opened with the one right key.‬

‭Similarly, a record in one table can only be related to one record in another table.‬

‭ or example, in a database of customers and their login credentials, each customer only has one set of‬
F
‭login credentials (username/password combo).‬

‭This relationship is also represented by a foreign key.‬

‭A‬‭MANY-TO-MANY relationship‬‭is like a recipe and its‬‭ingredients.‬

‭ ust as a recipe can have multiple ingredients and an ingredient can be used in multiple recipes, a‬
J
‭record in one table can be related to multiple records in another table, and vice versa.‬

‭ imilarly, in a database of students and classes, a student can be enrolled in multiple classes, and a‬
S
‭class can have multiple students.‬

‭ his relationship is represented by a junction table, also known as a join table, which contains the‬
T
‭foreign keys of both tables.‬

‭78‬
I‭n order to effectively query your data, you need to understand the types of relationships different tables‬
‭have to each other.‬

‭ his allows you to create a visual map in your mind of how the data is stored and what you need to do‬
T
‭to shape the data the way you want for analysis.‬

‭This is typically one of the first things I do when coming across new data.‬

‭ tart to think in terms of data and relationships and you will have a much better understanding of the‬
S
‭data you are working with!‬

‭And it will make joins a lot easier!‬

‭79‬
‭ ip #47: Generating All Possible Combinations with‬
T
‭CROSS JOIN‬
‭ join type that’s often overlooked (but can be super useful for certain analytics tasks) is the‬‭CROSS‬
A
‭JOIN‬‭.‬

‭A CROSS JOIN returns every possible combination of rows from the two tables being joined.‬

‭ his might sound weird at first, but it’s great for scenarios where you need all permutations of two lists,‬
T
‭like all possible color-size combinations, or when creating date “scaffolds” for time series.‬

‭It can also be used to help with inefficient joins, but that is a lesson for another time.‬

‭Here are the basics of how it works.‬

‭Suppose we have a table “colors” and a table “sizes” as follows:‬

‭- colors table‬
-
color‬
‭
---------‬
‭
Red‬
‭
Green‬
‭
Blue‬
‭

‭- sizes table‬
-
size‬
‭
---------‬
‭
Small‬
‭
Medium‬
‭
Large‬
‭

‭If we want every color-size pair, we can use a CROSS JOIN:‬

SELECT‬
‭
c.color,‬
‭
s.size‬
‭
FROM‬‭
‭ colors c‬
CROSS‬‭
‭ JOIN‬‭
sizes‬‭
s‬
ORDER‬‭
‭ BY‬‭
c.color, s.size‬

‭This query will return‬‭9‬‭rows — every color paired‬‭with every size.‬

‭It’s a quick way to generate permutations without manually listing them out or writing a subquery.‬

‭80‬
‭The output would look something like this:‬

‭olor
c | size‬
--------|--------‬
‭
Blue
‭ | Small‬
Blue
‭ | Medium‬
Blue
‭ | Large‬
Green
‭ | Small‬
Green
‭ | Medium‬
Green
‭ | Large‬
Red
‭ | Small‬
Red
‭ | Medium‬
Red
‭ | Large‬

‭ ust remember that CROSS JOIN can produce very large result sets if your tables are big, so use it‬
J
‭carefully!‬

‭ .S. This tip illustrates the basics of a CROSS JOIN. There is a lot more you can do with it. I’ll teach‬
P
‭you some crazy CROSS JOIN stuff another time.‬

‭81‬
‭ ip #48: Traversing Hierarchical Data with Recursive‬
T
‭CTEs‬
‭ orking with hierarchies (like organization charts, folder structures, subordinates in a company, etc.)‬
W
‭can be tricky in SQL.‬

‭You typically need to use a Recursive CTE, which can be a bit mind blowing. 🤯‬
‭ Recursive CTE‬‭is super helpful for these scenarios‬‭because it lets you write a query that references‬
A
‭itself, naturally walking through data relationships of unlimited depth.‬

‭Here’s how a Recursive CTE works:‬

-‭ ‬ ‭ nchor Query‬‭: This is the starting place – the “root‬‭node” or “top-level” record.‬
A
‭-‬ ‭Recursive Query‬‭: This part references the CTE itself,‬‭allowing it to progress from one level of‬
‭hierarchy to the next.‬

‭Let’s say we have a table called “employees” with the following columns:‬

‭mployee_id
e manager_id employee_name‬
-----------
‭ ---------- -------------‬
1
‭ NULL Emily (CEO)‬
2
‭ 1 Sarah‬
3
‭ 1 Carlos‬
4
‭ 2 Judy‬
5
‭ 4 Aaron‬

-‭ ‬ ‭ mily (ID=1) is the CEO and has no manager (manager_id is NULL).‬

E
‭-‬ ‭Sarah (ID=2) and Carlos (ID=3) both report to Emily (manager_id=1).‬
‭-‬ ‭Judy (ID=4) reports to Sarah (manager_id=2).‬
‭-‬ ‭Aaron (ID=5) reports to Judy (manager_id=4).‬

‭A Recursive CTE can help us show the entire hierarchy:‬

WITH RECURSIVE employee_hierarchy AS (‬

‭
-- 1) Anchor query: start from the CEO or top-level‬‭
‭ employee‬
SELECT‬
‭
employee_id,‬
‭
manager_id,‬
‭
employee_name,‬
‭
1‬‭
‭ AS‬‭
level‬ ‭
-- track hierarchy depth‬
FROM‬‭
‭ employees‬
WHERE‬‭
‭ manager_id‬‭
IS‬‭
NULL‬

UNION‬‭
‭ ALL‬

-- 2) Recursive query: find direct reports at each‬‭

‭ level‬

‭82‬
SELECT‬
‭
e.employee_id,‬
‭
e.manager_id,‬
‭
e.employee_name,‬
‭
eh.level +‬‭
‭ 1‬
FROM‬‭
‭ employees e‬
JOIN‬‭
‭ employee_hierarchy eh‬
ON‬‭
‭ e.manager_id = eh.employee_id‬
)‬
‭

SELECT‬
‭
employee_id,‬
‭
manager_id,‬
‭
employee_name,‬
‭
level‬
‭
FROM‬‭
‭ employee_hierarchy‬
ORDER‬‭
‭ BY‬‭
level‬
, employee_id‬
‭

‭ irst, the‬‭Anchor Query‬‭finds the top-level employee‬‭(manager_id IS NULL). In our case, that’s the‬
F
‭CEO Emily.‬

‭ hen, the‬‭Recursive Query‬‭joins employees back to the CTE (employee_hierarchy) using manager_id‬
T
‭= employee_id, continuously pulling in rows for the next level down.‬

‭ ach time the recursion runs, we add 1 to the level column so we can see where each employee falls in‬
E
‭the hierarchy.‬

‭Recursive CTEs are great for readability and flexibility.‬

‭ hey’re usually more straightforward than subqueries for multi-level joins, and you can control how you‬
T
‭step through the data (e.g., track depth, filter partial hierarchies).‬

‭That’s it!‬

‭ nce you see how the anchor and recursive parts work together, recursive CTEs become a powerful‬
O
‭tool for any analyst handling hierarchical data.‬

‭Recursive CTEs can be tricky. To learn more about them‬‭check out the Bigquery Documentation here‬‭.‬

‭Happy querying!‬

‭83‬
‭ ip #49: Filtering Aggregated Data with HAVING (vs.‬
T
‭WHERE)‬
‭ hen you first learn SQL, you quickly meet the WHERE clause which is used to filter rows before‬
W
‭they’re grouped.‬

‭But there’s another clause called‬‭HAVING‬‭that might feel a bit mysterious at first.‬

‭ AVING is specifically used to filter‬‭aggregated‬‭results‬‭(like SUM, COUNT, AVG, etc.) after you’ve‬
H
‭used GROUP BY.‬

‭If you forget to use HAVING and try to filter an aggregated column with WHERE, you’ll get an error.‬

‭Suppose we have a table named “sales” that shows sales from different regions:‬

‭egion
r sale_id amount‬
North
‭ 1001 5000‬
North
‭ 1002 7000‬
South
‭ 1003 4000‬
East
‭ 1004 6000‬
East
‭ 1005 8000‬
West
‭ 1006 3000‬
West
‭ 1007 2000‬
North
‭ 1008 9000‬

‭We want to:‬

-‭ ‬ ‭ alculate the‬‭total sales‬‭(SUM(amount)) by region.‬

C
‭-‬ ‭Return only regions with total sales‬‭greater than‬‭10,000‬‭.‬

SELECT‬
‭
region,‬
‭
SUM‬
‭ (amount)‬‭
‭ AS‬‭
total_sales‬
FROM‬‭
‭ sales‬
GROUP‬‭
‭ BY‬‭
region‬
HAVING‬‭
‭ SUM‬
(amount) >‬‭
‭ 10000‬
ORDER‬‭
‭ BY‬‭
total_sales‬‭
DESC‬

‭First, we used‬‭GROUP BY‬‭to group all rows by their‬‭region.‬

‭Then, we used‬‭SUM(amount)‬‭to compute the total amount‬‭for each region.‬

‭ inally, we used‬‭HAVING SUM(amount) > 10000‬‭. Now that‬‭each region’s amounts are summed, we‬
F
‭filter to show only those with totals above 10,000.‬

I‭f we tried to put SUM(amount) > 10000 in a WHERE clause, we’d get an error because you can’t filter‬
‭an aggregated value at the row-level filter stage.‬

‭84‬
‭ astering HAVING is essential for intermediate SQL, especially when you’re building reports or‬
M
‭dashboards that need to show only groups meeting certain criteria.‬

‭Happy querying!‬

‭85‬
‭Tip #50: WHERE TRUE‬
‭One thing that can be a pain in SQL is figuring out what to put in your WHERE clause.‬

‭Here’s a weird tip I discovered recently that changed the way I write queries.‬

‭I thought I’d never use it, but now I love it.‬

‭ o let’s say you’re querying and playing around with different conditions in the WHERE clause,‬
S
‭commenting different lines in and out as you go.‬

‭But it’s a pain when you need to comment out the FIRST condition in your WHERE clause.‬

‭Then you have to rearrange what order everything is in so the query doesn't break.‬

‭This can slow you down a lot.‬

‭ ut starting your WHERE clause with “WHERE TRUE” makes exploring and tweaking queries so much‬
B
‭easier!‬

I‭nstead of commenting out filters one by one, you can simply toggle conditions without breaking the‬
‭query.‬

‭WHERE TRUE acts as a placeholder that lets every row through.‬

‭You can easily add or remove filters (just comment/uncomment) without worrying about syntax errors.‬

‭86‬
‭It’s a small trick, but it can save you so much time when exploring data or debugging long queries.‬

‭Plop a little “WHERE TRUE” in there and you’ll save yourself the headache.‬

‭Try it next time, and you’ll see what I mean!‬

‭87‬
‭Tip #51: An ORDER BY Trick‬
‭Did you know you can ORDER BY anything in SQL?‬

‭It doesn’t even have to be in the SELECT clause.‬

‭This was mind blowing to me at first.‬

‭But it totally makes sense.‬

‭One common use for this is randomized ordering.‬

‭ et’s say you only want a random group of 10 customers from your database. To do this, we can‬
L
‭ORDER BY the function RAND:‬

‭The RAND function produces a random value from 0 to 1.‬

‭Think of it like each row is running this function and getting a random number assigned to it.‬

‭Then at the end the rows are ordered by this random number that is assigned to each row.‬

‭That's how you get randomized ordering. Then we just take the first 10 results by using LIMIT.‬

‭But you can actually order by anything you want, not just RAND.‬

‭Try it for yourself!‬

‭ earning little nuances like this help you better understand how SQL works so you can write queries‬
L
‭with confidence.‬

‭88‬
‭Tip #52: CTEs are Simpler Than You Think‬
‭I often see CTEs listed on “advanced SQL” lists.‬

‭I think EVERYONE should learn CTEs, even as a beginner.‬

‭Don’t over complicate it!‬

‭Writing a CTE is actually one of the simplest ways to structure and organize your queries.‬

‭If you know how to write a SELECT statement, you’re already 90% of the way there.‬

‭Here’s a quick breakdown:‬

-‭ ‬ ‭ CTE is just a temporary result set you can reference later in your query.‬
A
‭-‬ ‭It starts with the WITH keyword, followed by the name of your CTE and the query that defines it.‬
‭-‬ ‭If there are multiple CTEs you want to write, just separate them with a comma.‬
‭-‬ ‭You can think of it as a cleaner, more readable way to reuse logic in your SQL script.‬

‭Here’s an example:‬

‭ITH sales_by_region AS (‬
W
SELECT‬
‭
region,‬
‭
SUM‬
‭ (sales)‬‭
‭ AS‬‭
total_sales‬
FROM‬‭
‭ sales_data‬
GROUP‬‭
‭ BY‬‭
region‬
)‬
‭

‭ELECT‬‭
S *‬
FROM‬‭
‭ sales_by_region‬
WHERE‬‭
‭ total_sales >‬‭
10000‬

‭In this query we define a CTE called “sales_by_region” to calculate sales totals by region.‬

‭Then we use it to filter for regions with more than $10,000 in sales.‬

‭See how clean and readable that is?‬

‭There are obviously more complex implementations than this, but this is the gist of it!‬

‭If CTEs have ever felt confusing to you, I hope this helps simplify things.‬

‭89‬
‭Tip #53: SQL JOINs Explained‬
‭SQL joins can feel like a mind-bending puzzle.‬

‭Don't worry!‬

‭This amazing visual by C.L Moffatt is here to clear things up for you.‬

‭It shows seven different ways you can join two tables together in SQL:‬

‭ .‬
1 I‭nner Join‬
‭2.‬ ‭Left Join‬
‭3.‬ ‭Right Join‬
‭4.‬ ‭Full Join‬
‭5.‬ ‭Left Excluding Join‬
‭6.‬ ‭Right Excluding Join‬
‭7.‬ ‭Outer Excluding Join‬

‭ ome might argue that the last three aren't technically joins, but we're going to consider them as such‬
S
‭for simplicity.‬

‭Here is an explanation of each:‬

-‭ ‬ I‭nner Join‬‭: This is your go-to Join, returning all‬‭matching records from both tables.‬
‭-‬ ‭Left Join‬‭: This fetches all records from the left‬‭table, and any matching records from the right‬
‭table.‬
‭-‬ ‭Right Join‬‭: This is the opposite of a Left Join, bringing all records from the right table, and any‬
‭matching ones from the left.‬
‭-‬ ‭Outer Join‬‭: This returns all records from both tables, matching wherever possible.‬

‭90‬
‭-‬ ‭ eft Excluding Join‬‭: This pulls all records from the left table that don't have a match in the right‬
L
‭table.‬
‭-‬ ‭Right Excluding Join‬‭: It fetches all records from‬‭the right table that don't find a match in the left‬
‭table.‬
‭-‬ ‭Outer Excluding Join‬‭: This returns all records from‬‭both tables that do not match.‬

‭ ide note‬‭: I have never actually used a RIGHT JOIN in practice but nevertheless I still like this visual for‬
S
‭completeness.‬

‭Trust me, once you get the hang of it, SQL Joins will no longer feel like a brain teaser!‬

‭Remember, practice makes perfect.‬

‭Don't shy away from getting your hands dirty with some SQL Join exercises.‬

‭91‬
‭Tip #54: NULL Values in WHERE Clause‬
‭Null values in SQL can behave weirdly sometimes.‬

‭Here is a tip I learned a while back that saved me A LOT of frustration:‬

‭A comparison on a null value ALWAYS evaluates to false.‬

‭If you don't know this, it can cause unexpected query output leading to frustration.‬

‭Here's how to fix it...‬

‭Let's say you are trying to remove any rows from your “animals” table where the type is “dog.”‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ animals‬
WHERE‬‭
‭ type‬‭
<>‬‭
'dog'‬

‭This will remove any rows from the output with the type of “dog.”‬

‭Great.‬

‭But it will also not return any rows where the column is NULL.‬

‭This is because null values are not values themselves.‬

‭They are the ABSENCE of a value.‬

‭That means a null compared to anything else will always be false.‬

‭To fix this, you need to add a COALESCE() function in your WHERE clause like this:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ animals‬
WHERE‬‭
‭ COALESCE‬
(‭
‭t‬ype‬
,‬
‭ ''‬
‭ ) <>‬‭
‭ 'dog'‬

‭ his will turn any null values into an empty string which will evaluate to true when compared as not‬
T
‭equal to “dog.”‬

‭ his issue caused me some major frustration with a query I was writing until I figured out what the‬
T
‭problem was.‬

‭I hope this saves you a future headache!‬

‭92‬
‭ ip #55: Calculating Running Totals with Window‬
T
‭Functions‬
‭ hen you want to see how a metric accumulates over time (like daily sales or a running count of user‬
W
‭signups), window functions are your go to.‬

‭Here’s how the syntax structure looks:‬

SUM(column_name) OVER (‬
‭
PARTITION BY partition_column‬
‭
ORDER BY ordering_column‬
‭
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW‬
‭
)‬
‭

‭This might look confusing at first, but it makes sense once you get the hang of it.‬

‭Imagine you have a table “daily_sales” with these columns:‬

-‭ ‬ ‭ ale_date‬
S
‭-‬ ‭Product_id‬
‭-‬ ‭Amount‬

‭You’d like to see a running total of amount per product, day by day.‬

‭Here’s how to do that:‬

‭In this query,‬‭SUM adds up the amount column.‬

‭93‬
‭PARTITION BY product_id groups the data so each product’s total is tracked independently.‬

‭ORDER BY sale_date specifies the sequence for accumulating the total.‬

‭ unning totals are useful for things like trend analysis to see how a value (such as sales) grows over‬
R
‭time, cumulative counts to track how many sign-ups or events have happened up to a certain date, and‬
‭rolling metrics.‬

‭Now go make some running totals!‬

‭94‬
‭ ip #56: Generate a “Date Spine” to Avoid Missing‬
T
‭Dates in Charts‬
‭Here’s a SQL tip I never anticipated needing.‬

‭Have you ever created a chart and noticed some dates missing?‬

‭ hat happens when your data table doesn’t have entries for those dates – they don’t show up in your‬
T
‭query results.‬

‭ date spine solves this by ensuring every date in a range is present, whether there was any activity‬
A
‭(sales, sign-ups, etc.) or not.‬

‭Here’s how you can do this:‬‭GENERATE_DATE_ARRAY‬

‭Create a date spine CTE like this:‬

WITH date_spine AS (‬
‭
SELECT‬‭
‭ day‬
FROM‬‭
‭ UNNEST(GENERATE_DATE_ARRAY(‬
'2023-01-01'‬
‭ ,‬‭
‭ '2023-01-31'‬
,‬‭
‭ INTERVAL‬‭
1‬‭
DAY‬
))‬‭
‭ AS‬
day‬
‭
)‬
‭

‭ ENERATE_DATE_ARRAY(start_date, end_date, interval) creates an array of consecutive days from‬

G
‭January 1 to January 31.‬

‭UNNEST turns that array into rows.‬

‭Then join the date spine with your main table:‬

SELECT‬
‭
ds.day‬‭
‭ AS‬‭calendar_date,‬
COALESCE‬
‭ (‭
‭S
‬UM‬(s.sales_amount),‬‭
‭ 0‬)‬‭
‭ AS‬‭
daily_sales‬
FROM‬‭
‭ date_spine ds‬
LEFT‬‭
‭ JOIN‬‭
sales_table s‬
ON‬‭
‭ ds.day = s.sale_date‬
GROUP‬‭
‭ BY‬‭
ds.day‬
ORDER‬‭
‭ BY‬‭
ds.day‬

‭ sing a LEFT JOIN ensures all dates from the spine appear, even if no matching sales row exists‬
U
‭(which yields NULL, replaced by 0 via COALESCE).‬

‭This ensures your charts and reports have every date, preventing misleading gaps.‬

‭Zero activity is still valuable information.‬

‭Omitting those days can skew averages or trends and make your charts look weird.‬

‭95‬
‭ nderstanding the technique of creating and joining on a date spine ensures you never miss a date‬
U
‭again, even if your underlying data doesn’t record an event for that day!‬

‭P.S. Another option is to create a “date dimension” table that has all the dates in it and join to that:‬

‭96‬
‭Tip #57: Validating String Length‬
‭When cleaning or validating text fields, you often need to check how long a string is.‬

‭The LENGTH function (or LEN in some SQL dialects) returns the number of characters in a string.‬

‭ common example is phone number validation — ensuring each phone number has the correct digit‬
A
‭count.‬

‭Suppose your table “users” has a column “phone_number” with inconsistent formats.‬

‭One way you might detect which numbers are the wrong length is by using a CASE statement:‬

‭In this query, LENGTH(phone_number) checks how many digits the phone number has.‬

‭CASE classifies each phone number as either “Valid Format” or “Invalid Format.”‬

‭ his allows you to quickly spot values that deviate from expected lengths (i.e., short or overly long‬
T
‭strings).‬

I‭t also improves quality control, preventing errors downstream by flagging problematic entries early‬
‭(such as phone numbers or ID fields).‬

‭ he LENGTH function by itself might not always be that useful but when paired with a CASE statement‬
T
‭it can be quite powerful!‬

‭97‬
‭ ip #58: Combining Columns into a Unique Key with‬
T
‭CONCAT‬
‭Here’s a trick I learned early in my data career that I’ve used dozens of times.‬

‭Let’s say you need a unique key for your table.‬

‭But what if no single column in a table is unique on its own?‬

‭Usually a combination of columns is!‬

‭Using CONCAT, you can merge those columns into one unique key for easier referencing or deduping.‬

‭Imagine your table sales has these columns:‬

-‭ ‬ ‭ ale_date‬
S
‭-‬ ‭Store_id‬
‭-‬ ‭Product_id‬
‭-‬ ‭Amount‬

‭ o single column is guaranteed unique, but (sale_date, store_id, product_id) together do form a unique‬
N
‭record.‬

‭You can create a composite key like this:‬

‭98‬
‭CONCAT(...) merges multiple columns into one string.‬

‭Underscores (_) help separate the values for readability.‬

‭Casting ensures all pieces are treated as strings (important if some are dates/numbers).‬

‭ uickly spot exact duplicates by grouping or filtering on this new key and simplify joins on multi-column‬
Q
‭matches by using the single key as a match field.‬

‭ nsure data integrity by making it clear that the combination of values is what defines each unique‬
E
‭record.‬

‭ hen no natural primary key exists, building one with CONCAT can save you plenty of hassle down the‬
W
‭road and keeps your data consistent and well-organized.‬

‭99‬
‭Tip #59: Optimize Your Query Performance and Cost‬
‭You can do all the query optimizations in the world.‬

‭ ut the single biggest thing that accounts for 99% of reducing costs and speeding up your queries is‬
B
‭this:‬

I‭n a‬‭column-store‬‭data warehouse (like Google BigQuery,‬‭Snowflake, or Redshift),‬‭select only the‬

‭columns you actually need‬‭.‬

‭ nlike traditional row-store databases (like MySQL or PostgreSQL), where the entire row is typically‬
U
‭retrieved, column-store systems only read the columns you request.‬

‭Here’s why it matters:‬

‭Less Data to Scan‬

‭By choosing fewer columns, you reduce the amount of data that has to be read from disk.‬

‭ his directly translates to less I/O, lowers processing time, and (especially in BigQuery) reduces query‬
T
‭costs because you’re billed largely by how much data you scan.‬

‭Here’s an example:‬

‭100‬
I‭n the “Not optimal” query, you scan every column from “big_table,” likely increasing cost and execution‬
‭time significantly compared to selecting only “user_id” and “total_spend.”‬

‭Bottom Line‬‭: Always aim to‬‭select only the columns‬‭you actually need in your queries.‬

I‭t’s the quickest win for reducing cost and improving performance in a columnar data warehouse like‬
‭BigQuery.‬

‭By doing so, you’ll keep queries both fast and cost-effective.‬

‭101‬
‭Tip #60: Fact vs. Dimension Tables‬
‭ ne of the first SQL lessons I learned in my first job was the difference between‬‭fact‬‭and‬‭dimension‬
O
‭tables.‬

‭At first it wasn’t obvious why it mattered — until I had to actually use it.‬

‭Think of‬‭fact tables‬‭as the core events or transactions.‬

I‭f you worked at a car dealership, the fact tables might be car inventory (what’s in stock) and car sales‬
‭(what’s been sold).‬

‭Then, you have‬‭dimension tables‬‭which describe details‬‭about those facts.‬

I‭n this example, those could be the make of the cars (Ford, GM, Toyota, etc.), car type (SUV, sedan,‬
‭truck), odometer range, and year.‬

I‭f you’ve ever filtered a car search by price, mileage, or body type, you were interacting with‬‭dimension‬
‭tables‬‭.‬

‭Those dimensions tables would be joined to the fact tables to do analysis.‬

‭Typically, an analysis like this would involve something like sales and inventory turnover.‬

‭ nderstanding this helps you structure data efficiently, write better queries, and build dashboards that‬
U
‭make sense.‬

‭If you’re working with analytics, it’s one of the most useful concepts to know!‬

‭Got a dataset?‬

‭Try identifying what’s‬‭fact‬‭and what’s‬‭dimension‬‭—‬‭it’ll change how you analyze data!‬

‭102‬
‭ ip #61: Aggregate Your Data at Multiple Levels with‬
T
‭GROUPING SETS‬
‭I prefer simple analytics.‬

‭Just SQL + spreadsheets whenever possible.‬

‭ ut there’s always been one challenge: calculating metrics at different levels of granularity. Let’s say‬
B
‭you’re tracking trial sign-ups and conversions.‬

‭Your trials table contains:‬

-‭ ‬ ‭ ignup_date (date of sign-up)‬

s
‭-‬ ‭week (week of sign-up)‬
‭-‬ ‭marketing_channel (where the user came from)‬
‭-‬ ‭trial_users (number of trial sign-ups)‬
‭-‬ ‭converted_users (number of trials that converted)‬

‭Your goal is to calculate‬‭Trial Conversion Rate (TCR)‬‭at multiple levels:‬

-‭ ‬ ‭ y date and channel (daily conversion rate for each channel)‬

B
‭-‬ ‭By week (weekly conversion rate)‬
‭-‬ ‭By channel (overall conversion rate per marketing channel)‬
‭-‬ ‭Overall (total conversion rate across all trials)‬

‭103‬
‭Why This Works Well for Google Sheets:‬

‭-‬ ‭ re-Aggregated Data: This single query precomputes all necessary granularities, allowing you‬
P
‭to use simple SUMIFS or AVERAGEIFS in Google Sheets to reference different levels.‬
‭-‬ ‭Eliminates Extra Queries: Instead of running separate queries for daily, weekly, and‬
‭channel-level conversion rates, this one query returns all the data you need in a single result‬
‭set.‬
‭-‬ ‭Handles Calculated Metrics Correctly: Since conversion rate is a ratio, it must be computed at‬
‭each level separately — this query ensures correct calculations rather than just summing rates‬
‭incorrectly.‬

‭ his method keeps everything clean, reduces query costs, and ensures that you get the correct‬
T
‭calculations at every level.‬

‭104‬
‭ ip #62: Check for Multiple NULL Columns Using‬
T
‭COALESCE‬
‭Normally, if you want rows where all three columns (x, y, z) are NULL, you might write:‬

WHERE x IS NULL‬
‭
AND y IS NULL‬
‭
AND z IS NULL‬
‭

‭A quicker alternative is:‬

WHERE COALESCE(x, y, z) IS NULL‬

‭

‭COALESCE returns the‬‭first non-null‬‭value among its‬‭arguments.‬

‭If all arguments (x, y, and z) are NULL, it returns NULL.‬

‭Therefore, if COALESCE(x, y, z) is NULL, it means‬‭every one‬‭of x, y, and z has to be NULL.‬

‭ his one-liner can make your queries simpler and easier to read when checking for columns that must‬
T
‭all be NULL.‬

‭105‬
‭Tip #63: Calculate Moving Averages in SQL‬
‭ ‬‭rolling average‬‭(or moving average) helps smooth‬‭out short-term fluctuations to highlight‬
A
‭longer-term trends in time-series data.‬

I‭n stock market analysis, for example, you might want a‬‭5-day‬‭or‬‭30-day‬‭rolling average of closing‬
‭prices to see a smoothed trend rather than day-to-day noise.‬

‭Suppose you have a “stock_prices” table with:‬

-‭ ‬ ‭ ymbol (e.g., AAPL, GOOG)‬

s
‭-‬ ‭trade_date‬
‭-‬ ‭closing_price‬

‭You can compute a 5-day moving average of closing_price per stock symbol like this:‬

‭106‬
‭Using PARTITION BY symbol ensures that each stock (symbol) is processed independently.‬

‭ORDER BY trade_date defines the sequence of rows (from oldest to newest trades).‬

‭ OWS BETWEEN 4 PRECEDING AND CURRENT ROW ensures that for each row, the window‬
R
‭includes the current day plus the four previous days, totaling five days.‬

‭Here are a few more practical use cases for rolling averages:‬

‭ ebsite Traffic or User Metrics:‬‭Monitor rolling 7-day‬‭or 30-day averages for daily active users,‬
W
‭sign-ups, or page views.‬

‭ ales or Inventory Planning:‬‭Track product sales averages over flexible periods to better forecast‬
S
‭inventory needs or spot seasonal effects.‬

‭To customize the window of days just change the range.‬

‭ ou can also combine multiple moving averages (short and long-term) in the same query for richer‬
Y
‭analysis.‬

‭ hether it’s financial data, user metrics, or daily sales, rolling averages help reduce noise so you can‬
W
‭focus on real trends over time.‬

‭Happy analyzing!‬

‭107‬
‭Tip #64: Spot Checking and Testing‬
‭ ometimes you’re not 100% sure your SQL results are correct, but you spot check a few rows, and‬
S
‭they all look fine.‬

‭So you move forward.‬

‭This is a good start, but it’s even better to build‬‭systematic tests‬‭into your queries.‬

‭Here’s why:‬

‭Spot checks catch obvious errors.‬

‭ anually verifying a few rows (or edge cases) can quickly reveal mismatches between your‬
M
‭expectations and the data.‬

I‭f the small sample matches what you expect, it’s a positive sign, but it’s not proof that the entire‬
‭dataset is correct.‬

‭ dding automated tests via conditions in your queries that fail when something is wrong provides‬
A
‭confidence.‬

‭ uppose you have an “orders” table and a related “line_items” table. Each order in “orders” has a‬
S
‭“total_amount,” and the “line_items” for that order contain a breakdown of products, quantities, and‬
‭prices.‬

‭108‬
‭ e want to ensure that the‬‭sum of the line items for each order‬‭actually matches the “total_amount”‬
W
‭column in the “orders” table.‬

‭If there’s more than a tiny difference — say more than 1 cent — it likely indicates a data problem.‬

I‭n the query above, we sum up the extended price (quantity * unit_price) for each order and compare it‬
‭to “orders.total_amount.”‬

I‭f “expected_total” in the “orders” table doesn’t match the actual sum of the line items (line_item_total)‬
‭within a 1-cent margin, we return those rows.‬

‭If this query returns‬‭no rows‬‭, everything matches‬‭perfectly.‬

‭If any rows appear, you know exactly which order(s) need deeper investigation.‬

‭A couple reasons this is helpful:‬

-‭ ‬ ‭ revent surprises on dashboards.‬

P
‭-‬ ‭Confidence in the analytics team.‬

‭Data tests prove you’re proactively validating accuracy.‬

‭ y writing targeted test queries like this, you’ll build far more confidence in your data and you’ll catch‬
B
‭subtle mistakes before they hit a dashboard or report!‬

‭109‬
‭ ip #65: Query Multiple Tables at Once with Wildcard‬
T
‭Tables‬
‭ hen you have multiple tables with similar names and schemas (often seen in partitioned or sharded‬
W
‭datasets), BigQuery’s wildcard table feature lets you query them all in one go.‬

‭Instead of writing one query per table, you can reference them with a single wildcard pattern.‬

‭Suppose you store monthly data in separate tables “mytable_2023_0,” “mytable_2023_02,” and so on.‬

‭You can query them all like this:‬

SELECT‬
‭
*‬
‭
FROM‬‭̀project_id.dataset_id.mytable_*`‬
‭
WHERE‬‭
‭ _TABLE_SUFFIX‬‭
BETWEEN‬‭
'2023_01'‬‭
AND‬‭
'2023_03'‬

‭In this example, mytable_* is the‬‭wildcard‬‭pattern‬‭matching mytable_2023_01, mytable_2023_02, etc.‬

‭ TABLE_SUFFIX is an automatic column that BigQuery provides that indicates the matched substring‬
_
‭from the table name (e.g., 2023_01).‬

‭ HERE _TABLE_SUFFIX BETWEEN '2023_01' AND '2023_03' filters which tables you’re including in‬
W
‭the query.‬

‭ sing a wildcard table allows for convenience (one query for multiple tables, instead of multiple‬
U
‭queries), scalability (easily handle many monthly, daily, or weekly tables without rewriting queries), and‬
‭flexibility (combine it with standard SQL clauses like WHERE, GROUP BY, etc.).‬

‭ xtra Tip:‬‭To limit costs, always filter by _TABLE_SUFFIX‬‭so you don’t inadvertently scan every‬
E
‭wildcard-matching table in your dataset.‬

‭Wildcard tables in BigQuery keep your SQL tidier and your process more efficient!‬

‭110‬
‭ ip #66: Creating Your Own SQL Functions to Reuse‬
T
‭Logic‬
‭Sometimes, built-in SQL functions don’t cover exactly what you need.‬

‭ aybe you repeatedly filter out “test” emails from your data and hate copying the same WHERE clause‬
M
‭everywhere.‬

I‭n databases that support‬‭User-Defined Functions (UDFs)‬‭, such as BigQuery, you can create your‬
‭own function once and reuse it in multiple queries.‬

‭Let’s say your ecommerce table has rows with “test” email addresses you want to exclude every time.‬

‭Instead of repeating a big WHERE clause, you can create a function:‬

‭In this example,‬‭you declare a function named “is_test_email”‬‭in your dataset.‬

‭It returns TRUE if the passed-in email is in the specified list, FALSE otherwise.‬

‭You can update this function once to add or remove test emails.‬

‭111‬
I‭nstead of typing a long WHERE clause for test emails in each query, you simply call the function once.‬
‭If you need to add new “test” addresses, just update the function, and all queries using it reflect the‬
‭change.‬

‭Whenever you see repetitive logic, consider turning it into a UDF!‬

‭112‬
‭Tip #67: When to Use DISTINCT vs. GROUP BY‬
‭ hen you’re trying to get unique records from a table, you might ask, “Should I use DISTINCT or‬
W
‭GROUP BY?”‬

‭ oth remove duplicates, but they serve different needs, and new analysts often get confused by which‬
B
‭to use.‬

‭Use DISTINCT when you just need uniqueness.‬

‭ELECT‬‭
S DISTINCT‬‭
customer_id‬
FROM‬‭
‭ orders‬

‭DISTINCT removes duplicates, returning each customer_id once, ignoring any duplicates.‬

‭It’s ideal when you’re not trying to sum, average, or otherwise aggregate.‬

‭It’s good for quickly listing unique IDs or combinations of columns (e.g., (region, product_id) pairs).‬

‭Use GROUP BY when you need aggregations.‬

SELECT‬
‭
customer_id,‬
‭
COUNT‬
‭ (*)‬‭
‭ AS‬‭
order_count‬
FROM‬‭
‭ orders‬
GROUP‬‭
‭ BY‬‭
customer_id‬

‭ ot only does GROUP BY ensure each customer_id is unique, but it also allows you to do calculations‬
N
‭(like COUNT, SUM, etc.) for each group.‬

‭You can include multiple columns to group on and add as many aggregates as you need.‬

‭It’s good for summaries, totals, averages, or any aggregated stats across categories or groups.‬

‭Here’s a quick example.‬

‭Let’s say you want to count how many unique customers exist:‬

‭- DISTINCT approach‬
-
SELECT‬‭
‭ COUNT‬
(‭
‭D
‬ISTINCT‬‭
customer_id)‬‭
AS‬‭
unique_customers‬
FROM‬‭
‭ orders‬

‭Here, DISTINCT directly creates unique rows for customer_id before counting.‬

‭113‬
‭Alternatively:‬

‭- GROUP BY approach‬
-
SELECT‬‭
‭ COUNT‬
(*)‬‭
‭ AS‬‭
unique_customers‬
FROM‬‭
‭ (‬
SELECT‬‭
‭ customer_id‬
FROM‬‭
‭ orders‬
GROUP‬‭
‭ BY‬‭
customer_id‬
) subquery‬
‭

‭You group by customer_id (one row per customer), and then count how many groups there are.‬

‭Use‬‭DISTINCT‬‭when you just need a set of unique rows.‬

‭Use‬‭GROUP BY‬‭when you want to summarize or aggregate data across those unique rows.‬

‭Happy querying!‬

‭114‬
‭Tip #68: Slowly Changing Dimensions‬
‭We’ve talked about‬‭fact tables‬‭vs.‬‭dimension tables‬‭.‬

‭But did you know there is a table structure that is like a combination of the two?‬

‭It’s called the “slowly changing dimension.”‬

‭ lowly changing dimensions handle the scenario where dimension attributes change over time, yet you‬
S
‭still need historical context.‬

‭ or instance, a customer can move from San Francisco to Seattle, and you want to record both‬
F
‭addresses in your history, not just overwrite it.‬

‭Imagine a customer table that stores each customer’s address.‬

‭If you overwrote the address on every change, you’d lose historical info about where they used to live.‬

‭Instead, a slowly changing dimension approach keeps multiple records:‬

CREATE‬‭
‭ TABLE‬‭
customer_dimension (‬
customer_id‬
‭ INT‬
‭ ,‬
‭
customer_name‬
‭ VARCHAR‬
‭ (‬
‭ 100‬
‭ ),‬
‭
address‬
‭ VARCHAR‬
‭ (‬
‭ 200‬
‭ ),‬
‭
city‬
‭ VARCHAR‬
‭ (‬
‭ 100‬
‭ ),‬
‭
state‬
‭ VARCHAR‬
‭ (‬
‭ 2‬
‭ ),‬
‭
start_date‬
‭ DATE‬
‭ ,‬
‭
end_date‬
‭ DATE‬
‭ ,‬
‭
current_flag‬
‭ BOOLEAN‬
‭
)‬
‭

‭ hen the customer’s address changes, you “close” the old record by setting “end_date” to the change‬
W
‭date and “current_flag” = FALSE.‬

‭ hen you create a new record with the updated address, “start_date” = change date, “end_date” =‬
T
‭NULL, and “current_flag” = TRUE.‬

‭ ou can join your fact table to “customer_dimension” using the date range (start_date / end_date) to‬
Y
‭figure out which address was active at the time of a particular purchase.‬

‭ ou never lose the original data and analysts can see exactly how a customer’s address evolved over‬
Y
‭time.‬

‭ lowly changing dimensions ensure your data warehouse accurately reflects historical context, giving‬
S
‭you deeper insights into how dimension attributes evolve over time!‬

‭115‬
‭ ip #69: Sometimes You Just Need a Giant CASE‬
T
‭Statement‬
‭SQL can feel elegant — until you have a hundred messy conditions that need wrangling.‬

‭But there isn’t always a fancy trick.‬

‭When your data is inconsistent or mislabeled, you might need a huge CASE statement to clean it.‬

‭Ideally, you fix it upstream, but sometimes that’s not an option.‬

‭Don’t panic. Just write the CASE and move on!‬

‭Suppose you’ve got an events table with columns like “event_type,” “event_name,” and “ticket_price.”‬

‭The data is all over the place, and you need to bucket them into a few consolidated categories.‬

‭Sometimes the simplest (if not prettiest) approach is one giant CASE:‬

‭If your data source is chaotic, a big CASE might be the cleanest interim fix.‬

‭116‬
‭ ong-term, try standardizing your data at the source so you don’t need a monstrous CASE in every‬
L
‭query.‬

‭ nother option is to create a‬‭lookup‬‭or‬‭mapping‬‭table if these categories are stable and repeated in‬
A
‭many places. But even that can get unwieldy if conditions are extremely scattered.‬

‭There isn’t always a magical SQL shortcut —‬‭huge CASE‬‭statements‬‭are sometimes inevitable.‬

‭ o your best to keep it organized, add comments, and remember that “done is better than perfect”‬
D
‭when messy data meets tight deadlines.‬

‭117‬
‭Tip #70: Avoiding the “Double Counting” Pitfall‬
‭ ne of the trickiest mistakes for new SQL users is accidentally‬‭double-counting‬‭rows when‬‭joining‬‭or‬
O
‭aggregating‬‭.‬

‭You might expect a certain total, then you see an inflated sum, and it’s not immediately clear why.‬

‭Here’s what to watch out for and how to fix it.‬

‭The Most Common Cause:‬

I‭f you join a one-to-one table (like “customers”) to a one-to-many table (such as “orders”), your‬
‭“customers” data will replicate for every order, causing inflated totals.‬

‭Let’s go through an example that you might run into.‬

‭QUALIFY ensures only one row per email (e.g., latest created user).‬

‭This prevents transactions from being duplicated due to multiple user rows.‬

‭Always check that your join key is unique.‬

‭118‬
I‭f it’s not (or even if it is), I like to use QUALIFY ROW_NUMBER() to ensure it’s always unique in the‬
‭future.‬

‭This helps guarantee that your data is accurate!‬

‭119‬
‭ ip #71: Combine Multiple Rows into One with‬
T
‭STRING_AGG‬
‭ ver needed to consolidate multiple text rows (like customer messages) into a single row for each‬
E
‭conversation?‬

‭That’s what STRING_AGG is for.‬

I‭t allows you to concatenate values from multiple rows within a group, using a delimiter of your choice‬
‭(like a comma).‬

‭Suppose you have a “messages” table with columns:‬

-‭ ‬ ‭ onversation_id‬
C
‭-‬ ‭Created_at‬
‭-‬ ‭Content‬
‭-‬ ‭… other columns‬

‭You can group messages by conversation and combine all content fields into one row:‬

‭ TRING_AGG(content, ', ') merges all content from the same group into a single string, separated by‬
S
‭commas.‬

‭120‬
‭MIN(created_at) helps you track the earliest message if you need context on ordering.‬

‭If you only want certain message types or users, apply conditions in the WHERE clause or use a CTE.‬

‭ TRING_AGG allows you to quickly see the entire conversation in one cell and provides a concise view‬
S
‭rather than multiple rows.‬

‭You can also join or pivot your data more easily when you need aggregated text.‬

‭Whenever you need a combined text “timeline,” STRING_AGG is your friend.‬

‭Turn scattered messages into a single, easy-to-read string!‬

‭121‬
‭ ip #72: Parsing Text with REGEX_EXTRACT: Don’t‬
T
‭Fear the Regex!‬
‭For many data analysts,‬‭regex‬‭(regular expressions)‬‭looks like a cryptic language.‬

‭ ut often you only need a few basic patterns to handle common tasks, like extracting a code from a‬
B
‭messy string.‬

‭That’s where REGEX_EXTRACT (available in BigQuery and other SQL dialects) comes in handy.‬

‭You don’t even really have to know REGEX as a data analyst.‬

‭Just know that it exists, what it is, and how it works.‬

‭I would leverage ChatGPT or another AI to do the heavy lifting for you!‬

‭Imagine you have a column message containing strings like:‬

‭"Order confirmation #12345 for user JohnDoe"‬

‭"Thank you for your purchase. Your order #67890 ships soon."‬

‭You want just the‬‭order number‬‭.‬

‭Try something like:‬

SELECT‬
‭
message,‬
‭
REGEX_EXTRACT(message, r‬
‭ 'order\s#(\d+)'‬
‭ )‬‭
‭ AS‬‭
order_id‬
FROM‬‭
‭ mydataset.orders_table;‬

‭Here’s what’s happening with‬‭r'order\s#(\d+)':‬

-‭ ‬ ‭ rder\s# looks for the word “order” followed by whitespace and a hash sign (#).‬
o
‭-‬ ‭(\d+) is a capture group: \d means “digit.”‬
‭-‬ ‭+ means “one or more digits.”‬
‭-‬ ‭Putting it in parentheses means we want to capture that numeric sequence.‬

‭ EGEX_EXTRACT(message, ...) returns only what’s inside the first capture group, which is the digits‬
R
‭right after #.‬

‭This is useful because‬‭real-world text often includes‬‭random words or formats.‬

‭Regex helps you filter noise.‬

‭ ou don’t need to master every advanced regex trick — knowing “digits \d,” “wildcard .,” “plus +,” or‬
Y
‭“asterisk *” gets you a long way.‬

‭Instead of writing a big substring dance, REGEX_EXTRACT zeroes in on exactly what you need.‬

‭122‬
‭Here are a few more regex quick tips:‬

-‭ ‬ ‭ scape Characters: Notice you often need double backslashes in some SQL dialects, like \\d+.‬
E
‭-‬ ‭Testing: Use an online regex tester tool to confirm your pattern before plugging it into SQL.‬
‭-‬ ‭Document: Regex can be cryptic — add comments explaining what your pattern does.‬

‭Regex can seem intimidating, but for routine text extraction, you’ll mostly rely on a few simple patterns.‬

‭ ith REGEX_EXTRACT, you can slice through messy text fields like a pro, grabbing only the data you‬
W
‭need.‬

‭123‬
‭Tip #73: Easily Group Data by Month with LAST_DAY‬
‭ hen building monthly reports or financial dashboards, you often need to bucket transactions into their‬
W
‭month (not just by date).‬

‭A handy trick in many SQL dialects is the LAST_DAY function.‬

I‭t takes a date or datetime and returns the last day of that month, which you can then use as a‬
‭consistent monthly identifier.‬

‭Let’s say you have a sales table with columns:‬

-‭ ‬ ‭ ale_date (DATE or DATETIME)‬

s
‭-‬ ‭amount‬

‭You can aggregate by month using LAST_DAY:‬

‭ AST_DAY(sale_date) converts each date to the last calendar day of its month (e.g., all April 2023‬
L
‭sales become 2023-04-30).‬

‭ ROUP BY the resulting end-of-month value lumps all April sales into a single row (with the date‬
G
‭2023-04-30).‬

‭ ou get one row per month, which is perfect for monthly dashboards, trend lines, or financial‬
Y
‭summaries.‬

‭124‬
I‭n some SQL engines, you can also group by DATE_TRUNC('month', sale_date) or something similar‬
‭— both approaches yield a clear monthly grouping.‬

‭ AST_DAY is especially nice if you prefer using an‬‭actual date‬‭to represent the month-end, which feels‬
L
‭more intuitive in financial contexts.‬

‭Whenever you need monthly rollups, don’t overcomplicate it.‬

‭ AST_DAY does the heavy lifting, turning your daily or even timestamped data into neat, month-based‬
L
‭aggregates!‬

‭125‬
‭ ip #74: Anonymizing Personally Identifiable‬
T
‭Information with a Hash Function‬
‭ ometimes you need to remove or obscure personal data (like emails, phone numbers, etc.) so that‬
S
‭individuals can’t be identified.‬

‭One easy and common technique is using a‬‭hash function‬‭.‬

‭There are two options:‬

-‭ ‬ ‭ ne-Way‬‭: You can’t reverse the hash to get the original‬‭data.‬

O
‭-‬ ‭Deterministic‬‭: The same input always hashes to the‬‭same output (making it consistent for‬
‭analysis, but still private).‬

‭Here’s how it works:‬

‭ ash Algorithm‬‭(e.g., SHA256, MD5) takes the input‬‭(email, phone) and generates a unique‬
H
‭fixed-length output string (the “hash”).‬

‭ he original data can’t be reconstructed from the hash, protecting your dataset from exposing personal‬
T
‭info.‬

SELECT‬
‭
user_id,‬
‭
email,‬
‭
TO_HEX(SHA256(email))‬‭
‭ AS‬‭
hashed_email‬
FROM‬‭
‭ mydataset.users‬

‭SHA256(email) applies the SHA256 hash function.‬

‭TO_HEX converts the binary hash output into a human-readable hexadecimal string.‬

‭The result is each unique email gets a unique “hash.”‬

‭Some further considerations:‬

‭-‬ ‭ ossible Collisions: Theoretically, different inputs can produce the same hash, but strong‬
P
‭algorithms like SHA256 make collisions extremely rare.‬
‭-‬ ‭Salting: To further enhance security (especially with common strings like “john@gmail.com”),‬
‭you can add a random “salt” to the input before hashing, ensuring the hash is unique even if‬
‭someone guesses the original.‬

‭ ashing is a simple yet powerful technique to anonymize personal data, preserving key analytical‬
H
‭capabilities without revealing sensitive information.‬

‭It’s a cornerstone of modern data security and privacy.‬

‭126‬
‭Tip #75: Joining Beyond Simple IDs‬
‭Joins don’t always revolve around matching a single id column.‬

‭In real-world data, you might match on‬‭multiple conditions‬‭— including ranges and inequalities.‬

‭Sometimes, you join on a simple key, but other times, your join condition might look something like this:‬

LEFT JOIN users u‬

‭
on u.user_id = t.user_id‬
‭
AND DATE_DIFF(t.transaction_date,u.signup_date, DAY) <= 30‬
‭

‭ his isn’t just a direct equality join — it confirms that the transaction happened within 30 days of‬
T
‭sign-up.‬

‭Another example could look something like this:‬

JOIN subscriptions s‬
‭
ON u.user_id = s.user_id‬
‭
AND u.event_date BETWEEN s.start_date AND s.end_date‬
‭

‭This confirms that events are only matched to active subscriptions.‬

‭The bottom line is not all joins are simple id = id matches.‬

‭You can join on date ranges, categories, locations, and other conditions.‬

‭This flexibility allows SQL to model real-world relationships more accurately.‬

‭If you are ever stumped on a join, consider that you may need multiple criteria!‬

‭127‬
‭ ip #76: Parsing Dates with PARSE_DATE: Fixing‬
T
‭Messy Date Formats‬
‭If you’ve worked with raw data, you’ve probably come across dates in all kinds of non-standard formats.‬

‭ QL needs dates in a consistent format (like YYYY-MM-DD) to perform calculations, filter ranges, and‬
S
‭group by time.‬

‭ hat’s where PARSE_DATE (or its sibling functions like PARSE_DATETIME) comes in — it helps‬
T
‭convert messy date strings‬‭into proper date objects.‬

‭Let’s say you have a table with a column “raw_date” containing dates in various formats:‬

-‭ ‬ ‭ 2/31/2023 (MM/DD/YYYY)‬
1
‭-‬ ‭31-12-2023 (DD-MM-YYYY)‬
‭-‬ ‭2023.12.31 (YYYY.MM.DD)‬

‭You can use PARSE_DATE to standardize them into a proper DATE column.‬

‭128‬
‭I am also prefixing PARSE_DATE with SAFE so it returns null if the date is not in the expected format.‬

‭ he result is a clean “final_parsed_date” column that you can use for comparisons, aggregations, or‬
T
‭reporting.‬

‭Common Date Formats You’ll Encounter:‬

-‭ ‬ ‭ S-style: MM/DD/YYYY (e.g., 01/31/2023)‬

U
‭-‬ ‭EU-style: DD-MM-YYYY (e.g., 31-01-2023)‬
‭-‬ ‭ISO-8601: YYYY-MM-DD (e.g., 2023-01-31)‬
‭-‬ ‭Custom: Variations with slashes, dashes, dots, or no separators (YYYYMMDD)‬

‭Here’s another option:‬

I‭f multiple formats exist in one column, you may want to identify them with CASE statements and parse‬
‭them conditionally.‬

CASE‬
‭
WHEN raw_date LIKE '%/%/%' THEN PARSE_DATE('%m/%d/%Y', raw_date)‬
‭
WHEN raw_date LIKE '%-%-%' THEN PARSE_DATE('%d-%m-%Y', raw_date)‬
‭
ELSE NULL‬
‭
END‬‭
‭ AS‬‭
standardized_date‬

‭Data from APIs, CSVs, or external systems often use inconsistent date formats.‬

‭PARSE_DATE is your go-to for turning chaotic date strings into clean, consistent date values in SQL.‬

‭ nce parsed, your dates are ready for analysis, and your queries will run smoother and produce‬
O
‭accurate results!‬

‭129‬
‭ ip #77: Splitting Bad Rows into an Error Table with‬
T
‭SAFE_CAST‬
‭ hen processing messy data, the last thing you want is a few invalid rows crashing your entire query or‬
W
‭pipeline.‬

‭A smart approach is to‬‭split your data into two paths‬‭:‬

‭ .‬ ‭Valid rows‬‭: Insert into your main table.‬

1
‭2.‬ ‭Invalid rows‬‭: Log into an error table for cleanup‬‭and analysis later.‬

‭Here’s how you can handle this using SAFE_CAST:‬

‭ et’s say you have a “transactions” table, and the “transaction_amount” column is supposed to be‬
L
‭numeric.‬

‭Some rows, however, have unexpected issues like emojis or text:‬

‭SAFE_CAST(some_id AS INT64) tries to convert the id to an integer.‬

‭130‬
‭If it fails (e.g., 💸 200), it‬‭returns NULL‬‭instead of throwing an error.‬
‭Valid rows can then be selected below where id_status <> ‘error’ or vice versa.‬

‭This approach handles errors gracefully and keeps your process running even with bad data.‬

‭And invalid rows aren’t lost — they’re tracked for future investigation.‬

‭The same approach with SAFE_CAST can handle invalid dates, strings, or other data types.‬

‭ sing SAFE_CAST to split data into valid and invalid paths ensures you can cleanly handle messy‬
U
‭data!‬

‭131‬
‭Tip #78: Filtering on Boolean‬
‭Here’s a small SQL tip that makes your queries cleaner.‬

‭Let’s say you have a users table with a Boolean column called “is_active.”‬

‭If you want to find all active users, you might think you need to write:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ users‬
WHERE‬‭
‭ is_active =‬‭
TRUE‬

‭But you don’t need the = TRUE. You can‬‭just write‬‭this‬‭:‬

‭ELECT‬‭
S *‬‭
FROM‬‭
users‬
WHERE‬‭
‭ is_active‬

‭Since “is_active” is already a‬‭Boolean‬‭, the WHERE‬‭clause evaluates it as either TRUE or FALSE.‬

‭Filtering for FALSE Values:‬

‭If you want to filter for inactive users (FALSE values), you might be tempted to write:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ users‬
WHERE‬‭
‭ is_active =‬‭
FALSE‬

‭But a cleaner way is to use‬‭NOT‬‭:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ users‬
WHERE‬‭
‭ NOT‬‭
is_active‬

‭This works the same way, but it’s shorter and easier to read.‬

‭A small change, but it makes your SQL much cleaner!‬

‭132‬
‭ ip #79: Using GREATEST and LEAST to Compare‬
T
‭Values in a Row‬
‭If you want to find the largest value in a list of columns or expressions, you might think of using MAX.‬

‭But in SQL, MAX is a reserved keyword for aggregating values across rows.‬

I‭nstead, use the GREATEST (or LEAST)‬‭function to find‬‭the largest (or smallest) value across columns‬
‭in a single row.‬

‭GREATEST‬‭: Returns the largest value in a list.‬

‭LEAST‬‭: Returns the smallest value in a list.‬

‭ hese functions work similarly but operate on individual rows, not across rows (unlike the MAX or MIN‬
T
‭aggregate functions).‬

‭REATEST(value1, value2, ..., valueN)‬

G
LEAST(value1, value2, ..., valueN)‬
‭

‭ oth functions compare all the values provided in the list and return the‬‭largest‬‭or‬‭smallest‬‭,‬
B
‭respectively.‬

I‭f any value is NULL, the result will also be NULL unless you handle it with a function like COALESCE‬
‭or IFNULL.‬

‭Let’s say you have a table “student_scores” with three exam scores for each student.‬

‭You want to calculate:‬

-‭ ‬ ‭ ach student’s highest score across all exams.‬

E
‭-‬ ‭Each student’s lowest score across all exams.‬

‭Here’s how you’d use GREATEST and LEAST:‬

‭133‬
‭ y combining GREATEST and LEAST with IF(), IFNULL(), and COALESCE(), this query robustly‬
B
‭handles NULL values while providing meaningful comparisons within a row.‬

‭134‬
‭ ip #80: When to Use FIRST_VALUE and‬
T
‭LAST_VALUE‬
‭When working with data, you frequently need to identify the first or last record in a group (or partition).‬

‭Two common approaches are:‬

‭ .‬ M
1 ‭ IN/MAX Aggregations‬
‭2.‬ ‭FIRST_VALUE/LAST_VALUE Window Functions‬

‭What’s the Difference?‬

‭ IN and MAX return the smallest and largest value, respectively, in a group, so are useful when you‬
M
‭only need to know the extreme value (like the earliest or latest date) for each group.‬

‭ hey only return a single value and don’t give you any additional context from the row that contains that‬
T
‭value.‬

‭ IRST_VALUE and LAST_VALUE‬‭return the value from the‬‭first and last row, respectively, in a partition‬
F
‭based on a specified order.‬

‭ hese are useful‬‭when you want the complete context‬‭(i.e., additional columns) from the first or last row‬
T
‭— not just the extreme value.‬

‭By default, LAST_VALUE considers rows only up to the current row.‬

‭To ensure it looks at the entire partition, add a frame clause:‬

‭“ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING”‬

‭Suppose you have a “purchases” table with the following data:‬

‭135‬
I‭f your only goal is to get the earliest and latest purchase dates for each customer, you can use MIN‬
‭and MAX.‬

‭This query efficiently aggregates the earliest and latest dates.‬

‭ ut suppose you want to know not only the dates, but also the products from the first and last‬
B
‭purchases.‬

‭MIN and MAX can’t do this because they only return a single value from the aggregated column.‬

‭Instead, use FIRST_VALUE and LAST_VALUE.‬

‭MIN/MAX and FIRST_VALUE/LAST_VALUE serve different purposes.‬

‭Choose the function that best fits the level of detail your analysis requires.‬

‭136‬
‭Tip #81: Demystifying Window Function Keywords‬
‭Window functions are awesome.‬

‭But most people find them confusing.‬

‭Let me clear things up for you!‬

‭Window functions let you calculate metrics over partitions of your data.‬

‭Think running totals, moving averages, rankings, etc.‬

‭ t first, the basics of window functions like OVER(PARTITION BY column ORDER BY column) seem a‬
A
‭bit confusing, but relatively simple.‬

‭ ut things can quickly feel overwhelming when you see keywords like ROWS BETWEEN‬
B
‭UNBOUNDED PRECEDING AND CURRENT ROW.‬

‭The good news?‬

‭It’s not as scary as it sounds.‬

‭These keywords just define the‬‭“window frame.”‬

‭Basically, the range of rows the calculation will include.‬

‭You define a “window” of rows using the OVER clause.‬

‭ nd the window frame (specified with the ROWS clause) determines which rows in that partition are‬
A
‭used for the calculation.‬

‭Here’s a breakdown of common window frame options:‬

-‭ ‬ ‭ NBOUNDED PRECEDING: Includes‬‭all rows before‬‭the current row in the partition.‬

U
‭-‬ ‭CURRENT ROW: Includes‬‭only the current row‬‭in the‬‭partition.‬
‭-‬ ‭UNBOUNDED FOLLOWING: Includes‬‭all rows after‬‭the current‬‭row in the partition.‬

‭If you want a specific number of rows it might look like this:‬

‭-‬ ‭ OWS BETWEEN 2 PRECEDING AND CURRENT ROW: Includes the current row and the two‬
R
‭rows before it.‬
‭-‬ ‭ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING: Includes the current row and the‬
‭next row.‬

‭The default behavior if you don’t specify a frame is often the entire partition.‬

‭But by defining a custom frame, you can get more granular control.‬

‭Here’s an example of a window function in action:‬

‭137‬
‭The keywords are just defining‬‭how much data to include.‬

‭With practice, the syntax becomes intuitive.‬

‭Hope this helps!‬

‭138‬
‭Tip #82: How to Handle Missing Data‬
‭ hen your data contains missing or incomplete values, you have several strategies to ensure your‬
W
‭queries return meaningful results.‬

‭1.‬ ‭Filtering:‬

‭Exclude rows with missing data if they’re not needed.‬

‭For example:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ your_table‬
WHERE‬‭
‭ column‬‭
IS‬‭NOT‬‭
NULL‬

‭2.‬ ‭Replacing with Defaults:‬

‭Substitute missing values with a default value using functions like COALESCE (or IFNULL in MySQL).‬

‭This is useful when a default makes sense in your context.‬

‭For example, if missing scores should be treated as zero:‬

SELECT‬
‭
student_id,‬
‭
COALESCE‬
‭ (score,‬‭
‭ 0‬)‬‭
‭ AS‬‭
score‬
FROM‬‭
‭ your_table‬

‭3.‬ ‭Imputing with Calculated Values:‬

‭Sometimes it’s better to replace missing data with a calculated value (e.g., an average).‬

‭You can compute the average of the available data and then use it to fill in the gaps:‬

WITH avg_value AS (‬
‭
SELECT‬‭
‭ AVG‬
(score)‬‭
‭ AS‬‭
avg_score‬
FROM‬‭
‭ your_table‬
WHERE‬‭
‭ score‬‭
IS‬‭
NOT‬‭NULL‬
)‬
‭
SELECT‬
‭
student_id,‬
‭
COALESCE‬
‭ (score, (‬
‭ SELECT‬‭
‭ avg_score‬‭
FROM‬‭
avg_value))‬‭
AS‬‭
score‬
FROM‬‭
‭ your_table‬

‭Choose the strategy that best fits your data quality requirements and analysis needs.‬

‭Remember this next time you run into missing data!‬

‭139‬
‭ ip #83: Understanding the Difference Between‬
T
‭NULL and Zero‬
‭As a beginner querying data, one thing that can trip you up is dealing with NULL vs. zeros.‬

‭They’re not the same thing, and mixing them up can lead to incorrect results in your analysis.‬

‭NULL‬‭means “unknown” or “missing” data.‬

‭It’s like a placeholder for something that doesn’t exist (yet).‬

‭For example, a customer didn’t enter their email address, so the email column is NULL.‬

‭Zero‬‭is an actual value, like 0 sales or $0 revenue.‬

‭It means “nothing happened,” but it’s still a value.‬

‭SQL handles NULL differently from actual numbers like zero.‬

‭For example, aggregate functions like SUM or AVG automatically ignore NULL values.‬

I‭f you treat NULL as zero without thinking, you might misinterpret missing data as “nothing happened,”‬
‭when in reality, the data just isn’t available.‬

‭Let’s say you have a table “sales”:‬

‭ransaction_id | revenue‬
t
---------------|--------‬
‭
1
‭ | 100‬
2
‭ | NULL‬
3
‭ | 200‬

SELECT‬
‭
SUM‬
‭ (revenue)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ sales‬

‭Result for SUM‬‭: 300‬

‭The NULL value is automatically ignored by SUM.‬

SELECT‬
‭
AVG‬
‭ (revenue)‬‭
‭ AS‬‭
average_revenue‬
FROM‬‭
‭ sales‬

‭Result for AVG‬‭: 150‬

‭Only 100 and 200 are included, so the average is (100 + 200) / 2.‬

‭140‬
‭You‬‭don’t need to explicitly filter NULLs‬‭— SQL ignores‬‭them in aggregate functions by default.‬

‭What if you want to replace NULLs?‬

‭ ometimes, you may want to treat NULL as a specific value (like zero) to make the data easier to‬
S
‭analyze.‬

‭Use the COALESCE‬‭function to replace NULLs:‬

SELECT‬
‭
SUM‬
‭ (‬
‭COALESCE‬
‭ (revenue,‬‭
‭ 0‭
)
‬)‬‭
AS‬‭
total_revenue_with_defaults,‬
AVG‬
‭ (‬
‭COALESCE‬
‭ (revenue,‬‭
‭ 0‭
)
‬)‬‭
AS‬‭
average_revenue_with_defaults‬
FROM‬‭
‭ sales‬

‭Result for total_revenue_with_defaults‬‭: 300‬

‭Result for average_revenue_with_defaults:‬‭(100 + 0‬‭+ 200) / 3 = 100‬

‭Understanding the difference between NULL and zero is key to accurate analysis!‬

‭141‬
‭ ip #84: Understanding COUNT(*), COUNT(column),‬
T
‭and COUNT(1)‬
‭As a data analyst, you’ll frequently use the COUNT function to count rows or values.‬

‭ ut you might have noticed there are different ways to use COUNT, like COUNT(*), COUNT(column),‬
B
‭and even COUNT(1).‬

‭What’s the difference between them, and when should you use each?‬

‭COUNT(*):‬‭Counts all rows in a result set.‬

-‭ ‬ I‭t includes rows with NULL values.‬

‭-‬ ‭Use this when you want to count the total number of rows in a table or query result.‬

‭Here’s a syntax example for counting all orders in a table:‬

‭ELECT‬‭
S COUNT‬
(*)‬‭
‭ AS‬‭
total_orders‬
FROM‬‭
‭ orders‬

‭COUNT(column):‬‭Counts non-NULL values in a specific‬‭column‬‭.‬

-‭ ‬ I‭t skips rows where the specified column is NULL.‬

‭-‬ ‭Use this when you want to count only rows where a specific column has a value.‬

‭Here’s a syntax example for counting orders with a non-NULL customer_id:‬

‭ELECT‬‭
S COUNT‬
(customer_id)‬‭
‭ AS‬‭
valid_orders‬
FROM‬‭
‭ orders‬

‭COUNT(1):‬‭Counts all rows, just like COUNT(*).‬

-‭ ‬ I‭t’s equivalent to COUNT(*) but often seen in queries.‬

‭-‬ ‭Some people think COUNT(1) is faster, but in modern SQL engines, there’s no difference in‬
‭performance.‬

‭Again, counting all orders using COUNT(1):‬

‭ELECT‬‭
S COUNT‬
(‭
‭1
‬‬)‬‭
‭ AS‬‭
total_orders‬
FROM‬ ‭
‭ orders‬

‭ tick with COUNT(*) for total row counts and COUNT(column) when focusing on non-NULL values in a‬
S
‭specific column.‬

‭And don’t worry — if you see COUNT(1), it’s simply another way of expressing the same idea.‬

‭142‬
‭ ip #85: Best Practices for Using INNER JOIN vs.‬
T
‭LEFT JOIN and Filtering Data‬
‭When filtering data in a query that involves joins, you have two options for placing your filter conditions:‬

‭ .‬ I‭n the JOIN clause (or ON condition).‬

1
‭2.‬ ‭In the WHERE clause.‬

‭Deciding where to put your filters depends on your join type and the result you want.‬

‭INNER JOIN‬‭returns only rows with matching values‬‭in both tables.‬

‭ hether you put the filter in the JOIN’s ON clause or in the WHERE clause, the results are generally‬
W
‭the same because only matching rows are returned.‬

‭ EFT JOIN‬‭returns all rows from the left table, plus‬‭matching rows from the right table (with NULL for‬
L
‭non-matches).‬

‭ iltering in the JOIN (ON) clause keeps all left-table rows. Conditions on the right table are applied‬
F
‭during the join so unmatched rows remain (with NULL in the filtered columns).‬

‭ iltering in the WHERE clause applies after the join. Any condition on right-table columns will filter out‬
F
‭rows with NULL values, effectively turning a LEFT JOIN into an INNER JOIN for those conditions.‬

‭Here are examples for filtering in the INNER JOIN:‬

‭- Using WHERE‬
-
SELECT‬‭
‭ c.customer_name, o.order_amount‬
FROM‬‭
‭ customers c‬
INNER‬‭
‭ JOIN‬‭
orders o‬
ON‬‭
‭ c.customer_id = o.customer_id‬
WHERE‬‭
‭ o.order_amount >‬‭
50‬

‭- Or equivalently, filtering in the JOIN clause:‬

-
SELECT‬‭
‭ c.customer_name, o.order_amount‬
FROM‬‭
‭ customers c‬
INNER‬‭
‭ JOIN‬‭
orders o‬
ON‬‭
‭ c.customer_id = o.customer_id‬
AND‬‭
‭ o.order_amount >‬‭
50‬

‭143‬
‭Here’s an example for filtering in the LEFT JOIN:‬

‭ELECT‬‭
S c.customer_name, o.order_amount‬
FROM‬‭
‭ customers c‬
LEFT‬‭
‭ JOIN‬‭
orders o‬
ON‬‭
‭ c.customer_id = o.customer_id‬
AND‬‭
‭ o.order_amount >‬‭
50‬

‭ustomer_name | order_amount‬
c
--------------|-------------‬
‭
Alice
‭ | NULL‬ -- (Alice's order_amount did‬‭
‭ not meet the condition)‬
Bob
‭ | 75‬ -- (Bob's order qualifies)‬
‭
Charlie
‭ | NULL‬ -- (No order exists for Charlie)‬
‭

‭Avoid the WHERE Clause for Right-Table Filters:‬

I‭f you put a right-table filter in the WHERE clause with a LEFT JOIN, rows with no match (where the‬
‭right table is NULL) will be filtered out, which might not be what you intend.‬

‭Example (Problematic):‬

‭ELECT‬‭
S c.customer_name, o.order_amount‬
FROM‬‭
‭ customers c‬
LEFT‬‭
‭ JOIN‬‭
orders o‬
ON‬‭
‭ c.customer_id = o.customer_id‬
WHERE‬‭
‭ o.order_amount >‬‭
50‬

‭ hen you place a filter on a right-table column in the WHERE clause of a LEFT JOIN, rows where that‬
W
‭column is NULL get excluded.‬

‭As a result, the query behaves like an INNER JOIN for that condition.‬

‭ his effectively means that only rows with a matching order (that meet the condition) are returned, just‬
T
‭like with an INNER JOIN.‬

‭ e want to be as explicit as possible when writing SQL. So you should change this to be an INNER‬
W
‭JOIN with both filter conditions in the join.‬

‭The bottom line is this:‬

-‭ ‬ ‭ or INNER JOINs, filtering with the JOIN clause or the WHERE clause yields equivalent results.‬
F
‭-‬ ‭For LEFT JOINs, place filters for the right table in the JOIN clause to preserve all left-table rows.‬
‭-‬ ‭When you place a filter on a right-table column in the WHERE clause of a LEFT JOIN, that is‬
‭essentially just an INNER JOIN, so that’s what you should use instead to be more explicit.‬

‭Understanding where to put your filters helps ensure your query returns exactly the data you need.‬

‭It’s a detailed nuance but tips like this really help take your SQL skills beyond beginner!‬

‭144‬
‭ ip #86: Exploring and Validating Your Data with‬
T
‭Simple Queries‬
‭ hen working with data, one of the most important steps is‬‭validating‬‭that your data is behaving as‬
W
‭you expect.‬

I‭t’s common to assume your SQL query is working perfectly, but a simple typo or misunderstanding‬
‭about your data can lead to errors in your analysis.‬

‭ great way to catch these issues is by writing‬‭small‬‭validation queries‬‭to ensure your data is‬
A
‭consistent and your query logic is correct.‬

‭Here’s a consolidated list of different ways you can validate your data with simple SQL queries:‬

‭If “distinct_customers” is much smaller than “total_rows,” duplicates may exist.‬

‭A significant difference between “total_rows” and “valid_rows” indicates missing or invalid entries.‬

‭145‬
‭ discrepancy between “source_row_count” and “post_join_row_count” may signal issues with your join‬
A
‭logic (dropped or duplicated rows).‬

‭ here are many ways to write little queries that should return zero results if what you believe to be true‬
T
‭about the data is actually true.‬

‭ y running these small queries, you can immediately spot anomalies such as unexpected duplicates,‬
B
‭missing values, or transformation errors.‬

‭Always validate your data and query logic with small checks along the way.‬

‭It might seem tedious, but these little queries can save you from major headaches down the line!‬

‭146‬
‭ ip #87: Why You Might Want to Use a View in‬
T
‭BigQuery‬
‭ hen working with BigQuery, you might wonder why you’d use a‬‭view‬‭instead of just querying a table‬
W
‭directly or creating a new table.‬

‭ view can be a powerful tool that can make your workflows more efficient, simplify complex queries,‬
A
‭and provide real-time data without manual updates.‬

‭A‬‭view‬‭in BigQuery is like a saved query.‬

‭It doesn’t store data itself. Instead, it acts as a dynamic query that runs every time you access it.‬

‭ hink of it as a‬‭window into your data‬‭— you define‬‭the query once, and it pulls the latest results‬
T
‭every time you use it.‬

‭A view can be advantageous over a table because it provides real-time data updates.‬

‭Since views run the query each time they’re accessed, they always return the most up-to-date data.‬

‭Let’s say‬‭your “sales” table gets updated every hour.‬

‭ view that calculates total sales by region will always reflect the latest numbers without needing to‬
A
‭recreate or update a table:‬

‭REATE‬‭
C VIEW‬‭
region_sales_view‬‭
AS‬
SELECT‬
‭
region‬
‭
,‬‭
‭ SUM‬
(sales_amount)‬‭
‭ AS‬‭
total_sales‬
FROM‬‭
‭ sales‬
GROUP‬‭
‭ BY‬‭
region‬

‭ lot of SQL environments (like BigQuery) also have options to “save query as view,” making it easier‬
A
‭for you.‬

I‭nstead of rebuilding a summary table every hour, the view runs this query dynamically, saving time and‬
‭effort.‬

‭Views can also simplify workflows by hiding complex SQL logic.‬

‭ ou can write a single query once, save it as a view, and let others use it without worrying about the‬
Y
‭underlying details.‬

‭ onsider an example in which you create a query that calculates revenue by product and excludes test‬
C
‭data.‬

‭147‬
‭Instead of copying and pasting this query every time, save it as a view:‬

‭REATE‬‭
C VIEW‬‭
clean_product_revenue‬‭
AS‬
SELECT‬
‭
product_name‬
‭
,‬‭
‭ SUM‬
(revenue)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ sales‬
WHERE‬‭
‭ is_test =‬‭
FALSE‬
GROUP‬‭
‭ BY‬‭
product_name‬

‭The accounting team can now use this view in their queries:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ clean_product_revenue‬
WHERE‬‭
‭ product_name =‬‭
'Widget A'‬

‭They don’t need to know how the revenue is calculated — they just filter the results.‬

‭Views make it easier to collaborate.‬

‭For example, the accounting team might want a report on monthly revenue.‬

‭You can create a view like this:‬

‭ELECT‬‭
S *‬
FROM‬‭
‭ monthly_revenue‬
WHERE‬‭
‭ month‬‭
>=‬‭
'2024-01-01'‬

‭Then they can run a simple query on top of it.‬

‭Views ensure that everyone uses the‬‭same logic‬‭for‬‭calculations.‬

I‭nstead of having multiple analysts write their own (potentially inconsistent) queries, a view‬
‭standardizes the approach.‬

‭Views also minimize storage costs.‬

‭Because views don’t store data, they’re cheaper than creating new tables..‬

‭When Not to Use a view:‬

‭-‬ ‭ erformance: Since views re-run the query every time, they can be slower for large datasets‬
P
‭compared to materialized tables.‬
‭-‬ ‭Complex Joins or Aggregations: If your view involves very heavy calculations, consider creating‬
‭a materialized view or summary table instead for better performance.‬

I‭f you want to simplify your workflow, save time, and provide others with easy access to complex‬
‭queries, consider using a view in BigQuery!‬

‭148‬
‭Tip #88: Using the REPEAT Function‬
‭ he REPEAT function might seem like an odd one to use in SQL — why would you ever need to repeat‬
T
‭a string multiple times?‬

‭ hile it’s not something you’ll use daily, there are real-world scenarios where this function can be‬
W
‭surprisingly handy.‬

‭The REPEAT function takes a string (or bytes) and repeats it a specified number of times.‬

‭Here is the syntax: REPEAT(original_value, repetitions)‬

-‭ ‬ ‭ riginal_value: The string you want to repeat.‬

o
‭-‬ ‭repetitions: How many times you want it repeated.‬

‭If either parameter is NULL, the result is NULL.‬

‭Let’s say you’re a data analyst working on a customer service dashboard.‬

‭You’re tasked with highlighting priority levels for open support tickets.‬

‭ he business team wants a simple visual indicator to represent the urgency of a ticket based on its‬
T
‭priority level:‬

-‭ ‬ ‭ ow Priority: A single * (one asterisk).‬

L
‭-‬ ‭Medium Priority: Two * symbols.‬
‭-‬ ‭High Priority: Three * symbols.‬

I‭nstead of hardcoding this logic, you can use the REPEAT function to dynamically generate these‬
‭indicators based on the ticket priority.‬

‭Here is your table “support_tickets”:‬

‭icket_id | priority | description‬

t
----------|----------|-----------------‬
‭
101
‭ | Low | Login issue‬
102
‭ | High | System outage‬
103
‭ | Medium | Billing question‬

‭149‬
‭You can use REPEAT to generate the visual indicators:‬

SELECT‬
‭
ticket_id,‬
‭
priority‬
‭ ,‬
‭
description,‬
‭
REPEAT‬
‭ (‬
‭'*'‬
‭ ,‬
‭
CASE‬
‭
WHEN‬‭
‭ priority‬‭
=‬‭
'Low'‬‭
THEN‬‭1‬
WHEN‬‭
‭ priority‬‭
=‬‭
'Medium'‬‭THEN‬‭
2‬
WHEN‬‭
‭ priority‬‭
=‬‭
'High'‬‭THEN‬‭
3‬
ELSE‬‭
‭ 0‬
END‬
‭ )‬‭
‭ AS‬‭
visual_indicator‬
FROM‬‭
‭ support_tickets‬

-- Result‬
‭

‭icket_id | priority | description

t | visual_indicator‬
------------------------------------------------------------‬
‭
101
‭ | Low | Login issue | *‬
102
‭ | High | System outage | ***‬
103
‭ | Medium | Billing question | **‬

‭ he “visual_indicator” column makes it easier for non-technical stakeholders to quickly see ticket‬
T
‭urgency in reports or dashboards.‬

‭ ou can extend this logic to represent SLAs (service-level agreements), customer tiers, or any other‬
Y
‭classification that benefits from a repeated pattern.‬

I‭nstead of manually coding symbols for each priority level, you can rely on the logic to dynamically‬
‭adjust as priority levels or business rules change.‬

‭Other potential use cases for REPEAT:‬

‭-‬ ‭ se REPEAT to create a text-based progress bar (e.g., repeating # based on a percentage of‬
U
‭completion).‬
-‭ ‬ ‭Use REPEAT to create fixed-length placeholders (e.g., masking credit card numbers).‬
‭-‬ ‭Quickly generate repetitive text patterns when testing queries or database performance.‬

‭ hether it’s visual indicators, progress bars, or placeholder strings, REPEAT can make your SQL‬
W
‭queries more functional and your reports more insightful.‬

‭Next time you need a little flair in your data, give it a try!‬

‭150‬
‭ ip #89: Using the SPLIT Function for String‬
T
‭Manipulation‬
‭The SPLIT function is a powerful tool for breaking apart strings into arrays based on a delimiter.‬

‭Once you’ve split a string, you can access specific parts of the result using array indexing.‬

‭ common pattern is using SPLIT and grabbing the‬‭first element‬‭of the resulting array, which can be‬
A
‭incredibly useful for tasks like cleaning or transforming messy data.‬

‭Here is the syntax: SPLIT(value[, delimiter])‬

-‭ ‬ ‭ alue: The string to split.‬

v
‭-‬ ‭delimiter: The character(s) used to split the string. Defaults to a comma (,).‬

‭ ou can access specific parts of the resulting array using‬‭square brackets‬‭([index]), where indexing‬
Y
‭starts at 0.‬

‭ ay you’re working with a “users” table that stores email addresses, but you only need the username‬
S
‭(the part before the @).‬

‭151‬
‭SPLIT(email, '@') breaks the email into an array: ['alice', 'example.com'].‬

‭[OFFSET(0)] grabs the first element of the array (‘alice’).‬

‭If you frequently need to extract parts of strings, remember:‬

‭ .‬ S
1 ‭ PLIT(column, 'delimiter')[OFFSET(0)]‬
‭2.‬ ‭SPLIT(column, 'delimiter')[OFFSET(1)]‬

‭If you wanted the 3rd element you would put “2”, etc.‬

‭ ro Tip‬‭: Pair SPLIT with other string functions like‬‭TRIM or UPPER for even more control when‬
P
‭cleaning or transforming data.‬

‭Mastering this will make your string manipulation tasks much more simple!‬

‭152‬
‭ ip #90: Using the BAG_OF_WORDS Function for‬
T
‭Text Analysis‬
‭ hen analyzing text data in BigQuery, you might think you need Python or specialized tools for tasks‬
W
‭like identifying commonly used words in customer service messages or community interactions.‬

‭ owever, BigQuery’s BAG_OF_WORDS function can help you perform‬‭basic text analysis‬‭right inside‬
H
‭SQL.‬

‭ he BAG_OF_WORDS function takes a‬‭tokenized document‬‭(an array of words or terms) and returns‬
T
‭an array of terms and their frequencies.‬

I‭t essentially gives you a bag-of-words model (a foundational concept in text analysis) by showing how‬
‭often each word appears in a piece of text.‬

‭Imagine you work with a “customer_service” table that contains messages from your support team.‬

‭You want to find out the most common words customers are using to describe issues.‬

‭Step 1: Tokenize the Messages‬

‭You can use BigQuery’s SPLIT function to break the messages into individual words (tokens).‬

‭essage_id | message_content‬
m
-----------|---------------------------------------------------------‬
‭
1
‭ | The delivery was late and the product is damaged.‬
2
‭ | I need‬‭
help‬‭
with‬‭a refund‬‭
for‬‭
a damaged‬‭
product.‬
3‬
‭ | The product was‬‭
‭ not‬‭delivered‬‭
on‬‭
time.‬

SELECT‬
‭
message_id,‬
‭
SPLIT‬
‭ (‬
‭LOWER‬
‭ (message_content),‬‭
‭ ' '‬
)‬‭
‭ AS‬‭
tokenized_message‬
FROM‬‭
‭ customer_service‬

‭essage_id | tokenized_message‬
m
-----------|---------------------------------------------------------------‬
‭
1
‭ | ["the", "delivery", "was", "late", "and", "the", "product", "is", "damaged"]‬
2
‭ | ["i", "need", "help", "with", "a", "refund", "for", "a", "damaged", "product"]‬
3
‭ | ["the", "product", "was", "not", "delivered", "on", "time"]‬

‭153‬
‭Step 2: Use BAG_OF_WORDS to Count Words‬

‭Now, use the BAG_OF_WORDS function to calculate the frequency of each word in each message.‬

SELECT‬
‭
message_id,‬
‭
BAG_OF_WORDS(‬
‭ SPLIT‬
‭ (‭
‭L‬OWER‬
(message_content),‬‭
‭ ' '‬
))‬‭
‭ AS‬‭
word_frequencies‬
FROM‬‭
‭ customer_service‬

‭essage_id | word_frequencies‬
m
-----------|-----------------------------------------------------------------------‬
‭
1
‭ | [{"term": "the", "count": 2}, {"term": "delivery", "count": 1}, ...]‬
2
‭ | [{"term": "i", "count": 1}, {"term": "need", "count": 1}, ...]‬
3
‭ | [{"term": "the", "count": 1}, {"term": "product", "count": 1}, ...]‬

‭Step 3: Flatten the Results for Aggregation‬

‭ o analyze the most common words across all messages, flatten the array of terms and aggregate the‬
T
‭counts:‬

SELECT‬
‭
word.term,‬
‭
SUM‬
‭ (word.count)‬‭
‭ AS‬‭
total_count‬
FROM‬‭
‭ customer_service,‬
UNNEST(BAG_OF_WORDS(‬
‭ SPLIT‬
‭ (‬
‭ LOWER‬
‭ (message_content),‬‭
‭ ' '‬
)))‬‭
‭ AS‬‭
word‬
GROUP‬‭
‭ BY‬‭
word.term‬
ORDER‬‭
‭ BY‬‭
total_count‬‭
DESC‬

‭erm
t | total_count‬
----------|-------------‬
‭
the
‭ | 4‬
product
‭ | 3‬
damaged
‭ | 2‬
delivery | 2‬
‭
was
‭ | 2‬

‭ AG_OF_WORDS is useful for‬‭identifying commonly mentioned‬‭issues (e.g., “damaged,” “refund”) in‬

B
‭customer messages, and for finding trending topics or terms in community interactions.‬

‭ se BAG_OF_WORDS next time you want to dig into the words your customers or community are‬
U
‭using!‬

‭154‬
‭ ip #91: You’re Never Going to Get the Data Exactly‬
T
‭How You Want It—And That’s Okay‬
‭ s a data analyst, one of the hardest lessons to learn isn’t about SQL syntax or complex joins — it’s‬
A
‭about‬‭managing expectations‬‭.‬

‭ ere’s the reality: in the world of modern data, you’re almost never going to get the exact data you want‬
H
‭in the format you need.‬

‭Data Lives in Silos‬

‭ ompanies use dozens of software tools, such as CRMs, e-commerce platforms, marketing‬
C
‭automation, analytics tools, and more.‬

‭Each of these systems stores data differently, with its own quirks, limitations, and gaps.‬

‭APIs Are Imperfect‬

‭ hile APIs give us programmatic access to data, they rarely provide all the fields you want or the‬
W
‭granularity you need.‬

‭ or example, an API might provide order data but leave out refund information, or it might give you‬
F
‭aggregated metrics but no raw data to drill into.‬

‭Incomplete Historical Data‬

‭Many systems don’t store data indefinitely or don’t collect the data you need in the first place.‬

‭You’ll often find yourself missing key pieces of the puzzle.‬

‭Formats Are Messy‬

‭ ven if you can get the data, you’ll probably spend more time cleaning it up (deduping, transforming,‬
E
‭validating) than actually analyzing it.‬

‭The Good News: You’re Not Alone‬

I‭f you’re banging your head against the wall trying to figure out why it’s so hard to get the perfect‬
‭dataset, know this:‬‭everyone struggles with this‬‭.‬

‭It doesn’t mean you’re bad at your job or that you don’t know enough.‬

‭It’s just how data works.‬

‭Even experienced analysts and engineers face these same limitations.‬

‭The key is to adapt and focus on‬‭working with what‬‭you have‬‭.‬

‭Adjust Your Expectations‬

‭155‬
‭It’s okay to aim for 95%.‬

‭In most cases, you don’t need a perfect dataset to make meaningful insights.‬

‭If you can answer the big questions with the data you have, that’s usually good enough.‬

‭Be Creative with Your Solutions‬

‭When you can’t get the exact data, think about workarounds:‬

-‭ ‬ ‭ an you infer missing data from another source?‬

C
‭-‬ ‭Can you transform or aggregate the data to make it usable?‬
‭-‬ ‭Can you fill gaps with assumptions or proxies?‬

‭Learn to Communicate Limitations‬

‭It’s okay to say, “We don’t have data on X, but here’s what we can infer from Y.”‬

‭Stakeholders care about insights, not perfection.‬

‭---------------‬

I‭magine you’re tasked with analyzing customer lifetime value (LTV) for an e-commerce business, but‬
‭the API only provides total order revenue and a list of customer IDs.‬

‭You’re missing refund data, discounts, and shipping costs.‬

‭While this might seem like a blocker, here’s how you can adapt:‬

-‭ ‬ ‭ ocus on What You Have‬‭: Use total revenue as a proxy‬‭for LTV, knowing it’s not perfect.‬
F
‭-‬ ‭Estimate Missing Pieces‬‭: If you know the average refund‬‭rate or shipping cost, apply it as a‬
‭general adjustment.‬
‭-‬ ‭Communicate Assumptions‬‭: Share your calculations and‬‭highlight the limitations so‬
‭stakeholders understand what’s missing.‬

‭The world of data is messy.‬

‭But that’s okay.‬

‭ s long as you’re creative, flexible, and focused on delivering value, you’re doing exactly what a great‬
A
‭data analyst should!‬

‭156‬
‭ ip #92: Why GROUP BY Columns Must Appear in‬
T
‭the SELECT Clause (Unless Aggregated)‬
‭When working with SQL, you might run into an error saying something like:‬

‭“Column "X" must appear in the GROUP BY clause or be used in an aggregate function.”‬

‭This can be confusing at first, but it’s a fundamental rule of how SQL handles grouped data.‬

‭Let’s break it down.‬

‭When you use GROUP BY in a query, every column in your SELECT clause must either:‬

-‭ ‬ ‭ e listed in the GROUP BY clause, or‬

B
‭-‬ ‭Be used as part of an aggregate function (like SUM(), COUNT(), AVG(), etc.).‬

‭When you use GROUP BY, SQL groups rows based on the specified columns.‬

‭ nce grouped,‬‭only aggregate values‬‭(like totals,‬‭counts, or averages) or the grouping columns‬

O
‭themselves can appear in the results.‬

‭ his is because SQL can’t return individual values for columns that aren’t part of the group — it doesn’t‬
T
‭know which value to pick!‬

‭Say you want to calculate total revenue by product:‬

SELECT‬
‭
product_name,‬
‭
SUM‬
‭ (revenue)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ sales‬
GROUP‬‭
‭ BY‬‭
product_name‬

‭ ere, “product_name” is in the GROUP BY clause because it defines how rows are grouped, and‬
H
‭SUM(revenue) is aggregated.‬

‭Everything in the SELECT clause follows the rule.‬

I‭f you try to include a column in the SELECT clause that isn’t in the GROUP BY or aggregated, SQL‬
‭won’t know how to handle it:‬

SELECT‬
‭
product_name,‬
‭
region,‬
‭
SUM‬
‭ (revenue)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ sales‬
GROUP‬‭
‭ BY‬‭
product_name‬

‭157‬
‭Error‬‭: column "region" must appear in the GROUP BY clause or be used in an aggregate function.‬

‭ his happens because SQL groups the rows by “product_name,” but “region” isn’t part of the grouping‬
T
‭or aggregated.‬

‭To fix this,‬‭add the column to the GROUP BY clause.‬

‭If you want “region” in the results, you need to group by both “product_name” and “region”:‬

SELECT‬
‭
product_name,‬
‭
region,‬
‭
SUM‬
‭ (revenue)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ sales‬
GROUP‬‭
‭ BY‬‭
product_name, region‬

‭ hen in doubt, focus on the relationship between the SELECT clause and GROUP BY clause: they‬
W
‭must align.‬

‭This rule is fundamental to understanding how SQL queries process data.‬

‭Once you get it, you’ll avoid those frustrating errors!‬

‭158‬
‭Tip #93: WHERE EXISTS‬
‭Here’s a SQL tip that most experienced analysts know, but most beginners don’t know.‬

‭In your data analyst career, you might see queries that include WHERE EXISTS.‬

‭Don’t worry — it’s not as intimidating as it seems.‬

‭EXISTS basically does the same thing as WHERE IN.‬

‭They both check if there’s at least one matching row in the subquery.‬

‭While they can sometimes be interchangeable, they also have subtle differences.‬

‭Understanding those differences can help you write more efficient queries.‬

‭Let’s break it down.‬

‭EXISTS‬‭is used to check whether‬‭a subquery returns‬‭any rows‬‭.‬

‭It’s like asking, “Does this condition exist?”‬

‭It stops evaluating as soon as it finds the first match, making it efficient for certain use cases.‬

‭ELECT‬‭
S customer_id‬
FROM‬‭
‭ customers c‬
WHERE‬‭
‭ EXISTS‬‭
(‬
SELECT‬‭
‭ 1‬
FROM‬‭
‭ orders o‬
WHERE‬‭
‭ o.customer_id = c.customer_id‬
)‬
‭

‭This query returns all customers who have at least one order.‬

‭IN‬‭is used to check whether a value exists‬‭in a list‬‭or subquery result‬‭.‬

‭It retrieves all possible matches before filtering, which can make it less efficient with large datasets.‬

‭ELECT‬‭
S customer_id‬
FROM‬‭
‭ customers‬
WHERE‬‭
‭ customer_id‬‭
IN‬‭
(‬
SELECT‬‭
‭ customer_id‬
FROM‬‭
‭ orders‬
)‬
‭

‭This query also returns all customers who have at least one order.‬

‭For everyday queries, you can keep using IN.‬

‭159‬
I‭f you do spot EXISTS in someone else’s code, just remember it’s basically another way of saying,‬
‭“Does at least one row exist in this subquery?”‬

‭ ou’ll often get the same result, and you don’t need to change your approach unless you’re dealing with‬
Y
‭performance issues on very big tables.‬

‭160‬
‭ ip #94: Why You Should Always Comment Your‬
T
‭Code‬
‭Writing SQL queries can feel like solving a puzzle.‬

‭ hen you’re deep into writing complex joins, subqueries, and window functions, you can see the‬
W
‭puzzle results and it makes sense in that moment.‬

‭But what happens when you (or someone else) revisit that query six months later?‬

‭Without‬‭clear comments‬‭, it might feel like decoding‬‭a foreign language.‬

‭Complex queries need context:‬

-‭ ‬ ‭ hy are you filtering out certain rows?‬

W
‭-‬ ‭What is the logic behind a calculation or join?‬
‭-‬ ‭What does each section of the query achieve?‬

‭ ou won’t always remember why you wrote a query the way you did, especially when revisiting it weeks‬
Y
‭or months later.‬

‭Comments act as a time capsule, helping you recall your thought process.‬

I‭f you work on a team, your coworkers will need to understand your queries and comments provide‬
‭clarity, ensuring they don’t waste time reverse-engineering your logic.‬

‭ hen something breaks or needs updating, comments make it easier to pinpoint what each part of your‬
W
‭query is doing, saving valuable time.‬

‭Best practices for commenting in SQL:‬

‭-‬ ‭At the top of your query, briefly explain what it does.‬

‭- This query calculates total revenue by product category and region for the‬
-
current quarter.‬
‭

‭-‬ ‭Use comments to break your query into logical sections, especially for complex operations.‬

‭- Step 1: Filter out canceled orders‬

-
-- Step 2: Aggregate revenue by product category‬
‭
-- Step 3: Join with region data‬
‭

‭-‬ ‭For tricky conditions, explain what’s happening.‬

‭- Exclude test accounts (e.g., emails ending in '@test.com') and inactive users‬
-
WHERE email NOT LIKE '%@test.com'‬
‭
AND is_active = TRUE‬
‭

‭161‬
‭-‬ ‭For quick clarifications, add inline comments next to specific lines.‬

SELECT‬
‭
product_id,‬
‭
SUM‬
‭ (revenue)‬‭
‭ AS‬‭total_revenue,‬‭
-- Aggregate revenue‬‭
for each product‬
COUNT‬
‭ (‬
‭DISTINCT‬‭
‭ customer_id)‬‭AS‬‭
unique_customers‬‭
-- Count unique buyers‬
FROM‬‭
‭ sales‬

‭Here’s an example of how comments can make a query more understandable:‬

‭- This query calculates the monthly revenue per region and identifies‬
-
-- the top-performing region for each month.‬
‭

WITH regional_revenue AS (‬
‭
-- Step 1: Aggregate monthly revenue by region‬
‭
SELECT‬
‭
DATE_TRUNC(order_date,‬‭
‭ MONTH‬
)‬‭
‭ AS‬‭
order_month,‬
region,‬
‭
SUM‬
‭ (order_total)‬‭
‭ AS‬‭
total_revenue‬
FROM‬‭
‭ orders‬
WHERE‬‭
‭ order_status =‬‭'completed'‬‭
-- Exclude canceled‬‭or pending orders‬
GROUP‬‭
‭ BY‬‭
order_month, region‬
),‬
‭
ranked_regions‬‭
‭ AS‬‭
(‬
-- Step 2: Rank regions by revenue for each month‬
‭
SELECT‬
‭
order_month,‬
‭
region,‬
‭
total_revenue,‬
‭
RANK‬
‭ ()‬‭
‭ OVER‬‭
(‭
P
‬ARTITION‬‭
BY‬‭
order_month‬‭ORDER‬‭
BY‬‭
total_revenue‬‭
DESC‬
)‬‭
‭ AS‬
revenue_rank‬
‭
FROM‬‭
‭ regional_revenue‬
)‬
‭
-- Step 3: Filter to the top-performing region for each month‬
‭
SELECT‬
‭
order_month,‬
‭
region‬‭
‭ AS‬‭top_region,‬
total_revenue‬‭
‭ AS‬‭
top_revenue‬
FROM‬‭
‭ ranked_regions‬
WHERE‬‭
‭ revenue_rank =‬‭1‬
ORDER‬‭
‭ BY‬‭
order_month‬

‭Without comments, this query would take much longer to decipher.‬

‭With comments, it’s clear what each step is doing, even for someone new to the query.‬

‭162‬
‭ ommenting your SQL code might feel like an extra step, but it saves time, frustration, and confusion‬
C
‭down the line.‬

‭ hether it’s for debugging, team collaboration, or just making your queries easier to revisit, clear‬
W
‭comments are a small effort with a big payoff.‬

‭163‬
‭ ip #95: When to Use REGEXP_CONTAINS Instead‬
T
‭of LIKE‬
‭Here’s one SQL lesson you won’t know you need until you are faced with it:‬

‭When to use REGEXP_CONTAINS vs. LIKE?‬

‭The LIKE operator is great for simple pattern matching, but it has its limitations.‬

‭ hen you need more complex or precise text filtering, REGEXP_CONTAINS (or REGEXP_LIKE in‬
W
‭some databases) steps in as the superior tool.‬

‭Here’s why you might choose REGEXP_CONTAINS over LIKE in SQL:‬

‭-‬ ‭ dvanced Pattern Matching:‬‭LIKE is limited to basic‬‭wildcards (% and _) and can only match‬
A
‭one pattern at a time unless you use multiple OR conditions. REGEXP_CONTAINS can match‬
‭multiple patterns and support full regular expressions, allowing you to increase complexity.‬
‭-‬ ‭Case Sensitivity Control:‬‭LIKE behavior with case‬‭sensitivity depends on the database’s‬
‭collation settings, while REGEXP_CONTAINS can explicitly match case-sensitive patterns.‬

‭Here’s an examples where REGEXP_CONTAINS is needed.‬

‭We want to find rows where the column contains either “error” or “failed”:‬

‭- Using REGEXP_CONTAINS‬
-
SELECT‬‭
‭ *‬
FROM‬‭
‭ logs‬
WHERE‬‭
‭ REGEXP_CONTAINS(message, r‬
'error|failed'‬
‭ );‬
‭

‭- Using LIKE (requires multiple conditions)‬

-
SELECT‬‭
‭ *‬
FROM‬‭
‭ logs‬
WHERE‬‭
‭ message‬‭
LIKE‬‭
'%error%'‬
OR‬‭
‭ message‬‭
LIKE‬‭
'%failed%'‬

‭ lso, LIKE cannot handle flexible character sets or ranges, but REGEXP_CONTAINS allows for more‬
A
‭advanced matching.‬

‭164‬
‭Now let’s match rows where a column contains a 5-digit ZIP code:‬

‭- Using REGEXP_CONTAINS‬
-
SELECT‬‭
‭ *‬
FROM‬‭
‭ addresses‬
WHERE‬‭
‭ REGEXP_CONTAINS(zip_code, r‬
'^\d{5}$'‬
‭ );‬
‭

‭- LIKE can't enforce exact digit counts‬

-
SELECT‬‭
‭ *‬
FROM‬‭
‭ addresses‬
WHERE‬‭
‭ zip_code‬‭
LIKE‬‭
'_____'‬
;‬‭
‭ -- Matches any 5-character‬‭
string, not just digits‬

‭ y knowing the strengths and limitations of both, you can choose the right tool for each scenario, and‬
B
‭save yourself a lot of frustration with complex text queries!‬

‭165‬
‭ ip #96: The SELECT Clause is Independent of the‬
T
‭Query’s Processing‬
‭When writing SQL, remember that the SELECT clause only controls what you see in your results.‬

‭It doesn’t dictate how the data is processed.‬

‭ he logic in your WHERE, HAVING, ORDER BY, and other clauses runs regardless of whether their‬
T
‭expressions appear in the SELECT clause.‬

‭The SELECT clause defines the output of your query.‬

‭WHERE, HAVING, ORDER BY, etc., perform filtering, grouping, and ordering of your data.‬

‭Consider the following query:‬

‭ELECT‬‭
S customer_id, order_date‬
FROM‬‭
‭ orders‬
WHERE‬‭
‭ order_amount > (‬
SELECT‬‭
‭ AVG‬
(order_amount)‬‭
‭ FROM‬‭
orders)‬

‭ he WHERE clause is using a subquery to filter orders by comparing each order’s amount to the‬
T
‭overall average order amount.‬

‭ ven though the computed average isn’t shown in the SELECT results, it still influences which rows are‬
E
‭returned.‬

‭The bottom line is you don’t have to include every part of your query’s logic in the SELECT clause.‬

‭ his allows you to focus on displaying only the data you need, while the underlying logic runs‬
T
‭independently in the background.‬

‭166‬
‭ ip #97: Don’t Use CASE Statements as Join‬
T
‭Conditions‬
‭Here’s a mistake I see beginners make all the time:‬

‭Using CASE statements as part of a JOIN condition.‬

‭ hile it might work, it’s not good practice and can lead to performance issues, unnecessary complexity,‬
W
‭and even query failures at scale.‬

‭ beginner might think to him/herself, “I need to match on different conditions depending on the data,‬
A
‭so I’ll just write a CASE statement in my JOIN clause.”‬

‭Here’s what that might look like:‬

‭ELECT‬‭
S a.*, b.*‬
FROM‬‭
‭ table_a a‬
JOIN‬‭
‭ table_b b‬
ON‬‭
‭ CASE‬
WHEN‬‭
‭ a.column_x =‬‭
'A'‬‭
THEN‬‭
b.column_y = a.column_z‬
WHEN‬‭
‭ a.column_x =‬‭
'B'‬‭
THEN‬‭
b.column_y = a.column_w‬
END‬
‭

‭At first glance, it seems like a clever way to account for complex logic in your joins.‬

‭But while SQL will allow this, it’s not the right approach.‬

‭Here’s why:‬

‭-‬ ‭ erformance Issues‬‭: A CASE statement in a JOIN forces‬‭the database to evaluate conditions‬
P
‭row by row, making it much harder to optimize. This can significantly slow down your query,‬
‭especially on large datasets.‬
‭-‬ ‭Hard to Debug‬‭: It’s difficult to understand, maintain,‬‭and debug queries with conditional joins,‬
‭especially as your logic grows more complex.‬
‭-‬ ‭Scalability Problems‬‭: What works on a small dataset‬‭might blow up on a larger one.‬

‭Complex join conditions can overwhelm the query planner, leading to compute errors or even timeouts.‬

‭Instead of writing a CASE in the JOIN,‬‭pre-compute‬‭the logic in a CTE‬‭or subquery.‬

‭This simplifies the join and makes your query easier to read and maintain.‬

‭167‬
‭Here’s how to rewrite the query from above - pre-compute join keys in a CTE:‬

WITH precomputed AS (‬
‭
SELECT‬‭
‭ *,‬
CASE‬
‭
WHEN‬‭
‭ column_x =‬‭
'A'‬‭
THEN‬‭
column_z‬
WHEN‬‭
‭ column_x =‬‭
'B'‬‭
THEN‬‭
column_w‬
END‬‭
‭ AS‬‭join_key‬
FROM‬‭
‭ table_a‬
)‬
‭
SELECT‬‭
‭ p.*, b.*‬
FROM‬‭
‭ precomputed p‬
JOIN‬‭
‭ table_b b‬
ON‬‭
‭ p.join_key = b.column_y;‬

‭While putting a CASE in a JOIN condition might seem like a quick fix, it’s not good practice.‬

‭Pre-compute the logic in a CTE or subquery instead.‬

‭It will save you time, headaches, and compute resources in the long run.‬

‭168‬
‭Tip #98: HAVING MAX and HAVING MIN‬
‭Here’s a weird SQL tip you probably didn’t know about.‬

‭ ost SQL users think of MAX() and MIN() as go-to functions for returning the largest or smallest value‬
M
‭in a group.‬

‭ ut did you know you can use HAVING MAX and HAVING MIN to restrict aggregation to only the rows‬
B
‭with the‬‭maximum or minimum value‬‭?‬

‭It’s like magic for narrowing down data to key rows without extra subqueries.‬

‭ et’s say you’re analyzing climate data and want to identify a year when the‬‭highest temperature‬
L
‭occurred for each season.‬

I‭nstead of writing a long query with subqueries, you can use HAVING MAX to directly filter for rows with‬
‭the maximum temperature:‬

‭ AVING MAX temp filters the rows to only include those with the highest temperature within each‬
H
‭season.‬

‭169‬
‭ANY_VALUE retrieves the year from one of the rows with the maximum temperature.‬

‭MAX(temp) displays the maximum temperature for each season.‬

I‭nstead of writing a separate subquery to find the maximum value and then joining it back, you can use‬
‭HAVING MAX to simplify your query.‬

‭This works well for other scenarios like tracking top performers, highest sales, or peak events.‬

‭Want to find the year with the‬‭lowest temperature‬‭for each season?‬

‭Just swap HAVING MAX with HAVING MIN:‬

SELECT‬
‭
season,‬
‭
ANY_VALUE(‬
‭ year‬‭
‭ HAVING‬‭
MIN‬‭
temp)‬‭
AS‬‭
year_with_min_temp,‬
MIN‬
‭ (temp)‬‭
‭ AS‬‭
min_temperature‬
FROM‬‭
‭ Temperatures‬
GROUP‬‭
‭ BY‬‭
season‬

‭ AVING MAX and HAVING MIN are great tools for isolating rows tied to extreme values within groups,‬
H
‭without requiring additional subqueries.‬

‭ hether you’re analyzing weather patterns, sales, or performance metrics, this trick can save you time‬
W
‭and simplify your SQL queries.‬

‭170‬
‭Tip #99: Data Normalization Basics‬
‭As a data analyst, you don’t have to be a data normalization expert.‬

‭But you do need to understand why normalization exists.‬

‭It’s a fancy term that can seem intimidating, but the concept is actually pretty simple.‬

‭In relational databases, data is typically organized into multiple related tables rather than one big table.‬

‭Let’s break it down.‬

‭Normalization is the process of organizing your database to:‬

-‭ ‬ ‭ educe redundancy‬
R
‭-‬ ‭I‭m
‬ prove data integrity‬‭(data is accurate and consistent)‬

‭ he idea is to split data into related tables and use keys (like primary keys and foreign keys) to connect‬
T
‭them.‬

‭Imagine you’re managing a database of customers and orders.‬

‭Without normalization, your table might look like this:‬

‭rder_id | customer_name | customer_email

o | order_total‬
---------|---------------|-------------------|------------‬
‭
1
‭ | John Smith | john@example.com | 100.00‬
2
‭ | Jane Doe | jane@example.com | 50.00‬
3
‭ | John Smith | john@example.com | 75.00‬

‭Notice the duplication?‬

‭John Smith’s name and email appear twice.‬

‭ his is inefficient (you’re wasting storage space) and Error-prone (if John updates his email, you need‬
T
‭to update it in multiple places, which could lead to mistakes).‬

‭To fix this, you’d‬‭split the data into two tables‬‭:‬

‭1.‬ ‭Customers Table‬

‭ customer_id | customer_name | customer_email

| |‬
|‬
‭-------------|---------------|-------------------|‬
‭
| 1
‭ | John Smith | john@example.com |‬
| 2
‭ | Jane Doe | jane@example.com |‬

‭171‬
‭2.‬ ‭Orders Table‬

‭ order_id | customer_id | order_total |‬

|
|‬
‭----------|-------------|-------------|‬
‭
| 1
‭ | 1 | 100.00 |‬
| 2
‭ | 2 | 50.00 |‬
| 3
‭ | 1 | 75.00 |‬

I‭n this normalized design, customer information is stored only once, and the orders table references the‬
‭customer via “customer_id.”‬

‭This reduces duplication and improves maintainability!‬

‭172‬
‭Tip #100: Using || for String Concatenation‬
‭Let’s finish with a fun one…‬

‭When you first see a double pipe‬‭symbol (“||”) in‬‭SQL, it can feel like you’re reading a foreign language.‬

‭But don’t worry, you’re not missing a secret underground SQL club.‬

‭It’s just shorthand for‬‭concatenate.‬‭It’s a clean, efficient alternative to the CONCAT() function.‬

‭Here’s how it works:‬

‭The || operator‬‭concatenates strings‬‭, meaning it combines‬‭two or more strings into one.‬

‭ et’s say you have a table “customers,” and you want to combine their first and last names into a single‬
L
‭column:‬

SELECT‬
‭
first_name ||‬‭
‭ ' '‬‭
|| last_name‬‭
AS‬‭
full_name‬
FROM‬‭
‭ customers‬

‭irst_name | last_name | full_name‬

f
-----------|-----------|-----------‬
‭
John
‭ | Doe | John Doe‬
Jane
‭ | Smith | Jane Smith‬

|‭| is faster to type and easier to read compared to CONCAT() and it handles separators like spaces,‬
‭hyphens, or custom strings directly in the query.‬

‭BigQuery allows you to use || with arrays for even more functionality.‬

‭Handle Nulls with COALESCE():‬

‭If any value in the || operation is NULL, the result will also be NULL.‬

‭To avoid this, use COALESCE() to replace nulls with a default value:‬

SELECT‬
‭
COALESCE‬
‭ (first_name,‬‭
‭ 'Unknown'‬
) ||‬‭
‭ ' '‬‭
||‬‭
COALESCE‬
(last_name,‬‭
‭ 'Unknown'‬
)‬
‭
AS‬‭
‭ full_name‬
FROM‬‭
‭ customers‬

‭Now if you ever see a || in someone else’s query, you’ll know what it is.‬

‭It’s not hieroglyphics. Just a simple alternative to something you already know and understand!‬

‭173‬
‭Conclusion‬
‭You did it!‬

‭Congratulations on making it to the end!!! 🎉‬

‭I hope these tips were helpful and that you found new ways to improve your SQL skills.‬

‭I’ll leave you with these last parting words…‬

‭You don’t have to know everything to be great at SQL.‬

I‭’ve been doing this for years, but every time I dive into SQL documentation, I find features I’ve never‬
‭used or things that make me go,‬‭“Wait, what is that?”‬