Skip to content

Commit 77d1eca

Browse files
Anshuman SinghAnshuman Singh
authored andcommitted
Notes for 4th class
1 parent 3ccfe9c commit 77d1eca

File tree

1 file changed

+353
-0
lines changed

1 file changed

+353
-0
lines changed

04-joins-2.md

Lines changed: 353 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,353 @@
1+
2+
## Agenda
3+
4+
- Self Join
5+
- More problems on Joins
6+
- Inner vs Outer joins
7+
- WHERE vs ON
8+
- Union and Union All
9+
10+
11+
## Self Join
12+
13+
Let's say at Scaler, for every student we assign a Buddy. For this we have a `students` table, which looks as follows:
14+
15+
`id | name | buddy_id`
16+
17+
This `buddy_id` will be an id of what?
18+
19+
Correct. Now, let's say we have to print for every student, their name and their buddy's name. How will we do that? Here 2 rows of which tables would we want to stitch together to get this data?
20+
21+
Correct, an SQL query for the same shall look like:
22+
23+
```sql
24+
SELECT s1.name, s2.name
25+
FROM students s1
26+
JOIN students s2
27+
ON s1.buddy_id = s2.id;
28+
```
29+
30+
This is an example of SELF join. A self join is a join where we are joining a table with itself. In the above query, we are joining the `students` table with itself. In a self joining, aliasing tables is very important. If we don't alias the tables, then SQL will not know which row of the table to match with which row of the same table (because both of them have same names as they are the same table only).
31+
32+
### SQL query as pseudocode
33+
34+
As we have been doing since the CRUD class, let's also see how Joins can be represented in terms of pseudocode.
35+
36+
Let's take this query:
37+
38+
```sql
39+
SELECT s1.name, s2.name
40+
FROM students s1
41+
JOIN students s2
42+
ON s1.buddy_id = s2.id;
43+
```
44+
45+
In pseudocode, it shall look like:
46+
47+
```python3
48+
ans = []
49+
50+
for row1 in students:
51+
for row2 in students:
52+
if row1.buddy_id == row2.id:
53+
ans.add(row1 + row2)
54+
55+
for row in ans:
56+
print(row.name, row.name)
57+
```
58+
59+
## More problems on JOIN
60+
61+
### Joining multiple tables
62+
63+
Till now, we had only joined 2 tables. But what if we want to join more than 2 tables? Let's say we want to print the name of every film, along with the name of the language and the name of the original language. How can we do that? If you have to add 3 numbers, how do you do that?
64+
65+
To get the name of the language, we would first want to combine film and language table over the `language_id` column. Then, we would want to combine the result of that with the language table again over the `original_language_id` column. This is how we can do that:
66+
67+
```sql
68+
SELECT f.title, l1.name, l2.name
69+
FROM film f
70+
JOIN language l1
71+
ON f.language_id = l1.language_id
72+
JOIN language l2
73+
ON f.original_language_id = l2.language_id;
74+
```
75+
76+
Let's see how this might work in terms of pseudocode:
77+
78+
```python3
79+
ans = []
80+
81+
for row1 in film:
82+
for row2 in language:
83+
if row1.language_id == row2.id:
84+
ans.add(row1 + row2)
85+
86+
for row in ans:
87+
for row3 in language:
88+
if row.language_id == row3.language_id:
89+
ans.add(row + row3)
90+
91+
for row in ans:
92+
print(row.name, row.language_name, row.original_language_name)
93+
```
94+
95+
### Joins with multiple conditions in ON clause
96+
97+
Till now, whenever we did a join, we joined based on only 1 condition. Like in where clause we can combine multiple conditions, in Joins as well, we can have multiple conditions.
98+
99+
Let's see an example. For every film, name all the films that were released in the range of 2 years before or after that film and there rental rate was more than the rate of the movie.
100+
101+
```sql
102+
SELECT f1.name, f2.name
103+
FROM film f1
104+
JOIN film f2
105+
ON (f2.year BETWEEN f1.year - 2 AND f1.year + 2) AND f2.rental > f1.rental;
106+
```
107+
108+
> Note:
109+
> 1. Join does not need to happen on equality of columns always.
110+
> 2. Join can also have multiple conditions.
111+
112+
A Compound Join is one where Join has multiple conditions on different columns.
113+
114+
115+
## Inner vs Outer Joins
116+
117+
While we have pretty much discussed everything that is mostly important to know about joins, there are a few nitty gritties that we should know about.
118+
119+
Let's take the join query we had written a bit earlier:
120+
121+
```sql
122+
SELECT s1.name, s2.name
123+
FROM students s1
124+
JOIN students s2
125+
ON s1.buddy_id = s2.id;
126+
```
127+
128+
Let's say there is a student that does not have a buddy, i.e., their `buddy_id` is null. What will happen in this case? Will the student be printed?
129+
130+
If you remember what we discussed about CRUD , is NULL equal to anything? Nope. Thus, the row will never match with anything and not get printed. The join that we discussed earlier is also called inner join. You could have also written that as:
131+
132+
```sql
133+
SELECT s1.name, s2.name
134+
FROM students s1
135+
INNER JOIN students s2
136+
ON s1.buddy_id = s2.id
137+
```
138+
139+
The keyword INNER is optional. By default a join is INNER join.
140+
141+
As you see, an INNER JOIN doesn't include a row that didn't match the condition for any combination.
142+
143+
Opposite of INNER JOIN is OUTER JOIN. Outer Join will include all rows, even if they don't match the condition. There are 3 types of outer joins:
144+
- Left Join
145+
- Right Join
146+
- Full Join
147+
148+
As the names convey, left join will include all rows from the left table, right join will include all rows from the right table and full join will include all rows from both the tables.
149+
150+
Let's take an example to understand these well:
151+
152+
Assume we have 2 tables: students and batches with following data:
153+
154+
155+
`batches`
156+
157+
| batch_id | batch_name |
158+
|----------|------------|
159+
| 1 | Batch A |
160+
| 2 | Batch B |
161+
| 3 | Batch C |
162+
163+
`students`
164+
165+
| student_id | name | batch_id |
166+
|------------|------------|----------|
167+
| 1 | John | 1 |
168+
| 2 | Jane | 1 |
169+
| 3 | Jim | null |
170+
| 4 | Ram | null |
171+
| 5 | Sita | 2 |
172+
173+
Now let's write queries to do each of these joins:
174+
175+
```sql
176+
SELECT s.name, b.batch_name
177+
FROM students s
178+
LEFT JOIN batches b
179+
ON s.batch_id = b.batch_id;
180+
```
181+
182+
```sql
183+
SELECT s.name, b.batch_name
184+
FROM students s
185+
RIGHT JOIN batches b
186+
ON s.batch_id = b.batch_id;
187+
```
188+
189+
```sql
190+
SELECT s.name, b.batch_name
191+
FROM students s
192+
FULL OUTER JOIN batches b
193+
ON s.batch_id = b.batch_id;
194+
```
195+
196+
197+
Now let's use different types of joins and tell me which row do you think will not be a part of the join.
198+
199+
Output of LEFT JOIN (Go row by row in left table - which is students and then look for match/matches):
200+
201+
```
202+
John batchA
203+
Jane batchA
204+
Jim NULL
205+
Ram NULL
206+
Sita batchB
207+
```
208+
209+
Output of RIGHT JOIN (Go row by row in right table - which is batches table and then look for match/matches):
210+
batchA has 2 matches - John and Jane
211+
batchB has 1 match - Sita
212+
batchC has 0 match - NULL
213+
214+
```
215+
John batchA
216+
Jane batchA
217+
Sita batchB
218+
NULL batchC
219+
```
220+
221+
Output of FULL JOIN (Do the left join. Then look at every row of right table which is `batches` and figure out rows which were not printed yet - print them with null match)
222+
223+
```
224+
John batchA
225+
Jane batchA
226+
Jim NULL
227+
Ram NULL
228+
Sita batchB
229+
NULL batchC
230+
```
231+
232+
## Join with WHERE v/s ON
233+
234+
Let's take an example to discuss this. If we consider a simple query:
235+
```sql
236+
SELECT *
237+
FROM A
238+
JOIN B
239+
ON A.id = B.id;
240+
```
241+
In pseudocode, it will look like:
242+
243+
```python3
244+
ans = []
245+
246+
for row1 in A:
247+
for row2 in B:
248+
if (ON condition matches):
249+
ans.add(row1 + row2)
250+
251+
for row in ans:
252+
print(row.id, row.id)
253+
```
254+
Here, the size of intermediary table (`ans`) will be less than `n*m` because some rows are filtered.
255+
256+
We can also write the above query in this way:
257+
258+
```sql
259+
SELECT *
260+
FROM A, B
261+
WHERE A.id = B.id;
262+
```
263+
The above query is nothing but a CROSS JOIN behind the scenes which can be written as:
264+
265+
```sql
266+
SELECT *
267+
FROM A
268+
CROSS JOIN B
269+
WHERE A.id = B.id;
270+
```
271+
Here, the intermediary table `A CROSS JOIN B` is formed before going to WHERE condition.
272+
273+
In pseudocode, it will look like:
274+
275+
```python3
276+
ans = []
277+
278+
for row1 in A:
279+
for row2 in B:
280+
ans.add(row1 + row2)
281+
282+
for row in ans:
283+
if (WHERE condition matches):
284+
print(row.id, row.id)
285+
```
286+
287+
The size of `ans` is always `n*m` because table has cross join of A and B. The filtering (WHERE condition) happens after the table is formed.
288+
289+
From this example, we can see that:
290+
1. The size of the intermediary table (`ans`) is always greater or equal when using WHERE compared to using the ON condition. Therefore, joining with ON uses less internal space.
291+
2. The number of iterations on `ans` is higher when using WHERE compared to using ON. Therefore, joining with ON is more time efficient.
292+
293+
In conclusion,
294+
1. The ON condition is applied during the creation of the intermediary table, resulting in lower memory usage and better performance.
295+
2. The WHERE condition is applied during the final printing stage, requiring additional memory and resulting in slower performance.
296+
3. Unless you want to create all possible pairs, avoid using CROSS JOINS.
297+
298+
## UNION and UNION ALL
299+
300+
Sometimes, we want to print the combination of results of multiple queries. Let's take an example of the following tables:
301+
302+
`students`
303+
| id | name |
304+
|----|------|
305+
306+
`employees`
307+
| id | name |
308+
|----|------|
309+
310+
`investors`
311+
| id | name |
312+
|----|------|
313+
314+
315+
You are asked to print the names of everyone associated with Scaler. So, in the result we will have one column with all the names.
316+
317+
We can't have 3 SELECT name queries because it will not produce this singular column. We basically need SUM of such 3 queries. Join is used to stitch or combine rows, here we need to add the rows of one query after the other to create final result.
318+
319+
UNION allows you to combine the output of multiple queries one after the other.
320+
321+
```sql
322+
SELECT name FROM students
323+
UNION
324+
SELECT name FROM employees
325+
UNION
326+
SELECT name FROM investors;
327+
```
328+
Now, as the output is added one after the other, there is a constraint: Each of these individual queries should output the same number of columns.
329+
330+
Note that, you can't use ORDER BY for the combined result because each of these queries are executed independently.
331+
332+
UNION outputs distinct values of the combined result. **It stores the output of individual queries in a set and then outputs those values in final result. Hence, we get distinct values. But if we want to keep all the values, we can use UNION ALL. It stores the output of individual queries in a list and gives the output, so we get all the duplicate values.**
333+
334+
If you want to perform any operation on the combined result, you put them in braces and give it an alias.
335+
For example,
336+
337+
```sql
338+
SELECT
339+
first_name, last_name
340+
FROM
341+
(SELECT first_name, last_name
342+
FROM customer
343+
344+
UNION
345+
346+
SELECT first_name, last_name
347+
FROM actor) AS some_alias
348+
349+
350+
ORDER BY first_name, last_name
351+
LIMIT 10
352+
```
353+

0 commit comments

Comments
 (0)