SQL Notes
SQL Notes
SQL Notes
unique keys?
The one thing that primary, unique, and foreign keys all have in common is the fact that each
type of key can consist of more than just one column from a given table. In other words,
foreign, primary, and unique keys are not restricted to having just one column from a given table
each type of key can cover multiple columns. So, that is one feature that all the different types
of keys share they can each be comprised of more than just one column, which is something
that many people in software are not aware of.
Of course, the database programmer is the one who will actually define which columns are covered
by a foreign, primary, or unique key. That is one similarity all those keys share, but there are also
some major differences that exist between primary, unique, and foreign keys. We will go over
those differences in this article. But first, we want to give a thorough explanation of why foreign
keys are necessary in some situations.
Can a unique key have NULL values? Can a primary key have
NULL values?
Unique key columns are allowed to hold NULL values. The values in a primary key column,
however, can never be NULL.
The difference between the having and where clause is best illustrated by an example.
Suppose we have a table called emp_bonus as shown below. Note that the table has multiple
entries for employees A and B.
emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus that each employee received, then we would write a
SQL statement like this:
Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses
for the year of 2007. You might think that we could write a query like this:
BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;
The WHERE clause does not work with aggregates like SUM
The SQL above will not work, because the where clause doesnt work with aggregates like
sum, avg, max, etc.. Instead, what we will need to use is the having clause. The having clause
was added to sql just so we could compare aggregates to other values just how the where
clause can be used with non-aggregates. Now, the correct sql will look like this:
GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Difference between having and where clause
So we can see that the difference between the having and where clause in sql is that the where
clause can not be used with aggregates, but the having clause can. One way to think of it is
that the having clause is an additional filter to the where clause.
In the table below, how would you retrieve the unique values for the employee_location
without using the DISTINCT keyword?
employee
employee_name employee_location
Sunil India
Alex Russia
Albert Canada
Alex Russia
We can actually accomplish this with the GROUP BY keyword. Heres what the SQL would
look like:
employee_location
New York
India
Russia
Canada
So, you can see that the duplicate values for "Russia" and "Canada" are not returned in the
results.
This is a valid alternative to using the DISTINCT keyword. If you need a refresher on the
GROUP BY clause, then check out this question: Group By and Having. This question would
probably be asked just to see how good you are with coming up with alternative options for
SQL queries. Although, it probably doesnt prove much about your SQL skills.
Joins are used to combine the data from two tables, with the result being a new, temporary table.
The temporary table is created based on column(s) that the two tables share, which represent
meaningful column(s) of comparison. The goal is to extract meaningful data from the resulting
temporary table. Joins are performed based on something called a predicate, which specifies the
condition to use in order to perform a join. A join can be either an inner join or an outer join,
depending on how one wants the resulting table to look.
It is best to illustrate the differences between inner and outer joins by use of an example. Here we
have 2 tables that we will use for our example:
Employee Location
Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.
Outer Joins
Lets start the explanation with outer joins. Outer joins can be be further divided into left outer
joins, right outer joins, and full outer joins. Here is what the SQL for a left outer join would look
like, using the tables above:
select * from employee left outer join location
on employee.empID = location.empID;
In this SQL we are joining on the condition that the employee IDs match in the rows tables. So,
we will be essentially combining 2 tables into 1, based on the condition that the employee IDs
match. Note that we can get rid of the "outer" in left outer join, which will give us the SQL below.
This is equivalent to what we have above.
A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. The SQL above will give us the result set shown below.
Earlier we had mentioned something called a join predicate. In the SQL above, the join
predicate is "on employee.empID = location.empID". This is the heart of any type of join,
because it determines what common column between the 2 tables will be used to "join" the 2
tables. As you can see from the result set, all of the rows from the left table are returned when we
do a left outer join. The last row of the Employee table (which contains the "Johson" entry) is
displayed in the results even though there is no matching row in the Location table. As you can
see, the non-matching columns in the last row are filled with a "NULL". So, we have "NULL" as the
entry wherever there is no match.
A right outer join is pretty much the same thing as a left outer join, except that the rows that are
retained are from the right table. This is what the SQL looks like:
Using the tables presented above, we can show what the result set of a right outer join would look
like:
We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.
Inner Joins
Now that weve gone over outer joins, we can contrast those with the inner join. The difference
between an inner join and an outer join is that an inner join will return only the rows that actually
match based on the join predicate. Once again, this is best illustrated via an example. Heres what
the SQL for an inner join will look like:
Now, here is what the result of running that SQL would look like:
We can see that an inner join will only return rows in which there is a match based on the join
predicate. In this case, what that means is anytime the Employee and Location table share an
Employee ID, a row will be generated in the results to show the match. Looking at the original
tables, one can see that those Employee IDs that are shared by those tables are displayed in the
results. But, with a left or right outer join, the result set will retain all of the rows from either the
left or right table.
A given table may have more than just one choice for a primary key. Basically, there may be
another column (or combination of columns for a multi-column primary key) that qualify as
primary keys. Any combination of column(s) that may qualify to be a primary key are known as
candidate keys. This is because they are considered candidates for the primary key. And the
options that are not selected to be the primary key are known as secondary keys.
Example of a superkey
Suppose we have a table that holds all the managers in a company, and that table is called
Managers. The table has columns called ManagerID, Name, Title, and DepartmentID. Every
manager has his/her own ManagerID, so that value is always unique in each and every row.
This means that if we combine the ManagerID column value for any given row with any other
column value, then we will have a unique set of values. So, for the combinations of (ManagerID,
Name), (ManagerID, TItle), (ManagerID, DepartmentID), (ManagerID, Name, DepartmentID), etc
there will be no two rows in the table that share the exact same combination of values, because
the ManagerID will always be unique and different for each row. This means that pairing the
Manager ID with any other column(s) will ensure that the combination will also be unique across
all rows in the table.
And that is exactly what defines a superkey its any combination of column(s) for which that
combination of values will be unique across all rows in a table. So, all of those combinations of
columns in the Manager table that we gave earlier would be considered to be superkeys. Even the
ManagerID column is considered to be a superkey, although a special type of superkey as you can
read more about below.
What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.
Example of cardinality
Suppose we have a table called People which has a Sex column that has only two possible values
of Male and Female. Then, that Sex column would have a cardinality of 2, because there are
only two unique values that could possibly appear in that column Male and Female.
Difference between cardinality and selectivity
In SQL, the term selectivity is used when discussing database indexes. The selectivity of a
database index is a numeric value that is calculated using a specific formula. That formula actually
uses the cardinality value to calculate the selectivity. This means that the selectivity is calculated
using the cardinality so the terms selectivity and cardinality are very much related to each other.
Here is the formula used to calculated selectivity:
This means that if the People table we were talking about earlier has 10,000 rows, then the
selectivity would be 2/10,000, which equals .02%. This is considered to be a very low selectivity.
Note that the number of records is equivalent to the number of rows in the table.
So, you see the formula and you are thinking thats great, but what does this actually
mean? Well, lets say we have a table with a Sex column which has only two possible
values of Male and Female. Then, that Sex column would have a cardinality of 2,
because there are only two unique values that could possibly appear in that column Male
and Female. If there are 10,000 rows in the table, then this means that the selectivity of an
index on that particular column will be 2/10,000 * 100%, which is .02%.
The key with the selectivity value is that it basically measures how selective the values
within a given column are in other words how many different values are available in the
given sample set. A selectivity of .02% is considered to be really low, and means that given
the number of rows, there is a very small amount of variation in the actual values for that
column. In our example Sex column,
Why does the database actually care about the selectivity and how does it use it? Well, lets
consider what a low selectivity means. A low selectivity basically means there is not a lot of
variation in the values in a column that there is not a lot of possibilities for the values of a
column. Suppose, using the example table that we discussed earlier, that we want to find the
names of all the females in the table.
If a session ends up waiting too long for some locked data, then some databases, like DB2 from
IBM, will actually time out after a certain amount of time and return an error instead of waiting
and then updating the data as requested. But some databases, like Oracle, may handle the
situation differently Oracle can actually leave a session in a lock wait state for an indefinite
amount of time. So, there are a lot of differences between different database vendors in terms of
how they choose to deal with locks and other sessions waiting for locks to be released.
Here is a list of the usual lock levels and types supported, and more information on what each
technique means:
Because the data that can be stored in blocks/pages can be wide and varied, page/block locking is
less favored in databases today.
The SQL statement above will lock the row or rows which have a value of XYZ for the column
named some_column. The locking of the row(s) happens behind the scenes as part of the
RDBMS software, and it prevents other database user sessions from updating the same row(s) at
the same exact time as well.
When sessions do stall for an indefinite period of time, that is known as deadlock, which you can
read more about here: Database deadlock.
Lock Escalation
You should also read about the concept of Lock Escalation, which is a built in feature of many
RDBMSs today.
Some other differences between foreign, primary, and unique keys
While unique and primary keys both enforce uniqueness on the column(s) of one table, foreign
keys define a relationship between two tables. A foreign key identifies a column or group of
columns in one (referencing) table that refers to a column or group of columns in another
(referenced) table in our example above, the Employee table is the referenced table and the
Employee Salary table is the referencing table.
As we stated earlier, both unique and primary keys can be referenced by foreign keys.
Data mining is the process of finding patterns in a given data set. These patterns can often provide
meaningful and insightful data to whoever is interested in that data. Data mining is used today in
a wide variety of contexts in fraud detection, as an aid in marketing campaigns, and even
supermarkets use it to study their consumers.
Data warehousing can be said to be the process of centralizing or aggregating data from multiple
sources into one common repository.
Another interesting example of data mining is how one grocery store in the USA used the data it
collected on its shoppers to find patterns in their shopping habits. They found that when men
bought diapers on Thursdays and Saturdays, they also had a strong tendency to buy beer. The
grocery store could have used this valuable information to increase their profits. One thing they
could have done odd as it sounds is move the beer display closer to the diapers. Or, they could
have simply made sure not to give any discounts on beer on Thursdays and Saturdays. This is data
mining in action extracting meaningful data from a huge data set.
A great example of data warehousing that everyone can relate to is what Facebook does. Facebook
basically gathers all of your data your friends, your likes, who you stalk, etc and then stores
that data into one central repository. Even though Facebook most likely stores your friends, your
likes, etc, in separate databases, they do want to take the most relevant and important
information and put it into one central aggregated database. Why would they want to do this? For
many reasons they want to make sure that you see the most relevant ads that youre most likely
to click on, they want to make sure that the friends that they suggest are the most relevant to
you, etc keep in mind that this is the data mining phase, in which meaningful data and patterns
are extracted from the aggregated data. But, underlying all these motives is the main motive: to
make more money after all, Facebook is a business.
We can say that data warehousing is basically a process in which data from multiple
sources/databases is combined into one comprehensive and easily accessible database. Then this
data is readily available to any business professionals, managers, etc. who need to use the data to
create forecasts and who basically use the data for data mining.
Datawarehousing vs Datamining
Remember that data warehousing is a process that must occur before any data mining can take
place. In other words, data warehousing is the process of compiling and organizing data into one
common database, and data mining is the process of extracting meaningful data from that
database. The data mining process relies on the data compiled in the datawarehousing phase in
order to detect meaningful patterns.
In the Facebook example that we gave, the data mining will typically be done by business users
who are not engineers, but who will most likely receive assistance from engineers when they are
trying to manipulate their data. The data warehousing phase is a strictly engineering phase, where
no business users are involved. And this gives us another way of defining the 2 terms: data mining
is typically done by business users with the assistance of engineers, and data warehousing is
typically a process done exclusively by engineers.
employee
employee_name employee_location
Sunil India
Alex Russia
Albert Canada
Now, suppose we want to find out which employees are from the same location as the employee
named Joe. In this example, that location would be New York. Lets assume for the sake of our
example that we can not just directly search the table for people who live in New York with a
simple query like this (maybe because we dont want to hardcode the city name) in the SQL
query:
SELECT employee_name
FROM employee
WHERE employee_location = "New York"
So, instead of a query like that what we could do is write a nested SQL query (basically a query
within another query which more commonly called a subquery) like this:
SELECT employee_name
FROM employee
WHERE employee_location in
( SELECT employee_location
FROM employee
WHERE employee_name = "Joe")
Using a subquery for such a simple question is inefficient. Is there a more efficient and elegant
solution to this problem?
It turns out that there is a more efficient solution we can use something called a self join. A self
join is basically when a table is joined to itself. The way you should visualize a self join for a given
table is by imagining that a join is performed between two identical copies of that table. And
that is exactly why it is called a self join because of the fact that its just the same table being
joined to another copy of itself rather than being joined with a different table.
Now, lets come up with a solution to the original problem using a self join instead of a subquery.
This will help illustrate how exactly a self join works. The key question that we must ask ourselves
is what should our join predicate be in this example? Well, we want to find all the employees who
have the same location as Joe.
Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.
The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.
Referential integrity is a relational database concept in which multiple tables share a relationship
based on the data stored in the tables, and that relationship must remain consistent.
The concept of referential integrity, and one way in which its enforced, is best illustrated by an
example. Suppose company X has 2 tables, an Employee table, and an Employee Salary table. In
the Employee table we have 2 columns the employee ID and the employee name. In the
Employee Salary table, we have 2 columns the employee ID and the salary for the given ID.
By enforcing referential integrity, we can solve that problem, so that we wouldnt have to manually
delete him from the Employee Salary table (or any others). Heres how: first we would define the
employee ID column in the Employee table to be our primary key. Then, we would define the
employee ID column in the Employee Salary table to be a foreign key that points to a primary key
that is the employee ID column in the Employee table. Once we define our foreign to primary key
relationship, we would need to add whats called a constraint to the Employee Salary table. The
constraint that we would add in particular is called a cascading delete this would mean that any
time an employee is removed from the Employee table, any entries that employee has in the
Employee Salary table would alsoautomatically be removed from the Employee Salary table.
Note in the example given above that referential integrity is something that must beenforced, and
that we enforced only one rule of referential integrity (the cascading delete). There are actually 3
rules that referential integrity enforces:
1.We may not add a record to the Employee Salary table
unless the foreign key for that record points to an existing
employee in the Employee table.
3.If the primary key for a record in the Employee table changes,
all corresponding records in the Employee Salary table must be
modified using what's called a cascading update.
Its worth noting that most RDBMSs relational databases like Oracle, DB2, Teradata, etc. can
automatically enforce referential integrity if the right settings are in place. But, a large part of the
burden of maintaining referential integrity is placed upon whoever designs the database schema
basically whoever defined the tables and their corresponding structure/relationships in the
database that you are using. Referential integrity is an important concept and you simply must
know it for any programmer interview.
In other words, the surrogate key really has no business meaning i.e., the data stored in a
surrogate key has no intrinsic meaning to it.
The difference between the having and where clause is best illustrated by an example.
Suppose we have a table called emp_bonus as shown below. Note that the table has multiple
entries for employees A and B.
emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus that each employee received, then we would write a
SQL statement like this:
Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses
for the year of 2007. You might think that we could write a query like this:
BAD SQL:
select employee, sum(bonus) from emp_bonus
group by employee where sum(bonus) > 1000;
The WHERE clause does not work with aggregates like SUM
The SQL above will not work, because the where clause doesnt work with aggregates like
sum, avg, max, etc.. Instead, what we will need to use is the having clause. The having clause
was added to sql just so we could compare aggregates to other values just how the where
clause can be used with non-aggregates. Now, the correct sql will look like this:
GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
Now, lets say that we want to run a query to find all the details of any employees who are named
Jesus? So, we decide to run a simple query like this:
Another type of index is a bitmap index, which work well on columns that contain Boolean values
(like true and false), but many instances of those values basically columns with low selectivity.
Lets say that we create a B- tree index on the Employee_Name column This means that when we
search for employees named Jesus using the SQL we showed earlier, then the entire Employee
table does not have to be searched to find employees named Jesus. Instead, the database will
use the index to find employees named Jesus, because the index will presumably be sorted
alphabetically by the Employees name. And, because it is sorted, it means searching for a name is
a lot faster because all names starting with a J will be right next to each other in the index! Its
also important to note that the index also stores pointers to the table row so that other column
values can be retrieved read on for more details on that.
As a general rule, an index should only be created on a table if the data in the indexed column will
be queried frequently.
Purchases
Now, lets suppose that the owner of the store wants to find out, on a given date, how many of
each product was sold in the store. Then we would write this SQL in order to find that out:
Note that in the SQL we wrote, the group by statement uses multiple columns: group by item,
purchase_date;. This allows us to group the individual items for a given date so basically we are
dividing the results by the date the items are purchased, and then for a given date we are able to
find how many items were purchased for that date. This is why the group by statement with
multiple columns is so useful!
It is best to illustrate the differences between left outer joins and right outer joins by use of an
example. Here we have 2 tables that we will use for our example:
Employee Location
For the purpose of our example, it is important to note that the very last employee in the
Employee table (Johnson, who has an ID of 25) is not in the Location table. Also, no one from the
Employee table is from Bangalore (the employee with ID 39 is not in the Employee table). These
facts will be significant in the discussion that follows.
In the SQL above, we are joining on the condition that the employee IDs match in the tables
Employee and Location. So, we will be essentially combining 2 tables into 1, based on the
condition that the employee IDs match. Note that we can get rid of the "outer" in left outer join,
which will give us the SQL below. This is equivalent to what we have above.
A left outer join retains all of the rows of the left table, regardless of whether there is a row that
matches on the right table. What are the left and right tables? Thats easy the left table is
simply the table that comes first in the join statement in this case it is the Employee table, its
called the left table because it appears to the left of the keyword join. So, the right table in
this case would be Location. The SQL above will give us the result set shown below.
As you can see from the result set, all of the rows from the left table (Employee) are returned
when we do a left outer join. The last row of the Employee table (which contains the "Johson"
entry) is displayed in the results even though there is no matching row in the Location table. As
you can see, the non-matching columns in the last row are filled with a "NULL". So, we have
"NULL" as the entry wherever there is no match.
Subscribe to our newsletter on the left to receive more free interview questions!
A right outer join is pretty much the same thing as a left outer join, except that all the rows from
the right table are displayed in the result set, regardless of whether or not they have matching
values in the left table. This is what the SQL looks like for a right outer join:
Using the tables presented above, we can show what the result set of a right outer join would look
like:
We can see that the last row returned in the result set contains the row that was in the Location
table, but which had no matching empID in the Employee table (the "Bangalore, India" entry).
Because there is no row in the Employee table that has an employee ID of "39", we have NULLs in
that row for the Employee columns.
So, what is the difference between the right and left outer joins?
The difference is simple in a left outer join, all of the rows from the left table will be displayed,
regardless of whether there are any matching columns in the right table. In a right outer
join, all of the rows from the right table will be displayed, regardless of whether there are any
matching columns in the left table. Hopefully the example that we gave above help clarified this
as well.
Actually, it doesnt matter. The right outer join does not add any functionality that the left outer
join didnt already have, and vice versa. All you would have to do to get the same results from a
right outer join and a left outer join is switch the order in which the tables appear in the SQL
statement. If thats confusing, just take a closer look at the examples given above.
Its important to note that the very last row in the Employee table does not exist in the Employee
Location table. Also, the very last row in the Employee Location table does not exist in the
Employee table. These facts will prove to be significant in the discussion that follows.
In the SQL above, we actually remove the "outer" in left outer join, which will give us the SQL
below. Running the SQL with the outer keyword, would give us the exact same results as
running the SQL without the outer. Here is the SQL without the outer keyword:
A left outer join (also known as a left join) retains all of the rows of the left table, regardless of
whether there is a row that matches on the right table. The SQL above will give us the result set
shown below.
Using the tables presented above, we can show what the result set of a right outer join would look
like:
We can see that the last row returned in the result set contains the row that was in the Location
table, but not in the Employee table (the "Bangalore, India" entry). Because there is no matching
row in the Employee table that has an employee ID of "39", we have NULLs in the result set for
the Employee columns.
In SQL, the having clause and the group by statement work together when using aggregate
functions like SUM, AVG, MAX, etc. This is best illustrated by an example. Suppose we have a table
called emp_bonus as shown below. Note that the table hasmultiple entries for employees A and B
which means that both employees A and B have received multiple bonuses.
emp_bonus
Employee Bonus
A 1000
B 2000
A 500
C 700
B 1250
If we want to calculate the total bonus amount that each employee has received, then we would
write a SQL statement like this:
Employee Sum(Bonus)
A 1500
B 3250
C 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses for the
year of 2012 this is assuming of course that the emp_bonus table contains bonuses only for the
year of 2012. This is when we need to use the HAVINGclause to add the additional check to see if
the sum of bonuses is greater than $1,000, and this is what the SQL look like:
GOOD SQL:
select employee, sum(bonus) from emp_bonus
group by employee having sum(bonus) > 1000;
And the result of running the SQL above would be this:
Employee Sum(Bonus)
A 1500
B 3250
Lets say that you are given a SQL table called Compare (the schema is shown below) with only
one column called Numbers.
Compare
{
Numbers INT(4)
}
Write a SQL query that will return the maximum value from the Numbers column, without using
a SQL aggregate like MAX or MIN.
This problem is difficult because you are forced to think outside the box, and use whatever SQL
you know to solve a problem without using the most obvious solution (doing a select MAX from
the table).
Probably the best way to start breaking this problem down is by creating a sample table with some
actual data that matches the schema given. Here is a sample table to start out with:
Compare
Numbers
30
70
-8
90
The value that we want to extract from the table above is 90, since it is the maximum value in the
table. How can we extract this value from the table in a creative way (it will have to be creative
since we cant use the max or min aggregates)? Well, what are the properties of the highest
number (90 in our example)? We could say that there are no numbers larger than 90 that
doesnt sound very promising in terms of solving this problem.
We could also say that 90 is the only number that does not have a number that is greater than it.
If we can somehow return every value that does not have a value greater than it then we would
only be returning 90. This would solve the problem. So, we should try to design a SQL statement
that would return every number that does not have another number greater than it. Sounds fun
right?
Lets start out simple by figuring out which numbers do have any numbers greater than
themselves. This is an easier query. We can start by joining the Compare table with itself this is
called a self join, which you can read more about here in case you are not familiar with self
joins: Example of self join in SQL .
Using a self join, we can create all the possible pairs for which each value in one column is greater
than the corresponding value in the other column. This is exactly what the following query does:
Now, let's use the sample table we created, and we end up with this table after running the query
above:
Smaller Larger
-8 90
30 90
70 90
-8 70
30 70
70 90
Now we have every value in the "Smaller" column except the largest value of 90. This means that
all we have to do is find the value that is not in the Smaller column (but is in the Compare table),
and that will give us the maximum value. We can easily do this using the NOT IN operator in SQL.
But before we do that we have to change the query above so that it only selects the "Smaller"
column - because that is the only column we are interested in. So, we can simply change our
query above to this in order to get the "Smaller" column:
SELECT Smaller.Numbers
FROM Compare as Larger JOIN Compare AS Smaller
ON Smaller.Numbers < Larger.Numbers
Now, all we have to do is apply the NOT IN operator to find the max value.
SELECT Numbers
FROM Compare
WHERE Numbers NOT IN (
SELECT Smaller.Numbers
FROM Compare AS Larger
JOIN Compare AS Smaller ON Smaller.Numbers < Larger.Numbers
)
This will give us what we want - the maximum value. But there is one small problem with the SQL
above - if the maximum value is repeated in the Compare table then it will return that value twice.
We can prevent that by simply using the DISTINCT keyword. So, here's what the query looks like
now:
And there we have our final answer. Of course, some of you may be saying that there is a much
simpler solution to this problem. And you would be correct. Here is a simpler answer to the
problem using the SQL Top clause along with the SQL Order By clause - this is what it would look
like in SQL Server:
select TOP 1 -- select the very top entry in result set
Numbers
from
Compare
order by
Numbers DESC
And since MySQL does not have a TOP clause this is what it would look like in MySQL using just
ORDER BY and LIMIT :
select
Numbers
from
Compare
order by
Numbers DESC - order in descending order
LIMIT 1 --retrieve only one value
So, even though there are a couple of much simpler answers it is nice to know the more
complicated answer using a self join so that you can impress your interviewer with your
knowledge.
If you are confused, some examples of deadlock should definitely help clarify what goes on during
deadlock. And, you should probably read our explanation ofdatabase locks before proceeding since
that will help your understanding as well.
Note that the crediting of customer B and debiting of customer A will be run as a single transaction
this is important for the discussion that follows.
Now, lets also say that the other database user Y works in the IT department and has to go
through the customers table and update the zip code of all customers who currently have a zip
code of 94520, because that zip code has now been changed to 94521. So, the SQL for this would
simply have a WHERE clause that would limit the update to customers with a zip code of 94520.
Also, both customers A and B currently have zip codes of 94520, which means that their
information will be updated by database user Y.
Here is a breakdown of the events in our fictitious example that lead to deadlock:
1. Database user X in the customer service department selects customer As data and
updates As bank balance to debit/decrease it by $5,000. However, whats important here
is that there is no COMMIT issued yet because database user X still has to update
customer Bs balance to increase/credit by $5,000 and those 2 separate SQL statements
will run as a single SQL transaction. Most importantly, this means that database user X
still holds a lock on the row for customer A because his transaction is not fully
committed yet (he still has to update customer A). The lock on the row for customer
A will stay until the transaction is committed.
2. Database user Y then has to run his SQL to update the zip codes for customers with zip
codes of 94520. The SQL then updates customer Bs zip code. But, because the SQL
statement from user Y must be run as a single transaction, the transaction has not
committed yet because all of the customers havent had their zip codes changed yet. So,
this means that database user Y holds a lock on the row for customer B. .
3. Now, Database user X still has to run the SQL statement that will update customer Bs
balance to increase it by $5,000. But, now the problem is that database user Y has a lock
on the row for customer B. This means that the request to update customer Bs balance
must wait for user Y to release the lock on customer B. So, database user X is waiting for
user Y to release a lock on customer B.
4. Now, the SQL statement being run by user Y tries to update the zip code for customer
A. But, this update can not happen because user X holds a lock on customer As row. So,
user Y is waiting for a lock to be released by user X.
5. Now you can see that we have user X waiting for user Y to release a lock and user Y
waiting for user X to release a lock. This is the situation of deadlock, since neither user can
make any progress, and nothing happens because they are both waiting for each other.
So, in theory, these two database sessions will be stalled forever. But, read on to see how
some DBMSs deal with this unique situation.
In either of the deadlock detection methods, one of the requests will have to be terminated to stop
the deadlock. This also means that any transaction changes which came before the request will
have to be rolled back so that the other request can make progress and finish.
Remember that an index is usually a tree data structure and leaf nodes are the nodes that are at
the very bottom of that tree. In other words, a clustered index basically contains the actual
table level data in the index itself. This is very different from most other types of indexes as
you can read about below.
Owners
Owner_Name
Owner_Age
Cars
Car_Type
Owner_Name
Lets assume that a given owner can have multiple cars so a single Owner_Name can appear
multiple times in the Cars table. Now, lets say that we create a clustered index on the
Owner_Name column in the Cars table. What does this accomplish for us? Well, because a
clustered index is stored physically on the disk in the same order as the index, it would mean that
a given Owner_Name would have all his/her car entries stored right next to each other on disk. In
other words, if there is an owner named Joe Smith or Raj Gupta, then each owner would have
all of his/her entries in the Cars table stored right next to each other on the disk.
Note that having an index on the Owner_Name would not necessarily be unique, because there are
many people who share the same name. So, you might have to add another column to the
clustered index to make sure that its unique.
What is a disadvantage to using a clustered index?
A disadvantage to using a clustered index is the fact that if a given row has a value updated in one
of its (clustered) indexed columns what typically happens is that the database will have to move
the entire row so that the table will continue to be sorted in the same order as the clustered index
column. Consider our example above to clarify this. Suppose that someone named Rafael Nadal
buys a car lets say its a Porsche from Roger Federer. Remember that our clustered index is
created on the Owner_Name column. This means that when we do a update to change the name
on that row in the Cars table, the Owner_Name will be changed from Roger Federer to Rafael
Nadal.
But, since a clustered index also tells the database in which order to physically store the rows on
disk, when the Owner_Name is changed it will have to move an updated row so that it is still in
the correct sorted order. So, now the row that used to belong to Roger Federer will have to be
moved on disk so that its grouped (or clustered) with all the car entries that belong to Rafael
Nadal. Clearly, this is a performance hit. This means that a simple UPDATE has turned into a
DELETE and then an INSERT just to maintain the order of the clustered index. For this exact
reason, clustered indexes are usually created on primary keys or foreign keys, because of the fact
that those values are less likely to change once they are already a part of a table.
This means that with a non-clustered index extra work is required to follow that pointer to the
row in the table to retrieve any other desired values, as opposed to a clustered index which can
just access the row directly since it is being stored in the same order as the clustered index itself.
So, reading from a clustered index is generally faster than reading from a non-clustered index.
A clustered index determines the order in which the rows of the table will be stored on disk
and it actually stores row level data in the leaf nodes of the index itself. A non-clustered
index has no effect on which the order of the rows will be stored.
Using a clustered index is an advantage when groups of data that can be clustered are
frequently accessed by some queries. This speeds up retrieval because the data lives close
to each other on disk. Also, if data is accessed in the same order as the clustered index,
the retrieval will be much faster because the physical data stored on disk is sorted in the
same order as the index.
A clustered index can be a disadvantage because any time a change is made to a value of
an indexed column, the subsequent possibility of re-sorting rows to maintain order is a
definite performance hit.
A table can have multiple non-clustered indexes. But, a table can have only one clustered
index.
Non clustered indexes store both a value and a pointer to the actual row that holds that
value. Clustered indexes dont need to store a pointer to the actual row because of the fact
that the rows in the table are stored on disk in the same exact order as the clustered index
and the clustered index actually stores the row-level data in its leaf nodes.
Be careful using triggers. Using triggers on specific database tables can be a nice way to
come up with a solution for a specific problem. For example, if you want to enforce a
difficult constraint, triggers can be a good solution. But, on the other hand, when a trigger
is fired (or triggered) by a given SQL statement, then all the work that trigger does
becomes a part of the SQL statement. This extra work can slow down the SQL statement,
and sometimes that can cause a big hit on performance if the trigger fires a lot.
Use the same data type for both primary and foreign key columns. This means that when
defining a primary and foreign key, its not a good idea to define the primary key with a
VARCHAR data type and a foreign key with a CHAR data type its better to define both
keys as either VARCHAR or CHAR. This is because table joins when the data types are
different, then the DBMS will have to convert one of the data types to the same type as
the other one. This process of conversion from one data type to another is extra work that
the RDBMS, and of course slows down any joins that are done between tables with
columns that have different data types.
When choosing a data type for a numeric column, you should always pick thesmallest data
type that can hold your data. Choosing the smallest data type that still fits your data can
save a lot of space. For example, if you choose to use the BIGINT data type when all of
your data for a given column could actually fit in the TINYINT or SMALLINT data types
would be a huge waste of space. Think of it this way if your table has millions of rows
then it would result in millions of entries for that BIGINT column which would have too
much extra space inside them. So, the unused and wasted space could add up very fast.
Another thing to consider is to use CHAR instead of VARCHAR when you know that the
entries will have the same length. VARCHAR columns actually have an extra 1 to 3 bytes
per entry so that they can also hold the length of the actual data. In addition to that extra
1 to 3 bytes, whenever a VARCHAR column has some change in the column data, there is
some more processing overhead because the length of the column data has to be
calculated yet again.
One possibility to help tune database performance by reducing the number of I/O
operations is to use more than just one physical disk drive. By having multiple disk drives,
you can spread out the database files on the different disk drives. By spreading out the
database files over multiple drives, you can have many I/O operations happening in
parallel, which can be a big performance advantage.
Another important approach to correctly tuning your database for better performance is to
make sure that you allocate buffers which have the right size. What exactly is a buffer?
Well, its a part of the memory (the RAM) that is used to temporarily store data that has
recently been retrieved from the database and also temporarily hold data that will
eventually be stored in the permanent storage (like a disk). If you have a buffer of the
right size, then data that has already been read from the database will remain in the buffer
for a fair amount of time, and that means that if a new query is run asking for the same
data, then it can be retrieved directly from the buffer rather than making another
expensive (performance-wise) query into the database.
Buffers can also be used to store data that is written to the buffer by the RDBMS for
temporary storage. Then, this data can later be copied to permanent storage at some
point in the future. This is also known as asynchronous I/O because the data is not written
immediately from the RDBMS to the permanent storage all major RDBMSs have some
form of it, since its quite important for performance.
What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.
What are some tips on tuning SQL Indexes for better performance?
If youve already read our article on SQL indexes, then you know that indexes are used to make
queries run faster by reducing the time taken to look up data. But, the tradeoff for that improved
performance is that indexes also take up space, and there must be some maintenance done on
indexes to ensure that they continue to run smoothly. Lets go over some tips, guidelines, and
suggestions on how to properly use and maintain indexes to improve performance.
If the SQL above looks scary to you, dont worry its still easy to understand for our purposes
here. The subquery portion of the SQL above begins after the NOT IN statement. The reason that
the query above is an uncorrelated subquery is that the subquery can be run independentlyof the
outer query. Basically, the subquery has no relationship with the outer query.
Now, a correlated subquery has the opposite property the subquery cannot be run
independently of the outer query. You can take a look at this example of a correlated subquery
below and easily see the difference yourself:
SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)
What you will notice in the correlated subquery above is that the inner subquery uses
Emp1.Salary, but the alias Emp1 is created in the outer query. This is why it is called a correlated
subquery, because the subquery references a value in its WHERE clause (in this case, it uses a
column belonging to Emp1) that is used in the outer query.
You can also read this particular SQL problem to get more detailed information on how the
correlated query above works: SQL Find nth highest salary
Suppose that you are given the following simple database table
called Employee that has 2 columns named Employee ID and
Salary:
Employee
Employee ID Salary
3 200
4 800
7 450
Write a SQL query to get the second highest salary from the table
above. Also write a query to find the nth highest salary in SQL,
where n can be any number.
The easiest way to start with a problem like this is to ask yourself a simpler question first. So, lets
ask ourselves how can we find the highest salary in a table? Well, you probably know that is
actually really easy we can just use the MAX aggregate function:
So, if we can somehow select the highest value from a result set thatexcludes the highest value,
then we would actually be selecting the 2nd highest salary value. Think about that carefully and
see if you can come up with the actual SQL yourself before you read the answer that we provide
below. Here is a small hint to help you get started: you will have to use the NOT IN SQL
operator.
Solution to finding the 2nd highest salary in SQL
Now, here is what the SQL will look like:
Running the SQL above would return us 450, which is of course the 2nd highest salary in the
Employee table.
This now means that the highest value in this new result set will actually be the 2nd highest value
in the Employee table. So, we then select the max Salary from the new result set, and that gives
us 2nd highest Salary in the Employee table. And that is how the query above works.
How would you write a SQL query to find the Nth highest salary?
What we did above was write a query to find the 2nd highest Salary value in the Employee table.
But, another commonly asked interview question is how can we use SQL to find theNth highest
salary, where N can be any number whether its the 3rd highest, 4th highest, 5th highest, 10th
highest, etc? This is also an interesting question try to come up with an answer yourself before
reading the one below to see what you come up with.
The answer and explanation to finding the nth highest salary in SQL
Here we will present one possible answer to finding the nth highest salary first, and the
explanation of that answer after since its actually easier to understand that way. Note that the
first answer we present is actually not optimal from a performance standpoint since it uses a
subquery, but we think that it will be interesting for you to learn about because you might just
learn something new about SQL. If you want to see the more optimal solutions first, you can skip
down to the sections that says Find the nth highest salary without a subquery instead.
The SQL below will give you the correct answer but you will have to plug in an actual value for N
of course. This SQL to find the Nth highest salary should work in SQL Server, MySQL, DB2, Oracle,
Teradata, and almost any other RDBMS:
The most important thing to understand in the query above is that the subquery is evaluated each
and every time a row is processed by the outer query. In other words, the inner query can not be
processed independently of the outer query since the inner query uses the Emp1 value as well.
SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > Emp1.Salary)
You can probably see that Emp1 and Emp2 are just aliases for the same Employee table its like
we just created 2 separate clones of the Employee table and gave them different names.
Employee
Employee ID Salary
3 200
4 800
7 450
For the sake of our explanation, lets assume that N is 2 so the query is trying to find the 2nd
highest salary in the Employee table. The first thing that the query above does is process the very
first row of the Employee table, which has an alias of Emp1.
The salary in the first row of the Employee table is 200. Because the subquery is correlated to the
outer query through the alias Emp1, it means that when the first row is processed, the query will
essentially look like this note that all we did is replace Emp1.Salary with the value of 200:
SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > 200)
So, what exactly is happening when that first row is processed? Well, if you pay special attention
to the subquery you will notice that its basically searching for the count of salary entries in the
Employee table that are greater than 200. Basically, the subquery is trying to find how many
salary entries are greater than 200. Then, that count of salary entries is checked to see if it equals
1 in the outer query, and if so then everything from that particular row in Emp1 will be returned.
Note that Emp1 and Emp2 are both aliases for the same table Employee. Emp2 is only being
used in the subquery to compare all the salary values to the current salary value chosen in Emp1.
This allows us to find the number of salary entries (the count) that are greater than 200. And if
this number is equal to N-1 (which is 1 in our case) then we know that we have a winner and
that we have found our answer.
But, its clear that the subquery will return a 2 when Emp1.Salary is 200, because there are clearly
2 salaries greater than 200 in the Employee table. And since 2 is not equal to 1, the salary of 200
will clearly not be returned.
So, what happens next? Well, the SQL processor will move on to the next row which is 800, and
the resulting query looks like this:
SELECT *
FROM Employee Emp1
WHERE (1) = (
SELECT COUNT(DISTINCT(Emp2.Salary))
FROM Employee Emp2
WHERE Emp2.Salary > 800)
Since there are no salaries greater than 800, the query will move on to the last row and will of
course find the answer as 450. This is because 800 is greater than 450, and the count will be 1.
More precisely, the entire row with the desired salary would be returned, and this is what it would
look like:
EmployeeID Salary
7 450
Its also worth pointing out that the reason DISTINCT is used in the query above is because there
may be duplicate salary values in the table. In that scenario, we only want to count repeated
salaries just once, which is exactly why we use the DISTINCT operator.
Think of it this way we are looking for a pattern that will lead us to the answer. One way to look
at it is that the 2nd highest salary would have just one salary that is greater than it. The 4th
highest salary would have 3 salaries that are greater than it. In more general terms, in order to
find the Nth highest salary, we just find the salary that has exactly N-1 salaries greater
than itself. And that is exactly what the query above accomplishes it simply finds the salary
that has N-1 salaries greater than itself and returns that value as the answer.
Find the nth highest salary using the TOP keyword in SQL Server
We can also use the TOP keyword (for databases that support the TOP keyword, like SQL Server)
to find the nth highest salary. Here is some fairly simply SQL that would help us do that:
To understand the query above, first look at the subquery, which simply finds the N highest
salaries in the Employee table and arranges them in descending order. Then, the outer query will
actually rearrange those values in ascending order, which is what the very last line ORDER BY
Salary does, because of the fact that the ORDER BY Default is to sort values in ASCENDING order.
Finally, that means the Nth highest salary will be at the top of the list of salaries, which means we
just want the first row, which is exactly what SELECT TOP 1 Salary will do for us!
Find the nth highest salary without using the TOP keyword
There are many other solutions to finding the nth highest salary that do not need to use the TOP
keyword, one of which we already went over. Keep reading for more solutions.
Note that the DESC used in the query above simply arranges the salaries in descending order so
from highest salary to lowest. Then, the key part of the query to pay attention to is the LIMIT N-
1, 1. The LIMIT clause takes two arguments in that query the first argument specifies the offset
of the first row to return, and the second specifies the maximum number of rows to return. So, its
saying that the offset of the first row to return should be N-1, and the max number of rows to
return is 1. What exactly is the offset? Well, the offset is just a numerical value that represents the
number of rows from the very first row, and since the rows are arranged in descending order we
know that the row at an offset of N-1 will contain the (N-1)th highest salary.
Note that I havent personally tested the SQL above, and I believe that it will only work in SQL
Server 2012 and up. Let me know in the comments if you notice anything else about the query.
select * from (
select Emp.*,
row_number() over (order by Salary DESC) rownumb
from Employee Emp
)
where rownumb = n; /*n is nth highest salary*/
The first thing you should notice in the query above is that inside the subquery the salaries are
arranged in descending order. Then, the row_number analytic function is applied against the list of
descending salaries. Applying the row_number function against the list of descending salaries
means that each row will be assigned a row number starting from 1. And since the rows are
arranged in descending order the row with the highest salary will have a 1 for the row number.
Note that the row number is given the alias rownumb in the SQL above.
This means that in order to find the 3rd or 4th highest salary we simply look for the 3rd or 4th
row. The query above will then compare the rownumb to n, and if they are equal will return
everything in that row. And that will be our answer!
select * FROM (
select EmployeeID, Salary
,rank() over (order by Salary DESC) ranking
from Employee
)
WHERE ranking = N;
The rank function will assign a ranking to each row starting from 1. This query is actually quite
similar to the one where we used the row_number() analytic function, and works in the same way
as well.
Weve now gone through many different solutions in different database vendors like Oracle,
MySQL, and SQL Server. Hopefully now you understand how to solve a problem like this, and you
have improved your SQL skills in the process! Be sure to leave a comment if you have any
questions or observations.
Purchases
Note that in the SQL we wrote, the group by statement uses multiple columns: group by item,
purchase_date;. This allows us to group the individual items for a given date so basically we are
dividing the results by the date the items are purchased, and then for a given date we are able to
find how many items were purchased for that date. This is why the group by statement with
multiple columns is so useful!
System privileges
System privileges are privileges given to users to allow them to perform certain functions that deal
withmanaging the database and the server . Most of the different types of permissions supported
by the database vendors fall under the system privilege category. Lets go through some examples
of system privileges in Oracle and SQL Server.
Object privileges
Object privileges are privileges given to users so that they can perform certain actions upon
certain database objects where database objects are things like tables, stored procedures,
indexes, etc. Some examples of object privileges include granting a particular database user the
right to DELETE and/or SELECT from a particular table. This is done using the GRANT clause, which
you can read more about here: SQL GRANT.
Lets go over some things that you should know if you plan on using the GRANT statement.
SQL GRANT ON
The ON clause is only used to grant object privileges not system privileges. This clause specifies
which object privileges (as in which table privileges, view privileges, etc..) are being granted.
In the example above, the GRANT statement is used to give the privilege of being able to select
from SOME_TABLE to the SOME_USER user.
REVOKE privilege [, privilege ... ] /* can add more privileges here too */
[ON database object ] /*object can be tables, views, etc.. */
FROM grantee [, grantee ...] ; /*user, PUBLIC, or role*/
Lets go over some things that you should know if you plan on using the REVOKE statement.
SQL REVOKE ON
The ON clause is only used to revoke object privileges not system privileges. This clause
specifies which object privileges (as in which table privileges, view privileges, etc..) are being
revoked.
In the example above, the REVOKE statement is used to revoke the privilege of being able to
select from SOME_TABLE from the SOME_USER user.
The majority of todays RDBMSs come with predefined roles that can be assigned to any user. But,
a database user can also create his/her own role if he or she has the CREATE ROLE privilege.
An example of this is that in older versions of Oracle (before release 10.2), there is a role called
CONNECT, which included privileges like CREATE TABLE, CREATE VIEW, CREATE SESSIONS, ALTER
SESSION, and several other privileges. But, having all of these privileges is probably too much for
a normal business user. That is probably why in newer versions of Oracle (since version 10.2), the
CONNECT role has been changed so that it only has the CREATE SESSION privilege.
Here are some non-RDBMS specific things to keep in mind when tuning your SQL queries to
improve performance:
GROUP BY may be better to use than DISTINCT. In some DBMSs GROUP BY is a more efficient
way of retrieving unique rows than DISTINCT. This is because GROUP BY performs the sort that
finds duplicates earlier in the processing of a query than DISTINCT. The DISTINCT clause will
perform the sort at the very last step, and will do this against the final result set. Why does it
matter when the sort is performed? Well, if duplicate rows are eliminated earlier on in the
processing of a query, then it means that the rest of the processing of the query will be more
efficient because there will presumably be less rows to perform the rest of processing on, since the
duplicates have already been eliminated. For your particular RDBMS, you should look at the
explain plans for running a query with GROUP BY or DISTINCT to see how they compare.
Hints might help you tune your SQL queries. What is a hint? A hint is special syntax that you can
put inside your SQL. What does a hint do? Well, it tells the query optimizer to perform a certain
action, like if you want to tell the optimizer to use a certain method to join tables, or if you want to
tell the optimizer to use a certain index.
Now that weve cleared up what we mean by predicates, lets get back to the original topic.
So, we said that you should know if your optimizer takes the order of the predicates in a
WHERE clause into account, and if that order affects the order in which the predicates are
evaluated. But, why should the order of the predicates matter? Well, if you optimizer does
take the order into account, then you would want the predicate that eliminates the
higher number of rows to be evaluated first by the optimizer. So, for example, lets
say that we have a table called Websites which has columns for the website_active and the
website_subject. The website_active column is just a yes or a no entry, and lets
assume that most of the rows (something like 90%) in the table have a yes value for
website_active. But, lets also say that there are three possible subjects like technical,
self help, cooking, etc. And, the subjects are evenly distributed amongst the rows so
1/3rd of the rows are technical, 1/3rd are self help, etc.
Now, lets say we want to run a query with a where clause like this WHERE
website_subject = cooking AND website_active = NO. Which predicate of the WHERE
clause should be executed first the website_active = NO or the website_subject =
cooking? Well, think about that on your own for a second. Wouldnt it make more sense
to run the predicate which eliminates more rows first? That way, the second predicate has
less rows to process. With that in mind, lets ask ourself which predicate will eliminate
more rows? Wouldnt it be the check to see WHERE website_active = NO? Because, that
check will eliminate 90% of the rows in the table. But the check for website_subject =
cooking would only eliminate 67% of the rows, so clearly 90% is better which means
that the predicate that checks website_active = NO should be run first.
If a transaction is implicitly started, then the transaction will continue until its either fully
committed or until the transaction is rolled back. One scenario in which the transaction could
potentially be rolled back is if the user disconnects before having submitted a statement that
would end the transaction.
why its called autocommit mode. Autocommit is used by default in every connection to
SQL Server unless the implicit mode is set or an explicit transaction is started.