Ace the Data Engineer Interview PDF
Ace the Data Engineer Interview PDF
Christopher Garzon
www.dataengineeracademy.com
1
Table Of Contents
PART 1: Introduction
PART 3: SQL
Section 9: Compute
2
Section 10: System Design – Case Studies
PART 5: Algorithms
PART 6: Behavorial
3
PART 1: Introduction
● building datasets
● cleaning up data
● monitoring data quality
● creating real time processes
4
● scaling data to make sure the product can handle large amounts of data
● some fun SQL and Python exercises
● and much more!
If after reading this book you feel like you could benefit from 1-1 mentorship with a FAANG
engineer, visit Data Engineer Academy
Remember that A LOT of companies have some sort of educational budget, so you can most
likely expense the mentorship package (or part of it) through your company. If you need me to
talk with someone at your company or give you a sample invoice, just email me.
5
PART 2: Data Modeling
Table 1: Fact_daily_sales
Number_of_units_s
Date Sales
old
2022-01-0 $100,150.6
42
1 5
2022-01-0
$60,250.40 35
2
2022-01-0
$82,325.35 39
3
… … …
2022-09-0 $142,580.1
95
1 2
We can see that the Fact_daily_sales table is aggregating the total amount of sales and units per
day. The date column in table 1 is the primary key. This means that there should not be two
rows with the same date value.
What if you wanted to see how much each buyer at AMZ bought every day?
Table 2: Fact_daily_activity
6
User_I Number_of_units_s
Date Sales
d old
2022-01-0
123 $50.65 1
1
2022-01-0 $150.4
456 15
1 0
2022-01-0
349 $83.35 2
1
2022-01-0 $346.1
123 3
2 2
… … … …
2022-09-0
123 $36.12 1
1
Fact_daily_activity is aggregating the total amount of sales per day and user. The date
and user_id columns together in table 2 are what’s known as the compound primary key. This
time we see that a date OR a user_id can repeat more than once, but together they should not.
The user_id is also what is known as a foreign key, meaning that it can be used to join to another
table where the user_id column is the primary key. Finally, what if you wanted to break this table
into each order?
Table 3: Fact_daily_orders
User_I Number_of_units_s
Date Order_id Sales
d old
2022-01-0
123 xydhi $50.65 1
1
2022-01-0 $100.4
456 flma 10
1 0
2022-01-0
456 ture $50.00 5
1
2022-01-0 $146.1
123 tore 1
2 2
… … … … …
2022-09-0
123 tgrw $36.12 1
1
7
Fact_daily_orders is aggregating the total amount of sales and units per day, user, and
order_id. Finally, the date, user_id, and order_id columns in table 3 are also the compound
primary key and the user_id and order_id can both be foreign keys depending on how the
database is set up.
In every scenario there is a grain and there is a measurement. In this case the
measurement is in the form of an aggregation. Don’t worry if you don’t quite fully understand
the part about the keys - we will go in more depth in the database modeling section. For now,
let’s continue focusing on fact tables.
The 3 main types of fact tables are transaction, periodic snapshot, and accumulating snapshot
tables.
● Accumulating snapshot tables: The accumulating snapshot table is less common but
still important. This table means that 1 row is getting updated along a certain process or
time period. Here is a great article behind the history of the accumulating snapshot table.
A good example is thinking about a weekly period snapshot table. Table 1 is at a daily
grain. Let’s say we wanted a weekly grain. For a periodic snapshot, we can update the
table once a week, every Monday as an example. What is we wanted to track the weekly
progress before the week was up though? We can insert a row on the Tuesday before and
then update that row every day by adding on the sales for that day onto that single row
representing the weekly total.
Stick with me, you will see in section 3 how this relates to passing 1 part of the data engineering
interview.
8
Section 3: Dim Tables
In the previous section we learned about the three different types of fact tables. We also
went over some keys, but we will save that for the next section. For this part, we will learn about
the other common type of table – dim (dimensional) tables.
A dim table is a table that organized descriptive attributes into columns. We will discuss
the most common type of dimensional table: SCD (Slowly Changing Dimensions). As you can
tell by the name, an SCD is a table where the value won’t change much over time. Within SCD,
there are three different types – Type1, Tyep2, Type3.
Type1
A Type1 table is the most common dimensional table and involves simply updating a
row. Keeping with the example in the Fact Tables section, let’s zoom in on User_id 123 and
pretend that we have another table called dim_users.
Before:
After:
9
User_I First Last Primary Residence
d Name Name Location
One can probably already guess that dim_users is a table that won’t change much. After all, how
often does someone change their name. However, you can see above that Neff Nezos decided to
change his primary location. Therefore, we decided to simply change the value of New York to
Seattle.
Type2
A Type2 table is one where the before does not get erased. As you can see below, we
simply INSERT a new row using a SQL command. The benefit here is that we get to keep the
history and do analysis to see if there were any changes.
Before:
After:
Type3
A Type3 table is one where we simply add a new column. “Old Residence Location”
now exists in the after table. This is not as common and probably will not get asked during an
interview, so I will not spend too much time on this.
Before:
10
User_I First Last Primary Residence
d Name Name Location
After:
My personal preference is that you should avoid this Type3 if you can because it
becomes hard to code a table schema (the skeleton of the table) if you don’t know what the table
columns are going to be. For example, what is Neff Nezos moves 10 times. Will you dim_user
table suddenly has 10 columns for this one user_id whereas the other user_id’s have 1 or 2
applicable and non-null columns?
This leads us to our next section: database modeling. We will discuss why fact and dim
tables are the way they are and the many benefits there is to constructing tables this way. This
will help you pass the data modeling section of the interview and after that, you will be ready to
run some SQL queries.
Data Modeling is when you model out what your data warehouse is going to look like. A
date warehouse can contain one or more schemas and a schema can contain multiple tables. Use
this analogy: think of a data warehouse as a physical warehouse, a schema as a section or corner
of the warehouse, and in that corner, there are boxes (the tables) that typically belong to one part
of the company. For example, let’s say you work for a hedge fund and are tasked with building
an entire data warehouse from scratch. What are some examples of schemas you might build?
You could build a sales schema only available to the sales team, a marketing schema only
available to the marketing team, and an investments schema only available to the investments
team. The reason for doing this is for a) organization and b) security and privacy. As a data
engineer, you may be tasked with database admin tasks such as controlling who has access to
what. The investments team won’t need access to marketing data that might contain customer’s
personal information. Therefore, it is a good idea to separate these tables into different schemas.
Now let’s talk about tables and continue using the examples from section 1 and 2. Did
you ever wonder why exactly dim and fact tables exist? Let’s look at dim_users and
fact_daily_activity again.
11
core.dim_users
User_I First Last Primary Residence
d Name Name Location
core. fact_daily_activity
User_I Number_of_units_s
Date Sales
d old
2022-01-0
123 $50.65 1
1
2022-01-0 $150.4
456 15
1 0
2022-01-0
349 $83.35 2
1
2022-01-0 $346.1
123 3
2 2
… … … …
2022-09-0
123 $36.12 1
1
You can see the “core” at the beginning of the table name. That’s because now we have
placed our two tables in the “core” schema. What is “core” exactly? At big tech companies it is
common for multiple teams to have access to the main type of data of the company. Therefore,
the main tables just go into the “core” schema or the “public” schema. The name doesn’t really
matter – it just identifies that it is available to anyone for use.
Back to the table design; why did we need a dim table to begin with when we could just
add the “first_name” and “last_name” columns from core.dim_user to core.fact_daily_activity?
Two reasons:
1) Notice that the user_id value (123) doesn’t change. Even if Neff decided to change
his name to Jeff, user_id is still 123 in both tables, and we can JOIN these two tables
on the value.
12
column in core.fact_daily_activity because this column will (most likely) always be
the same and therefore taking up A LOT of unnecessary space which leads to a lot of
$$ that can be saved. This is one way that a knowledgeable Data Engineer can help
his/her company save money.
Star Schemas
If you recall from section 1, we mentioned how these tables had columns that are
primary and/or foreign keys. Looking at core.dim_users above, we can see that the user_id
column fits the definition of the primary key: no two rows will have the same user_id value of
“123”. Looking at the core.fact_daily_activity table below we see that the date and user_id
columns are compound primary keys but the user_id column stands out. Why? Because in this
case, the user_id is also the foreign key. In other words, the foreign key (the user_id in the core.
fact_daily_activity table) can be joined to a primary key of another table (the user_id in the
core.dim_users table).
In the data modeling section of the interview, you will get most likely get asked to model
out the schema and design tables in that schema. These concepts are the essence of what makes
13
up a star schema. A star schema, which is by far the most popular, is the idea that a fact table is
in the middle of the schema surrounded by a bunch of dim tables. See the table below.
core.dim_users
User_I First Last Primary Residence
d Name Name Location
123
Neff Nezos New York
14
Region
Region … Currency
_id
1 US … USD
7 CA … CAD
core.dim_regions
core.fact_daily_orders
User_I Region_i Number_of_units_s
Date Sales
d d old
2022-01-0
123 1 $50.65 1
1
2022-01-0 $100.4
456 7 10
1 0
2022-01-0
456 1 $50.00 5
1
2022-01-0 $146.1
123 1 1
2 2
… … … … …
2022-09-0
123 1 $36.12 1
1
core.dim_dates
Week_Numb Day_of_Mon
Date Month
er th
2022-01-0
1 1 1
1
… … … …
2022-04-0
13 4 6
6
15
You may get asked to design something like this and explain your reasoning. This is where you
think back to the warehouse analogy. What makes sense to put in a dim table vs a fact table?
What makes sense to separate into two separate dim tables or two separate fact tables? Note: One
schema can have multiple fact tables.
Pro Tip
Before going into any interview, study the business!! Knowing how to code all the above
is great but understanding the business and how the data is related to each other (like a star
schema) will help you design beautiful data models. This is what will set you apart from the rest.
It might even be helpful to think about food delivery services that you use and start from there.
16
Solution: Food Delivery Company Data Modeling
You will notice that I did not draw out the tables and have just bullet pointed the solution.
These is for two reasons - A) in an interview you do not want to waste valuable time creating
tables and B) frankly, it saves space ☺. The problem here is a very complex one with a lot of
moving parts and during an interview you will want to focus on quality and not focus too much
on making it look pretty.
There is a lot to digest, so take a look at the schema outline solution below and then let’s
break it into pieces in the explanation section following that.
Schema Outline
1) Let’s start with the dim tables. What are the distinct objects? Those are your entities and
should be your dimension tables? This includes:
o Dim_deliverer
▪ Deliverer_id
▪ Full Name
▪ Region_Id
▪ (Other Relevant Demographic_info_cols)
▪ Accont_creation_date - life of the Deliverer
▪ Last_login_date
▪ Number_of_trips
▪ L7_ Number_of_trips (last 7 days)
▪ L28_ Number_of_trips (last 28 days)
o Dim_customer
▪ Customer_id
▪ Full_Name
▪ Region_Id
▪ (Other Relevant Demographic_info_cols, like address)
▪ Accont_creation_date - life of the customer
▪ Last_login_date
▪ Number_of_orders
▪ L7_ Number of_ orders (last 7 days)
▪ L28_ Number of_ orders (last 28 days)
▪ Premium_user (Y or N)
▪ Premium_user_start_date (Y or N)
o Dim_ restaurant
▪ Restaurant_id
▪ Restaurant_Name
▪ Region_Id
17
▪ (Other Relevant Demographic_info_cols)
▪ Account_creation_date - life of the resturant
▪ Number_of_orders
▪ L7_ Number of_ orders (last 7 days)
▪ L28_ Number of_ orders (last 28 days)
▪ Menu_items (a json)
▪ Opening_Hour
▪ Closing_Hour
▪ Food_Category (fast_food, healthy, etc)
2) Next, what are the fact tables? Recall that there could be multiple fact tables, and this
problem is the perfect place for that. Activity based tables deserve their own fact tables,
such as the following:
o Fact_deliverer_activity
▪ Session_id
▪ Deliverer _id
▪ Region_id
▪ Timestamp
▪ Activity_type -
login/update_account_info/available_to_pickup/accept_order
/Cancel_order/arrive_to_resturant/picks_up_order/drops_off_order
o Fact_customer_acivity
▪ Session_id
▪ Customer _id
▪ Region_id
▪ Timestamp
▪ Activity_type -
login/update_account_info/available_to_order/places_order
/Cancels_order/receives_order/rates_driver/tips_driver
3) Are there any entities that interact often that you can then create a fact table from? We
know that the two fact tables above interest, so let’s create one more fact table:
o Fact_order_activity
▪ Order_id
▪ Deliverer_id
▪ Customer_id
▪ Region_id
▪ Timestamp
▪ Activity_type - (places_order/driver_picks_up_order/
driver_drops_off_order)
4) Finally, is there any Dimension table needed for the fact table above? Yes. This concept
is called “creating a wide table” or “flattening out” the fact_order_activity table we see in
part 3.
o Dim_orders – have ONE RECORD per interaction (one record per order)
▪ Order_id
▪ Deliverer_id
18
▪ Customer_id
▪ Region_id
▪ places_order_time
▪ deliverer_accepted_time
▪ deliverer_picks_up_order_time
▪ deliverer_drops_off_order_time
Explanation
1) This section is related to dim tables. The way we go about deciding which dim tables to
create is by first thinking which “objects” exist. An object here is a part of the business flow that
will exist regardless of changes to the business. For example, if this food delivery company
decides to charge more, it won’t change the objects needed to run the business. You can see
that the fields are similar.
One field that stands out in the Dim_customer table is Premium_user This is just a
Boolean field which will indicate whether the user is a premium user (Y) or not (N). Another field
to point out is the Menu_items in the Dim_ restaurant table. Note that I wrote that this field is
JSON value. For example, resturant A may have 4 menu items. Therefore, the value of this row
might like something like {menu_item_ids: 242,256,2457,823}. Can you think of another dim
table that can be created knowing this fact?
2) Each object can have can theoretically have a fact table. I did include one for the resturant
object, but the concepts remain the same as for the Fact_deliverer_activity and
Fact_customer_acivity tables. The one important thing to point out is that this table is at a
Session_id, Customer_id , Region_id granularity. Note that each value in the activity_type
column has its own timestamp. This will be important for the fourth part.
3) You might be wondering why a table like Fact_order_activity is necessary if in section two
we already have Fact_deliverer_activity and Fact_customer_acivity. The concept is simply:
Storage is cheaper than compute! Duplicating data is better than doing JOINs and there is less
room for error if analysts and scientists incorrectly join columns that cause expensive queries.
19
OR
PART 3: SQL
Set Up:
Go to coderpad.io and create an account – the free account is more than enough. It is also
good to familiarize yourself with coderpad for future interviews. For the first half of the metrics
via SQL section, you can use coderpad to test your code and get back actual results. The second
half will test for more advanced knowledge of SQL and won’t require coderpad.
Some tips - write down your answers before viewing the final solution. There are many
ways to answer the same question so make sure to jot down any solutions that come to mind. It
might surprise you to see those two potential solutions you thought would give you the same
answer result in different outputs.
Interview tips:
The following tips apply to both SQL and coding questions in general (python algorithms). I
cannot stress how important these tips are. Getting the right answer is important. But in the case
where your final answer is slightly off, the interviewer will weigh how you did with the
following other aspects of the interview. Pro Tip: This is where mock interviews are very
impactful.
20
● Ask Questions. Ask for Clarifications. Never Assume!
A lot of the times the interview will ask a question that is vague. They are testing to see if
you have the product and business sense to take a step back and ask for clarifying
questions. The WORST thing you can do is start coding immediately. You should spend
around 5 minutes I this part.
● Ask for feedback on your approach before coding. Most importantly, listen to the
feedback.
For example, if you say, “this problem requires an OUTER JOIN because xyz”, and the
interview asks why an outer join and continues to push you in a different
direction…TAKE THE HINT! These exercises are meant to be a team effort between you
and the interviewer. They want you to succeed and if start pushing you in a different then
trust them, take a breather, pause for 2 minutes and think about another potential solution.
● Code Cleanliness-Clarity
As a tech interviewer I can assure you that the hardest part is trying to follow someone’s
logic in real time. Whether it be SQL or python, seeing someone else write code and
trying to figure out what and why they are doing is tough. However, it becomes even
tougher when students don’t follow standard code etiquette. Make no mistake, you WILL
get points deducted if you code is not clean/commented/easy to read.
21
Average: Department’s Average Salary
Goal
Using coderpad, return the employee count of every department along with the average
salary in each department, like the table below. This question will be one of the simplest, but
don’t forget to read over the tips at the beginning of this section.
Engineerin
department_name The name of the department
g
average_salary_by_departm
$60,250 The average salary per each department
ent
* The numbers here are made up – do not expect in final solution final solution.
Solution
SELECT
d.name as department_name
,SUM(CASE WHEN e.first_name IS NOT NULL THEN 1 ELSE 0 END) AS number_of_employees
,AVG(e.salary) average_salary_by_department
FROM departments d
LEFT JOIN employees e
ON d.id = e.department_id
GROUP BY 1
ORDER BY average_salary_by_department DESC
Output
22
Common Mistakes
● Doing a join instead of a left join. Remember the advice in the tips section at the
beginning of this chapter? Test the data and ask for clarifying questions! The questions
states, “every department” and this is something that you may need to clarify – do they
also mean include departments with 0 employees? Even if this question didn’t come to
mind, seeing what the data looks like would have revealed that the biz dev department
has no employees. That would have allowed you to ask a follow up question about
whether they want it to be included or not.
● A lot of people might do a “count(*)” instead of “sum(case…” to get the number of
employees. However, you get different answers because the left join still creates a row
and COUNT(*) will count the rows regardless if they are null or not. Test it out!
● For this exercise I am going to add what not clean code would look like. Note that even
though the code gets the correct answer, the below answer as presented will still result in
points getting deducted. Notice how much more difficult the below code is to read?
- SELECT is not capitalized
- The first column is on the same line as the select statement
- The second column is called x
- The main SQL syntax is not capitalized
- The departments table has an alias of z and the employees table has an alias of t
which are just irrelevant
- The left join statement is all in 1 line
23
Subqueries: Project Over Budget
Goal
Using coderpad again, return projects that are over budget based on the employee’s
salary. Return the full name of the employee in 1 column.
Column
Value Description
Name
Chris
Full Name Full name of the employee in 1 column
Garzon
* The numbers here are made up – do not expect in final solution final solution.
Solution
WITH project_over_budget AS (
SELECT *
FROM (
SELECT
p.id
, title
, budget
,SUM(salary) total_salary
24
FROM employees_projects ep
JOIN employees e
ON ep.employee_id = e.id
JOIN projects p
ON ep.project_id = p.id
GROUP BY 1,2
) sub
WHERE total_salary > budget
)
SELECT
p.id
,p.title
,p.budget
,p.total_salary
,CONCAT(e.first_name,' ' , e.last_name) AS full_name
FROM employees_projects ep
JOIN employees e
ON ep.employee_id = e.id
JOIN project_over_budget p
ON ep.project_id = p.id
Output
Goal
Using coderpad once again, use a window function to return the highest and lowest
earner per project. Note: if a project only has 1 employee, only 1 employee should be returned.
Create a column called “rank” to specify whether this employee was the highest, lowest earner,
or both. Hint: Do not use rank window function.
25
Column Name Value Description
Geo
title The name of the project
Clustering
employee_full_na
Ana Garzon The full name of the employee
me
Output
26
Bonus Question: Why did I specify to not use rank window function? Can you think of a
scenario where rank would have failed us?
Goal
Using coderpad once again, for each project id return the start date and end date. Then,
for each project, return 3 new results: 1) the days passed between that project and the next
project’s start date 2) 1) the days passed between that project and the next project’s end date
and 3) 1) the days passed between that project’s end date and the next project’s start date. In
this case, define “next” by ordering by start date. As a challenge, do not use any subqueries.
Solution
SELECT
id
,start_date
,end_date
,start_date - LAG(start_date, 1) OVER (ORDER BY start_date) as start_start
,end_date - LAG(end_date, 1) OVER (ORDER BY start_date) as end_end
,start_date - LAG(end_date, 1) OVER (ORDER BY start_date) as end_start
FROM projects
Output
Bonus: Can you do it using whichever function you did not use? (lag or lead)
27
Joins: Segment by Delivery Membership
Goal
For this question, coderpad is not needed and no datasets are available - simply write out
your solution. Imagine you have two datasets: grubfub_dataset and doorfash_dataset. Each
dataset contains a column titled user_id and the values are a string of numbers to identify a user.
If user_id 1234 is in the grubfhub dataset if means that this person has a grubfub membership. If
the user is in both datasets, it means the user has both memberships. Write a query to return
something like the output below - percent of users that have only a grubfub membership, percent
that have only doorfash, and percent that have both.
Output
Column Valu
Description
Name e
Solution
SELECT 100*SUM(CASE WHEN d.user_id IS null THEN 1 ELSE 0 END)/COUNT(*) as g_only,
100*SUM(CASE WHEN g.user_id IS null THEN 1 ELSE 0 END)/COUNT(*) as d_only,
100*SUM(CASE WHEN g.user_id IS NOT null AND d.user_id IS NOT null THEN 1 ELSE 0
END)/COUNT(*) as BOTH
FROM
28
(SELECT distinct user_id FROM grubfub_dataset) g
FULL OUTER JOIN
(SELECT distinct user_id FROM doorfash_dataset) d
ON m.user_id = w.user_id;
Goal
Similarly, no need for the coderpad here – just write your solution. You work at TikTok
and you are tasked with making a histogram and representing it in Tableau. However, Tableau
can only hold so much data. You decided to use you SQL skills to preprocess the data and save
space.
Imagine the dataset below has millions of rows. The goal is to write a query that returns a
table that mimics a histogram. In the last 30 days, post and what does the distribution curve (a
histogram) look like for how many videos a user post? Your dataset should look like something
like the output dataset.
Dataset
Column
Value Description
Name
Output
29
Valu
Column Name Description
e
Solution
SELECT
video_count
, COUNT(video_count) count_frequency
FROM
(
SELECT
user_id
,COUNT(video_id) video_count
FROM fact_daily_tiktok_videos
WHERE date >= CURRENT_DATE - 30
GROUP BY 1
)
GROUP BY 1
Before Covid it was common to have white boarding sessions at Big Tech Companies. A
white board session is when you would show up in person, write out your code, and explain your
thought process out loud all while the other person was in the room following along. Other than
being unnatural and not representative of how you normally code, the biggest drawback to this
method was that candidates couldn’t test their code iteratively like you do on a laptop.
In a way, Covid made coding interviews easier by allowing candidates to test their code
against a database. Recently, however, some companies have started doing “virtual whiteboard”
sessions (25% of them, including Amazon). The way it works is they send you a screenshot of an
input and output sample dataset and ask you to type your solution on a virtual whiteboard. There
is no database attached, and no result to test against.
If you completed the Joins: Segment by Delivery Membership and Business Need:
Create Histogram problems, then you will notice that those cannot be tested against an actual
30
dataset on coder pad. Below you will find more questions like the above, but with more detailed
sample input and output datasets.
1) Dataset Input/Output #1
Solution
WITH T_ORDERS AS (
SELECT
order_date
, product_code
, SUM (quantity) order_quantity
FROM fact_daily_orders
GROUP BY 1,2
)
SELECT
order_date
, o.product_code
, product_category_code
, order_quantity
FROM T_ORDERS o
31
LEFT JOIN dim_product p
ON o.product_code = p.product_code
2) Dataset Input/Output #2
Solution
There are a few ways to do this with one very optimal method. If you bought the premium
version, please go ahead, and email me your solution and I can provide some feedback.
32
Section 6: Table Creations
Below is the code for how you would create a fact_daily_order table. Don’t worry, you
won’t be asked to create tables during the interview but just in case, its helpful to see the code at
least once.
Notice that you need to specify the data types. This helps in avoiding getting different
data types in the same column which would be a nightmare – imagine trying to sum up a
date…Since you most likely won’t get tested on this, let’s move on. In section 8 we will go over
those last two lines you see – the distkey and sortkey identifiers. For now, all you need to know
is that those are ways to organize your data in an optimal manner for quicker queries.
33
PART 4: System Design
34
Section 7: Data storage: S3
System Design is an important part of the Data Engineering job. If you are a recent
college grad (or grad intern), chances are that companies will weigh your technical skills more –
i.e. SQL and python. However, if you show that you are familiar and comfortable with the
different aspect of the System Design skill, then it will stand out big time. Section 10 will go
over System Design examples and section 11 have some system design practice problems. But
first, let’s go over some important parts in the next 3 sections - lets begin!
Sticking with the Amazon Web Services (AWS) world since it is the most popular, let’s
discuss S3! S3 is the data lake service in AWS. Anyone can create an AWS account and use the
tools within their account. Each tool has a different use case and cost structure, but S3 is by far
the most popular. What is a data lake? Think of a data lake as a cabinet with virtually unlimited
size for anyone to store their data. Their data can be structured (excel file), unstructured (music
file), and a single object can go up to 5Tb in size. As of 2021, Amazon has over 100 trillion
objects!
If you are not familiar with data sizes, you should familiarize yourself with the basic
concepts. Everything on your computer is of a certain size and can be seen next to your file in the
finder tab. You can also click on the apple logo on the top left and click on the “About This Mac”
Button and click on storage.
35
You can see that there is about 405 GB left on this computer for me to store files. The
order of magnitudes are 1 Byte, Kilobytes (1000 Bytes), Megabyte (1000 Kilobytes), Gigabyte
(1000 Megabytes), Terabyte (1000 Gigabytes), Petabyte (1000 Terabytes). Big Tech companies
operate at the petabyte levels.
Why is this important? Because as a data engineer that needs to manage S3 buckets, you
will be responsible to forecasting costs before launching a product. The important things you will
need to know for the interview are as following: 1) S3 is normally the first stop for your data and
is normally done with Python 2) it is typically good practice to store raw data in S3 regardless of
use as it is extremely cheap to do so 3) the cost and different cost structures. Frankly, you will
never be asked for any specifics regarding costs, but it’s good to familiarize yourself with the
basic structure and keep certain tiers in mind. For example, you can see that S3 Standard charges
$0.023 per GB per Month for data storage. However, they only charge $0.004 per GB per Month
for Glacier, which is S3’s long-term storage solution where you can place files that don’t require
immediate access (access once a quarter). The tiers are interesting as it can dramatically reduce
costs if you are familiar with your business use case and know you won’t be needing your data
immediately. The full cost outline can be found here.
36
Section 8: Data Warehouse: Redshift
In the database modeling section of this book, we learned about fact and dim tables and
how to model schemas. Those schemas exist in a database, and it is good to get familiar with the
different type of databases. A database is simply a way that data is organized and set up for a
user to retrieve. Most individuals are familiar with Excel (rows and columns) but in the database
world there are many different types of databases and two classes of databases– SQL and
NoSQL.
NoSQL databases is data for unstructured data. When you think “unstructured data”, it’s
a good idea to think of JSON code like the one below.
{
"id": "0001",
"type": "donut",
"name": "Coco Donut",
"calories": 455,
"batters":
{
"batter":
[
{ "id": "1001", "type":
"Regular" },
{ "id": "1002", "type":
"Chocolate" },
{ "id": "1003", "type":
"Blueberry" }}
]
},
"topping":
[
{ "id": "5002", "type": "Glazed" },
37
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar"
},
{ "id": "5006", "type": "Chocolate with
Sprinkles" },
{ "id": "5003", "type": "Chocolate" }
]
}
Unlike the queries you were running in SQL Via Metrics section, you would pull the value here
via key-value method. In other words, requesting “id” would get you “001”. A good example is
DynamoDB – a key-value NoSQL database. DynamoDB is good for “horizontal scaling” and fits
well with real time applications where data is being streamed at a fast and high pace.
A SQL database is like what you saw in the database modeling section earlier in this book.
SQL is for structured data and is for multi-row transactions. Redshift is a very popular example
and I think it is worth going over Redshift a little more in depth as it is another tool you can add
to your system design come interview time.
We know that Redshift is for structured data, but what exactly does this entail? For starters,
we saw that we could store our unstructured data in S3. This is where your SQL and ETL
(extract, transform, and load) skills come in handy. If you have JSON code like the one above,
you can use SQL to manipulate and load (ETL) your data into Redshift where analysts and data
scientists can do their job with greater ease. * Redshift is also a columnar database. This mean
that when you query a table in Redshift, your data will be optimized to do aggregates by column.
For example, adding total sales in the sales column of the arrad table will run quickly…if you set
up the table correctly.
Let’s look at the last few lines of the create table code from section 6.
What exactly does distkey (another name for this is partitions) and sort key mean? This is
VERY essential to correctly doing your job as a data engineer and having happy coworkers.
Once again, remember that interviewers will most likely not ask you to create a table, but that
doesn’t mean you shouldn’t mention these concepts.
The analogy that I like to use is that of multiple highway lanes (which are the servers)
and the cars are the data columns. The distkey allows us to evenly split the traffic amongst
highway lanes as opposed to call cars trying to use the same 1 lane. Specifying the distkey as
user_id helps speed up querying significantly because the data for 1 user_id will be one “lane”.
38
Therefore, if you filter for user_id with a WHERE clause, then your query only scans that one
server instead of checking every single lane.
The sortkey sorts the data within each distkey. For example, we know every row for
user_id 123 might be on server 1. Once we apply the sortkey of date, then the way those rows
pertaining to user_id 123 get sorted is by the date (in this example). This means that if we have
data starting from the year 2012 but you also filter by date > “2022-01-01”, then you can avoid
scanning 9 years of data.
During the interview you will be asked to justify why you chose a certain type of
database. For the most part, saying you chose Redshift because it is a SQL database, the sort and
dist keys speed up querying, and you don’t need data updated in real time (it can be updated
every 10 minutes to 24 hours) should suffice. But again, remember to talk about the tradeoffs!
*Bonus Question: Can you design what the table would look like after the JSON above gets
turned into a Redshift table?
39
Section 9: Compute
We have gone over Data Modeling, Storage and Databases. Even though Interviewers
will never ask to go into specifics, I want to briefly discuss one other component that is worth
understanding the fundamentals of- compute. Compute it to a computer what the brain is to a
human. Your brain receives information every day and processes information to then make
decisions. Your computer has limited CPU and can be found in the activity monitor app like the
screenshot below.
40
We can see which applications on our computer take up the most CPU percentage wise. If
you were to run a machine learning model on your computer that takes up the remaining CPU,
then chances are the other applications on your computer will be closer. The worst-case scenario
is that your entire computer crashes. In the old days companies used to set server (another
computer essentially) farms manually, but thanks to the cloud every tech company can now use
virtual servers and completely avoid ever having to touch hardware. The service for virtual
servers for AWS is called Elastic Cloud Compute (EC2). Like S3, Amazon allows you to choose
different servers for your workload depending on how much compute (processing power) you
need. If you really want to get your hands dirty, you can spin up a small EC2 server that costs
around 2 cents an hour for the “on-demand” tier. Just make sure to turn it off where it can rack up
costs if you forget!
Like the previous two sections, the specifics oh which EC2 server you would need is not
important for passing a data engineer interview. Understanding the fundamental concepts are
much more important. However, I will briefly mention 3 topics that could be mentioned in the
system design interview, and they would make a strong impression as new grads are typically not
expected to understand these concepts.
● Scaling: There are two types of scaling – vertical scaling (“scaling up”) in which
more compute is added to current servers, and horizontal scaling (“scaling out”)
in which more servers are added to the family of servers that are processing your
application. You might require vertical scaling if you want to run 1 machine
learning model to get results 1 time and it has a lot of data to process. Horizontal
scaling is typically what tech companies do because it is not resource constraint.
For example, if you are Twitter and you suddenly start receiving a lot of traffic
due to a pandemic that has shocked the world, then as a company you don’t want
your application to not load because of this unexpected traffic. Chances are, you
might even need a tool like Elastic Load Balancing (ELB) to evenly distribute this
traffic amongst your many servers. Knowing these two concepts and how it
applies to a business will be beneficial to understand for your interviews.
41
B to point A. Redis is a very popular open-source example. This ultimately
provides the user a better costumer experience and the twitter results get returned
faster.
● Abstraction/ETL Tools: One good thing to know about the AWS world is that a
bunch of tools, such as Glue, EMR, Airflow, and even Redshift are all operating
on top of EC2 servers. At the end of the day, regardless of what tool you are using
you will require compute! It is good to know these tools exist in case you want to
play around with them on your own. Again, an interview will never ask you to
demo you using them, but the thing that matters is that you know that some tool
must be used to process the data in your pipelines. In the next section you will see
how it fits into the whole environment.
42
Section 10: System Design – Case Studies
Now that we have the foundation for data infrastructure (storage, compute, and
databases), let’s put this all together to see how they relate to one another. We will go over an
example from Robinhood and then go through an example of what a system Design question
might look like. It’s important to note that you will never be asked to code anything with regards
to this topic – there isn’t enough time and that’s what the SQL and python questions are for.
However, you might be asked to draw something out. The next section will have some practice
problems for you.
You’ll notice that Robinhood broke up their data infrastructure into multiple sections:
Ingestion, Storage, and Processing, Validation, and User Facing Layers. You can also see that
Robinhood uses a lot of the tools that we have already discussed: S3 to store their data and at one
point Redshift as the data warehouse, which you can see sits in the intersection of the Storage
and Query Layer. Let’s go over each layer to discuss the tools we haven’t discussed yet. More
importantly, pay extra attention to what might be asked in an interview that isn’t included in this
article.
Ingestion: This section is important because it contains a lot of new information and how you
answer will be very dependent on the question that your interviewer asks. Robinhood points out
Kafka. Kafka is an open-source tools that lets you stream data in real time. Think of this as the
water pipeline in your home. When you turn on your facet you expect water to stream through
immediately. This may not be a necessary requirement for every company, so you need to think
through your specific business use case. For example, a financial company that is running ML
models on historical data may only need to update their data once a day. For this use case, that
financial company might just get their data through an FTP server or have their data provider
drop the data directly into their S3 bucket.
The AWS equivalent of Kafka is Kinesis. Kinesis is also a real-time streaming tool but the
difference is that it is in the AWS ecosystem and integrates well with other AWS tools (like S3).
Kinesis is quicker and easier to set up. So why did Robinhood choose Kinesis? Because it’s
cheaper! Sure it takes longer and more engineering time to set up, but because Kafka is open
43
source and Robinhood has the budget for engineers, it was probably in Robinhood’s best interest
to spend the time to benefit from the cost savings in the long run.
This is what the interviewer is looking for you: they want to see you discuss your
rationale for using real time (business use case) and the tradeoffs between what you did
(Kafka) and didn’t do (Kinesis).
Storage: This one is straightforward. When you think storage for raw data, think S3. Notice how
Robinhood even mentions Glacier, which we previously discussed. Robinhood also placed
Redshift in the intersection of Storage and Query, which is correct because if you recall, Redshift
has a storage and compute layer to it.
Processing: This is also an important section containing new things. Glue and Airflow are two
AWS tools that allow you to do ETL and create your pipelines. So, when you think ETL, most
likely there is some “processing” that is involved. To transfer data from your S3 bucket into
Redshift, you might need to get the data into the right format, as we discussed in the Compute
section. While Airflow and Glue are different tools, Amazon uses EC2 as the servers in the
background! Like the notes about ingestion layer, your job is to discuss the tradeoffs between the
two (Robinhood happens to use both).
Glue is cheaper and “serverless”, meaning you don’t need to do any work configuring
how much computer power you are going to use. Airflow is an orchestration tool that visualizes
your entire pipeline. For example, if your pipeline has 10 steps and it fails on step 6, you can first
see where the error happened and then fix the error and rerun the rest of the pipeline starting
from the point of failure. Airflow is expensive and requires a little more configuration. Either
way, chances are you will need to do some sort of ETL and that important thing here is that you
mention this during the interview.
Query: The query layer is not something you need to worry about. It simply setting up a tool so
that users can query databases. SQL workbench is 1 example of an open-source tool, but most
likely your company will already have a tool.
Validation: The Validation Layer can occur in a lot of tools. Airflow, which was discussed in the
query layer, is a common one. It’s an important concept as we have discussed before – you want
to make sure you have checks to identify any problems with the data. You should mention that
somewhere in your design, but it is necessarily its only section. It just happens along the way.
User Facing Layers: If you are not familiar with JupyterHub or Looker, don’t worry. Like the
query layer, these are tools that are easy to set up and simply ways for your coworkers (data
analysts and data scientists) to interact with your data warehouse. Interviews probably won’t ask
you about this, but it’s good to know in case you want to mention it. Tableau dashboards are
another way for data scientists to show their work and analysis to the rest of the team.
44
2) Outline of a System Design Question
Not all companies do what I am about to show you, but I think it’s important that you get
familiar with at least the format of what a system design question can look like.
In the next section, there will be 2 system design questions. I want you to imagine that
you have a blank word doc with 4 quadrants and each quadrant has one of the following
categories: Notes, Schema Design, System Design, API.
For this book, don’t worry too much about the API section. Data Engineers won’t have to
worry much about. It is essentially just a “if the user asks us for x, what do we return to them.” In
the Robinhood example, a user might ask us for the price of amazon stock, and we would return
then $130 for example.
45
What is important are the other three sections. When an interview asks you to design a
system (like Robinhood), the first place you might head to is the Notes section. This is where
you ask your questions, take notes, write down the question that is asked etc. Remember the tips
for section 5? Apply those tips! Make sure you understand the requirements before you start
drawing arrows like you see in the Robinhood article.
Once you have a clear understanding of the question being asked, head over to the
Schema Design section and start drawing out the tables you would want to include in your
system. Whether virtual or on a white board, you won’t need draw out an entire star schema like
you did in section 4. But including some tables (both fact and dim) with the appropriate columns
and the logic to justify it will be more than sufficient.
Now is where you head into the Schema Design quadrant. This is where you draw
something along the lines of what you saw in the Robinhood example. Explain what data goes
into s3, how you used real time tool to get it there (if you needed to), and how it eventually got
into Redshift to create the tables that are written out in the Schema Design quadrant. Again,
don’t worry about making it pretty! As long as you can explain your rationale for each part, the
interviewer will like what they see!
A) A very common question is the tinyURL problem. A tinyURL service is when where a user
submits a URL and gets back a shortened version of that URL. How would you design that
tinyURL?
B) You work at a music streaming company called Stopify. You have access to all the data that
any music streaming service would have. With every song that a user listens, the data scientists
on the team can use Machine learning to learn quickly and give song Recommendations.
However, they need the proper data and infrastructure to make their ML work. How would you
design a Recommendation Engine?
Note: if you bought premium package and do not have experience with AWS, contact the author
to receive an ETL project and get some hands-on experience with a bunch of AWS tools.
47
Section 12: Product Improvement Via Data Engineers
This section will contain generalized examples of real-life example of how a good data
engineer can add tremendous value to the business. I want to stick away from specific companies
because the important thing is to see the patterns for how data engineers are seen as crucial to
what a company’s mission, regardless of the industry.
Company TLQ is a manufacturing company for industrial tools that collect data and sends it
back to its central server. TLQ has created these machines to also have sensors that collect data
such as voltage. When dealing with something like voltage fluctuations, you can imagine that
this it is important this company delivers machines that are reliable. This company’s dashboards
get updated every 3 days due to it not being able to process all the data quickly enough, but
management is worried that this isn’t fast enough. The potential cons of not speeding this process
up are:
- Product recall of its machines and potentially a cost of millions
- A safety Hazard for costumers
- Employees having to go to multiple dashboards to track different factory productivities.
As a data engineer, you realize there is a better way! Using your knowledge from the system
design section and seeing that TLQ doesn’t quite need real time data, you realize that can send
the data to S3 every hour and then use Glue to manipulate the data and ingest into Redshift every
hour. Now that the data exists in Redshift and partitioned and sorted by day AND hour, you can
write efficient queries to update your dashboards every hour. Now, the company went from
having dashboard updated every 72 hours to every 1 hour, a 90%+ improvement! Now company
TLQ can catch product malfunctions before shipping their machines.
Whether it is Tableau, Looker or Quicksight, companies rely on dashboards to be able to
visualize trends and patterns in real or near real time. Teams can then quickly make decisions
based on data. The more frequently it gets updated, the better. With that said, real time updates
are more expensive and not every company requires that. TLQ probably doesn’t need to receive
data every minute stating “every is functioning as normal”. Every hour will probably suffice
since the machines might get shipped out once or twice a day. And this small change of updating
the dashboards every hour instead of every minute will end up being 60 times cheaper! (60
minutes every hour)
48
Automation (Data Quality)
A company called Auto has launched worldwide and, in that decision, now contain
multiple websites (.com, .eu, .uk etc.) Before launching worldwide Auto was only in the US and
employee x was manually gathering the data related to the performance of the website and the
search engine. Using excel, employee x was able to spend 30 minutes to manually create charts
and updated the data. However, after launching worldwide, employee x realized that this process
would now take him 2 hours a day. At first, employee x was able to focus extra hard and get this
done in closer to 1 hour. But due to the speed at which this employee was operating, a lot of
mistakes were being discovered. Sometimes files for a region didn’t contain data. Other times,
there were duplicates and a session_id (the session of the user browsing a website) appeared in
more than 1 region where it reality each region_id and session_id should be 1-1. Additionally,
after 1 month this was taking closer to 3 hours due to an increase in the website’s popularity.
As a data engineer, you realized you could solve this problem using Airflow. Using
Python and API’s, you placed the raw data for each website in its own S3 bucket and then use
Airflow to create pipelines to merge all the data into 1 Redshift table. Using airflow and python
scripts, you create SQL like checks to ensure the data quality is accurate. The following checks
ensure that A) there are actual values getting returned and B) there are no duplicates.
The 2 queries above should always return True. Using Airflow, you can set up alerts that
get triggered and send the on-call engineer a notification if False was returned. Moreover, you
have written the ETL pipelines to account for any future regions that your company decides to
launch in. Thanks to your efforts, you have now helped your company:
Caching (Scaling)
49
A sports company, Jike, that releases limited edition sports gear occasionally, – anything
from shoes to jerseys. When they launch new releases, the website tends to lag in the first ten
minutes due to the immediate influx of superfans going to visit the page. Due to the popularity of
certain players at any given time of the year, it is incredibly hard to predict exactly how much
traffic will come to the site. All that is known is that there will be a spike in traffic.
The data engineer on the team decides that caching the data is the best solution. Using
Redis and caching the data related to the product being released, the data engineer can account
for 98% of the traffic on any given day. The outcome is that Jike realizes that they have been
paying for unutilized resources, which leads to them to consolidating resources and in turn:
- Saves money on resources overall
- Scales the website without having to predict traffic spikes
- Provides incredible costumer experience.
Being the good data engineer that you are, you not only have set up a caching mechanism for
the next popular item to go on sale, but you have also set up a way for software engineers to code
in future cases. That way, if a basketball player has a record night during a professional game
and their sales unexpectedly start to spike – code can automatically cache that webpage using the
Redis system you have set up for them.
A restaurant company called Johnny’s has been recently scaling and opening multiple
stores across the country. However, management is worried that as they scale, the quality of their
food will decline. Executives have decided to invest in some machine learning to ensure the
quality of this good. Specifically, they want to use Machine vision to see if the food is of the
highest quality. They tested the product in 1 store and found that the technology worked! The
models were streaming data into S3 and occasionally, an engineer in the office would manually
run the python script to train the model to become smarter.
After launching however, executives realized that they made a crucial mistake – they
didn’t account for scale. They thought that since it worked for 1 store, it would work for all of
them. Instead what they found was that there was an increased cost to run an ML for every store,
a lot of technical debt from the fact that this ML model was very scrappily put together for one
store, and a higher security risk and each store represented a different engineer potentially
overwriting data in the S3 buckets.
You decide to come in and use your knowledge of S3, Airflow, and Sagemaker to fully
automate this process. Sagemaker is a plug and play environment that allows you to
productionable your machine learning models and deploy at scale. With your knowledge of S3
and data lakes, you can ensure via Python script labeling that each store will get its own store_id
and no overwriting will occur. Finally, using Airflow, you can schedule the machine learning to
run as often as you want and now the ML model is returning results faster and with more
accuracy. You have now saved engineers from doing manual runs, helped the model learn faster
(and ultimately saving on tech debt) and ensured that there is data integrity with regards to
storing the data in the data lake.
50
PART 5: Algorithms
51
Section 13: Algorithms Via Python
Another important part of the data engineer interview is your python skills. For this
section, go back to coderpad and this time select python2 from the drop-down menu. Typically,
companies don’t care whether you use java, python, or some other coding language, but for data
scientists and data engineer’s python is normally the most popular one. The Algorithms section
is like the SQL section in that they are testing your python knowledge, thought process,
communication, and computer science fundamental skills.
If you are unfamiliar with python, I recommend codecademy as they have plenty of free
material. You may have also heard from your fellow engineers regarding the holy grail of
Software Engineering books – Cracking the Coding Interview. I think these resources are
amazing, but remember that there is a lot of material in these resources that were meant for
software engineers and frankly are a bit overkill when it comes to studying for a Data
Engineering interview. There is nothing wrong with learning a bit of python now, and then
mastering it on the job. Most tech companies are understaffed when it comes to junior engineers
that they do not mind taking on this risk as long as you show you have some knowledge and
understand of how coding works.
Another thing to keep in mind is that for data engineers, companies will typically ask
“easy” python questions and very rarely ask “medium” difficulty level questions. The medium
level questions are normally for more senior level engineers, and “hard” is almost always for
software engineers. This is another reason I do not want you to just search for python problems
online – you will just end up overstudying. With that, let’s over some tips, some computer
science fundamentals, and finally go through some practice problems.
Tips
The same tips in the beginning of Section 5: Metrics Via SQL absolutely apply. I will briefly
over them again here but reread that section.
● Never assume! Ask questions and for clarifications on what they are asking for – they
are normally vague on purpose.
● Talk out loud! Same concept applies – the interviewer is your friends and wants you to
do well, but he/she cannot read your mind.
52
● Test iteratively! The worst thing you can do is write out your entire code only to figure
out there are 10 syntax errors and must spend half your time finding them; avoid this by
testing bit by bit.
● Clarity! Like the SQL section, make sure you code is following common python
standards and is easy to read. Make the life of the interviewers easier and slow down to
make sure the code is readable and co-worker friendly. When in doubt, use comments.
There is one part of the python interview that differs from the SQL interview questions –
edge cases. Edge cases are scenarios in your python questions that are rare but still essential to
the testing your logic and coding skills. My advice here is acknowledge edge cases in the “never
assume” portion of your interview, and then come back to them at the end after you have a
working solution for many of the scenarios. Sometimes time might run out and while the
interviewer might deduct some points for not getting to the edge case, it is not nearly as bad as if
you started with the edge case.
Time Complexities: In computer science, the time complexity of a function is also known as
“the computational complexity” that describes the amount of computer time it takes to run an
algorithm. In other words, it is a way to measure an algorithm’s efficiency and is written as “Big
O notation”. Some of the popular time complexities are O(n!), O(n^2), O(n), O(logn), O(1).
Look at the image below, if there was an algorithm that ran at each complexity below, which one
do you think would be considered the most efficient?
53
If you answered O(1), that is correct. This one is known as constant time. In order words,
if your input was 1 number long [1], as opposed to 100 inputs long [1,2…100], the compute
power needed is the still same. Don’t you will see examples in the following section.
Typically, software engineers or data scientists are the ones that are asked to optimize for
time complexity, but as a data engineer you will stand out if you are able to answer these
questions as bonus part of your python interview (and just as part of your interview if applying
for a more senior role.)
Data Structures
List
Lists are mutable data structures in python that are defined by brackets. ['tree','dog',7 ,'2345'] is a
list – notice that the list contains different data types.
Array
An array is like a list, but it contains values of the same data types. Additionally, you can do
math with arrays like in the example below.
array1 = [1,2,3,4]
array2 = [1,2,3,4]
54
print(array1[1] + array2[2])
Dictionary
When you think dictionary, think of “key:value” pairs. A dictionary can be created using the
code below. If I wanted to get the year I would run print(thisdict[‘year’]).
thisdict = {
"item": "home",
"type": "condo",
"year": 1964
}
Set
Stack
When you hear the term stack think LIFO (Last in First Out) and a stack of plates. Like a stack
of plates, the idea is that you can Push (place a plate at the top of the stack) and Pop (remove the
top plate from the stack) a value (plate) into the stack – LIFO!
stack = []
stack.append('a')
stack.append('b')
stack.append('c')
print('Initial stack')
print(stack)
# pop the value
print(stack.pop())
print(stack)
print(stack)
Queues
When you hear the term Queues think FIFO (First in First Out) and a line of people. Like a line
of people, the idea is that you can Enqueue (a person joins the end of the line) and Dequeue
(the person at the front of the line goes forward) a value (plate) into the Queues – FIFO!
55
# Add elements to the queue
queue.append('1')
queue.append('2')
queue.append('3')
print("Initial queue")
print(queue)
Hash Table
A Hash Table is a data structure that stores values and has keys associated with them. Because
of this, lookups are efficient if we know the key associated with the value using an in. You will
see a good example in the following section.
Exercises
Given a string, write a function to count the occurrences of each word in that string.
Return the answer in the form of a dictionary. Don’t worry about time complexity for this one.
Input: 'I hope you are enjoying this book. Feel free to email me any suggestions for how to
improve this experience.'
Output: {'free': 1, 'Feel': 1, 'for': 1, 'to': 2, 'I': 1, 'experience.': 1, 'suggestions': 1, 'book.': 1, 'any':
1, 'this': 2, 'are': 1, 'how': 1, 'me': 1, 'you': 1, 'enjoying': 1, 'email': 1, 'hope': 1, 'improve': 1}
56
Solution #1:
def word_counter(str):
counts = dict()
words = str.split()
return counts
print(word_counter('I hope you are enjoying this book. Feel free to email me any suggestions for how
to improve this experience.'))
If you haven’t already, run this code on coderpad and look at the output. Important thing to note
here is that we were asked to use a dictionary which is why dict was used on line 2.
57
Question 2: Two Sum
Given an array of integers, and a target integer, return two numbers if they add up to target
integer.
You may not use the same element twice for one solution – i.e., if the target is 4 and there is only
one ‘2’ in the array, then [2,2] is not a valid solution.
Input:
Array: [6,5,9,-2,1310]
Target: 11
Output:
(6, 5)
(-2, 13)
58
Solution #2a: Two Sum
Right off the bat you should have noticed something nice I did for you…I answered an
assumption for you when asking the question! Remember, it is not guaranteed that the
interviewer will tell you not to assume something. In fact, chances are they won’t because they
want to see if you can stop, take a breath, and ask that question yourself.
The solution below is probably closer to what you initially did. This gets you the right
results! However, this function is at a O(n^2) complexity. Why? Because as we can see, for the
length of the array [len(array)] we are iterating through the array twice. For each i, we iterate
through a j and do the sum. This grows at a power of 2 as the length of the array gets longer. Is
there a way to do this at O(1) time complexity?
59
# if both elements add up to target, print both elements
if array[i] + array[j] == target:
print(array [i], array[j])
The solution below is at O(1)! In this instance we only iterate through the array once and we are
taking advantage of hash tables in order to store the value that we have already iterated through
once, so we don’t have to iterate through again. Test it out and print out the hash table along the
way.
for i in range(len(array)):
complement = target - array[i]
if complement in hashTable:
print(array [i],complement)
hashTable[array[i]] = array[i]
60
Python-SQL Like Prep
If you went through the problems at the beginning of this chapter, you can see that
algorithms are important to study because companies want to know you understand the
fundamentals. However, if you are early in your career, companies might only care if you know
python. For this, the way they test is by asking you SQL like python problems that you need to
answer in python. This means being able to do group by, joins, unions, and basic arithmetic using
pandas. Remember, the tips at the beginning of this section still apply.
Exercise
1) File order_data_2019
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
2) File order_data_2020
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
product_category_code
3) File order_data_2021
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
product_category_code, mobile_device
Use python (or a programming language you prefer) to combine the 3 files and do the following:
61
First, return the following columns in 1 dataset: order_date, order_quantity, order_price
Next, get the total order_quanitity and total order_revenue (quantity*price) at a quarterly level
Solution
import pandas as pd
import csv
# Read in each csv file and make sure to use your specific path
data_2019 = read.csv('/Users/chrisgarzon/Documents/order_data_2019.csv')
data_2020 = read.csv('/Users/chrisgarzon/Documents/order_data_2020.csv')
data_2021 = read.csv('/Users/chrisgarzon/Documents/order_data_2021.csv')
data['order_date'] = pd.to_datetime(data['order_date'])
data['quarter'] = data['order_date'].dt.to_period('Q')
data['order_revenue'] = data['order_quantity']*data['order_price']
df_rev = data.groupby(['quarter']).agg({'order_quantity':'sum','order_revenue':'sum'})
62
For More Problems Like The Ones Above and 1-1
Help Solving Them, Sign Up For Our Coaching
Package
OR
63
PART 6: Behavorial
64
Section 14: Most Common behavioral Question
Companies ask a variety of behavioral questions. However, they all tend to fall under the
same umbrella. Companies tend to gauge your response to see if you would be a good cultural
fit. A “good cultural fit” is someone who aligns with the value and principles that a company has
made public. Amazon is a really good example of a company that has shared their leadership
principles and does a really job of asking questions related to them. For this section we will go
over some of the questions typically asked and how to best answer them - don’t forget that
concepts apply to any company!
1. Bias for action: Tell me about a time you took a risk at work?
This question is testing to see if you are taking smart and well thought out risks. Amazon has
a phrase that a lot of other companies have adopted - a “one way door” vs a “two-way door”. A
one-way door is a decision that you made that is hard (if not impossible) to reverse, as opposed
to a two-way door which is a decision that is not written in stone. This is part of risk
management, and it is where the Bias for action part comes in.
Speed matters in business and companies want to know that you can move fast but
intelligently. Additionally, they will look for reasons why you chose this risk and not another?
Saying that it is a two-way door is great, but what else? Another thing they want to hear is your
business knowledge - what are the inputs and outputs. For example, if you say that the input was
100 hours of engineering time and a cost of $1 million dollars in AWS resources, but the output
only yields $100k in annual revenue, the company probably won’t think you assessed the risk
appropriately.
2. Disagree and Commit: Tell me about a time you disagreed with a coworker/manager?
Let me start off with how not to answer this question. DO NOT talk negatively about
anyone else. You can disagree with someone and still be respectful. You can state “I understand
why they thought this, and I spent time understanding their side even better, but according to my
analysis on ABC, I thought this was the way to go.”
65
What companies want to see is how you go about arguing your side. There is nothing
wrong with disagreeing with someone if you do it respectfully and with logic. By logic, I mean
data and research. Saying that this method “just feels right” won’t cut it.
Finally, there will be times that even if you are right, your manager might not yet trust
you and might be insistent on doing it his way. This is where the “commit” part comes in. Think
about to a time where you went ahead and committed only to get more information and proof
that your method might have been right. Ideally, you used this opportunity to revisit your initial
suggestion and pivot to try something new.
3. Ownership: Tell me about a time when you went over and above your job responsibility
to help the company?
Leaders lead by example. Talk about a time when someone on your team couldn’t
accomplish something. Maybe they were sick, had a lot on their plate, or simply didn’t know
how to. You stepped up and got the job done. Ownership is about going beyond what you are
expected to do and never saying “that is not my job.”
Talk about the how you arrived to do this decision. Did you talk to your manager? If yes, was
it something you brought up? Ideally this was not something your manager had to tell you to do
but rather you took the initiative after having seen something that struck you as worrisome. Why
this task over others? Ideally this task you decided to pick up seems like it was either urgent, or
something you knew you could do rather quickly.
4. Dive Deep: Tell me about at time you re-designed a process and why?
To redesign a process, one needs to understand the intricacies of that process first.
Companies LOVE when they hear that you dope deep. Don’t be afraid to talk about times where
you had to spent weeks reading through boring queries, or documentation to figure something
that no one on the team had noticed. Mention what the old design was, and how your new
proposed designed was going to be better and be ready to go into details if needed.
5. Prioritization: Tell me about a time when you had two deadlines at the same time. How
did you manage the situation?
This is a classic prioritization question. How does this candidate manage his/her time? How
does the candidate decide which task should be completed first? What interviewers want to see
here is that you weighed the tasks given many inputs – maybe one input is the urgency
management was showing. Maybe another factor was something that one task was something
you already knew how to do quickly, and you knew you could ask your co-worker for help on
the other task because your co-worker knew how to do that task relatively quickly. The trick here
is to mention which that you took 10 minutes to really think about which task you should spend
your time on. Did you hold a prioritization meeting to hear other people’s inputs? Was one task
more short thinking vs the other was long term? Was it possible to push back the deadline for one
of these tasks?
The way to NOT answer this question is to say that you pulled an all-nighter to finish
both. While this isn’t the worst answer in the world, this is also what leads to burn out.
Companies want to know that they’re hiring someone who can manager their time well enough
66
to not get burnt out and result in a churned employee. Pulling an all-nighter is not sustainable,
and frankly companies don’t want to be known as the place where all-nighters are expected and
then they will be labeled as toxic.
This last one is simple – be authentic! Tell them why you want to work there. It helps if
you use the product but also do some research on the culture. What excited you about the
company when reading good things about it online. Maybe mention something about your
personality – “I want to join this company because I am into startups but also want to be
surrounded by experienced engineers. This company seems to have the best of both worlds;
the team I am applying for has a startup vibe, but company XYZ has the resources of a much
bigger company.”
This question is potentially vital in that you can absolutely say the wrong thing - similar
to how you shouldn’t talk about your co-workers in a bad way, don’t talk about your previous
employer or past internship in a negative light either. Don’t say it’s because of the money
either – companies already know this! Find something that is truly you and don’t be afraid to
stand out!
STAR Method
As you reread the questions and advice above, another thing to keep in mind is the STAR
method. Aside from answering the question in the best way you can, your communication matter.
The STAR method is a tactic helps you streamline your thoughts and give your interviewer a
high-level response. It stands for:
Situation: Describe the situation at a high level. This is like developing a plot of a movie.
Describe who is involved, what the dilemma is, and what your team is trying to accomplish.
Task: What goal were you working towards and what was your individual task?
Action: Describe the actions that YOU personally took. This is no time to be humble; make sure
you use the word “I” to describe what the initiative that you took to accomplish the task at hand.
Result: Describe the outcome of your actions. If you can quantify this, even better! “Because of
my ETL pipeline automation skills, I was able to reduce the amount of time my co-worker spent
on manual tasks by 90%.”
Each section should be 2-3 sentences long. Let the interviewer digest the story you’re about
to tell and let them ask you follow up questions from there. As someone who has interviewed
1000’s of candidates, I can tell you firsthand that there is nothing more frustrating than to hear
someone ramble on for 20 minutes only to have me ask 20 clarifying questions.
Practice writing your answer to some of the questions above. Remember, it should come out
to about 8-12 lines. If you bought the premium package and want me to look, feel free to email
your answers to me!
67
Sample Questions and Answers
Below find some sample questions and answers. See if you can come up with your own
examples following the STAR method and notice the patterns – concise and quantifiable stories
with a brief overview at the beginning.
Tell me about your most challenging project ////// What did you do when you needed to motivate
a group of individuals or promote collaboration on a particular project?
[S] At XYZ my most recent project was regarding evaluating churn rates in our Pink Program. I
did a deep dive and realized that there was a high number of payment failures causing users to
fail when they were supposed to auto renew
[T] From there, the goal was to build a dataset to get more insight and eventually predict ahead
of time who was going to have a payment failure.
[A] This was proven harder than I thought because a) it seemed like no one knew where to get
payment related data, and b) the data that I found seemed very unreliable. I eventually found a
few people, by going this random office hours, that I thought could help. I organized a meeting
between 3 different teams to align where on a source of truth, and update stakeholders on this
payment’s failure dataset.
68
[R] Finally, the result was that we were able to build a dataset that allowed the science team to a)
figure out what the issue was and b) predict and built products to help alleviate the payments
issue. This leads to a massive increase in first month retention from 65% go 77%.
Tell me about a time in which you worked under a tight deadline // How have you leveraged data
to develop a strategy?
[Situation] At XYZ we were in the process of launching Growth Planner Score - our ML model
that would generate a 90-day sales prediction for every ASIN on a weekly basis
[Task] As the DE my task was to build the entire pipeline, create the features for these models
and set up jobs to run worldwide scale –scale meaning 2 billion + ASINs. For me, success meant
1) making sure the jobs ran under 8 hours and 2) keep the costs under our budget of 3k a month.
As a team, success meant 1) getting a high ASIN coverage being able to generate a score for
90% of all active ASINs 2) and having a high accuracy, precision, recall when it came to our
predictions.
[Action] Our ML model took account variables such as whether an ASIN had adopted a
marketing program such as ads, deals and A+. I needed to understand what the source of truth
was from all these teams so we could ingest their data but one of the teams we were working
with, the coupons team, wasn’t moving as quickly as we would have hoped. I found out their
only DE had just taken off for 3 weeks. And so my ownership kicked in and reached out to a
friend who knew a friend who used to work on the Coupons teams and asked him to point me in
the right direction.
[Result] As a result, I was able to get the data I needed through a combination of using our data
catalog, S3 buckets, using our redshift cluster. It’s running at 50% of the time needed and under
budget. Specifically, the coupons team was able to use our score to recommend which items
NOT to be a coupon on – this reduced their coupon rate of $0 sales from 50% to 30% in just one
month – it was amazing
When did you take a risk, make a mistake, or fail? How did you respond, and how did you grow
from that experience?
[S] I was working at the start up and one of the portfolio managers wanted bond data – data that
was very expensive to buy in Bloomberg.
[T] My task was to talk to the Bloomberg reps and figure out costs, how the data transfer would
work, what we would get etc
[A] That said, I knew that that data was public online and I knew I had the skills to just scrape
the data. Now, I had other tasks to complete so I didn’t want to waste too much time doing this,
But I figured it would only take me half a day.
[R] It ended up taking me 3 days, and on top of that it was very buggy, and the data was more
unreliable than I thought and on top of that the website metadata was changing and this would
break my code. So just me trying to fix this would result in me spending 1, maybe even 2 weeks.
Something that pushed back our other tasks.
69
I learned that next time, maybe spend 10-20 minutes more scoping, and maybe quickly ping my
manager the idea that I have. The good news is that this bias for action didn’t cost us anything
financially and didn’t ruin any existing projects, but I could have done a better job of scoping.
[S] This one time I had a 1-1 with a Portfolio Manager and it was just to get to know each other
better but she started telling me about her day to day, and I specifically remember this 1 task that
would take her 30-60 minutes a day that she was doing manually every day and she was getting
frustrated. I scheduled up a follow up to learn more about this effort.
[T] At the time I had other stuff scheduled on my sprint, but from first impressions it sounded
like something I could help her with relatively quickly
[A] I quickly ran it by my manager and asked him to give me 2 hours to help her automate this
reconciliation process she was doing – I created an S3 bucket where she could drop off file
which then kicked off a GLUE pipeline and would ETL her work.
[R] This resulted in having her work get 90% automated, but I went one step further by figuring
out how to automated the final 10% by pulling attachments from emails and sending an email
automatically when a file would drop into s3. This resulted in having that process 100%
automated.
Tell me about a time when you were faced with a problem that had a number of possible
solutions. What was the problem and how did you determine the course of action? What was the
outcome of that choice? /// Something proud?
I have a lot of experience with AWS – I was responsible for building the entire data
infrastructure at the start up.
[S] When I first joined the start up the team was simply running scripts manually on their
computer and saving files onto an ec2 server. This needed to be ETLed into a Data warehouse –
Redshift. But, I needed to decide what the best tool to use was.
[T] I took on the role of moving all that to the cloud , and creating pipes
[A] So I dug into all the systems and processes that the start up was doing and automated
everything – worked with third party data vendors to receive data in a FTP server, used python
scripts within Lambda functions to pull that data onto our data lake and finally dug into decide
whether to use Airflow or Glue SQL jobs for our ETL process, or SPARK to speed up jobs that
quants were doing, and finally get everything neat and organized in our data warehouse
[R] I was most proud of the fact that I was able to get 100+ tables in our datawarehouse from 20
different pipelines in about 6 months – and those tables were utilized every week and drove a lot
of investment decisions that improved our strategy by 20% - but also kept our costs close to 0
while airflow may have cost us $5k a month.
70
If after reading this book you feel like you could benefit from 1-1 mentorship with a FAANG
engineer, visit Data Engineer Academy
Remember that A LOT of companies have some sort of educational budget, so you can most
likely expense the mentorship package (or part of it) through your company. If you need me to
talk with someone at your company or give you a sample invoice, just email me. Even if the
price is too steep, reach out to me about payment plans and/or a pay after you get the job plans.
71
@ 2024 Data Education Holdings LLC. No part of this book may be reproduced or shared, in
any form or by any means, without the prior written permission of the author. All rights reserved
72