0% found this document useful (0 votes)

1K views72 pages

Ace the Data Engineer Interview PDF

The document introduces the DE Academy interactive portal, aimed at helping individuals prepare for data engineering careers. It outlines the course structure, covering topics such as data modeling, SQL, system design, and algorithms, while emphasizing the growing importance of data engineers in the tech industry. The author encourages readers to utilize the accompanying book and offers mentorship opportunities to enhance their learning experience.

Uploaded by

Ulises Montero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views72 pages

Ace the Data Engineer Interview PDF

Uploaded by

Ulises Montero

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Exciting News!

Our interactive portal, DE Academy! , is available for you

to use. I highly recommend you use this book in
conjunction – it’s free to sign up!

Christopher Garzon
www.dataengineeracademy.com
1
Table Of Contents

PART 1: Introduction

Section 1: Intro To Data Engineering

PART 2: Data Modeling

Section 2: Fact Tables

Section 3: Dim Tables

Section 4: Database Modeling

PART 3: SQL

Section 5: Metrics Via SQL

Section 6: Table Creation

PART 4: System Design

Section 7: Data storage: S3

Section 8: Data Warehouse: Redshift

Section 9: Compute
2
Section 10: System Design – Case Studies

Section 11: System Design Questions

Section 12: Product Impact Via Data Engineering

PART 5: Algorithms

Section 13: Algorithms Via Python

PART 6: Behavorial

Section 14: Most Common behavioral Question

3
PART 1: Introduction

Section 1: Intro to Data Engineering

Congratulations! If you are reading this, it means that you are serious about becoming a
data engineer. This course is intended for anyone who wants to feel confident going into the data
engineering interview. I am super excited for you because I am going to make a statement that I
don’t think has become quite mainstream yet: becoming a data engineer is the most underrated
position in tech and will quickly gain importance over the next 5-10 years. We are seeing that
being reflected in the compensation where companies are now starting to pay data engineers the
same as they would a software engineer or data scientist; something that was not true five years
ago. Why? Because of one incredibly important metric that came out in 2021: 80%-90% of the
data in the world is unstructured! For tech companies, this is like having oil and not having the
resources available to convert it into gas. While Data Scientist was crowned the sexiest job in the
world back in 2012, with that position came the need for what people refer to as its cousin: Data
engineers.
I will be the first to acknowledge that this book is not meant to be an “all-in-one” book in
one specific subject. For example, if you wanted to become an expert in all things Amazon Web
Services (AWS) – there are other great courses out there. With that said, if the goal is to break
into the DE career, those courses will not suffice. You can become an expert in AWS on the job
where your co-workers will be there to help you. But, there are some fundamental AWS things
that you can learn to help you stand out during the interview process; and THAT is what this
book is about. This book consolidates everything you need to know relating to data modeling,
system design, SQL, Python, AWS, behavioral questions etc. to get you through every part of the
data engineering interview process.
When I found myself studying for interviews, I noticed that I had to search in multiples
places for different pieces information. I was mind blown that there wasn’t just one place that
had everything. This leads to overstudying and now it is an inside joke in the tech industry that
most engineers realize that they have massively overstudied for interviews. On top of this, if you
don’t have the right network or mentor, it can be incredibly challenging to break into the industry
with close to 0 experience.
So with that out of the way, what exactly is data engineering? Here is the analogy I like to
use: Imagine you want to build a house. A data scientist might come in and paint the house
(create an ML model), furnish the place (create dashboards to communicate analytics), and get
insightful metrics (how many beds/bathrooms/sqft does the house have.) The data engineer,
however, is the one who built the foundation of the house – the pipes, the wirings, the structure,
and even the alarm detectors. The data engineer works very closely with the data scientist to
build the foundation necessary for the data scientist to do their job. This includes:

● building datasets
● cleaning up data
● monitoring data quality
● creating real time processes

4
● scaling data to make sure the product can handle large amounts of data
● some fun SQL and Python exercises
● and much more!

My goal is to help you get a job. Plain and simple.

If after reading this book you feel like you could benefit from 1-1 mentorship with a FAANG
engineer, visit Data Engineer Academy

Alternately, we have our interactive portal at DE Academy! Absolutely free!

Remember that A LOT of companies have some sort of educational budget, so you can most
likely expense the mentorship package (or part of it) through your company. If you need me to
talk with someone at your company or give you a sample invoice, just email me.

With that, let’s begin!

5
PART 2: Data Modeling

Section 2: Fact Tables

Let’s start with the fundamentals: Dimensional (dim) tables and fact tables. While data
can be found in many different forms, for now let’s just think about tables in the same way we
think about an excel sheet – rows and columns. These tables can be found in a database which
you can, using SQL, query from. We will go over databases (specifically Redshift data
warehouse) later in this book.
In data engineering, fact tables contain the measurements (like sales in dollars) at a
specific grain (like a row for every day - daily). You work at an e-commerce company, AMZ,
where users can buy and sell stuff online. Your job is to track the amount in sales every day
along with a few other metrics common in the e-commerce space. What might a table look like?

Table 1: Fact_daily_sales

Number_of_units_s
Date Sales
old

2022-01-0 $100,150.6
42
1 5

2022-01-0
$60,250.40 35
2

2022-01-0
$82,325.35 39
3

… … …

2022-09-0 $142,580.1
95
1 2

We can see that the Fact_daily_sales table is aggregating the total amount of sales and units per
day. The date column in table 1 is the primary key. This means that there should not be two
rows with the same date value.

What if you wanted to see how much each buyer at AMZ bought every day?

Table 2: Fact_daily_activity

6
User_I Number_of_units_s
Date Sales
d old

2022-01-0
123 $50.65 1
1

2022-01-0 $150.4
456 15
1 0

2022-01-0
349 $83.35 2
1

2022-01-0 $346.1
123 3
2 2

… … … …

2022-09-0
123 $36.12 1
1

Fact_daily_activity is aggregating the total amount of sales per day and user. The date
and user_id columns together in table 2 are what’s known as the compound primary key. This
time we see that a date OR a user_id can repeat more than once, but together they should not.
The user_id is also what is known as a foreign key, meaning that it can be used to join to another
table where the user_id column is the primary key. Finally, what if you wanted to break this table
into each order?

Table 3: Fact_daily_orders

User_I Number_of_units_s
Date Order_id Sales
d old

2022-01-0
123 xydhi $50.65 1
1

2022-01-0 $100.4
456 flma 10
1 0

2022-01-0
456 ture $50.00 5
1

2022-01-0 $146.1
123 tore 1
2 2

… … … … …

2022-09-0
123 tgrw $36.12 1
1

7
Fact_daily_orders is aggregating the total amount of sales and units per day, user, and
order_id. Finally, the date, user_id, and order_id columns in table 3 are also the compound
primary key and the user_id and order_id can both be foreign keys depending on how the
database is set up.

In every scenario there is a grain and there is a measurement. In this case the
measurement is in the form of an aggregation. Don’t worry if you don’t quite fully understand
the part about the keys - we will go in more depth in the database modeling section. For now,
let’s continue focusing on fact tables.

The 3 main types of fact tables are transaction, periodic snapshot, and accumulating snapshot
tables.

● Transaction: When you think of a transaction table, think of table 3 – fact_daily_orders.

A buyer with user_id 456 goes onto the AMZ website and puts in an order. That order
can have multiple items (and number of units) in the cart and BAM – that transaction gets
recorded in the table.
● Periodic snapshot: When you think of a periodic snapshot table, think of table 1:
fact_daily_sales. Maybe your fellow dataset wants to create a tableau dashboard to graph
sales over time. However, the data at its most granular level is too large and so you
decide to create a daily snapshot – this is table 1. Note, periodic can also mean weekly or
monthly. In some cases, it might even make sense to create different tables for all of these
periods.

● Accumulating snapshot tables: The accumulating snapshot table is less common but
still important. This table means that 1 row is getting updated along a certain process or
time period. Here is a great article behind the history of the accumulating snapshot table.
A good example is thinking about a weekly period snapshot table. Table 1 is at a daily
grain. Let’s say we wanted a weekly grain. For a periodic snapshot, we can update the
table once a week, every Monday as an example. What is we wanted to track the weekly
progress before the week was up though? We can insert a row on the Tuesday before and
then update that row every day by adding on the sales for that day onto that single row
representing the weekly total.

Stick with me, you will see in section 3 how this relates to passing 1 part of the data engineering
interview.

8
Section 3: Dim Tables

In the previous section we learned about the three different types of fact tables. We also
went over some keys, but we will save that for the next section. For this part, we will learn about
the other common type of table – dim (dimensional) tables.
A dim table is a table that organized descriptive attributes into columns. We will discuss
the most common type of dimensional table: SCD (Slowly Changing Dimensions). As you can
tell by the name, an SCD is a table where the value won’t change much over time. Within SCD,
there are three different types – Type1, Tyep2, Type3.

Type1

A Type1 table is the most common dimensional table and involves simply updating a
row. Keeping with the example in the Fact Tables section, let’s zoom in on User_id 123 and
pretend that we have another table called dim_users.

Before:

User_I First Last Primary Residence

d Name Name Location

123 Neff Nezos New York

After:

9
User_I First Last Primary Residence
d Name Name Location

123 Neff Nezos Seattle

One can probably already guess that dim_users is a table that won’t change much. After all, how
often does someone change their name. However, you can see above that Neff Nezos decided to
change his primary location. Therefore, we decided to simply change the value of New York to
Seattle.

Type2

A Type2 table is one where the before does not get erased. As you can see below, we
simply INSERT a new row using a SQL command. The benefit here is that we get to keep the
history and do analysis to see if there were any changes.

Before:

User_I First Last Primary Residence

d Name Name Location

123 Neff Nezos New York

After:

User_I First Last Primary Residence

d Name Name Location

123 Neff Nezos New York

123 Neff Nezos Seattle

Type3

A Type3 table is one where we simply add a new column. “Old Residence Location”
now exists in the after table. This is not as common and probably will not get asked during an
interview, so I will not spend too much time on this.

Before:

10
User_I First Last Primary Residence
d Name Name Location

123 Neff Nezos New York

After:

User_I First Last Old Residence Primary Residence

d Name Name Location Location

123 Neff Nezos New York Seattle

My personal preference is that you should avoid this Type3 if you can because it
becomes hard to code a table schema (the skeleton of the table) if you don’t know what the table
columns are going to be. For example, what is Neff Nezos moves 10 times. Will you dim_user
table suddenly has 10 columns for this one user_id whereas the other user_id’s have 1 or 2
applicable and non-null columns?
This leads us to our next section: database modeling. We will discuss why fact and dim
tables are the way they are and the many benefits there is to constructing tables this way. This
will help you pass the data modeling section of the interview and after that, you will be ready to
run some SQL queries.

Section 4: Database Modeling

Data Modeling is when you model out what your data warehouse is going to look like. A
date warehouse can contain one or more schemas and a schema can contain multiple tables. Use
this analogy: think of a data warehouse as a physical warehouse, a schema as a section or corner
of the warehouse, and in that corner, there are boxes (the tables) that typically belong to one part
of the company. For example, let’s say you work for a hedge fund and are tasked with building
an entire data warehouse from scratch. What are some examples of schemas you might build?
You could build a sales schema only available to the sales team, a marketing schema only
available to the marketing team, and an investments schema only available to the investments
team. The reason for doing this is for a) organization and b) security and privacy. As a data
engineer, you may be tasked with database admin tasks such as controlling who has access to
what. The investments team won’t need access to marketing data that might contain customer’s
personal information. Therefore, it is a good idea to separate these tables into different schemas.
Now let’s talk about tables and continue using the examples from section 1 and 2. Did
you ever wonder why exactly dim and fact tables exist? Let’s look at dim_users and
fact_daily_activity again.

11
core.dim_users
User_I First Last Primary Residence
d Name Name Location

123 Neff Nezos New York

core. fact_daily_activity

User_I Number_of_units_s
Date Sales
d old

2022-01-0
123 $50.65 1
1

2022-01-0 $150.4
456 15
1 0

2022-01-0
349 $83.35 2
1

2022-01-0 $346.1
123 3
2 2

… … … …

2022-09-0
123 $36.12 1
1

You can see the “core” at the beginning of the table name. That’s because now we have
placed our two tables in the “core” schema. What is “core” exactly? At big tech companies it is
common for multiple teams to have access to the main type of data of the company. Therefore,
the main tables just go into the “core” schema or the “public” schema. The name doesn’t really
matter – it just identifies that it is available to anyone for use.
Back to the table design; why did we need a dim table to begin with when we could just
add the “first_name” and “last_name” columns from core.dim_user to core.fact_daily_activity?
Two reasons:

1) Notice that the user_id value (123) doesn’t change. Even if Neff decided to change
his name to Jeff, user_id is still 123 in both tables, and we can JOIN these two tables
on the value.

2) This method saves space. In core.dim_users there is 1 row allocated to the

information pertaining to Neff Bezos. If we were include these attributes to the
core.fact_daily_activity table then this would mean we would get a lot of repetitive
values every day that Neff makes a purchase. There is no reason to have a first name

12
column in core.fact_daily_activity because this column will (most likely) always be
the same and therefore taking up A LOT of unnecessary space which leads to a lot of
$$ that can be saved. This is one way that a knowledgeable Data Engineer can help
his/her company save money.

Star Schemas

If you recall from section 1, we mentioned how these tables had columns that are
primary and/or foreign keys. Looking at core.dim_users above, we can see that the user_id
column fits the definition of the primary key: no two rows will have the same user_id value of
“123”. Looking at the core.fact_daily_activity table below we see that the date and user_id
columns are compound primary keys but the user_id column stands out. Why? Because in this
case, the user_id is also the foreign key. In other words, the foreign key (the user_id in the core.
fact_daily_activity table) can be joined to a primary key of another table (the user_id in the
core.dim_users table).
In the data modeling section of the interview, you will get most likely get asked to model
out the schema and design tables in that schema. These concepts are the essence of what makes

13
up a star schema. A star schema, which is by far the most popular, is the idea that a fact table is
in the middle of the schema surrounded by a bunch of dim tables. See the table below.

Star Schema for AMZ e-commerce company

core.dim_users
User_I First Last Primary Residence
d Name Name Location

123
Neff Nezos New York

14
Region
Region … Currency
_id

1 US … USD

7 CA … CAD

core.dim_regions

core.fact_daily_orders
User_I Region_i Number_of_units_s
Date Sales
d d old

2022-01-0
123 1 $50.65 1
1

2022-01-0 $100.4
456 7 10
1 0

2022-01-0
456 1 $50.00 5
1

2022-01-0 $146.1
123 1 1
2 2

… … … … …

2022-09-0
123 1 $36.12 1
1

core.dim_dates
Week_Numb Day_of_Mon
Date Month
er th
2022-01-0
1 1 1
1

… … … …

2022-04-0
13 4 6
6

15
You may get asked to design something like this and explain your reasoning. This is where you
think back to the warehouse analogy. What makes sense to put in a dim table vs a fact table?
What makes sense to separate into two separate dim tables or two separate fact tables? Note: One
schema can have multiple fact tables.

Exercise: Food Delivery Company Data Modeling

Practice makes perfect, so let’s do a practice problem. Remember, there is no complete

right or wrong answer. Writing the code is science, but designing the schema is closer to art.
What the interviewer wants to see is how you explain your thought process and explain the
tradeoffs.
Let’s say you work for a delivery food company, XYZ. A customer can go to the XYZ
app and place an order at their favorite restaurant for either pickup or delivery. Once that
restaurant receives the order, they can choose to accept or deny the order. If accepted, then the
XYZ app connects the restaurant with a nearby delivery man who comes, picks up the order, and
delivers to food to the customer. Design a schema that takes this entire process into account.
Spoiler Alert: One potential solution is on the next page, so you might want to create
your solution first before proceeding!

Pro Tip

Before going into any interview, study the business!! Knowing how to code all the above
is great but understanding the business and how the data is related to each other (like a star
schema) will help you design beautiful data models. This is what will set you apart from the rest.
It might even be helpful to think about food delivery services that you use and start from there.

16
Solution: Food Delivery Company Data Modeling

You will notice that I did not draw out the tables and have just bullet pointed the solution.
These is for two reasons - A) in an interview you do not want to waste valuable time creating
tables and B) frankly, it saves space ☺. The problem here is a very complex one with a lot of
moving parts and during an interview you will want to focus on quality and not focus too much
on making it look pretty.
There is a lot to digest, so take a look at the schema outline solution below and then let’s
break it into pieces in the explanation section following that.

Schema Outline

1) Let’s start with the dim tables. What are the distinct objects? Those are your entities and
should be your dimension tables? This includes:
o Dim_deliverer
▪ Deliverer_id
▪ Full Name
▪ Region_Id
▪ (Other Relevant Demographic_info_cols)
▪ Accont_creation_date - life of the Deliverer
▪ Last_login_date
▪ Number_of_trips
▪ L7_ Number_of_trips (last 7 days)
▪ L28_ Number_of_trips (last 28 days)
o Dim_customer
▪ Customer_id
▪ Full_Name
▪ Region_Id
▪ (Other Relevant Demographic_info_cols, like address)
▪ Accont_creation_date - life of the customer
▪ Last_login_date
▪ Number_of_orders
▪ L7_ Number of_ orders (last 7 days)
▪ L28_ Number of_ orders (last 28 days)
▪ Premium_user (Y or N)
▪ Premium_user_start_date (Y or N)
o Dim_ restaurant
▪ Restaurant_id
▪ Restaurant_Name
▪ Region_Id

17
▪ (Other Relevant Demographic_info_cols)
▪ Account_creation_date - life of the resturant
▪ Number_of_orders
▪ L7_ Number of_ orders (last 7 days)
▪ L28_ Number of_ orders (last 28 days)
▪ Menu_items (a json)
▪ Opening_Hour
▪ Closing_Hour
▪ Food_Category (fast_food, healthy, etc)

2) Next, what are the fact tables? Recall that there could be multiple fact tables, and this
problem is the perfect place for that. Activity based tables deserve their own fact tables,
such as the following:

o Fact_deliverer_activity
▪ Session_id
▪ Deliverer _id
▪ Region_id
▪ Timestamp
▪ Activity_type -
login/update_account_info/available_to_pickup/accept_order
/Cancel_order/arrive_to_resturant/picks_up_order/drops_off_order
o Fact_customer_acivity
▪ Session_id
▪ Customer _id
▪ Region_id
▪ Timestamp
▪ Activity_type -
login/update_account_info/available_to_order/places_order
/Cancels_order/receives_order/rates_driver/tips_driver

3) Are there any entities that interact often that you can then create a fact table from? We
know that the two fact tables above interest, so let’s create one more fact table:

o Fact_order_activity
▪ Order_id
▪ Deliverer_id
▪ Customer_id
▪ Region_id
▪ Timestamp
▪ Activity_type - (places_order/driver_picks_up_order/
driver_drops_off_order)

4) Finally, is there any Dimension table needed for the fact table above? Yes. This concept
is called “creating a wide table” or “flattening out” the fact_order_activity table we see in
part 3.
o Dim_orders – have ONE RECORD per interaction (one record per order)
▪ Order_id
▪ Deliverer_id

18
▪ Customer_id
▪ Region_id
▪ places_order_time
▪ deliverer_accepted_time
▪ deliverer_picks_up_order_time
▪ deliverer_drops_off_order_time

Explanation

1) This section is related to dim tables. The way we go about deciding which dim tables to
create is by first thinking which “objects” exist. An object here is a part of the business flow that
will exist regardless of changes to the business. For example, if this food delivery company
decides to charge more, it won’t change the objects needed to run the business. You can see
that the fields are similar.
One field that stands out in the Dim_customer table is Premium_user This is just a
Boolean field which will indicate whether the user is a premium user (Y) or not (N). Another field
to point out is the Menu_items in the Dim_ restaurant table. Note that I wrote that this field is
JSON value. For example, resturant A may have 4 menu items. Therefore, the value of this row
might like something like {menu_item_ids: 242,256,2457,823}. Can you think of another dim
table that can be created knowing this fact?

2) Each object can have can theoretically have a fact table. I did include one for the resturant
object, but the concepts remain the same as for the Fact_deliverer_activity and
Fact_customer_acivity tables. The one important thing to point out is that this table is at a
Session_id, Customer_id , Region_id granularity. Note that each value in the activity_type
column has its own timestamp. This will be important for the fourth part.

3) You might be wondering why a table like Fact_order_activity is necessary if in section two
we already have Fact_deliverer_activity and Fact_customer_acivity. The concept is simply:
Storage is cheaper than compute! Duplicating data is better than doing JOINs and there is less
room for error if analysts and scientists incorrectly join columns that cause expensive queries.

4) As mentioned, Dim_orders is simply a “flattened out” version of Fact_order_activity. If you

look closely, you can see that timestamp is not in Dim_orders. The reason this is vital is
because a lot of times in businesses like these (food delivery, car sharing, etc), optimizing for
time is very essentially. A table like Dim_orders let the data scientists write very simple Queries
for each order id (since there is only 1 row per order_id) and optimize the routes that their
algorithms are creating to minimize ETA times.

For More Problems Like The Ones Above and 1-1

Help Solving Them, Sign Up For Our Coaching
Package

19
OR

Test out our interactive portal at DE Academy!

PART 3: SQL

Section 5: Metrics Via SQL

The goal of this section is to get you familiar with the basics of SQL, get exposure to
what the most advanced parts of SQL are used for, and practice using coderpad – which is the
environment that is most used during interviews. If you are not familiar with SQL, I highly
recommend going through the mode analytics (free version) lessons here – specifically the
beginner and intermediate sections. That said, the advanced SQL section isn’t too hard and will
be needed for 1 or 2 questions in this section (window functions). SQL is a MUST for the job!
So, I recommend spending one weekend going through the material in mode analytics.

Set Up:
Go to coderpad.io and create an account – the free account is more than enough. It is also
good to familiarize yourself with coderpad for future interviews. For the first half of the metrics
via SQL section, you can use coderpad to test your code and get back actual results. The second
half will test for more advanced knowledge of SQL and won’t require coderpad.
Some tips - write down your answers before viewing the final solution. There are many
ways to answer the same question so make sure to jot down any solutions that come to mind. It
might surprise you to see those two potential solutions you thought would give you the same
answer result in different outputs.

Interview tips:
The following tips apply to both SQL and coding questions in general (python algorithms). I
cannot stress how important these tips are. Getting the right answer is important. But in the case
where your final answer is slightly off, the interviewer will weigh how you did with the
following other aspects of the interview. Pro Tip: This is where mock interviews are very
impactful.

20
● Ask Questions. Ask for Clarifications. Never Assume!
A lot of the times the interview will ask a question that is vague. They are testing to see if
you have the product and business sense to take a step back and ask for clarifying
questions. The WORST thing you can do is start coding immediately. You should spend
around 5 minutes I this part.

● Ask for feedback on your approach before coding. Most importantly, listen to the
feedback.
For example, if you say, “this problem requires an OUTER JOIN because xyz”, and the
interview asks why an outer join and continues to push you in a different
direction…TAKE THE HINT! These exercises are meant to be a team effort between you
and the interviewer. They want you to succeed and if start pushing you in a different then
trust them, take a breather, pause for 2 minutes and think about another potential solution.

● Investigate the data!

Remember, asking questions and asking for clarifications is the first thing you should be
doing. After that, you should proceed to ask the interviewer if you can investigate the
data. The interviewer will say yes and that is when you run SELECT * FROM dataset
LIMIT 10 so you can get a better sense of the data. I cannot tell you how many
candidates just assume that they don’t need to check the data and dive right into coding.

● Talk out loud.

When you finally start coding, the interviewer will be looking at your screen to see what
you are typing. However, you should get used to talking out loud because this gives the
interviewer an opportunity to either a) assure you that your approach is correct and b)
hint that you should pivot if they see you’re going down the wrong path. They are testing
to see how receptive you are to feedback.

● Test, Test, Test

You do not get penalized for running tests. Therefore, I recommend breaking up the
problem and your solution along the way. This includes testing to see what the data looks
like during SQL problems.

● Code Cleanliness-Clarity
As a tech interviewer I can assure you that the hardest part is trying to follow someone’s
logic in real time. Whether it be SQL or python, seeing someone else write code and
trying to figure out what and why they are doing is tough. However, it becomes even
tougher when students don’t follow standard code etiquette. Make no mistake, you WILL
get points deducted if you code is not clean/commented/easy to read.

● Once complete, ask about edge cases.

Once you have the correct answer, you can ask whether there are different edge cases or
variations to test for. With regards to edge cases, these are normally more apparent in
algorithm questions. We will cover this more in depth in the Python/Algorithm section.

21
Average: Department’s Average Salary
Goal
Using coderpad, return the employee count of every department along with the average
salary in each department, like the table below. This question will be one of the simplest, but
don’t forget to read over the tips at the beginning of this section.

Column Name Value Description

Engineerin
department_name The name of the department
g

The number of employees in each

number_of_employees 129
department

average_salary_by_departm
$60,250 The average salary per each department
ent
* The numbers here are made up – do not expect in final solution final solution.

Solution
SELECT
d.name as department_name
,SUM(CASE WHEN e.first_name IS NOT NULL THEN 1 ELSE 0 END) AS number_of_employees
,AVG(e.salary) average_salary_by_department
FROM departments d
LEFT JOIN employees e
ON d.id = e.department_id
GROUP BY 1
ORDER BY average_salary_by_department DESC

Output

22
Common Mistakes
● Doing a join instead of a left join. Remember the advice in the tips section at the
beginning of this chapter? Test the data and ask for clarifying questions! The questions
states, “every department” and this is something that you may need to clarify – do they
also mean include departments with 0 employees? Even if this question didn’t come to
mind, seeing what the data looks like would have revealed that the biz dev department
has no employees. That would have allowed you to ask a follow up question about
whether they want it to be included or not.
● A lot of people might do a “count(*)” instead of “sum(case…” to get the number of
employees. However, you get different answers because the left join still creates a row
and COUNT(*) will count the rows regardless if they are null or not. Test it out!
● For this exercise I am going to add what not clean code would look like. Note that even
though the code gets the correct answer, the below answer as presented will still result in
points getting deducted. Notice how much more difficult the below code is to read?
- SELECT is not capitalized
- The first column is on the same line as the select statement
- The second column is called x
- The main SQL syntax is not capitalized
- The departments table has an alias of z and the employees table has an alias of t
which are just irrelevant
- The left join statement is all in 1 line

Select d.name as department_name

,sum(CASE WHEN e.first_name IS NOT NULL THEN 1 ELSE 0 END) AS x
,AVG(e.salary) average_salary_by_department
from departments z LEFT join employees t ON z.id = t.department_id GROUP BY 1 order by
average_salary_by_department DESC

23
Subqueries: Project Over Budget
Goal
Using coderpad again, return projects that are over budget based on the employee’s
salary. Return the full name of the employee in 1 column.

Column
Value Description
Name

id 1 The id of the project

title xyz_project The title of the project

budget $60,250 The budget assigned to each budget

The total salary based on all employee’s salary assigned to

Total_salary $40,250
the project

Chris
Full Name Full name of the employee in 1 column
Garzon
* The numbers here are made up – do not expect in final solution final solution.

Solution
WITH project_over_budget AS (
SELECT *
FROM (
SELECT
p.id
, title
, budget
,SUM(salary) total_salary

24
FROM employees_projects ep
JOIN employees e
ON ep.employee_id = e.id
JOIN projects p
ON ep.project_id = p.id
GROUP BY 1,2
) sub
WHERE total_salary > budget
)

SELECT
p.id
,p.title
,p.budget
,p.total_salary
,CONCAT(e.first_name,' ' , e.last_name) AS full_name
FROM employees_projects ep
JOIN employees e
ON ep.employee_id = e.id
JOIN project_over_budget p
ON ep.project_id = p.id

Output

Window Function: Employee Per Project

Goal
Using coderpad once again, use a window function to return the highest and lowest
earner per project. Note: if a project only has 1 employee, only 1 employee should be returned.
Create a column called “rank” to specify whether this employee was the highest, lowest earner,
or both. Hint: Do not use rank window function.

25
Column Name Value Description

Geo
title The name of the project
Clustering

employee_full_na
Ana Garzon The full name of the employee
me

salary $100000 Salary of employee

Indicates this is highest earner in the

Rank Highest
project
* The numbers here are made up – do not expect in final solution final solution
Solution
SELECT
title
,full_name
,salary
,CASE WHEN count_asc <> 1 AND count_dsc = 1 THEN 'Highest' ELSE (CASE WHEN count_asc
= 1 AND count_dsc <> 1 THEN 'Lowest' ELSE 'Both' END) END AS rnk
FROM (
SELECT
p.title
,e.salary
,CONCAT(e.first_name,' ' , e.last_name) AS full_name
,ROW_NUMBER() OVER (PARTITION BY title ORDER BY salary) count_asc
,ROW_NUMBER() OVER (PARTITION BY title ORDER BY salary desc) count_dsc
FROM employees_projects ep
JOIN employees e
ON ep.employee_id = e.id
JOIN projects p
ON ep.project_id = p.id ) sub
WHERE count_asc = 1 OR count_dsc = 1

Output

26
Bonus Question: Why did I specify to not use rank window function? Can you think of a
scenario where rank would have failed us?

Lag/Lead Functions: Time Delta for Projects

Goal
Using coderpad once again, for each project id return the start date and end date. Then,
for each project, return 3 new results: 1) the days passed between that project and the next
project’s start date 2) 1) the days passed between that project and the next project’s end date
and 3) 1) the days passed between that project’s end date and the next project’s start date. In
this case, define “next” by ordering by start date. As a challenge, do not use any subqueries.

Solution
SELECT
id
,start_date
,end_date
,start_date - LAG(start_date, 1) OVER (ORDER BY start_date) as start_start
,end_date - LAG(end_date, 1) OVER (ORDER BY start_date) as end_end
,start_date - LAG(end_date, 1) OVER (ORDER BY start_date) as end_start
FROM projects

Output

Bonus: Can you do it using whichever function you did not use? (lag or lead)

27
Joins: Segment by Delivery Membership

Goal
For this question, coderpad is not needed and no datasets are available - simply write out
your solution. Imagine you have two datasets: grubfub_dataset and doorfash_dataset. Each
dataset contains a column titled user_id and the values are a string of numbers to identify a user.
If user_id 1234 is in the grubfhub dataset if means that this person has a grubfub membership. If
the user is in both datasets, it means the user has both memberships. Write a query to return
something like the output below - percent of users that have only a grubfub membership, percent
that have only doorfash, and percent that have both.

Output

Column Valu
Description
Name e

Percent of users that have a grubfub

grubfub 7
membership

Percent of users that have a doorfash

doorfash 34
membership

Percent of users that have a both

both 59
memberships
* The numbers here are made up – do not expect in final solution final solution.

Solution
SELECT 100*SUM(CASE WHEN d.user_id IS null THEN 1 ELSE 0 END)/COUNT(*) as g_only,
100*SUM(CASE WHEN g.user_id IS null THEN 1 ELSE 0 END)/COUNT(*) as d_only,
100*SUM(CASE WHEN g.user_id IS NOT null AND d.user_id IS NOT null THEN 1 ELSE 0
END)/COUNT(*) as BOTH
FROM

28
(SELECT distinct user_id FROM grubfub_dataset) g
FULL OUTER JOIN
(SELECT distinct user_id FROM doorfash_dataset) d
ON m.user_id = w.user_id;

Business Need: Create Histogram

Goal
Similarly, no need for the coderpad here – just write your solution. You work at TikTok
and you are tasked with making a histogram and representing it in Tableau. However, Tableau
can only hold so much data. You decided to use you SQL skills to preprocess the data and save
space.
Imagine the dataset below has millions of rows. The goal is to write a query that returns a
table that mimics a histogram. In the last 30 days, post and what does the distribution curve (a
histogram) look like for how many videos a user post? Your dataset should look like something
like the output dataset.

Dataset

Column
Value Description
Name

User_id 1234 Id to identify the TikTok user

Video_id 342 Id to identify the TikTok video

2022-08-1 The date when video was

Date
0 posted

Output

29
Valu
Column Name Description
e

Video_count 7 Number of videos posted

Count_frequen Number of users who posted 7

34
cy times
* The numbers here are made up – do not expect in final solution final solution.

Solution
SELECT
video_count
, COUNT(video_count) count_frequency
FROM
(
SELECT
user_id
,COUNT(video_id) video_count
FROM fact_daily_tiktok_videos
WHERE date >= CURRENT_DATE - 30
GROUP BY 1
)
GROUP BY 1

SQL No Coding: Input / Output

Before Covid it was common to have white boarding sessions at Big Tech Companies. A
white board session is when you would show up in person, write out your code, and explain your
thought process out loud all while the other person was in the room following along. Other than
being unnatural and not representative of how you normally code, the biggest drawback to this
method was that candidates couldn’t test their code iteratively like you do on a laptop.
In a way, Covid made coding interviews easier by allowing candidates to test their code
against a database. Recently, however, some companies have started doing “virtual whiteboard”
sessions (25% of them, including Amazon). The way it works is they send you a screenshot of an
input and output sample dataset and ask you to type your solution on a virtual whiteboard. There
is no database attached, and no result to test against.
If you completed the Joins: Segment by Delivery Membership and Business Need:
Create Histogram problems, then you will notice that those cannot be tested against an actual

30
dataset on coder pad. Below you will find more questions like the above, but with more detailed
sample input and output datasets.

1) Dataset Input/Output #1

Try typing out your solution before scrolling down any

further…the solution is there.

Solution
WITH T_ORDERS AS (
SELECT
order_date
, product_code
, SUM (quantity) order_quantity
FROM fact_daily_orders
GROUP BY 1,2
)

SELECT
order_date
, o.product_code
, product_category_code
, order_quantity
FROM T_ORDERS o

31
LEFT JOIN dim_product p
ON o.product_code = p.product_code

2) Dataset Input/Output #2

Solution
There are a few ways to do this with one very optimal method. If you bought the premium
version, please go ahead, and email me your solution and I can provide some feedback.

32
Section 6: Table Creations
Below is the code for how you would create a fact_daily_order table. Don’t worry, you
won’t be asked to create tables during the interview but just in case, its helpful to see the code at
least once.

CREATE TABLE fact_daily_orders (

Date date,
User_Id varchar(255),
Region_Id varchar(255),
Sales numeric(255),
Number_of_units_sold numeric(255)
)
DISTKEY(User_Id)
SORTKEY(Date),

Notice that you need to specify the data types. This helps in avoiding getting different
data types in the same column which would be a nightmare – imagine trying to sum up a
date…Since you most likely won’t get tested on this, let’s move on. In section 8 we will go over
those last two lines you see – the distkey and sortkey identifiers. For now, all you need to know
is that those are ways to organize your data in an optimal manner for quicker queries.

33
PART 4: System Design

34
Section 7: Data storage: S3
System Design is an important part of the Data Engineering job. If you are a recent
college grad (or grad intern), chances are that companies will weigh your technical skills more –
i.e. SQL and python. However, if you show that you are familiar and comfortable with the
different aspect of the System Design skill, then it will stand out big time. Section 10 will go
over System Design examples and section 11 have some system design practice problems. But
first, let’s go over some important parts in the next 3 sections - lets begin!
Sticking with the Amazon Web Services (AWS) world since it is the most popular, let’s
discuss S3! S3 is the data lake service in AWS. Anyone can create an AWS account and use the
tools within their account. Each tool has a different use case and cost structure, but S3 is by far
the most popular. What is a data lake? Think of a data lake as a cabinet with virtually unlimited
size for anyone to store their data. Their data can be structured (excel file), unstructured (music
file), and a single object can go up to 5Tb in size. As of 2021, Amazon has over 100 trillion
objects!
If you are not familiar with data sizes, you should familiarize yourself with the basic
concepts. Everything on your computer is of a certain size and can be seen next to your file in the
finder tab. You can also click on the apple logo on the top left and click on the “About This Mac”
Button and click on storage.

35
You can see that there is about 405 GB left on this computer for me to store files. The
order of magnitudes are 1 Byte, Kilobytes (1000 Bytes), Megabyte (1000 Kilobytes), Gigabyte
(1000 Megabytes), Terabyte (1000 Gigabytes), Petabyte (1000 Terabytes). Big Tech companies
operate at the petabyte levels.
Why is this important? Because as a data engineer that needs to manage S3 buckets, you
will be responsible to forecasting costs before launching a product. The important things you will
need to know for the interview are as following: 1) S3 is normally the first stop for your data and
is normally done with Python 2) it is typically good practice to store raw data in S3 regardless of
use as it is extremely cheap to do so 3) the cost and different cost structures. Frankly, you will
never be asked for any specifics regarding costs, but it’s good to familiarize yourself with the
basic structure and keep certain tiers in mind. For example, you can see that S3 Standard charges
$0.023 per GB per Month for data storage. However, they only charge $0.004 per GB per Month
for Glacier, which is S3’s long-term storage solution where you can place files that don’t require
immediate access (access once a quarter). The tiers are interesting as it can dramatically reduce
costs if you are familiar with your business use case and know you won’t be needing your data
immediately. The full cost outline can be found here.

36
Section 8: Data Warehouse: Redshift
In the database modeling section of this book, we learned about fact and dim tables and
how to model schemas. Those schemas exist in a database, and it is good to get familiar with the
different type of databases. A database is simply a way that data is organized and set up for a
user to retrieve. Most individuals are familiar with Excel (rows and columns) but in the database
world there are many different types of databases and two classes of databases– SQL and
NoSQL.
NoSQL databases is data for unstructured data. When you think “unstructured data”, it’s
a good idea to think of JSON code like the one below.

{
"id": "0001",
"type": "donut",
"name": "Coco Donut",
"calories": 455,
"batters":
{
"batter":
[
{ "id": "1001", "type":
"Regular" },
{ "id": "1002", "type":
"Chocolate" },
{ "id": "1003", "type":
"Blueberry" }}
]
},
"topping":
[
{ "id": "5002", "type": "Glazed" },

37
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar"
},
{ "id": "5006", "type": "Chocolate with
Sprinkles" },
{ "id": "5003", "type": "Chocolate" }
]
}

Unlike the queries you were running in SQL Via Metrics section, you would pull the value here
via key-value method. In other words, requesting “id” would get you “001”. A good example is
DynamoDB – a key-value NoSQL database. DynamoDB is good for “horizontal scaling” and fits
well with real time applications where data is being streamed at a fast and high pace.
A SQL database is like what you saw in the database modeling section earlier in this book.
SQL is for structured data and is for multi-row transactions. Redshift is a very popular example
and I think it is worth going over Redshift a little more in depth as it is another tool you can add
to your system design come interview time.
We know that Redshift is for structured data, but what exactly does this entail? For starters,
we saw that we could store our unstructured data in S3. This is where your SQL and ETL
(extract, transform, and load) skills come in handy. If you have JSON code like the one above,
you can use SQL to manipulate and load (ETL) your data into Redshift where analysts and data
scientists can do their job with greater ease. * Redshift is also a columnar database. This mean
that when you query a table in Redshift, your data will be optimized to do aggregates by column.
For example, adding total sales in the sales column of the arrad table will run quickly…if you set
up the table correctly.
Let’s look at the last few lines of the create table code from section 6.

CREATE TABLE fact_daily_orders (

Date date,
User_Id varchar(255),
Region_Id varchar(255),
Sales numeric(255),
Number_of_units_sold numeric(255)
)
DISTKEY(User_Id)
SORTKEY(Date),

What exactly does distkey (another name for this is partitions) and sort key mean? This is
VERY essential to correctly doing your job as a data engineer and having happy coworkers.
Once again, remember that interviewers will most likely not ask you to create a table, but that
doesn’t mean you shouldn’t mention these concepts.
The analogy that I like to use is that of multiple highway lanes (which are the servers)
and the cars are the data columns. The distkey allows us to evenly split the traffic amongst
highway lanes as opposed to call cars trying to use the same 1 lane. Specifying the distkey as
user_id helps speed up querying significantly because the data for 1 user_id will be one “lane”.
38
Therefore, if you filter for user_id with a WHERE clause, then your query only scans that one
server instead of checking every single lane.
The sortkey sorts the data within each distkey. For example, we know every row for
user_id 123 might be on server 1. Once we apply the sortkey of date, then the way those rows
pertaining to user_id 123 get sorted is by the date (in this example). This means that if we have
data starting from the year 2012 but you also filter by date > “2022-01-01”, then you can avoid
scanning 9 years of data.
During the interview you will be asked to justify why you chose a certain type of
database. For the most part, saying you chose Redshift because it is a SQL database, the sort and
dist keys speed up querying, and you don’t need data updated in real time (it can be updated
every 10 minutes to 24 hours) should suffice. But again, remember to talk about the tradeoffs!

*Bonus Question: Can you design what the table would look like after the JSON above gets
turned into a Redshift table?

39
Section 9: Compute
We have gone over Data Modeling, Storage and Databases. Even though Interviewers
will never ask to go into specifics, I want to briefly discuss one other component that is worth
understanding the fundamentals of- compute. Compute it to a computer what the brain is to a
human. Your brain receives information every day and processes information to then make
decisions. Your computer has limited CPU and can be found in the activity monitor app like the
screenshot below.

40
We can see which applications on our computer take up the most CPU percentage wise. If
you were to run a machine learning model on your computer that takes up the remaining CPU,
then chances are the other applications on your computer will be closer. The worst-case scenario
is that your entire computer crashes. In the old days companies used to set server (another
computer essentially) farms manually, but thanks to the cloud every tech company can now use
virtual servers and completely avoid ever having to touch hardware. The service for virtual
servers for AWS is called Elastic Cloud Compute (EC2). Like S3, Amazon allows you to choose
different servers for your workload depending on how much compute (processing power) you
need. If you really want to get your hands dirty, you can spin up a small EC2 server that costs
around 2 cents an hour for the “on-demand” tier. Just make sure to turn it off where it can rack up
costs if you forget!
Like the previous two sections, the specifics oh which EC2 server you would need is not
important for passing a data engineer interview. Understanding the fundamental concepts are
much more important. However, I will briefly mention 3 topics that could be mentioned in the
system design interview, and they would make a strong impression as new grads are typically not
expected to understand these concepts.

● Scaling: There are two types of scaling – vertical scaling (“scaling up”) in which
more compute is added to current servers, and horizontal scaling (“scaling out”)
in which more servers are added to the family of servers that are processing your
application. You might require vertical scaling if you want to run 1 machine
learning model to get results 1 time and it has a lot of data to process. Horizontal
scaling is typically what tech companies do because it is not resource constraint.
For example, if you are Twitter and you suddenly start receiving a lot of traffic
due to a pandemic that has shocked the world, then as a company you don’t want
your application to not load because of this unexpected traffic. Chances are, you
might even need a tool like Elastic Load Balancing (ELB) to evenly distribute this
traffic amongst your many servers. Knowing these two concepts and how it
applies to a business will be beneficial to understand for your interviews.

● Caching: Another topic will be good to familiarize yourself with is Caching.

Think of Caching as the act of temporarily bringing something closer to the action
and leaving it there for efficiency purposes. An analogy I like to use is imagine
you are at your desk, and you get thirsty ever 15 minutes (it’s a hot day!). You go
to your fridge and bring back a glass of whatever drink you want. From time to
time you grab soda, juice, or sparkling water, but 90% of the time you simply
grab flat water. You get sick of standing up every 15 minutes and decide to just
grab a gallon of water and leave it on your desk – this is caching! And it’s an
important concept as it could potentially save big tech companies millions in
dollars and give the costumer a much better customer experience. Sticking with
the Twitter use case, imagine that the President of the United States Tweeted
something and suddenly it is blowing up. Whenever a user goes to his twitter
page, a server can return that result to the user’s phone. In other words, in had to
travel from point Z to point A. However, as a data engineer we can cache data
knowing that this is a very popular post and now it only needs to travel from point

41
B to point A. Redis is a very popular open-source example. This ultimately
provides the user a better costumer experience and the twitter results get returned
faster.

● Abstraction/ETL Tools: One good thing to know about the AWS world is that a
bunch of tools, such as Glue, EMR, Airflow, and even Redshift are all operating
on top of EC2 servers. At the end of the day, regardless of what tool you are using
you will require compute! It is good to know these tools exist in case you want to
play around with them on your own. Again, an interview will never ask you to
demo you using them, but the thing that matters is that you know that some tool
must be used to process the data in your pipelines. In the next section you will see
how it fits into the whole environment.

For Hands On AWS Projects Utilizing Tools Like The

Ones Above, Sign Up For Premium Package And
Receive 1-1 Help

42
Section 10: System Design – Case Studies
Now that we have the foundation for data infrastructure (storage, compute, and
databases), let’s put this all together to see how they relate to one another. We will go over an
example from Robinhood and then go through an example of what a system Design question
might look like. It’s important to note that you will never be asked to code anything with regards
to this topic – there isn’t enough time and that’s what the SQL and python questions are for.
However, you might be asked to draw something out. The next section will have some practice
problems for you.

1) Take a look through this case study: Data Lake at Robinhood.

You’ll notice that Robinhood broke up their data infrastructure into multiple sections:
Ingestion, Storage, and Processing, Validation, and User Facing Layers. You can also see that
Robinhood uses a lot of the tools that we have already discussed: S3 to store their data and at one
point Redshift as the data warehouse, which you can see sits in the intersection of the Storage
and Query Layer. Let’s go over each layer to discuss the tools we haven’t discussed yet. More
importantly, pay extra attention to what might be asked in an interview that isn’t included in this
article.

Ingestion: This section is important because it contains a lot of new information and how you
answer will be very dependent on the question that your interviewer asks. Robinhood points out
Kafka. Kafka is an open-source tools that lets you stream data in real time. Think of this as the
water pipeline in your home. When you turn on your facet you expect water to stream through
immediately. This may not be a necessary requirement for every company, so you need to think
through your specific business use case. For example, a financial company that is running ML
models on historical data may only need to update their data once a day. For this use case, that
financial company might just get their data through an FTP server or have their data provider
drop the data directly into their S3 bucket.
The AWS equivalent of Kafka is Kinesis. Kinesis is also a real-time streaming tool but the
difference is that it is in the AWS ecosystem and integrates well with other AWS tools (like S3).
Kinesis is quicker and easier to set up. So why did Robinhood choose Kinesis? Because it’s
cheaper! Sure it takes longer and more engineering time to set up, but because Kafka is open

43
source and Robinhood has the budget for engineers, it was probably in Robinhood’s best interest
to spend the time to benefit from the cost savings in the long run.
This is what the interviewer is looking for you: they want to see you discuss your
rationale for using real time (business use case) and the tradeoffs between what you did
(Kafka) and didn’t do (Kinesis).

Storage: This one is straightforward. When you think storage for raw data, think S3. Notice how
Robinhood even mentions Glacier, which we previously discussed. Robinhood also placed
Redshift in the intersection of Storage and Query, which is correct because if you recall, Redshift
has a storage and compute layer to it.

Processing: This is also an important section containing new things. Glue and Airflow are two
AWS tools that allow you to do ETL and create your pipelines. So, when you think ETL, most
likely there is some “processing” that is involved. To transfer data from your S3 bucket into
Redshift, you might need to get the data into the right format, as we discussed in the Compute
section. While Airflow and Glue are different tools, Amazon uses EC2 as the servers in the
background! Like the notes about ingestion layer, your job is to discuss the tradeoffs between the
two (Robinhood happens to use both).
Glue is cheaper and “serverless”, meaning you don’t need to do any work configuring
how much computer power you are going to use. Airflow is an orchestration tool that visualizes
your entire pipeline. For example, if your pipeline has 10 steps and it fails on step 6, you can first
see where the error happened and then fix the error and rerun the rest of the pipeline starting
from the point of failure. Airflow is expensive and requires a little more configuration. Either
way, chances are you will need to do some sort of ETL and that important thing here is that you
mention this during the interview.

Query: The query layer is not something you need to worry about. It simply setting up a tool so
that users can query databases. SQL workbench is 1 example of an open-source tool, but most
likely your company will already have a tool.

Validation: The Validation Layer can occur in a lot of tools. Airflow, which was discussed in the
query layer, is a common one. It’s an important concept as we have discussed before – you want
to make sure you have checks to identify any problems with the data. You should mention that
somewhere in your design, but it is necessarily its only section. It just happens along the way.

User Facing Layers: If you are not familiar with JupyterHub or Looker, don’t worry. Like the
query layer, these are tools that are easy to set up and simply ways for your coworkers (data
analysts and data scientists) to interact with your data warehouse. Interviews probably won’t ask
you about this, but it’s good to know in case you want to mention it. Tableau dashboards are
another way for data scientists to show their work and analysis to the rest of the team.

44
2) Outline of a System Design Question

Not all companies do what I am about to show you, but I think it’s important that you get
familiar with at least the format of what a system design question can look like.
In the next section, there will be 2 system design questions. I want you to imagine that
you have a blank word doc with 4 quadrants and each quadrant has one of the following
categories: Notes, Schema Design, System Design, API.

Notes Schema Design

System Design API Design

For this book, don’t worry too much about the API section. Data Engineers won’t have to
worry much about. It is essentially just a “if the user asks us for x, what do we return to them.” In
the Robinhood example, a user might ask us for the price of amazon stock, and we would return
then $130 for example.

45
What is important are the other three sections. When an interview asks you to design a
system (like Robinhood), the first place you might head to is the Notes section. This is where
you ask your questions, take notes, write down the question that is asked etc. Remember the tips
for section 5? Apply those tips! Make sure you understand the requirements before you start
drawing arrows like you see in the Robinhood article.
Once you have a clear understanding of the question being asked, head over to the
Schema Design section and start drawing out the tables you would want to include in your
system. Whether virtual or on a white board, you won’t need draw out an entire star schema like
you did in section 4. But including some tables (both fact and dim) with the appropriate columns
and the logic to justify it will be more than sufficient.
Now is where you head into the Schema Design quadrant. This is where you draw
something along the lines of what you saw in the Robinhood example. Explain what data goes
into s3, how you used real time tool to get it there (if you needed to), and how it eventually got
into Redshift to create the tables that are written out in the Schema Design quadrant. Again,
don’t worry about making it pretty! As long as you can explain your rationale for each part, the
interviewer will like what they see!

For More Problems Like The Ones Above and 1-1

Help Solving Them, Sign Up For Premium Package

Section 11: System Design Questions

46
Try to design what a system might look like for the following two questions using the
templates in the previous section. You can email me your solutions to either go over them
together or get feedback through email. Remember though, there is no right or wrong solution
but what matters more is your rational and discussion of tradeoffs.

A) A very common question is the tinyURL problem. A tinyURL service is when where a user
submits a URL and gets back a shortened version of that URL. How would you design that
tinyURL?

B) You work at a music streaming company called Stopify. You have access to all the data that
any music streaming service would have. With every song that a user listens, the data scientists
on the team can use Machine learning to learn quickly and give song Recommendations.
However, they need the proper data and infrastructure to make their ML work. How would you
design a Recommendation Engine?

Note: if you bought premium package and do not have experience with AWS, contact the author
to receive an ETL project and get some hands-on experience with a bunch of AWS tools.

47
Section 12: Product Improvement Via Data Engineers
This section will contain generalized examples of real-life example of how a good data
engineer can add tremendous value to the business. I want to stick away from specific companies
because the important thing is to see the patterns for how data engineers are seen as crucial to
what a company’s mission, regardless of the industry.

Dashboards (Near Real Time)

Company TLQ is a manufacturing company for industrial tools that collect data and sends it
back to its central server. TLQ has created these machines to also have sensors that collect data
such as voltage. When dealing with something like voltage fluctuations, you can imagine that
this it is important this company delivers machines that are reliable. This company’s dashboards
get updated every 3 days due to it not being able to process all the data quickly enough, but
management is worried that this isn’t fast enough. The potential cons of not speeding this process
up are:
- Product recall of its machines and potentially a cost of millions
- A safety Hazard for costumers
- Employees having to go to multiple dashboards to track different factory productivities.
As a data engineer, you realize there is a better way! Using your knowledge from the system
design section and seeing that TLQ doesn’t quite need real time data, you realize that can send
the data to S3 every hour and then use Glue to manipulate the data and ingest into Redshift every
hour. Now that the data exists in Redshift and partitioned and sorted by day AND hour, you can
write efficient queries to update your dashboards every hour. Now, the company went from
having dashboard updated every 72 hours to every 1 hour, a 90%+ improvement! Now company
TLQ can catch product malfunctions before shipping their machines.
Whether it is Tableau, Looker or Quicksight, companies rely on dashboards to be able to
visualize trends and patterns in real or near real time. Teams can then quickly make decisions
based on data. The more frequently it gets updated, the better. With that said, real time updates
are more expensive and not every company requires that. TLQ probably doesn’t need to receive
data every minute stating “every is functioning as normal”. Every hour will probably suffice
since the machines might get shipped out once or twice a day. And this small change of updating
the dashboards every hour instead of every minute will end up being 60 times cheaper! (60
minutes every hour)

48
Automation (Data Quality)

A company called Auto has launched worldwide and, in that decision, now contain
multiple websites (.com, .eu, .uk etc.) Before launching worldwide Auto was only in the US and
employee x was manually gathering the data related to the performance of the website and the
search engine. Using excel, employee x was able to spend 30 minutes to manually create charts
and updated the data. However, after launching worldwide, employee x realized that this process
would now take him 2 hours a day. At first, employee x was able to focus extra hard and get this
done in closer to 1 hour. But due to the speed at which this employee was operating, a lot of
mistakes were being discovered. Sometimes files for a region didn’t contain data. Other times,
there were duplicates and a session_id (the session of the user browsing a website) appeared in
more than 1 region where it reality each region_id and session_id should be 1-1. Additionally,
after 1 month this was taking closer to 3 hours due to an increase in the website’s popularity.
As a data engineer, you realized you could solve this problem using Airflow. Using
Python and API’s, you placed the raw data for each website in its own S3 bucket and then use
Airflow to create pipelines to merge all the data into 1 Redshift table. Using airflow and python
scripts, you create SQL like checks to ensure the data quality is accurate. The following checks
ensure that A) there are actual values getting returned and B) there are no duplicates.

A) SELECT COUNT(*) > 0

FROM table1
WHERE DATE = CURRENT_DATE

B) SELECT sub. session_id_count > 0

FROM (
SELECT
session_id
, COUNT (*) AS session_id_count > 0
FROM table1
WHERE DATE = CURRENT_DATE
GROUP BY 1
) sub

The 2 queries above should always return True. Using Airflow, you can set up alerts that
get triggered and send the on-call engineer a notification if False was returned. Moreover, you
have written the ETL pipelines to account for any future regions that your company decides to
launch in. Thanks to your efforts, you have now helped your company:

- Save around 15 hours a week in manpower

- Ensure data quality is of the highest standards
- Scaling for future use cases

Caching (Scaling)

49
A sports company, Jike, that releases limited edition sports gear occasionally, – anything
from shoes to jerseys. When they launch new releases, the website tends to lag in the first ten
minutes due to the immediate influx of superfans going to visit the page. Due to the popularity of
certain players at any given time of the year, it is incredibly hard to predict exactly how much
traffic will come to the site. All that is known is that there will be a spike in traffic.
The data engineer on the team decides that caching the data is the best solution. Using
Redis and caching the data related to the product being released, the data engineer can account
for 98% of the traffic on any given day. The outcome is that Jike realizes that they have been
paying for unutilized resources, which leads to them to consolidating resources and in turn:
- Saves money on resources overall
- Scales the website without having to predict traffic spikes
- Provides incredible costumer experience.

Being the good data engineer that you are, you not only have set up a caching mechanism for
the next popular item to go on sale, but you have also set up a way for software engineers to code
in future cases. That way, if a basketball player has a record night during a professional game
and their sales unexpectedly start to spike – code can automatically cache that webpage using the
Redis system you have set up for them.

Machine Learning (Real Time):

A restaurant company called Johnny’s has been recently scaling and opening multiple
stores across the country. However, management is worried that as they scale, the quality of their
food will decline. Executives have decided to invest in some machine learning to ensure the
quality of this good. Specifically, they want to use Machine vision to see if the food is of the
highest quality. They tested the product in 1 store and found that the technology worked! The
models were streaming data into S3 and occasionally, an engineer in the office would manually
run the python script to train the model to become smarter.
After launching however, executives realized that they made a crucial mistake – they
didn’t account for scale. They thought that since it worked for 1 store, it would work for all of
them. Instead what they found was that there was an increased cost to run an ML for every store,
a lot of technical debt from the fact that this ML model was very scrappily put together for one
store, and a higher security risk and each store represented a different engineer potentially
overwriting data in the S3 buckets.
You decide to come in and use your knowledge of S3, Airflow, and Sagemaker to fully
automate this process. Sagemaker is a plug and play environment that allows you to
productionable your machine learning models and deploy at scale. With your knowledge of S3
and data lakes, you can ensure via Python script labeling that each store will get its own store_id
and no overwriting will occur. Finally, using Airflow, you can schedule the machine learning to
run as often as you want and now the ML model is returning results faster and with more
accuracy. You have now saved engineers from doing manual runs, helped the model learn faster
(and ultimately saving on tech debt) and ensured that there is data integrity with regards to
storing the data in the data lake.

50
PART 5: Algorithms

51
Section 13: Algorithms Via Python
Another important part of the data engineer interview is your python skills. For this
section, go back to coderpad and this time select python2 from the drop-down menu. Typically,
companies don’t care whether you use java, python, or some other coding language, but for data
scientists and data engineer’s python is normally the most popular one. The Algorithms section
is like the SQL section in that they are testing your python knowledge, thought process,
communication, and computer science fundamental skills.
If you are unfamiliar with python, I recommend codecademy as they have plenty of free
material. You may have also heard from your fellow engineers regarding the holy grail of
Software Engineering books – Cracking the Coding Interview. I think these resources are
amazing, but remember that there is a lot of material in these resources that were meant for
software engineers and frankly are a bit overkill when it comes to studying for a Data
Engineering interview. There is nothing wrong with learning a bit of python now, and then
mastering it on the job. Most tech companies are understaffed when it comes to junior engineers
that they do not mind taking on this risk as long as you show you have some knowledge and
understand of how coding works.
Another thing to keep in mind is that for data engineers, companies will typically ask
“easy” python questions and very rarely ask “medium” difficulty level questions. The medium
level questions are normally for more senior level engineers, and “hard” is almost always for
software engineers. This is another reason I do not want you to just search for python problems
online – you will just end up overstudying. With that, let’s over some tips, some computer
science fundamentals, and finally go through some practice problems.

Tips

The same tips in the beginning of Section 5: Metrics Via SQL absolutely apply. I will briefly
over them again here but reread that section.

● Never assume! Ask questions and for clarifications on what they are asking for – they
are normally vague on purpose.
● Talk out loud! Same concept applies – the interviewer is your friends and wants you to
do well, but he/she cannot read your mind.

52
● Test iteratively! The worst thing you can do is write out your entire code only to figure
out there are 10 syntax errors and must spend half your time finding them; avoid this by
testing bit by bit.
● Clarity! Like the SQL section, make sure you code is following common python
standards and is easy to read. Make the life of the interviewers easier and slow down to
make sure the code is readable and co-worker friendly. When in doubt, use comments.

There is one part of the python interview that differs from the SQL interview questions –
edge cases. Edge cases are scenarios in your python questions that are rare but still essential to
the testing your logic and coding skills. My advice here is acknowledge edge cases in the “never
assume” portion of your interview, and then come back to them at the end after you have a
working solution for many of the scenarios. Sometimes time might run out and while the
interviewer might deduct some points for not getting to the edge case, it is not nearly as bad as if
you started with the edge case.

Computer Science Fundamentals

Time Complexities: In computer science, the time complexity of a function is also known as
“the computational complexity” that describes the amount of computer time it takes to run an
algorithm. In other words, it is a way to measure an algorithm’s efficiency and is written as “Big
O notation”. Some of the popular time complexities are O(n!), O(n^2), O(n), O(logn), O(1).

Look at the image below, if there was an algorithm that ran at each complexity below, which one
do you think would be considered the most efficient?

53
If you answered O(1), that is correct. This one is known as constant time. In order words,
if your input was 1 number long [1], as opposed to 100 inputs long [1,2…100], the compute
power needed is the still same. Don’t you will see examples in the following section.
Typically, software engineers or data scientists are the ones that are asked to optimize for
time complexity, but as a data engineer you will stand out if you are able to answer these
questions as bonus part of your python interview (and just as part of your interview if applying
for a more senior role.)

Data Structures

List

Lists are mutable data structures in python that are defined by brackets. ['tree','dog',7 ,'2345'] is a
list – notice that the list contains different data types.

Array

An array is like a list, but it contains values of the same data types. Additionally, you can do
math with arrays like in the example below.

array1 = [1,2,3,4]
array2 = [1,2,3,4]

54
print(array1[1] + array2[2])

Dictionary

When you think dictionary, think of “key:value” pairs. A dictionary can be created using the
code below. If I wanted to get the year I would run print(thisdict[‘year’]).

thisdict = {
"item": "home",
"type": "condo",
"year": 1964
}

Set

A set is a collection which is unordered, unchangeable, and unindexed. Think of it as one

variable. Try running print(thisset[1]) and see what happens.

thisset = {"dog", "cat", "mouse"}

print(thisset)

Stack

When you hear the term stack think LIFO (Last in First Out) and a stack of plates. Like a stack
of plates, the idea is that you can Push (place a plate at the top of the stack) and Pop (remove the
top plate from the stack) a value (plate) into the stack – LIFO!

stack = []
stack.append('a')
stack.append('b')
stack.append('c')

print('Initial stack')
print(stack)
# pop the value
print(stack.pop())
print(stack)
print(stack)

Queues

When you hear the term Queues think FIFO (First in First Out) and a line of people. Like a line
of people, the idea is that you can Enqueue (a person joins the end of the line) and Dequeue
(the person at the front of the line goes forward) a value (plate) into the Queues – FIFO!

55
# Add elements to the queue
queue.append('1')
queue.append('2')
queue.append('3')

print("Initial queue")
print(queue)

# Remove elements from the queue

print("\nElements dequeued from queue")
print(queue.pop(0))
print(queue.pop(0))
print(queue.pop(0))
print(queue)

Hash Table

A Hash Table is a data structure that stores values and has keys associated with them. Because
of this, lookups are efficient if we know the key associated with the value using an in. You will
see a good example in the following section.

Exercises

Question 1: Word Counter

Given a string, write a function to count the occurrences of each word in that string.
Return the answer in the form of a dictionary. Don’t worry about time complexity for this one.

Input: 'I hope you are enjoying this book. Feel free to email me any suggestions for how to
improve this experience.'

Output: {'free': 1, 'Feel': 1, 'for': 1, 'to': 2, 'I': 1, 'experience.': 1, 'suggestions': 1, 'book.': 1, 'any':
1, 'this': 2, 'are': 1, 'how': 1, 'me': 1, 'you': 1, 'enjoying': 1, 'email': 1, 'hope': 1, 'improve': 1}

56
Solution #1:

def word_counter(str):
counts = dict()
words = str.split()

for word in words:

if word in counts:
counts[word] += 1
else:
counts[word] = 1

return counts

print(word_counter('I hope you are enjoying this book. Feel free to email me any suggestions for how
to improve this experience.'))

If you haven’t already, run this code on coderpad and look at the output. Important thing to note
here is that we were asked to use a dictionary which is why dict was used on line 2.

57
Question 2: Two Sum

Given an array of integers, and a target integer, return two numbers if they add up to target
integer.

You may not use the same element twice for one solution – i.e., if the target is 4 and there is only
one ‘2’ in the array, then [2,2] is not a valid solution.

Input:

Array: [6,5,9,-2,1310]
Target: 11

Output:

(6, 5)
(-2, 13)

58
Solution #2a: Two Sum

Right off the bat you should have noticed something nice I did for you…I answered an
assumption for you when asking the question! Remember, it is not guaranteed that the
interviewer will tell you not to assume something. In fact, chances are they won’t because they
want to see if you can stop, take a breath, and ask that question yourself.

The solution below is probably closer to what you initially did. This gets you the right
results! However, this function is at a O(n^2) complexity. Why? Because as we can see, for the
length of the array [len(array)] we are iterating through the array twice. For each i, we iterate
through a j and do the sum. This grows at a power of 2 as the length of the array gets longer. Is
there a way to do this at O(1) time complexity?

def twoSum(array, target):

# search first element in array
for i in range(len(array) - 1):
# search other element in array
for j in range(i + 1, len(array)):

59
# if both elements add up to target, print both elements
if array[i] + array[j] == target:
print(array [i], array[j])

# Test The Code

array = [6, 5, 9, -2, 13, 10]
target = 11

# Run The Function

twoSum(array, target)

Solution #2b: Two Sum

The solution below is at O(1)! In this instance we only iterate through the array once and we are
taking advantage of hash tables in order to store the value that we have already iterated through
once, so we don’t have to iterate through again. Test it out and print out the hash table along the
way.

def twoSumHash(array, target):

sums = []
hashTable = {}

for i in range(len(array)):
complement = target - array[i]
if complement in hashTable:
print(array [i],complement)
hashTable[array[i]] = array[i]

# Test The Code

array = [6, 5, 9, -2, 13, 10]
target = 11

# Run The Function

twoSumHash(array, target)

60
Python-SQL Like Prep

If you went through the problems at the beginning of this chapter, you can see that
algorithms are important to study because companies want to know you understand the
fundamentals. However, if you are early in your career, companies might only care if you know
python. For this, the way they test is by asking you SQL like python problems that you need to
answer in python. This means being able to do group by, joins, unions, and basic arithmetic using
pandas. Remember, the tips at the beginning of this section still apply.

Exercise

You have the following 3 csv files.

1) File order_data_2019
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
2) File order_data_2020
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
product_category_code
3) File order_data_2021
columns: order_date, customer_order_id, order_quantity, order_price, product_code,
product_category_code, mobile_device

Use python (or a programming language you prefer) to combine the 3 files and do the following:

61
First, return the following columns in 1 dataset: order_date, order_quantity, order_price

Next, get the total order_quanitity and total order_revenue (quantity*price) at a quarterly level

Finally, get the median order_quanitity at a quarterly level

Solution

import pandas as pd
import csv

# Read in each csv file and make sure to use your specific path

data_2019 = read.csv('/Users/chrisgarzon/Documents/order_data_2019.csv')
data_2020 = read.csv('/Users/chrisgarzon/Documents/order_data_2020.csv')
data_2021 = read.csv('/Users/chrisgarzon/Documents/order_data_2021.csv')

# Only Pull the columns Needed

data_2019_cols = data_2019[['order_date', 'order_quantity', 'order_price']]

data_2020_cols = data_2020[['order_date', 'order_quantity', 'order_price']]
data_2021_cols = data_2021[['order_date', 'order_quantity', 'order_price']]

# Union the datasets together

data1 = pd.concat([data_2019_cols, data_2020_cols], ignore_index=True)

data = pd.concat([data1, data_2021_cols], ignore_index=True)

# Make date a datetime so we can turn into a quarter

data['order_date'] = pd.to_datetime(data['order_date'])

# Create quarter and order_revenue columns

data['quarter'] = data['order_date'].dt.to_period('Q')
data['order_revenue'] = data['order_quantity']*data['order_price']

# Get the order_quanitity and order_revenue at a quarterly level

df_rev = data.groupby(['quarter']).agg({'order_quantity':'sum','order_revenue':'sum'})

print(df_rev) # print to test result

# Get the median order_quanitity at a quarterly level

data_median = data.groupby(['quarter']).agg(Median=('order_quantity', 'median'))

print(data_median) # print to test result

62
For More Problems Like The Ones Above and 1-1
Help Solving Them, Sign Up For Our Coaching
Package

Sign up FOR FREE at our interactive portal at DE

Academy!

63
PART 6: Behavorial

64
Section 14: Most Common behavioral Question
Companies ask a variety of behavioral questions. However, they all tend to fall under the
same umbrella. Companies tend to gauge your response to see if you would be a good cultural
fit. A “good cultural fit” is someone who aligns with the value and principles that a company has
made public. Amazon is a really good example of a company that has shared their leadership
principles and does a really job of asking questions related to them. For this section we will go
over some of the questions typically asked and how to best answer them - don’t forget that
concepts apply to any company!

1. Bias for action: Tell me about a time you took a risk at work?

This question is testing to see if you are taking smart and well thought out risks. Amazon has
a phrase that a lot of other companies have adopted - a “one way door” vs a “two-way door”. A
one-way door is a decision that you made that is hard (if not impossible) to reverse, as opposed
to a two-way door which is a decision that is not written in stone. This is part of risk
management, and it is where the Bias for action part comes in.
Speed matters in business and companies want to know that you can move fast but
intelligently. Additionally, they will look for reasons why you chose this risk and not another?
Saying that it is a two-way door is great, but what else? Another thing they want to hear is your
business knowledge - what are the inputs and outputs. For example, if you say that the input was
100 hours of engineering time and a cost of $1 million dollars in AWS resources, but the output
only yields $100k in annual revenue, the company probably won’t think you assessed the risk
appropriately.

2. Disagree and Commit: Tell me about a time you disagreed with a coworker/manager?

Let me start off with how not to answer this question. DO NOT talk negatively about
anyone else. You can disagree with someone and still be respectful. You can state “I understand
why they thought this, and I spent time understanding their side even better, but according to my
analysis on ABC, I thought this was the way to go.”

65
What companies want to see is how you go about arguing your side. There is nothing
wrong with disagreeing with someone if you do it respectfully and with logic. By logic, I mean
data and research. Saying that this method “just feels right” won’t cut it.
Finally, there will be times that even if you are right, your manager might not yet trust
you and might be insistent on doing it his way. This is where the “commit” part comes in. Think
about to a time where you went ahead and committed only to get more information and proof
that your method might have been right. Ideally, you used this opportunity to revisit your initial
suggestion and pivot to try something new.

3. Ownership: Tell me about a time when you went over and above your job responsibility
to help the company?

Leaders lead by example. Talk about a time when someone on your team couldn’t
accomplish something. Maybe they were sick, had a lot on their plate, or simply didn’t know
how to. You stepped up and got the job done. Ownership is about going beyond what you are
expected to do and never saying “that is not my job.”
Talk about the how you arrived to do this decision. Did you talk to your manager? If yes, was
it something you brought up? Ideally this was not something your manager had to tell you to do
but rather you took the initiative after having seen something that struck you as worrisome. Why
this task over others? Ideally this task you decided to pick up seems like it was either urgent, or
something you knew you could do rather quickly.

4. Dive Deep: Tell me about at time you re-designed a process and why?

To redesign a process, one needs to understand the intricacies of that process first.
Companies LOVE when they hear that you dope deep. Don’t be afraid to talk about times where
you had to spent weeks reading through boring queries, or documentation to figure something
that no one on the team had noticed. Mention what the old design was, and how your new
proposed designed was going to be better and be ready to go into details if needed.

5. Prioritization: Tell me about a time when you had two deadlines at the same time. How
did you manage the situation?

This is a classic prioritization question. How does this candidate manage his/her time? How
does the candidate decide which task should be completed first? What interviewers want to see
here is that you weighed the tasks given many inputs – maybe one input is the urgency
management was showing. Maybe another factor was something that one task was something
you already knew how to do quickly, and you knew you could ask your co-worker for help on
the other task because your co-worker knew how to do that task relatively quickly. The trick here
is to mention which that you took 10 minutes to really think about which task you should spend
your time on. Did you hold a prioritization meeting to hear other people’s inputs? Was one task
more short thinking vs the other was long term? Was it possible to push back the deadline for one
of these tasks?
The way to NOT answer this question is to say that you pulled an all-nighter to finish
both. While this isn’t the worst answer in the world, this is also what leads to burn out.
Companies want to know that they’re hiring someone who can manager their time well enough

66
to not get burnt out and result in a churned employee. Pulling an all-nighter is not sustainable,
and frankly companies don’t want to be known as the place where all-nighters are expected and
then they will be labeled as toxic.

6. Why XYZ company?

This last one is simple – be authentic! Tell them why you want to work there. It helps if
you use the product but also do some research on the culture. What excited you about the
company when reading good things about it online. Maybe mention something about your
personality – “I want to join this company because I am into startups but also want to be
surrounded by experienced engineers. This company seems to have the best of both worlds;
the team I am applying for has a startup vibe, but company XYZ has the resources of a much
bigger company.”
This question is potentially vital in that you can absolutely say the wrong thing - similar
to how you shouldn’t talk about your co-workers in a bad way, don’t talk about your previous
employer or past internship in a negative light either. Don’t say it’s because of the money
either – companies already know this! Find something that is truly you and don’t be afraid to
stand out!

STAR Method

As you reread the questions and advice above, another thing to keep in mind is the STAR
method. Aside from answering the question in the best way you can, your communication matter.
The STAR method is a tactic helps you streamline your thoughts and give your interviewer a
high-level response. It stands for:

Situation: Describe the situation at a high level. This is like developing a plot of a movie.
Describe who is involved, what the dilemma is, and what your team is trying to accomplish.

Task: What goal were you working towards and what was your individual task?

Action: Describe the actions that YOU personally took. This is no time to be humble; make sure
you use the word “I” to describe what the initiative that you took to accomplish the task at hand.

Result: Describe the outcome of your actions. If you can quantify this, even better! “Because of
my ETL pipeline automation skills, I was able to reduce the amount of time my co-worker spent
on manual tasks by 90%.”

Each section should be 2-3 sentences long. Let the interviewer digest the story you’re about
to tell and let them ask you follow up questions from there. As someone who has interviewed
1000’s of candidates, I can tell you firsthand that there is nothing more frustrating than to hear
someone ramble on for 20 minutes only to have me ask 20 clarifying questions.
Practice writing your answer to some of the questions above. Remember, it should come out
to about 8-12 lines. If you bought the premium package and want me to look, feel free to email
your answers to me!

67
Sample Questions and Answers

Below find some sample questions and answers. See if you can come up with your own
examples following the STAR method and notice the patterns – concise and quantifiable stories
with a brief overview at the beginning.

Tell me about a time in which you disagreed with a coworker.

Anthony not wanting to use Scala
[S] I remember this one time at XYZ, we were in the process of building out this ML model and
I and one other engineer were responsible for building out the pipes that feed into that model
[T] The task was essentially to do a bunch of ETL, but as a team we are seeing that past code
reviews were either in SQL, or python or Scala.
[A] Finally, I recommended we just use Scala and use Spark SQL to manipulate the data but my
coworker was very against it, mostly because I think he wasn’t as familiar with Scala
[R] After weeks of trying to convince our manager trying to switch over to Scala, I proceeded to
just pull him aside and show him what the benefits of Scala were and sent him documentation
and some of my code for him to work of off. I also told him to reach to me whenever for help
and it resulted in our team getting synchronized and using one language.

Tell me about your most challenging project ////// What did you do when you needed to motivate
a group of individuals or promote collaboration on a particular project?

[S] At XYZ my most recent project was regarding evaluating churn rates in our Pink Program. I
did a deep dive and realized that there was a high number of payment failures causing users to
fail when they were supposed to auto renew
[T] From there, the goal was to build a dataset to get more insight and eventually predict ahead
of time who was going to have a payment failure.
[A] This was proven harder than I thought because a) it seemed like no one knew where to get
payment related data, and b) the data that I found seemed very unreliable. I eventually found a
few people, by going this random office hours, that I thought could help. I organized a meeting
between 3 different teams to align where on a source of truth, and update stakeholders on this
payment’s failure dataset.

68
[R] Finally, the result was that we were able to build a dataset that allowed the science team to a)
figure out what the issue was and b) predict and built products to help alleviate the payments
issue. This leads to a massive increase in first month retention from 65% go 77%.

Tell me about a time in which you worked under a tight deadline // How have you leveraged data
to develop a strategy?

[Situation] At XYZ we were in the process of launching Growth Planner Score - our ML model
that would generate a 90-day sales prediction for every ASIN on a weekly basis
[Task] As the DE my task was to build the entire pipeline, create the features for these models
and set up jobs to run worldwide scale –scale meaning 2 billion + ASINs. For me, success meant
1) making sure the jobs ran under 8 hours and 2) keep the costs under our budget of 3k a month.
As a team, success meant 1) getting a high ASIN coverage being able to generate a score for
90% of all active ASINs 2) and having a high accuracy, precision, recall when it came to our
predictions.
[Action] Our ML model took account variables such as whether an ASIN had adopted a
marketing program such as ads, deals and A+. I needed to understand what the source of truth
was from all these teams so we could ingest their data but one of the teams we were working
with, the coupons team, wasn’t moving as quickly as we would have hoped. I found out their
only DE had just taken off for 3 weeks. And so my ownership kicked in and reached out to a
friend who knew a friend who used to work on the Coupons teams and asked him to point me in
the right direction.
[Result] As a result, I was able to get the data I needed through a combination of using our data
catalog, S3 buckets, using our redshift cluster. It’s running at 50% of the time needed and under
budget. Specifically, the coupons team was able to use our score to recommend which items
NOT to be a coupon on – this reduced their coupon rate of $0 sales from 50% to 30% in just one
month – it was amazing

When did you take a risk, make a mistake, or fail? How did you respond, and how did you grow
from that experience?

[S] I was working at the start up and one of the portfolio managers wanted bond data – data that
was very expensive to buy in Bloomberg.
[T] My task was to talk to the Bloomberg reps and figure out costs, how the data transfer would
work, what we would get etc
[A] That said, I knew that that data was public online and I knew I had the skills to just scrape
the data. Now, I had other tasks to complete so I didn’t want to waste too much time doing this,
But I figured it would only take me half a day.
[R] It ended up taking me 3 days, and on top of that it was very buggy, and the data was more
unreliable than I thought and on top of that the website metadata was changing and this would
break my code. So just me trying to fix this would result in me spending 1, maybe even 2 weeks.
Something that pushed back our other tasks.

69
I learned that next time, maybe spend 10-20 minutes more scoping, and maybe quickly ping my
manager the idea that I have. The good news is that this bias for action didn’t cost us anything
financially and didn’t ruin any existing projects, but I could have done a better job of scoping.

Describe a time you took the lead on a project.

[S] This one time I had a 1-1 with a Portfolio Manager and it was just to get to know each other
better but she started telling me about her day to day, and I specifically remember this 1 task that
would take her 30-60 minutes a day that she was doing manually every day and she was getting
frustrated. I scheduled up a follow up to learn more about this effort.
[T] At the time I had other stuff scheduled on my sprint, but from first impressions it sounded
like something I could help her with relatively quickly
[A] I quickly ran it by my manager and asked him to give me 2 hours to help her automate this
reconciliation process she was doing – I created an S3 bucket where she could drop off file
which then kicked off a GLUE pipeline and would ETL her work.
[R] This resulted in having her work get 90% automated, but I went one step further by figuring
out how to automated the final 10% by pulling attachments from emails and sending an email
automatically when a file would drop into s3. This resulted in having that process 100%
automated.

Tell me about a time when you were faced with a problem that had a number of possible
solutions. What was the problem and how did you determine the course of action? What was the
outcome of that choice? /// Something proud?

I have a lot of experience with AWS – I was responsible for building the entire data
infrastructure at the start up.
[S] When I first joined the start up the team was simply running scripts manually on their
computer and saving files onto an ec2 server. This needed to be ETLed into a Data warehouse –
Redshift. But, I needed to decide what the best tool to use was.
[T] I took on the role of moving all that to the cloud , and creating pipes
[A] So I dug into all the systems and processes that the start up was doing and automated
everything – worked with third party data vendors to receive data in a FTP server, used python
scripts within Lambda functions to pull that data onto our data lake and finally dug into decide
whether to use Airflow or Glue SQL jobs for our ETL process, or SPARK to speed up jobs that
quants were doing, and finally get everything neat and organized in our data warehouse
[R] I was most proud of the fact that I was able to get 100+ tables in our datawarehouse from 20
different pipelines in about 6 months – and those tables were utilized every week and drove a lot
of investment decisions that improved our strategy by 20% - but also kept our costs close to 0
while airflow may have cost us $5k a month.

70
If after reading this book you feel like you could benefit from 1-1 mentorship with a FAANG
engineer, visit Data Engineer Academy

Alternately, we have our interactive portal at DE Academy! Absolutely free!

Otherwise, please feel free to email me at chris@dataengineeracademy.com for any questions!

Agreement
67% (3)
Agreement
4 pages
Bridge Inisght
No ratings yet
Bridge Inisght
227 pages
Pattern Price & Time Using Gann (100-150)
No ratings yet
Pattern Price & Time Using Gann (100-150)
51 pages
40 R Programming Interview Questions & Answers For All Levels - DataCamp
No ratings yet
40 R Programming Interview Questions & Answers For All Levels - DataCamp
22 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
JAJ Testimonies
No ratings yet
JAJ Testimonies
2 pages
CLV and Pricing Analytics Case 3
No ratings yet
CLV and Pricing Analytics Case 3
2 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Dissertation by KC Law 19990624
100% (1)
Dissertation by KC Law 19990624
69 pages
76 - Sample - Chapter Kunci M2K3 No 9
No ratings yet
76 - Sample - Chapter Kunci M2K3 No 9
94 pages
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
No ratings yet
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
6 pages
MSBI Course Content PDF
No ratings yet
MSBI Course Content PDF
4 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Benner Cycle Amp Fibonacci Numbers - Compress
No ratings yet
Benner Cycle Amp Fibonacci Numbers - Compress
12 pages
ITIL Exam Sample With Answer
No ratings yet
ITIL Exam Sample With Answer
9 pages
Etl Architecture
No ratings yet
Etl Architecture
76 pages
Data Scientist Interview Questions and Answers PDF
No ratings yet
Data Scientist Interview Questions and Answers PDF
37 pages
Looker
No ratings yet
Looker
57 pages
ETL Testing 1
No ratings yet
ETL Testing 1
15 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Ds Questions
No ratings yet
Ds Questions
11 pages
SSAS
No ratings yet
SSAS
2 pages
Applied Science Interview Prep
No ratings yet
Applied Science Interview Prep
4 pages
Top 100 Hadoop Interview Questions and Answers 2016
No ratings yet
Top 100 Hadoop Interview Questions and Answers 2016
21 pages
Fixed Deposit Sample
No ratings yet
Fixed Deposit Sample
18 pages
SQL Server Developer Interview Question
No ratings yet
SQL Server Developer Interview Question
11 pages
Amazon Data Engineer Interview Guide - Experienced
No ratings yet
Amazon Data Engineer Interview Guide - Experienced
19 pages
ExtremeHurst For Bloomberg Guide
No ratings yet
ExtremeHurst For Bloomberg Guide
45 pages
Hartron Answers
No ratings yet
Hartron Answers
8 pages
SQL Notebook by Rishabh
No ratings yet
SQL Notebook by Rishabh
101 pages
Data Mining & Business Intelligence
No ratings yet
Data Mining & Business Intelligence
322 pages
What Is ETL?: ETL Is A Process That Extracts The Data From Different Source Systems, Then
No ratings yet
What Is ETL?: ETL Is A Process That Extracts The Data From Different Source Systems, Then
7 pages
Fundamental Analysis
0% (1)
Fundamental Analysis
24 pages
L0 Banking
No ratings yet
L0 Banking
20 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Implementing InfoSphere Master Data Management Behavior Extensions
No ratings yet
Implementing InfoSphere Master Data Management Behavior Extensions
18 pages
DataStage How To Kick Start
100% (2)
DataStage How To Kick Start
133 pages
Proposal - Pharmacy Management Software
No ratings yet
Proposal - Pharmacy Management Software
7 pages
Oracle Analytical Functions 1
No ratings yet
Oracle Analytical Functions 1
18 pages
Reusable Transformations
No ratings yet
Reusable Transformations
4 pages
Migration Steps of Easy Bank New
No ratings yet
Migration Steps of Easy Bank New
15 pages
Treasury'S Borrowing Program: Investone
No ratings yet
Treasury'S Borrowing Program: Investone
42 pages
SQL Server Basics and Msbi
No ratings yet
SQL Server Basics and Msbi
8 pages
Answers To DataStage Interview Questions
No ratings yet
Answers To DataStage Interview Questions
3 pages
Company Name Job Profile Questions Asked
No ratings yet
Company Name Job Profile Questions Asked
8 pages
Vedant Surana - Imarticus Resume PDF
No ratings yet
Vedant Surana - Imarticus Resume PDF
2 pages
Instructions WRT Assessment Day 1 IITM
No ratings yet
Instructions WRT Assessment Day 1 IITM
3 pages
CV Data Analyst (Babatunde Binuyo) Capstone
No ratings yet
CV Data Analyst (Babatunde Binuyo) Capstone
2 pages
SQL Challenges: Scenario Based Data Challenges With Solutions
No ratings yet
SQL Challenges: Scenario Based Data Challenges With Solutions
8 pages
Data Analyst Roadmap 2025 (Weekly Study Plan)
No ratings yet
Data Analyst Roadmap 2025 (Weekly Study Plan)
3 pages
Proposal For Predicting Stock Prices Using LSTM: Rajat Dhyani
100% (1)
Proposal For Predicting Stock Prices Using LSTM: Rajat Dhyani
5 pages
Entry Level Data Scientist Resume Example
No ratings yet
Entry Level Data Scientist Resume Example
1 page
Infosys Logical Reasoning - Mathematical Reasoning Interview Questions With Answers
No ratings yet
Infosys Logical Reasoning - Mathematical Reasoning Interview Questions With Answers
6 pages
Project On Technical Analysis
No ratings yet
Project On Technical Analysis
32 pages
ALX Data Analytics Program Description
No ratings yet
ALX Data Analytics Program Description
6 pages
Etl Interview Questions
100% (1)
Etl Interview Questions
4 pages
DBMS Query Example
No ratings yet
DBMS Query Example
24 pages
Stock Market Movement Direction Prediction Using Tree Algorithms
No ratings yet
Stock Market Movement Direction Prediction Using Tree Algorithms
6 pages
The Data Warehouse ETL Toolkit - Chapter 04
100% (1)
The Data Warehouse ETL Toolkit - Chapter 04
51 pages
SQL Ansi
No ratings yet
SQL Ansi
1 page
Beginner's Guide To FIX Protocol
0% (1)
Beginner's Guide To FIX Protocol
1 page
Data Analyst Interview Questions
No ratings yet
Data Analyst Interview Questions
8 pages
Crypto A Beginner's Guide
From Everand
Crypto A Beginner's Guide
Jake Masterfield
No ratings yet
MANUAL 88T - 300-700 - Rev 03 1
No ratings yet
MANUAL 88T - 300-700 - Rev 03 1
262 pages
The Scarlet Ibis. Lesson - Anthology
No ratings yet
The Scarlet Ibis. Lesson - Anthology
14 pages
Assignment10
No ratings yet
Assignment10
2 pages
Performance Analysis of Hybrid Isolated Power Generation System
No ratings yet
Performance Analysis of Hybrid Isolated Power Generation System
32 pages
Lotus Industrial Gear Oil Ep-460
No ratings yet
Lotus Industrial Gear Oil Ep-460
2 pages
Setting Foundations
No ratings yet
Setting Foundations
2 pages
Public Sector Budgets - Types, Processes, Implementation and Controls - Issues and Prospects - Jurnal Review UAS
No ratings yet
Public Sector Budgets - Types, Processes, Implementation and Controls - Issues and Prospects - Jurnal Review UAS
23 pages
Buy GBL Cheap
No ratings yet
Buy GBL Cheap
2 pages
Iaasb Isa 600 Revised
No ratings yet
Iaasb Isa 600 Revised
103 pages
An Internship Report PDF
No ratings yet
An Internship Report PDF
15 pages
Critical Analysis of Dialog Axiata PLC - Summerized Version
No ratings yet
Critical Analysis of Dialog Axiata PLC - Summerized Version
4 pages
ViewSonic M1_V_M1+_V_M1_G2_M1+_G2_SM_1c
No ratings yet
ViewSonic M1_V_M1+_V_M1_G2_M1+_G2_SM_1c
35 pages
Teaching Plan Details Template
No ratings yet
Teaching Plan Details Template
19 pages
Coco Flour Flow Chart
No ratings yet
Coco Flour Flow Chart
3 pages
cb150r 2020
No ratings yet
cb150r 2020
116 pages
Unit 1 Kanis Method - Final
No ratings yet
Unit 1 Kanis Method - Final
32 pages
Sales Management
No ratings yet
Sales Management
2 pages
Organization Theories and Application
No ratings yet
Organization Theories and Application
16 pages
Fox Food and Home
No ratings yet
Fox Food and Home
1 page
Conversion Chart - of - Accounts OGAS To NGAS
No ratings yet
Conversion Chart - of - Accounts OGAS To NGAS
24 pages
Furnishka Catalogue
No ratings yet
Furnishka Catalogue
11 pages
Exp 4
No ratings yet
Exp 4
8 pages
2nd Year Math Half & Full Book
No ratings yet
2nd Year Math Half & Full Book
12 pages
Course Material 2 - The Unified Accounts Codes Structure
No ratings yet
Course Material 2 - The Unified Accounts Codes Structure
34 pages
Gen Bio 2 Q2 Lesson 1.1 Plant Life Cycle
No ratings yet
Gen Bio 2 Q2 Lesson 1.1 Plant Life Cycle
3 pages
EN RT250BT Comprehensive Manual
No ratings yet
EN RT250BT Comprehensive Manual
52 pages
(Ebook) Olivi and Franciscan Poverty: The Origins of the Usus Pauper Controversy by David Burr ISBN 9781512814989 pdf download
100% (1)
(Ebook) Olivi and Franciscan Poverty: The Origins of the Usus Pauper Controversy by David Burr ISBN 9781512814989 pdf download
58 pages