Practice Use Case

Practice Use Case
In this Chapter you will practice applying what you learned in the previous lessons in the context of a use
case working with customer data. Before you begin working on this lesson, read the following background
information about the Customers dataset and use case. Once you have completed the use case, you can
find the steps we took and the results produced with in the following pages. This example is meant to test
what you have learned thus far. When you feel comfortable with this use case example, we recommend
you challenge your wrangling skills by taking our Certification Exam.
Before getting started, download the following datasets:

• Use Case Data
Example
You work for a technical services company that sells three monthly subscription products:
• Silver (price: $9.99/month)

• Gold (price: $14.99/month)
• Platinum (price: $29.99/month)
The company occasionally offers promotional discounts – therefore, some product prices may be slightly
lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code.
In order to do this, you’ll need to join your Lab_customers dataset (where zip code information resides)
with sales data from your transactions datasets. Once you’ve joined the data, you’ll need to aggregate the
results.
Create a new flow

Add Following Datasets:
• lab_customers.csv
• lab_2010_transactions.csv
• zip_to_state_map.csv
Add recipe to Lab_Customers

Add a Recipe to the Lab_Customers dataset. Edit the Recipe to enter the Transformer Grid
Filter out faulty data

Remove the contacts who are not actual customers: We identify actual customers by the start_date
column.
Provide a default end_date value for active customers

For those current customers without an end date, set the default end date to January 01, 2050.
Fix the start_date data type mismatches

Notice that the start_date column contains in two different formats. You want all of the dates in this
column to conform to the format ‘yyyy/MM/dd’.
Lab_Transactions
Add a Recipe to the Lab_2010_transactions dataset. Edit that Recipe to enter the grid
Remove dummy data

Since this company sells online subscription products, the website sends a test transaction of $0.01 to
validate that a customer’s credit card is valid. Test transactions are recorded in the database and appear in
the transactions file. You need to remove these transactions so they don’t distort your analysis.
Clean up Discount Column

Remove the % from the Discount column, then convert the column to a decimal. Fill in the missing values
with 0.
Standardize product names

After 2011, the company renamed the subscription products. To maintain consistency between the
transactions files from each year, you need to replace the old product names with the new product names.
• Change ‘basic’ to ‘silver’

• Change ‘deluxe’ to ‘gold’
Determine the real price for each product

The real price for each product is a combination of the list price and the discount. Create a new column
that contains the actual charge for each transaction.
Union transactions datasets

Union 2010 – 2015 datasets. We want to apply the above transforms to all data, make sure your recipe step
is added to the beginning of your recipe, so that all the steps in your recipe are applied to all of the
unioned datasets.
• 2011_Transactions
Join with the Customers dataset

Join transaction data with the customer data, keep the following columns.
• Customer_id (2010_transactions)
• transaction_date (2010 Transactions)
• adjusted_price (2010 Transactions)
• product (2010 Transactions)
• address_state (Customers)
• address_zip (Customers)
• region (Customers)
• start_date (Customers)
• end_date (Customers)
Lookup state by zip

Scroll to the address_state column and notice that there are some blank records. Perform a lookup on the
address_zip column to zip_to_state_map.csv. This Lookup file contains a state mapping for each zip code.
Rename column3 to State and delete Address_state
Filter transactions based on date

Only keep transactions that have occurred during the course of a customer’s active subscription (i.e. the
transaction_date is before the end_date
Create a new table with aggregated results

Create a table that lists the count, sum, min and max of adjusted price, grouped by address zip.
Run the job at scale

Run job and view job results.
Solution
Solution Guide

Practice Use Case

Uploaded by

Copyright:

Available Formats

Practice Use Case

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practice Use Case

Uploaded by

Copyright:

Available Formats

What is the overall goal of the analysis?

What is the overall goal of the analysis?

What steps are taken to clean up and standardize the transaction data?

What steps are taken to clean up and standardize the transaction data?

Practice Use Case

Before getting started, download the following datasets:

• Silver (price: $9.99/month)

Create a new flow

Add recipe to Lab_Customers

Filter out faulty data

Provide a default end_date value for active customers

Fix the start_date data type mismatches

Remove dummy data

Clean up Discount Column

Standardize product names

• Change ‘basic’ to ‘silver’

Determine the real price for each product

Union transactions datasets

Join with the Customers dataset

Lookup state by zip

Filter transactions based on date

Create a new table with aggregated results

Run the job at scale

You might also like