Practice Use Case

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3
At a glance
Powered by AI
The key takeaways are to practice data wrangling skills by cleaning customer and transactional data, joining the datasets, and creating an aggregated results table to analyze sales activity by zip code.

The overall goal is to provide an analysis of sales activity by zip code by joining customer data with transaction data from multiple years and aggregating the results.

The steps taken to clean up and standardize the transaction data include removing test transactions, standardizing product names, calculating the real price after discounts, and filtering transactions within customers' active subscription periods.

Practice Use Case

In this Chapter you will practice applying what you learned in the previous lessons in the context of a use
case working with customer data. Before you begin working on this lesson, read the following background
information about the Customers dataset and use case. Once you have completed the use case, you can
find the steps we took and the results produced with in the following pages. This example is meant to test
what you have learned thus far. When you feel comfortable with this use case example, we recommend
you challenge your wrangling skills by taking our Certification Exam.

Before getting started, download the following datasets:


• Use Case Data

Example
You work for a technical services company that sells three monthly subscription products:

• Silver (price: $9.99/month)


• Gold (price: $14.99/month)
• Platinum (price: $29.99/month)

The company occasionally offers promotional discounts – therefore, some product prices may be slightly
lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code.

In order to do this, you’ll need to join your Lab_customers dataset (where zip code information resides)
with sales data from your transactions datasets. Once you’ve joined the data, you’ll need to aggregate the
results.

Create a new flow


Add Following Datasets:

• lab_customers.csv
• lab_2010_transactions.csv
• lab_2011_transactions.csv
• lab_2012_transactions.csv
• lab_2013_transactions.csv
• lab_2014_transactions.csv
• lab_2015_transactions.csv
• zip_to_state_map.csv

Add recipe to Lab_Customers


Add a Recipe to the Lab_Customers dataset. Edit the Recipe to enter the Transformer Grid

Filter out faulty data


Remove the contacts who are not actual customers: We identify actual customers by the start_date
column.

Provide a default end_date value for active customers


For those current customers without an end date, set the default end date to January 01, 2050.

Fix the start_date data type mismatches


Notice that the start_date column contains in two different formats. You want all of the dates in this
column to conform to the format ‘yyyy/MM/dd’.

Lab_Transactions
Add a Recipe to the Lab_2010_transactions dataset. Edit that Recipe to enter the grid

Remove dummy data


Since this company sells online subscription products, the website sends a test transaction of $0.01 to
validate that a customer’s credit card is valid. Test transactions are recorded in the database and appear in
the transactions file. You need to remove these transactions so they don’t distort your analysis.

Clean up Discount Column


Remove the % from the Discount column, then convert the column to a decimal. Fill in the missing values
with 0.

Standardize product names


After 2011, the company renamed the subscription products. To maintain consistency between the
transactions files from each year, you need to replace the old product names with the new product names.

• Change ‘basic’ to ‘silver’


• Change ‘deluxe’ to ‘gold’

Determine the real price for each product


The real price for each product is a combination of the list price and the discount. Create a new column
that contains the actual charge for each transaction.

Union transactions datasets


Union 2010 – 2015 datasets. We want to apply the above transforms to all data, make sure your recipe step
is added to the beginning of your recipe, so that all the steps in your recipe are applied to all of the
unioned datasets.

• 2011_Transactions
• 2012_Transactions
• 2013_Transactions
• 2014_Transactions
• 2015_Transactions

Join with the Customers dataset


Join transaction data with the customer data, keep the following columns.

• Customer_id (2010_transactions)
• transaction_date (2010 Transactions)
• adjusted_price (2010 Transactions)
• product (2010 Transactions)
• address_state (Customers)
• address_zip (Customers)
• region (Customers)
• start_date (Customers)
• end_date (Customers)

Lookup state by zip


Scroll to the address_state column and notice that there are some blank records. Perform a lookup on the
address_zip column to zip_to_state_map.csv. This Lookup file contains a state mapping for each zip code.
Rename column3 to State and delete Address_state

Filter transactions based on date


Only keep transactions that have occurred during the course of a customer’s active subscription (i.e. the
transaction_date is before the end_date

Create a new table with aggregated results


Create a table that lists the count, sum, min and max of adjusted price, grouped by address zip.

Run the job at scale


Run job and view job results.

Solution

Solution Guide

You might also like