Singapore DBT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 137

Welcome

Singapore dbt Meetup


Host: Jing Yu Lim & Michael Han

Join dbt community Slack


getdbt.com/community
#local-singapore
MC/Host: Jing Yu Lim
Spenmo
She/Her

“I love potatoes and leading commas.”

jingyu.lim@spenmo.com
linkedin.com/limjingyu 2
Join dbt Community Slack
getdbt.com/community

#local-singapore
Review and acknowledge Community rules
Create a Slack Account!
Available for Android & iOS

Go to getdbt.com/community & add your info

Click on invite link and follow instructions

Join #local-singapore channel to chat with us!

If you are unable to join - please email anna.filippova@dbtlabs.com for an invite


New to dbt Slack?

Got a question?
Use a ❓emoji before your message. e.g.❓Why is the sky blue?

I have the same question!


If you see another question that you want answered, add a ➕
reaction — this helps us prioritize questions.
We 🧡 threads!
My question got answered:
If your question gets solved (e.g. someone in the thread answers it
for you), react with a ✅ emoji.

I’m ready to interact!


Awesome! Non-question banter is strongly encouraged — use all
the reactions and gifs you want!

KEY THINGS TO REMEMBER:


Be Respectful and use Slack threads
Icebreaker - What’s your favourite data meme/ joke?

Share your answer in


#local-singapore !

6
Agenda

● 18:00 - Kick-off, food and networking


● 19:15 - Presentation by Chin Hwee
Your data is not static: Snapshots and incremental models for change
data capture

● 19:45 - Presentation by Pia and Michael


Implementing a Data Observability Initiative: a practical case with dbt

● 20:15 - Presentation by Thanh


How would the ideal Semantic Layer look like?

7
Register for Coalesce 2023
The Analytics Engineering Conference
October 16-20

Coalesce Online Coalesce London Coalesce San Diego Coalesce Sydney Coalesce Berlin

Come share your ideas with us!


On March 13th, 2023, we are opening our Call for Proposals for
Coalesce 2023. Submit your ideas and join us at Coalesce to speak
among peers in the community.

Need help brainstorming and workshopping ideas?


Join the #Coalesce-Pitch-Party-2023 channel! Bounce ideas off
others, upvote ideas you want to hear more from, nominate people
youʼd like to hear more from. Itʼs a party, and youʼre all invited!

Visit coalesce.getdbt.com/register
for 15% off, use code
dbtmeetup15
Singapore
dbt Meetup

dbt Meetup
29 March 2023
10 million people impacted 27 Years of
15,000 + alumni
1,870 organisations
causing ripple
500+ active collaborators effects.
New York Singapore London Stockholm Karlskrona Manchester São Paulo
WHO WE ARE

WE ARE A SCHOOL
BUILT TO RESPOND
TO A WORLD IN
CONSTANT FLUX.
WE BUILD
PEOPLE FOR
THE UNKNOWN.
Hyper Island brings value to people
and organisations by empowering
them to make change happen.
WHAT WE OFFER

Our Most Impactful


Learning Experiences.

Master's Programme
in Digital Management
18 months (Part-Time)
Next run starts July 2023

Digital Management
Acceleration Programme
12 weeks
Next run starts June 2023
(PART-TIME)
THE PART-TIME MA IN NUMBERS

18
months
12 months – taught modules
6 months – master’s thesis

8
modules
4 core modules
3 elective modules (choice)
1 master’s thesis
SGD$27,000
with up to 70% funding

1 TRANSFORMATIONAL
EXPERIENCE
THREE FOUNDATIONS

TRANSDISCIPLINARY SYSTEMS & SELF-LED


THINKING COMPLEXITY THINKING LEARNING

Source: Cambridge Press

Source: EU’s Violence Lab

Source: Inter-american Development Bank, Institute for


Self-Directed Learning
Applying Design Thinking to Adopting a Maker Mindset
understand human/customer toward emerging
needs & unintended technologies
consequences at a systems
level Leading business
transformation &
innovation in
Activating individual organisations and
agency in Futures society
Thinking for a
preferred future
Forming & leading high
performing teams in a
VUCA environment
Cultivating Transdisciplinary
Thinking and learning how to 17
rapidly learn across Reducing the cost of change to
disciplinary lines increase agility to fast
changing contexts
COURSE FEE - SELF-SPONSORED (NON-FINANCIAL SECTOR)

FULL FEE TOTAL COURSE FEE WITH IBF FUNDING


MA DIGITAL
MANAGEMENT FROM S$16,124.64*
S$27,900.36
Payment for SC aged 40 for SC below 40
Full courses fee Schedule and above years old or SPR
(inclusive of 8% GST in 2023 & IBF subsidy: IBF subsidy: 50%
9% GST in 2024) 70%
REGISTRATION FEE Registration fee Upon signing S$648.00* S$648.00
3 CORE MODULES
Hyper Island Way Week (Core) - S$0.00* S$0.00
3 ELECTIVES
IRP Humans & Customers (Core) 17 July 2023 S$1,598.13* S$2,439.25
Intelligent Machines (Core) 5 September 2023 S$1,598.13* S$2,439.25
PAYMENT SCHEDULE Leading Teams (Core) 23 October 2023 S$4,542.06* S$4,542.06
UPON SIGNING S$648.00 Radical Perspective (Elective) 18 December 2023 S$4,542.06* S$4,542.06
UPON SIGNING S$9,084.12
Future Scenarios (Elective) 22 January 2024 S$1,598.13* S$2,439.25
6 JAN 2023 ………..S$9,084.12
Agile Making & Prototyping (Elective) 4 March 2024 S$4,542.06 S$4,542.06
6 JUL 2023……..….S$9,084.12
Biz. Transformation & Innovation (Elective) 9 April 2024 S$1,598.13* S$2,439.25
The full course fee for each module is Short Specialisation Project (Elective) 10 May 2024 S$4,542.06 S$4,542.06
S$4,542.06 inclusive prevailing 8% GST in
2023 & 9% GST in 2024)

IBF funding criteria: (i). Eligible individuals must be Singapore Citizens (SC) or Singapore Permanent Residents (SPR), physically based in Singapore; (ii). Successfully completed IBF-STS Programme including passing all the
relevant assessments/ assignments; and (iii). At least 75% attendance.
Self-sponsored participants can utilise their SkillsFuture Credit.
FIs and FinTech firms can sponsor their SC or SPR staff (funding criteria applies).
Fees are subjected to change, based on prevailing IBF-STS Funding Quantum and Criteria. For further details please see : IBF-STS Funding Quantum & Criteria
*Total course fee will be calculated based on the combination of elective modules selected.
FOR COMPANY SPONSORED PARTICIPANTS, PLEASE GET IN TOUCH WITH zainon.samsudin@hyperisland.com
COURSE FEE - SELF (FINANCIAL SECTOR)/ COMPANY SPONSORED

FULL FEE TOTAL COURSE FEE WITH IBF FUNDING


MA DIGITAL
MANAGEMENT FROM S$10,236.78*
S$27,900.36
Payment for SC aged 40 for SC below 40
Full courses fee Schedule and above years old or SPR
(inclusive of 8% GST in 2023 & IBF subsidy: IBF subsidy: 50%
9% GST in 2024) 70%
REGISTRATION FEE Registration fee Upon signing S$648.00* S$648.00
3 CORE MODULES
Hyper Island Way Week (Core) - S$0.00* S$0.00
3 ELECTIVES
IRP Humans & Customers (Core) 17 July 2023 S$1,598.13* S$2,439.25
Intelligent Machines (Core) 5 September 2023 S$1,598.13* S$2,439.25
PAYMENT SCHEDULE Leading Teams (Core) 23 October 2023 S$1,598.13* S$2,439.25
UPON SIGNING S$648.00 Radical Perspective (Elective) 18 December 2023 S$1,598.13* S$2,439.25
UPON SIGNING S$9,084.12
Future Scenarios (Elective) 22 January 2024 S$1,598.13* S$2,439.25
6 JAN 2023 ………..S$9,084.12
Agile Making & Prototyping (Elective) 4 March 2024 S$4,542.06 S$4,542.06
6 JUL 2023……..….S$9,084.12
Biz. Transformation & Innovation (Elective) 9 April 2024 S$1,598.13* S$2,439.25
The full course fee for each module is Short Specialisation Project (Elective) 10 May 2024 S$4,542.06 S$4,542.06
S$4,542.06 inclusive prevailing 8% GST in
2023 & 9% GST in 2024)

IBF funding criteria: (i). Eligible individuals must be Singapore Citizens (SC) or Singapore Permanent Residents (SPR), physically based in Singapore; (ii). Successfully completed IBF-STS Programme including passing all the
relevant assessments/ assignments; and (iii). At least 75% attendance.
Self-sponsored participants can utilise their SkillsFuture Credit.
FIs and FinTech firms can sponsor their SC or SPR staff (funding criteria applies).
Fees are subjected to change, based on prevailing IBF-STS Funding Quantum and Criteria. For further details please see : IBF-STS Funding Quantum & Criteria
*Total course fee will be calculated based on the combination of elective modules selected.
FOR COMPANY SPONSORED PARTICIPANTS, PLEASE GET IN TOUCH WITH zainon.samsudin@hyperisland.com
HOW WILL YOU
BUILD YOURSELF
FOR THE
UNKNOWN?
WE’D LIKE TO HELP YOU ON THAT JOURNEY.
Icebreaker - What’s your favourite data meme/ joke?

21
Your data is not static: Snapshots and
incremental models for change data capture
Speaker(s): Chin Hwee Ong
Your data is not static:
Snapshots and incremental models
for change data capture

By: Chin Hwee Ong (@ongchinhwee)


30 March 2023
Singapore dbt Meetup
About me
Ong Chin Hwee 王敬惠
● Senior Data Engineer @ Grab
● Speaker and (occasional) writer on
data engineering topics
● 90% self-taught programmer (+10%
from my engineering degrees)

@ongchinhwee
Get my slides here!

Slides link: bit.ly/dbtsg2023-change-aware-data


@ongchinhwee
Before we begin, some disclaimers:

● While my team uses dbt in our daily work, the insights in this
talk are based on my own technical expertise and does not in
any way represent the views of my team and employer.
● Scope is limited to structured tabular data only
○ Similar concept with different implementation for unstructured data, but that’s
outside the scope of this talk

@ongchinhwee
Imagine this scenario in your data team:

● Finance requests a regeneration of a business-critical report


dated 11 months ago for Client A
● Client A has been upgraded from Tier 2 to Tier 1 3 months ago,
which increases the rebate earn rate
● However, you need to generate the report based on the Tier 2
earn rate (as Client A was a Tier 2 customer 11 months ago)

@ongchinhwee
Imagine this scenario in your data team:

@ongchinhwee
Time is an important dimension in your data

● Data is usually not static


○ State transitions during a business process

○ Attributes can change over time e.g. age, income, status

@ongchinhwee
What is change data capture?

● Design patterns for capturing and tracking changes in the state


of the data over time
● Data warehouse needs to capture changes in data from
upstream source systems at a point in time

@ongchinhwee
Why does change data capture matter?

● Data governance and audit purposes


● Build data history with backward and forward compatibility
○ What if there is a change in transformation logic that only applies
to a specific time range?
● When you need to use point-in-time values to track business
metrics over time

@ongchinhwee
Design Patterns for Change Data Capture

● Data versioning based on a combination of:


○ Data version identifiers
○ Timestamps
○ Status indicators
● Log trigger / tuple versioning
○ Also known as Type 2 Slowly Changing Dimensions (SCDs)
● Transaction logs

@ongchinhwee
What do we mean by the “Modern Data Stack”?
● Cloud-based
● Built around the cloud data warehouse / lake
● Modular and customizable - “choose the best tool for a specific job”

@ongchinhwee
@ongchinhwee
Data Warehousing in the Modern Data Stack
● Cloud-based compute and storage = more scalable
● From ETL to ELT - transformation within the data warehouse
● Possible to store “snapshots” of data in a cloud data
warehouse to capture historical changes

@ongchinhwee
Data Warehousing in the Modern Data Stack
“Since storage and compute are dirt cheap, engineering time is
expensive, why not snapshot all your data (and append new
partitions for each ETL schedule)?”

● Does not apply for very large dimensions


● Does not preclude the importance of dimensional data modelling

Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related
talks) by Maxime Beauchemin, creator of Airflow and Superset

@ongchinhwee
Change data capture in the Modern Data Stack

● Some implementations of change data capture


○ Modern approaches
■ Data snapshots
■ Incremental models
○ Traditional (Kimball’s) Dimensional Modelling techniques
■ Slowly Changing Dimensions (SCDs)

@ongchinhwee
Data Snapshots

● Read-only (immutable) copies of the state of a data source at a


particular point in time
● Stored at the staging area of a data warehouse

@ongchinhwee
Incremental models

● Load changes to data incrementally based on latest version of data


● Incremental data load typically determined based on latest source
timestamp of existing data stored in target table

@ongchinhwee
Side track: What is Kimball’s Dimensional Data Modelling?

● Fundamental Concepts: Facts vs Dimensions


○ Fact Tables
■ Contains metrics and facts about a business process e.g. process
time, transaction amount
■ Are typically fast-evolving during a business process event
■ Eventually reaches a final state at a point in time upon completion

@ongchinhwee
Side track: What is Kimball’s Dimensional Data Modelling?

● Fundamental Concepts: Facts vs Dimensions


○ Dimension Tables
■ Describes the attributes of a business process e.g. customer details
■ Are typically slow-changing and updated over a longer period of time
■ Does not have a “final state”

→ problem of Slowly Changing Dimensions (SCDs)

@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?

● Change tracking techniques to handle state changes in


dimensions
SCD Type How state changes are handled

0 Ignore changes

1 Overwrite

2 Row versioning (w/ “validity period”)

3 Previous value column

4 History table (“mini-dimension”)

@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?

● Type 0 Slowly Changing Dimensions (Type 0 SCD)


○ Fixed dimension e.g. account opening date
○ Ignores any changes
○ Assumes that the attribute will not change forever

@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?

● Type 1 Slowly Changing Dimensions (Type 1 SCD)


○ Reflects the latest version of dimension attributes
○ Previous version of the value in the dimension row is overwritten
with new value
○ Destroys history - not possible to rewind back to previous versions
of the data

@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?

● Type 2 Slowly Changing Dimensions (Type 2 SCD)


○ Implements row versioning for each dimension attribute
○ Concept of “validity period” for each version of the data
■ Row effective date / timestamp
■ Row expiration date / timestamp
■ Current row indicator

@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?

● Type 2 Slowly Changing Dimensions (Type 2 SCD)


○ When a change is detected in a data record:
■ A new dimension row is added with updated attribute values for the
data record
■ New primary surrogate key is assigned to new dimension row
■ Previous version of the attribute is updated with row expiration
timestamp

@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?

● Slowly Changing Dimensions (SCDs)


○ Type 3, Type 4 etc.
■ Less commonly used due to additional complexity without
significant performance benefits in cloud data warehouses
● Type 3: add new column to store previous attribute value
● Type 4: add history table to keep record of state changes

More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit:
The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross

@ongchinhwee
Is Kimball’s Dimensional Data Modelling still relevant?

● Yes, in some ways.


● Dimensional data modelling is still needed for these use cases:
○ Aggregation of facts
○ Metrics drill-down based on dimensions
■ e.g. how many customers in Singapore spent > US$400 per month
○ Financial reporting and audit

@ongchinhwee
dbt snapshots for historical data

@ongchinhwee
What is dbt snapshot?

● Change data capture mechanism that records changes to a mutable


upstream table over time
● Implements Type-2 Slowly Changing Dimensions on mutable table
sources
● Two available strategies for change data capture:
○ Timestamp: detect changes based on `updated_at` timestamp
○ Check: detect changes based on columns specified in `check_cols`

Ref: Snapshots - dbt Developer Hub

@ongchinhwee
Anatomy of a dbt snapshot

Ref: dbt-dummy project


(https://github.com/gmyrianthous/dbt-d
ummy/)

@ongchinhwee
Anatomy of a dbt snapshot
● Target schema
● Strategy (timestamp or check)
● Columns to track changes
○ For timestamp strategy: `updated_at` column for timestamp-based tracking
○ For check strategy: list of columns `check_cols` for column-based change tracking
● Primary key for record
● Whether to invalidate records no longer in source

@ongchinhwee
How does a dbt snapshot work?

@ongchinhwee
How does a dbt snapshot work?

@ongchinhwee
Incremental models and strategies for
fast-evolving data

@ongchinhwee
Incremental models

● Limiting the data transformation to a specified subset of


source data
○ Usually on rows in source data that have been created or updated
since the last scheduled run
● Significantly optimises runtime of transformations on large data +
reduces compute cost

@ongchinhwee
Anatomy of incremental
materialization on dbt
Adapted from: dbt-dummy project
(https://github.com/gmyrianthous/dbt-dummy/)

@ongchinhwee
Incremental materialization on dbt

@ongchinhwee
Filtering rows on incremental run: is_incremental()

@ongchinhwee
Anatomy of incremental models (on dbt)
● Whether destination table already exists in data warehouse
● Full-refresh?
● Incremental strategy
○ “Data warehouse”- like strategies: merge, delete+insert
○ “Data lake”-like strategies: append, insert_overwrite
● Columns to track changes
○ For merge or delete+insert: unique record key to overwrite
○ For merge: columns to update (partial overwrite)
● Query filter for incremental run

@ongchinhwee
Strategies on designing incremental models

● Design with upstream data update mechanism in mind


○ Affects how incremental load is loaded onto target model
● Incremental strategy depends on data warehouse
○ For merge/delete+insert strategies, specify a unique key to
update existing records
● Performance tip: Filter rows early!

@ongchinhwee
Tips and tricks on change-aware data modelling

@ongchinhwee
Snapshot all your upstream source data!

● What if assumptions about your data change?


○ Source data update mechanism (append-only vs overwrite)
○ Source table schema (columns added/altered/dropped)
○ Business logic
● Store data snapshots in staging area
○ Build SCDs (dbt snapshots) from data snapshots

@ongchinhwee
Source-target schema reconciliation

Detect schema diffs


between upstream source
system vs data
warehouse

● Useful for detecting


schema changes

@ongchinhwee
Use Type 2 SCDs + incremental model (in most cases)

● In most cases, Type 2 SCDs are sufficient for tracking and


capturing data changes
● Incremental models may be slightly complex, but pays off in
efficiency + cost when data scales quickly and changes (quickly
and slowly) over time

@ongchinhwee
Key Takeaways

● Dimensional data modeling techniques are still relevant in the


Modern Data Stack
● Adopt a mixture of dbt snapshots (SCDs) and incremental models
when designing change-aware data models for non-retroactive
business logic
● Data snapshots are useful in cases of where re-modelling is
required due to changes in business logic
@ongchinhwee
Reach out to
And check out my upcoming
me! conference session at:
: ongchinhwee Event: FOSSASIA Summit 2023

: ongchinhwee@data-folks.masto.host Date: 13 - 15 April 2023

: @ongchinhwee Venue: Lifelong Learning Institute

: hweecat

: https://ongchinhwee.me
@ongchinhwee
Implementing a Data Observability Initiative: a
practical case with dbt
Speaker(s): Pia Delos Reyes, Michael Han
What is Data Observability
Primer

Building on the engineering/DevOps observability…

“Observability… refers to the monitoring, tracking, and triaging of incidents to


prevent downtime”

Data Observability generally refers to the organisations ability to fully understand the
quality of the data. Broadly speaking, this can mean:

● Monitoring and logging pipeline quality and performance metrics


● Identifying data incidents at source for rapid triage
● Tracking data changes and drift
● etc…
What is the (general) use-case?
Primer

State of Analytics Engineering 2023: As identified in the latest dbt industry survey,
What do you find most challenging while
preparing data for your analysis? “unclear data ownership and poor source data
quality are the top-ranked data preparation pain
points for all participants”

This clearly states both the problem statement and


therefore the goal of a observability initiative, i.e.

● Identify source data quality issues


● Establish clear data ownership
● Consequently, improve source data quality
issues
Engineering vs Organisational Problem
Primer

It is important to stress that data observability is not just an engineering problem, but also an
organisational one:

Engineering Organisational

Identifying schema drift at Notifying data consumer


source teams to update system

Presenting data quality Holding teams accountable for


metrics metric outcomes

Provide framework to create Keeping documentation up to


portable documentation date
Case Study:
Implementing data
observability at Spenmo
Overview of Data Models
Initial Data Warehouse Implementation

Raw Data: Data is loaded as-is from data sources

Curated Data: The “intermediate” data layer where


data is standardised, but otherwise identical to
source data

Modeled Data: Data is organised by domain or


logic, and is transformed into “dims” and “facts” by
joining multiple sources.
Bottlenecks
Initial Data Warehouse Implementation

Initial implementation focused on dbt test coverage on modeled data, which had advantages but then
quickly ran into scalability issues:

Advantages Pain Point


● Fewer tables to cover
● Drain on data team bandwidth to
diagnose issue
● Clear data team ownership
● Lack of engagement from
on tables
engineering
● Time to resolution is high for data
● More direct association to
integrity issues with no ownership
outcomes
on data incident fix

Our solution:
Focus on building frameworks and pass ownership of data source to engineering owners
Solution Overview
Spinning off the observability project

1-2. Fivetran checks and loads data as before

3. Trigger dbt models via Fivetran Transform


a. dbt runs column- and row-based tests
on the table
b. run cross-schema tests on custom views
c. If there are any errors, notify via Slack

4. dbt publishes and serve documentation (via


dbt docs) and metrics (via elementary) as
static web pages in S3
Generate template .yml files
Enable engineering stakeholders to onboard quickly

● To generate your source.yml template file

Reference to generate_source.yml macro

● To generate your base models.sql template file (based from the source file that you’ve
generated above)
Writing dbt document using doc block
Enable engineering stakeholders to onboard quickly

Problem

● Documentation becomes a bit more complex


● Re-writing the exact definition repeatedly

Solution

● Writing it outside of your yml file and write it as


markdown (.md) using dbt docs blocks
Writing dbt document using doc block
Enable engineering stakeholders to onboard quickly

models/docs/__common_doc.md models/staging/__demo_schema.yml
Writing dbt document using doc block
Enable engineering stakeholders to onboard quickly
dbt-checkpoint on .pre-commit-config.yml
Enable engineering stakeholders to onboard quickly
dbt-checkpoint on .pre-commit-config.yml
Enable engineering stakeholders to onboard quickly
Elementary Data
Metrics and dashboard for engineering owners

Why dbt-elementary ? How do we use dbt-elementary ?


● Open-source, easy to integrate ● Publish to S3 as a static webpage
● Automatically generate meta data (e.g. ● Hold stakeholders accountable for their
models, alias, tags) project / pipeline performance
● Gives us the information about our ● Provide visibility of errors and data
project invocations such as start, end incidents at source
date, status, rows affected and
compiled sql code
● Provides performance metrics to
engineering stakeholders
Elementary Data
Metrics and dashboard for engineering owners
Elementary Data
Metrics and dashboard for engineering owners
Elementary Data
Metrics and dashboard for engineering owners
Closing Thoughts

● Don’t over-engineer
○ Piecemeal adoption of best practice is ok
○ Progressively build your DataOps culture

● Demonstrate value to get stakeholder buy-in, e.g.


○ Automate tests that are hard to run in
production
○ Produce reportable metrics to track progress

● Does the OOTB / Buy options work for you?


○ Statistical methods vs explicit definitions
○ Active vs passive stakeholder engagement
How would the Ideal Semantic Layer look like?
Speaker(s): Thanh Dinh Khac
How would an ideal
semantic layer look like?

Thanh Dinh, Holistics Data


dbt meetup – Singapore, March 2023
Agenda

● Why metrics layer? A real-world example.


● Features of an ideal semantic layer
● Current landscape
● FAQs
How it started
Why do we need a
“metrics layer?”
(or semantic layer… what’s
the difference anyway?)
Real-world Example
Same-day activation

Alice is a data analyst of a fintech (e.g insurance)


company.
One day the boss asks her to get daily same-day
activation rate of users (clients).

Same-day Activation: Users who signed up and


activated their account on the same day.
(Activation is a high-effort activity: NRIC, upload photo, KYC, etc)
Context: Data structure
events (
date,
user_id,
event_type,
Whenever a potential customer sign )
up for a new account, the system
creates a new user and logs an event
down.
The new user may be activated by the
user on the same day or on a later
day, or never.
Let’s help Alice build this
(with SQL and dbt)
Start: A base table
To make the data easier to work with, Alice
creates a new user_funnel table. user_funnel (
user_id,
She also creates a dbt model for this table for sign_up_date,
later reuse. activation_date,
)
Same day activation

Now we can signify which user activates in the same day with a case-when clause

same_day_activations (
user_id,
sign_up_date,
activation_date,
same_day_activated (bool),
)
Now we can start counting
Output: Divide the two columns
Visualizing the sequence

events ( user_funnel ( same_day_activations (


date, user_id, user_id,
user_id, sign_up_date, sign_up_date,
event_type, activation_date, same_day_activated (bool),
) ) )

activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_activated_users,
count_sda_users, same_day_conversion_rate
) )
What if we add some slight variation?

Bob now asks for 7-day conversion rate

Instead of activating on the same day, count users who activate within 7 days from
signing up.
Same-day 7-day activation

events ( user_funnel ( same_day_activated (


date, user_id, 7_day_activations (
tenant_id,
user_id, sign_up_date, sign_up_date,
user_id,
event_type, activation_date, same_day_activated
sign_up_date, (bool),
) ) ) 7_day_activated (bool),
)

activation_rates (
same_day_activated_daily ( activation_rates (
signup_date,
7_day_activated_daily (
sign_up_date, count_tenants,
signup_date,
count_tenants,
sign_up_date, num_users,
activated_tenants,
num_users,
count_sda_tenants, sda_conversion
num_activated_users,
) count_7da_users, ) 7da_conversion,
) )

Duplicated models!
Problems with this approach

● Any slight variation causes lots of rework for analysts, which


is:

○ Error-prone (maintaining duplication)

○ Time consuming (rework)

● It is impossible for business users to even self-serve those


data
Is there a better way?
events ( user_funnel ( same_day_activated (
date, user_id, user_id,
user_id, sign_up_date, sign_up_date,
event_type, activation_date, same_day_activated (bool),
) ) )

activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_sda_users,
num_sda_users, sda_conversion
) )

CASE WHEN
COUNT (DISTINCT user_id)
activate_date = sign_up_date
AS num_users
THEN 1 ELSE 0
END AS same_day_activated

SUM(same_day_activated)
AS num_sda_users num_sda_users / num_users AS sda_conversion
In dbt, models are first-class citizens.

You can define models, reuse them, and track


dependencies between models through the
lineage graph.

What if we can do the same for metrics,


decoupled from dimensions?
Metric as an independent concept
Rewrite the same question
using metric-centric query
Decouple metrics from dimensions & their
container vehicle (tables)
Model-based sequence

events ( user_funnel ( same_day_activations (


date, user_id, user_id,
user_id, sign_up_date, sign_up_date,
event_type, activation_date, same_day_activated (bool),
) ) )

activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_activated_users,
count_sda_users, same_day_conversion_rate
) )
Metric-based sequence

events ( user_funnel (
date, user_id,
user_id, sign_up_date,
event_type, activation_date,
) )
Benefits of semantic layer

Not constrained by table as a vehicle for transporting logic.

● Less mental effort to write, more intuitive, closer to the end-user’s business
language.

● Maintainability: Minimize duplication of logic means less effort for modification

● Flexibility: dimensions can be combined at run time instead of development time

It is like going from C to Python. But it doesn’t stop here.


How about seven-day activation?
Seven-day activation
Define metrics only => Less logic duplication
How about any n-day activation?
Parameterized metrics
Changes (abs and %) of conversions over time?
Window function

0.11 = 0.71 - 0.6 18.33% = 0.11 / 0.6


What if more tables are involved?
How do you calculate
[monthly average] of
same-day activation
rates [by sales agents]
[for customers in
Singapore] [in 2022]?

How long does it take you


to get this result in SQL?
Using dbt (dimensional modeling)
Monthly average of same-day activation rates [by sales agents] [for
customers in Singapore] [in 2022]?

events ( user_funnel ( same_day_activated (


date, user_id, user_id,
user_id, sign_up_date, sign_up_date,
event_type, activation_date, same_day_activated (bool),
) ) )

activation_rates (
same_day_activated_daily ( sda_by_agent (
signup_date,
sda_daily_agent_country (
sign_up_date, agent_name,
count_tenants,
count_tenants,
sign_up_date, monthly_sda_conversion,
activated_tenants,
agent_id,
count_sda_tenants, )
sda_conversion
) country, )
count_users,
count_sda_users,
)

Inject additional dimensional data from other tables


With metrics-centric thinking
Monthly average of same-day activation rates [by sales agents] [for
customers in Singapore] [in 2022]?
With metrics-centric thinking
Monthly average of same-day activation rates [by sales agents] [for
customers in Singapore] [in 2022]?
Solving the self-service problem for business users
Metric Exploration UI
Solving the self-service problem for business users
ChatGPT-like version (NLP)
An ideal semantic layer

Provides a higher level of abstraction than available with SQL, which


saves effort, reduces errors and enables self-serve interfaces
● The metrics can be decoupled from dimensions, which can be
provided during query time
● New metrics can be composed from existing ones
● Should be join-aware to be able to work on multiple models instead
of only one single model
6 types of metrics (according to Thoughtspot CTO)

● Simple aggregation e.g. sum(revenue), avg(price), count(distinct users)

● Combined aggregation with scalar functions e.g. sum(revenue) - sum(cost)

● Metrics that require joins e.g. sum(sales.revenue *


conversion_rates.exchange_rate)

● Metrics with window functions e.g. cumulative sum(sales.revenue) over weeks

● Metrics with multiple level of details e.g. metrics with dynamic grouping

● Multi-fact metrics e.g. metrics that span fact tables without direct relationships
Current landscape

● Prominent BI tools like PowerBI, Looker, Thoughtspot, GoodData, etc.


provide varied levels of the capabilities but they are locked inside
proprietary blackboxes
● Independent and open semantic layers are the future but their
capabilities are still quite limited
● These layers will improve their capabilities and reshape the analytics
landscape
Summary

● An example that highlights the need for a semantic layer: current


approach causes duplicated effort and low reliability
● A new way of thinking about analytics logic with an idealized semantic
layer: metrics-centric thinking
● How not only the new way of thinking improves maintainability, flexibility,
but also provides a solid foundation for self-service analytics
● Current landscape and future direction
Credits & References

● Pedram Navid for inspiration of conversion rate example


(https://pedram.substack.com/p/what-is-the-metrics-layer)

● “The metrics layer has growing up to do” by Amit Prakash (Thoughtspot CTO)
(https://www.thoughtspot.com/blog/the-metrics-layer-has-growing-up-to-do)
Thank you!
Any questions?

www.holistics.io hi@holistics.io holistics-software holistics.software holistics_bi


Architecture overview
Thank you!

Stay tuned for upcoming events and other content

Singapore dbt Meetup Group


#local-singapore Slack Channel
Before you leave...
Please share your feedback with us!

You might also like