Singapore DBT
Singapore DBT
Singapore DBT
jingyu.lim@spenmo.com
linkedin.com/limjingyu 2
Join dbt Community Slack
getdbt.com/community
#local-singapore
Review and acknowledge Community rules
Create a Slack Account!
Available for Android & iOS
Got a question?
Use a ❓emoji before your message. e.g.❓Why is the sky blue?
6
Agenda
7
Register for Coalesce 2023
The Analytics Engineering Conference
October 16-20
Coalesce Online Coalesce London Coalesce San Diego Coalesce Sydney Coalesce Berlin
Visit coalesce.getdbt.com/register
for 15% off, use code
dbtmeetup15
Singapore
dbt Meetup
dbt Meetup
29 March 2023
10 million people impacted 27 Years of
15,000 + alumni
1,870 organisations
causing ripple
500+ active collaborators effects.
New York Singapore London Stockholm Karlskrona Manchester São Paulo
WHO WE ARE
WE ARE A SCHOOL
BUILT TO RESPOND
TO A WORLD IN
CONSTANT FLUX.
WE BUILD
PEOPLE FOR
THE UNKNOWN.
Hyper Island brings value to people
and organisations by empowering
them to make change happen.
WHAT WE OFFER
Master's Programme
in Digital Management
18 months (Part-Time)
Next run starts July 2023
Digital Management
Acceleration Programme
12 weeks
Next run starts June 2023
(PART-TIME)
THE PART-TIME MA IN NUMBERS
18
months
12 months – taught modules
6 months – master’s thesis
8
modules
4 core modules
3 elective modules (choice)
1 master’s thesis
SGD$27,000
with up to 70% funding
1 TRANSFORMATIONAL
EXPERIENCE
THREE FOUNDATIONS
IBF funding criteria: (i). Eligible individuals must be Singapore Citizens (SC) or Singapore Permanent Residents (SPR), physically based in Singapore; (ii). Successfully completed IBF-STS Programme including passing all the
relevant assessments/ assignments; and (iii). At least 75% attendance.
Self-sponsored participants can utilise their SkillsFuture Credit.
FIs and FinTech firms can sponsor their SC or SPR staff (funding criteria applies).
Fees are subjected to change, based on prevailing IBF-STS Funding Quantum and Criteria. For further details please see : IBF-STS Funding Quantum & Criteria
*Total course fee will be calculated based on the combination of elective modules selected.
FOR COMPANY SPONSORED PARTICIPANTS, PLEASE GET IN TOUCH WITH zainon.samsudin@hyperisland.com
COURSE FEE - SELF (FINANCIAL SECTOR)/ COMPANY SPONSORED
IBF funding criteria: (i). Eligible individuals must be Singapore Citizens (SC) or Singapore Permanent Residents (SPR), physically based in Singapore; (ii). Successfully completed IBF-STS Programme including passing all the
relevant assessments/ assignments; and (iii). At least 75% attendance.
Self-sponsored participants can utilise their SkillsFuture Credit.
FIs and FinTech firms can sponsor their SC or SPR staff (funding criteria applies).
Fees are subjected to change, based on prevailing IBF-STS Funding Quantum and Criteria. For further details please see : IBF-STS Funding Quantum & Criteria
*Total course fee will be calculated based on the combination of elective modules selected.
FOR COMPANY SPONSORED PARTICIPANTS, PLEASE GET IN TOUCH WITH zainon.samsudin@hyperisland.com
HOW WILL YOU
BUILD YOURSELF
FOR THE
UNKNOWN?
WE’D LIKE TO HELP YOU ON THAT JOURNEY.
Icebreaker - What’s your favourite data meme/ joke?
21
Your data is not static: Snapshots and
incremental models for change data capture
Speaker(s): Chin Hwee Ong
Your data is not static:
Snapshots and incremental models
for change data capture
@ongchinhwee
Get my slides here!
● While my team uses dbt in our daily work, the insights in this
talk are based on my own technical expertise and does not in
any way represent the views of my team and employer.
● Scope is limited to structured tabular data only
○ Similar concept with different implementation for unstructured data, but that’s
outside the scope of this talk
@ongchinhwee
Imagine this scenario in your data team:
@ongchinhwee
Imagine this scenario in your data team:
@ongchinhwee
Time is an important dimension in your data
@ongchinhwee
What is change data capture?
@ongchinhwee
Why does change data capture matter?
@ongchinhwee
Design Patterns for Change Data Capture
@ongchinhwee
What do we mean by the “Modern Data Stack”?
● Cloud-based
● Built around the cloud data warehouse / lake
● Modular and customizable - “choose the best tool for a specific job”
@ongchinhwee
@ongchinhwee
Data Warehousing in the Modern Data Stack
● Cloud-based compute and storage = more scalable
● From ETL to ELT - transformation within the data warehouse
● Possible to store “snapshots” of data in a cloud data
warehouse to capture historical changes
@ongchinhwee
Data Warehousing in the Modern Data Stack
“Since storage and compute are dirt cheap, engineering time is
expensive, why not snapshot all your data (and append new
partitions for each ETL schedule)?”
Related reading: Functional Data Enginering - a modern paradigm for batch data processing (and related
talks) by Maxime Beauchemin, creator of Airflow and Superset
@ongchinhwee
Change data capture in the Modern Data Stack
@ongchinhwee
Data Snapshots
@ongchinhwee
Incremental models
@ongchinhwee
Side track: What is Kimball’s Dimensional Data Modelling?
@ongchinhwee
Side track: What is Kimball’s Dimensional Data Modelling?
@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?
0 Ignore changes
1 Overwrite
@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?
@ongchinhwee
What are Slowly Changing Dimensions (SCDs)?
@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?
@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?
@ongchinhwee
What is Slowly Changing Dimensions (SCDs)?
More information on Slowly Changing Dimension Techniques: Chapter 5 of The Data Warehouse Toolkit:
The Definitive Guide to Dimensional Modeling, 3rd Edition by Ralph Kimball and Margy Ross
@ongchinhwee
Is Kimball’s Dimensional Data Modelling still relevant?
@ongchinhwee
dbt snapshots for historical data
@ongchinhwee
What is dbt snapshot?
@ongchinhwee
Anatomy of a dbt snapshot
@ongchinhwee
Anatomy of a dbt snapshot
● Target schema
● Strategy (timestamp or check)
● Columns to track changes
○ For timestamp strategy: `updated_at` column for timestamp-based tracking
○ For check strategy: list of columns `check_cols` for column-based change tracking
● Primary key for record
● Whether to invalidate records no longer in source
@ongchinhwee
How does a dbt snapshot work?
@ongchinhwee
How does a dbt snapshot work?
@ongchinhwee
Incremental models and strategies for
fast-evolving data
@ongchinhwee
Incremental models
@ongchinhwee
Anatomy of incremental
materialization on dbt
Adapted from: dbt-dummy project
(https://github.com/gmyrianthous/dbt-dummy/)
@ongchinhwee
Incremental materialization on dbt
@ongchinhwee
Filtering rows on incremental run: is_incremental()
@ongchinhwee
Anatomy of incremental models (on dbt)
● Whether destination table already exists in data warehouse
● Full-refresh?
● Incremental strategy
○ “Data warehouse”- like strategies: merge, delete+insert
○ “Data lake”-like strategies: append, insert_overwrite
● Columns to track changes
○ For merge or delete+insert: unique record key to overwrite
○ For merge: columns to update (partial overwrite)
● Query filter for incremental run
@ongchinhwee
Strategies on designing incremental models
@ongchinhwee
Tips and tricks on change-aware data modelling
@ongchinhwee
Snapshot all your upstream source data!
@ongchinhwee
Source-target schema reconciliation
@ongchinhwee
Use Type 2 SCDs + incremental model (in most cases)
@ongchinhwee
Key Takeaways
: hweecat
: https://ongchinhwee.me
@ongchinhwee
Implementing a Data Observability Initiative: a
practical case with dbt
Speaker(s): Pia Delos Reyes, Michael Han
What is Data Observability
Primer
Data Observability generally refers to the organisations ability to fully understand the
quality of the data. Broadly speaking, this can mean:
State of Analytics Engineering 2023: As identified in the latest dbt industry survey,
What do you find most challenging while
preparing data for your analysis? “unclear data ownership and poor source data
quality are the top-ranked data preparation pain
points for all participants”
It is important to stress that data observability is not just an engineering problem, but also an
organisational one:
Engineering Organisational
Initial implementation focused on dbt test coverage on modeled data, which had advantages but then
quickly ran into scalability issues:
Our solution:
Focus on building frameworks and pass ownership of data source to engineering owners
Solution Overview
Spinning off the observability project
● To generate your base models.sql template file (based from the source file that you’ve
generated above)
Writing dbt document using doc block
Enable engineering stakeholders to onboard quickly
Problem
Solution
models/docs/__common_doc.md models/staging/__demo_schema.yml
Writing dbt document using doc block
Enable engineering stakeholders to onboard quickly
dbt-checkpoint on .pre-commit-config.yml
Enable engineering stakeholders to onboard quickly
dbt-checkpoint on .pre-commit-config.yml
Enable engineering stakeholders to onboard quickly
Elementary Data
Metrics and dashboard for engineering owners
● Don’t over-engineer
○ Piecemeal adoption of best practice is ok
○ Progressively build your DataOps culture
Now we can signify which user activates in the same day with a case-when clause
same_day_activations (
user_id,
sign_up_date,
activation_date,
same_day_activated (bool),
)
Now we can start counting
Output: Divide the two columns
Visualizing the sequence
activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_activated_users,
count_sda_users, same_day_conversion_rate
) )
What if we add some slight variation?
Instead of activating on the same day, count users who activate within 7 days from
signing up.
Same-day 7-day activation
activation_rates (
same_day_activated_daily ( activation_rates (
signup_date,
7_day_activated_daily (
sign_up_date, count_tenants,
signup_date,
count_tenants,
sign_up_date, num_users,
activated_tenants,
num_users,
count_sda_tenants, sda_conversion
num_activated_users,
) count_7da_users, ) 7da_conversion,
) )
Duplicated models!
Problems with this approach
activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_sda_users,
num_sda_users, sda_conversion
) )
CASE WHEN
COUNT (DISTINCT user_id)
activate_date = sign_up_date
AS num_users
THEN 1 ELSE 0
END AS same_day_activated
SUM(same_day_activated)
AS num_sda_users num_sda_users / num_users AS sda_conversion
In dbt, models are first-class citizens.
activation_rates (
same_day_activated_daily ( signup_date,
sign_up_date, num_users,
num_users, num_activated_users,
count_sda_users, same_day_conversion_rate
) )
Metric-based sequence
events ( user_funnel (
date, user_id,
user_id, sign_up_date,
event_type, activation_date,
) )
Benefits of semantic layer
● Less mental effort to write, more intuitive, closer to the end-user’s business
language.
activation_rates (
same_day_activated_daily ( sda_by_agent (
signup_date,
sda_daily_agent_country (
sign_up_date, agent_name,
count_tenants,
count_tenants,
sign_up_date, monthly_sda_conversion,
activated_tenants,
agent_id,
count_sda_tenants, )
sda_conversion
) country, )
count_users,
count_sda_users,
)
● Metrics with multiple level of details e.g. metrics with dynamic grouping
● Multi-fact metrics e.g. metrics that span fact tables without direct relationships
Current landscape
● “The metrics layer has growing up to do” by Amit Prakash (Thoughtspot CTO)
(https://www.thoughtspot.com/blog/the-metrics-layer-has-growing-up-to-do)
Thank you!
Any questions?