Big Book of Data Warehousing and Bi
Big Book of Data Warehousing and Bi
3.2 D
imensional Modeling Best Practice and Implementation on a . . ............................................... 18
Modern Lakehouse Architecture
3.3 L
oading a Data Warehouse Data Model in Real Time With the .. ................................................ 25
Databricks Data Intelligence Platform
3.4 W
hat’s New With Databricks SQL? .................................................................................................. 30
3.5 D
istributed Data Governance and Isolated Environments With Unity Catalog .. ................... 38
3.6 T
he Hitchhiker’s Guide to Data Privilege Model and Access Control in Unity Catalog ....... 42
4.1 H
ow to Build a Marketing Analytics Solution Using Fivetran and dbt on Databricks ........... 47
InMobi:
5.1 Driving Meaningful Connections Between Customers and Brands .......................... 79
Data is vital to the success of every company. As organizations Introduction to the Data Intelligence Platform
increasingly rely on data, maintaining efficient data management
and governance becomes more crucial. Addressing this, the The Databricks Data Intelligence Platform enables organizations to effectively
Databricks Data Intelligence Platform enables effective data manage, utilize and access all their data and AI. The platform — built on the
lakehouse architecture, with a unified governance layer across data and AI and a
management, utilization and access to data and AI. Built on the
single unified query engine that spans ETL, SQL, AI and BI — combines the best
lakehouse architecture, it merges the best features of data lakes
elements of data lakes and data warehouses to help reduce costs and deliver
and data warehouses, reducing costs and speeding up data and
faster data and AI initiatives.
AI initiatives. The platform provides unified governance for data
and AI, along with a versatile query engine for ETL, SQL, machine By combining generative AI with the unification benefits of a lakehouse, the
learning and BI. Data Intelligence Platform offers a Data Intelligence Engine called DatabricksIQ
that understands the unique semantics of your enterprise data. DatabricksIQ
automatically analyzes all aspects of data, including its content, metadata and
usage patterns (such as queries, reports and lineage). This comprehensive
analysis enables the platform to continually learn, enhance and add new
capabilities, optimizing data management and AI applications. Through this deep
understanding of data, the Databricks Data Intelligence Platform enables:
Enhanced Governance and Privacy: Data Intelligence Platforms can The Databricks Platform also simplifies the development of enterprise AI
automatically detect, classify and prevent misuse of sensitive data, while applications. The platform makes it easy for enterprises to build AI applications
simplifying management using natural language. that understand their data. The Databricks Platform offers multiple capabilities
to directly integrate enterprise data into AI systems, including:
First-Class Support for AI Workloads: Data Intelligence Platforms can
enhance any enterprise AI application by allowing it to connect to the End-to-end
RAG (retrieval augmented generation) to build
relevant business data and leverage the semantics learned by the high-quality conversational agents on your custom data
platform (metrics, KPIs, etc.) to deliver accurate results. AI application Training
custom models either from scratch on an organization’s
developers no longer have to “hack” intelligence together through brittle data, or by continued pretraining of existing models such as MPT
prompt engineering. and Llama 2, to further enhance AI applications with deep
understanding of a target domain
Efficient
and secure serverless inference on your enterprise data,
with unified governance and quality monitoring functionality
End-to-end
MLOps based on the popular MLflow open source
project, with all produced data automatically actionable, tracked
and monitorable in the lakehouse
To learn more about Databricks SQL, read our eBook Why the Data Lakehouse
Is Your Next Data Warehouse.
02
SECTION
Success in lakehouse-based data and AI initiatives hinges on a simplified data The lakehouse architecture’s holistic approach encompasses the entire data
architecture, data quality and governance, and scalability and performance. lifecycle, transformation and impact across various analytics and AI workloads.
These pillars collectively provide the foundation upon which organizations build Because all workloads share the same data while adhering to uniform security and
their data strategies, guiding them through the intricate maze of modern data governance policies, you can feel confident knowing you can rely on the accuracy
management and analytics. of the data. Functional silos diminish, paving the way for seamless collaboration
and, consequently, enhanced productivity in delivering data products.
Simplify
access management through a unified interface, ensuring consistent Leveraging serverless compute and AI-driven optimizations, the Databricks
and secure access across clouds and platforms, with enhanced fine-grained Platform facilitates concurrent data processing and query execution.
control and scalable low-code policies This ensures that multiple users and teams can undertake analytics tasks
concurrently without performance constraints.
Harness
AI to automate data and ML model monitoring, receive proactive
S T O R AG E
alerts for issues, streamline debugging, and achieve holistic observability of
The platform seamlessly integrates with data lakes, facilitating the cost-
your lakehouse operations using built-in system tables
effective storage of extensive data volumes while ensuring data availability
Efficiently
share data and AI assets across clouds, regions and platforms and reliability. It also optimizes data storage for enhanced performance,
using open source Delta Sharing in Unity Catalog, enabling secure reducing storage expenses.
collaboration and value creation without the need for complex processes or
costly replication
Big Book of Data Warehousing and BI 8
D E LTA L A K E
Delta Lake with Unity Catalog and Photon offers the best price/performance
out of the box without manual tuning. The Databricks Platform uses AI models
to solve common challenges with data storage, so you get faster performance
without having to manually manage tables, even as they change over time.
Big Book of Data Warehousing and BI 9
Databricks SQL distinguishes itself with its ability to handle massive data Best
Price/Performance
sets with speed and efficiency. Utilizing Databricks’ next-gen engine, Photon Serverless compute combined with AI-optimized processing achieves
with AI-driven optimizations, ensures rapid data processing and analysis, top-tier performance and scale at lower costs, without the need for
notably decreasing query execution durations. High performance is crucial for cloud infrastructure management
organizations facing data challenges, guaranteeing insights from an extensive Unified
Governance
variety of data sets. Moreover, Databricks SQL champions collaboration, Establish one unified governance layer across all data and AI assets no
providing a workspace where data professionals can instantaneously share matter where they live
queries, outcomes and understandings. This shared setting promotes knowledge
exchange and hastens resolution, allowing organizations to capitalize on their Reduce
Complexity
teams’ collective intelligence. Unify all your data, analytics and AI on one platform that supports SQL
and Python, notebooks and IDEs, batch and streaming, and all major
cloud providers
Rich
Ecosystem
Utilize SQL and favorite tools such as Power BI, Tableau, dbt and Fivetran
with Databricks for BI, data ingestion and transformation
Big Book of Data Warehousing and BI 10
3 . 6 The Hitchhiker’s Guide to Data Privilege Model and Access Control in Unity Catalog
Big Book of Data Warehousing and BI 12
S ECT I O N 3 .1
The lakehouse is a new data platform paradigm that combines the best features What is a Data Vault?
of data lakes and data warehouses. It is designed as a large-scale enterprise-
level data platform that can house many use cases and data products. It can A Data Vault is a more recent data modeling design pattern used to build
serve as a single unified enterprise data repository for all of your: data warehouses for enterprise-scale analytics compared to Kimball and
data
domains, Inmon methods.
real-time
streaming use cases,
Data Vaults organize data into three different types: hubs, links, and satellites.
data
marts,
Hubs represent core business entities, links represent relationships between
disparate
data warehouses,
hubs, and satellites store attributes about hubs or links.
data
science feature stores and data science sandboxes, and
departmental
self-service analytics sandboxes. Data Vault focuses on agile data warehouse development where scalability,
data integration/ETL and development speed are important. Most customers
have a landing zone, Vault zone and a data mart zone which correspond to
Given the variety of the use cases, different data organizing principles and
the Databricks organizational paradigms of Bronze, Silver and Gold layers.
modeling techniques may apply to different projects on a lakehouse. Technically,
The Data Vault modeling style of hub, link and satellite tables typically fits
the lakehouse architecture can support many different data modeling styles.
well in the Silver layer of the lakehouse architecture.
In this article, we aim to explain the implementation of the Bronze/Silver/Gold
data organizing principles of the lakehouse and how different data modeling Learn more about Data Vault modeling at Data Vault Alliance.
techniques fit in each layer.
Big Book of Data Warehousing and BI 13
DataData
vault modeling
vault modeling
Records a history
of the interaction
What is Dimensional Modeling?
Customer Product
Dimensional modeling is a bottom-up approach to designing data warehouses
in order to optimize them for analytics. Dimensional models are used to
Satellite Satellite denormalize business data into dimensions (like time and product) and facts
Satellite (like transactions in amounts and quantities), and different subject areas are
Satellite Satellite connected via conformed dimensions to navigate to different fact tables.
Customer Product
Satellite Satellite The most common form of dimensional modeling is the star schema. A star
Link
schema is a multi-dimensional data model used to organize data so that it is
easy to understand and analyze, and very easy and intuitive to run reports on.
Order Kimball-style star schemas or dimensional models are pretty much the gold
standard for the presentation layer in data warehouses and data marts, and
Data Vault Elements:
Hubs = unique business keys
Satellite even semantic and reporting layers. The star schema design is optimized for
Links = relationships and associations
Satellites = descriptive data Satellite
querying large data sets.
Star schema
Order
Satellite Prouct Dim Star schema Time Dim
Promotion ID
Customer ID
Customer Dim Promotion Dim
Revenue
C<:9=638 71 P8=6=9U=c 71
Units Sold
Name Promo Name
Cit4 Ad Type
Storing raw data in the original source data format in a landing zone also helps
with consistency wherein you ingest data via ingestion tools that don’t support
Delta as a native sink or where source systems dump data onto object stores
directly. This pattern also aligns well with the autoloader ingestion framework
wherein sources land the data in landing zone for raw files and then Databricks
AutoLoader converts the data to Staging layer in Delta format.
A diagram showing characteristics of the Bronze, Silver and Gold layers of the data lakehouse architecture.
Big Book of Data Warehousing and BI 15
Silver Layer — the Enterprise Central Repository Gold Layer — the Presentation Layer
In the Silver layer of the lakehouse architecture, the data from the Bronze layer In the Gold layer, multiple data marts or warehouses can be built as per
is matched, merged, conformed and cleaned (“just-enough”) so that the Silver dimensional modeling/Kimball methodology. As discussed earlier, the Gold layer
layer can provide an “enterprise view” of all its key business entities, concepts is for reporting and uses more denormalized and read-optimized data models
and transactions. This is akin to an Enterprise Operational Data Store (ODS) or with fewer joins compared to the Silver layer. Sometimes tables in the Gold layer
a Central Repository or Data domains of a Data Mesh (e.g. master customers, can be completely denormalized, typically if the data scientists want it that way
products, non-duplicated transactions and cross-reference tables). This to feed their algorithms for feature engineering.
enterprise view brings the data from different sources together, and enables
self-service analytics for ad-hoc reporting, advanced analytics and ML. It also ETL and data quality rules that are “project-specific” are applied when
serves as a source for departmental analysts, data engineers and data scientists transforming data from the Silver layer to Gold layer. Final presentation layers
to further create data projects and analysis to answer business problems via such as data warehouses, data marts or data products like customer analytics,
enterprise and departmental data projects in the Gold layer. product/quality analytics, inventory analytics, customer segmentation, product
recommendations, marketing/sales analytics, etc., are delivered in this layer.
In the lakehouse data engineering paradigm, typically the (Extract-Load- Kimball style star-schema based data models or Inmon style Data marts fit in
Transform) ELT methodology is followed vs. traditional Extract-Transform- this Gold layer of the lakehouse. Data Science Laboratories and Departmental
Load(ETL). ELT approach means only minimal or “just-enough” transformations Sandboxes for self-service analytics also belong in the Gold layer.
and data cleansing rules are applied while loading the Silver layer. All the
“enterprise level” rules are applied in the Silver layer vs. project-specific
transformational rules, which are applied in the Gold layer. Speed and agility to
ingest and deliver the data in the lakehouse is prioritized here.
From a data modeling perspective, the Silver layer has more 3rd-Normal Form
like data models. Data Vault-like write-performant data architectures and data
models can be used in this layer. If using a Data Vault methodology, both the raw
Data Vault and Business Vault will fit in the logical Silver layer of the lake —
and the Point-In-Time (PIT) presentation views or materialized views will be
presented in the Gold layer.
Big Book of Data Warehousing and BI 16
The
Bronze layer uses the data models of source systems. If data is landed in
raw formats, it is converted to DeltaLake format within this layer.
The
Silver layer for the first time brings the data from different sources
together and conforms it to create an Enterprise view of the data — typically
using a more normalized, write-optimized data models that are typically 3rd-
Normal Form-like or Data Vault-like.
The
Gold layer is the presentation layer with more denormalized or flattened
data models than the Silver layer, typically using Kimball-style dimensional
models or star schemas. The Gold layer also houses departmental and data Bronze Silver Gold
science sandboxes to enable self-service analytics and data science across HUB A B
the enterprise. Providing these sandboxes and their own separate compute
LINK SA ★
clusters prevents the Business teams from creating their own copies of data
HUB C D
outside of the Lakehouse.
Big Book of Data Warehousing and BI 17
This lakehouse data organization approach is meant to break data silos, bring teams together, and
empower them to do ETL, streaming, and BI and AI on one platform with proper governance. Central
data teams should be the enablers of innovation in the organization, speeding up the onboarding of
new self-service users, as well as the development of many data projects in parallel — rather than the
data modeling process becoming the bottleneck. The Databricks Unity Catalog provides search and
discovery, governance and lineage on the lakehouse to ensure good data governance cadence.
Build your Data Vaults and star schema data warehouses with Databricks SQL today.
How data is curated as it moves through the various lakehouse architecture layers.
Big Book of Data Warehousing and BI 18
S ECT I O N 3 . 2
A large number of our customers are migrating their legacy data warehouses The importance of Data Modeling
to Databricks Lakehouse as it enables them to modernize not only their Data
for Data Warehouse
Warehouse but they also instantly get access to a mature Streaming and
Advanced Analytics platform. Lakehouse can do it all as it is one platform for Data Models are front and center of building a Data Warehouse. Typically
all your streaming, ETL, BI, and AI needs — and it helps your business and Data the process starts with defending the Semantic Business Information Model,
teams collaborate on one platform. then a Logical data Model, and finally a Physical Data Model (PDM). It all
starts with a proper Systems Analysis and Design phase where a Business
As we help customers in the field, we find that many are looking for
Information model and process flows are created first and key business
best practices around proper data modeling and physical data model
entities, attributes and their interactions are captured as per the business
implementations in Databricks.
processes within the organization. The Logical Data Model is then created
In this article, we aim to dive deeper into the best practice of dimensional depicting how the entities are related to each other and this is a Technology
modeling on the Databricks Data Intelligence Platform and provide a live agnostic model. Finally a PDM is created based on the underlying technology
example of a physical data model implementation using our table creation platform to ensure that the writes and reads can be performed efficiently.
and DDL best practices. As we all know, for Data Warehousing, Analytics-friendly modeling styles like
Star-schema and Data Vault are quite popular.
Here are the high-level topics we will cover in this blog:
The
Importance of Data Modeling
Common
Data Modeling Techniques
Data
Warehouse Modeling DDL Implementation
Best
Practice and Recommendation for Data Modeling
on the Lakehouse
Big Book of Data Warehousing and BI 19
address... date...
Note each dimension table has __START_AT and __END_AT columns to support SCD Type 2, which are not displayed here because of limited space.
Big Book of Data Warehousing and BI 20
Unity Catalog is a Databricks Governance layer which lets Databricks admins Here is an example on querying the fact_sales table with a 3 level namespace.
and data stewards manage users and their access to data centrally across
all of the workspaces in a Databricks account using one Metastore. Users
in different workspaces can share access to the same data, depending
on privileges granted centrally in Unity Catalog. Unity Catalog has 3 level
Namespace ( catalog.schema(database).table) that organizes your data.
Learn more about Unity Catalog here.
Metastore
After the table is created, we can see that the primary key (store_id) is created
2. Primary Key, Foreign Key definitions as a constraint in the table definition below.
Primary and Foreign Key definitions are very important when creating a data
model. Having the ability to support the PK/FK definition makes defining the data
model super easy in Databricks. It also helps analysts quickly figure out the join
relationships in Databricks SQL Warehouse so that they can effectively write
queries. Like most other Massively Parallel Processing (MPP), EDW, and Cloud
Data Warehouses, the PK/FK constraints are informational only. Databricks does
not support enforcement of the PK/FK relationship, but gives the ability to define
it to make the designing of Semantic Data Model easy.
1 -- Store dimension
Primary Key store_id shows as table constraint
2 CREATE OR REPLACE TABLE dim_store(
3 store_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
4 business_key STRING,
5 name STRING, Here is an example of creating the fact_sales table with transaction_id as a
6 email STRING,
Primary Key, as well as foreigh keys that are referencing the dimension tables.
7 city STRING,
8 address STRING,
9 phone_number STRING,
10 created_date TIMESTAMP,
1 -- Fact Sales
11 updated_date TIMESTAMP,
2 CREATE OR REPLACE TABLE fact_sales(
12 start_at TIMESTAMP,
3 transaction_id BIGINT PRIMARY KEY,
13 end_at TIMESTAMP
4 date_id BIGINT NOT NULL CONSTRAINT dim_date_fk FOREIGN KEY REFERENCES dim_date,
14 );
5 customer_id BIGINT NOT NULL CONSTRAINT dim_customer_fk FOREIGN KEY REFERENCES
6 dim_customer,
7 product_id BIGINT NOT NULL CONSTRAINT dim_product_fk FOREIGN KEY REFERENCES
DDL Implementation for creating store dimension with Primary Key Definitions
8 dim_product,
9 store_id BIGINT NOT NULL CONSTRAINT dim_store_fk FOREIGN KEY REFERENCES dim_
10 store,
11 store_business_key STRING,
12 sales_amount DOUBLE
13 );
DDL Implementation for creating sales fact with Foreign Key definitions
Big Book of Data Warehousing and BI 22
After the fact table is created, we could see that the primary key 3. Identity columns for Surrogate keys
(transaction_id) and foreign keys are created as constraints in the table
definition below. An identity column is a column in a database that automatically generates a
unique ID number for each new row of data. These are commonly used to create
surrogate keys in the data warehouses. Surrogate keys are system-generated,
meaningless keys so that we don't have to rely on various Natural Primary Keys
and concatenations on several fields to identify the uniqueness of the row.
Typically these surrogate keys are used as Primary and Foreign keys in data
warehouses. Details on Identity columns are discussed in this blog. Below is an
example of creating an identity column customer_id, with automatically assigned
values starting with 1 and increment by 1.
1 -- Customer dimension
2 CREATE OR REPLACE TABLE dim_customer(
3 customer_id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
4 PRIMARY KEY,
Fact table definition with primary key and foreign keys referencing dimensions 5 name STRING,
6 email STRING,
7 address STRING,
8 created_date TIMESTAMP,
9 updated_date TIMESTAMP,
10 start_at TIMESTAMP,
11 end_at TIMESTAMP
12 );
In addition to Primary and Foreign key informational constraints, Databricks also Traditional Databases have b-tree and bitmap indexes, Databricks has much
supports column-level Data Quality Check constraints which are enforced to advanced form of indexing - multi-dimensional Z-order clustered indexing
ensure the quality and integrity of data added to a table. The constraints are and we also support Bloom filter indexing. First of all, the Delta file format uses
automatically verified. Good examples of these are NOT NULL constraints and Parquet file format, which is a columnar compressed file format so it’s already
column value constraints. Unlike the other Cloud Data Warehouse, Databricks very efficient in column pruning and on top of it using z-order indexing gives
went further to provide column value check constraints, which are very useful you the ability to sift through petabyte scale data in seconds. Both Z-order and
to ensure the data quality of a given column. As we could see below, the valid_ Bloom filter indexing dramatically reduce the amount of data that needs to be
sales_amount check constraint will verify that all existing rows satisfy the scanned in order to answer highly selective queries against large Delta tables,
constraint (i.e. sales amount > 0) before adding it to the table. More information which typically translates into orders-of-magnitude runtime improvements and
can be found here. cost savings. Use Z-order on your Primary Keys and foreign keys that are used
for the most frequent joins. And use additional Bloom filter indexing as needed.
Here are examples to add constraints for dim_store and fact_sales respectively
to make sure store_id and sales_amount have valid values.
1 -- Optimise fact_sales table by customer_id and product_id for better query and
2 join performance
3 OPTIMIZE US_Stores.Sales_DW.fact_sales
1 -- Add constraint to dim_store to make sure column store_id is between 1 and 9998 4 ZORDER BY (customer_id, product_id);
2 ALTER TABLE US_Stores.Sales_DW.dim_store ADD CONSTRAINT valid_store_id CHECK
3 (store_id > 0 and store_id < 9999);
4 Optimize fact_sales on customer_id and product_id for better performance
5 -- Add constraint to fact_sales to make sure column sales_amount has a valid value
6 ALTER TABLE US_Stores.Sales_DW.fact_sales ADD CONSTRAINT valid_sales_amount CHECK
7 (sales_amount > 0);
1 -- Create a bloomfilter index to enable data skipping on store_business_key
2 CREATE BLOOMFILTER INDEX
Add column constraint to existing tables to ensure data quality 3 ON TABLE US_Stores.Sales_DW.fact_sales
4 FOR COLUMNS(store_business_key)
And just like any other Data warehouse, you can ANALYZE TABLE to update Databricks Notebook Example
statistics to ensure the Query optimizer has the best statistics to create the
best query plan. With the Databricks Platform, one can easily design and implement various data
models with ease. To see all of the above examples in a complete workflow,
please look at this example.
1 -- collect stats for all columns for better performance
2 ANALYZE TABLE US_Stores.Sales_DW.fact_sales COMPUTE STATISTICS FOR ALL COLUMNS;
Please also check out our related blog:
Collect stats for all the columns for better query execution plan Five Simple Steps for Implementing a Star Schema
in Databricks With Delta Lake
6. Advanced Techniques
Are you looking to leverage a visual data modeling tool? Our partner erwin
Data Modeler by Quest can be used to reverse engineer, create and implement
Star-schema, Data Vaults, and any Industry Data Models in Databricks with just
a few clicks.
Big Book of Data Warehousing and BI 25
S ECT I O N 3 . 3
Dimensional modeling is one of the most popular data modeling techniques Here are the high-level steps we will cover in this blog:
for building a modern data warehouse. It allows customers to quickly develop Define
a business problem
facts and dimensions based on business needs for an enterprise. When helping
Design
a dimensional model
customers in the field, we found many are looking for best practices and
implementation reference architecture from Databricks. Best
practices and recommendations for dimensional modeling
Implementing
a dimensional model on a lakehouse architecture
In this article, we aim to dive deeper into the best practice of dimensional
modeling on the lakehouse architecture and provide a live example to load an Conclusion
EDW dimensional model in real-time using Delta Live Tables.
Big Book of Data Warehousing and BI 26
and rows represent business processes. The defined business problem sm store_id FK email
determines the grain of the fact data and required dimensions. The key idea name... customer_id FK created_date
here is that we could incrementally build additional data assets with ease date_id FK
based on the Business Matrix and its shared or conformed dimensions. dim_store dim_date
store_usiness_e
store_id PK date_id PK
sa es_amount
Shared Dimensions
name date_nu3
ry
ion
ito
address... date...
n
er
yee
r
zat
tio
.er
unt
ct
m
ller
or
ani
mo
plo
sto
du
es
nd
co
e
te
Org
Res
Pro
Pro
Em
Sal
Cu
Da
Ac
Ve
Sales Plan
Inventor The design should be easy to understand and efficient with different query
Customer Surveys patterns on the data. From the model, we designed the sales fact table to answer
Customer Service Calls
our business questions; as you can see, other than the foreign keys (FKs) to the
A Business Matrix with Shared Dimensions and Business Processes dimensions, it only contains the numeric metrics used to measure the business,
e.g. sales_amount.
Here we assume that the business sponsor would like to team to build a report We also designed dimension tables such as Product, Store, Customer, Date that
to give insights on: provide contextual information on the fact data. Dimension tables are typically
What
are the top selling products so they can understand joined with fact tables to answer specific business questions, such as the most
product popularity popular products for a given month, which stores are the best-performing ones
What
are the best performing stores to learn good store practices for the quarter, etc.
Big Book of Data Warehousing and BI 27
Enforced
CHECK Constraints to never worry about data quality or data
correctness issues sneaking up on you.
Big Book of Data Warehousing and BI 28
The Delta Live Tables pipeline example could be found here. Please refer to Here is an example of how the dimension table dim_store gets updated based
Delta Live Tables quickstart on how to create a Delta Live Tables pipeline. on the incoming changes. Below, the Store Brisbane Airport was updated to
As seen below, DLT offers full visibility of the ETL pipeline and dependencies Brisbane Airport V2, and with the out-of-box SCD Type 2 support, the original
between different objects across Bronze, Silver and Gold layers following the record ended on Jan 07 2022, and a new record was created which starts on the
lakehouse medallion architecture. same day with an open end date (NULL) - which indicates the latest record for
the Brisbane airport.
For more implementation details, please refer to here for the full notebook example.
Conclusion
In this blog, we learned about dimensional modeling concepts in detail,
best practices, and how to implement them using Delta Live Tables.
S ECT I O N 3 . 4
At this year’s Data + AI Summit, Databricks SQL continued to push the The AI-optimized warehouse: Ready for all your
boundaries of what a data warehouse can be, leveraging AI across the entire
workloads — no tuning required
product surface to extend our leadership in performance and efficiency,
while still simplifying the experience and unlocking new opportunities for our We believe that the best data warehouse is a lakehouse; therefore, we continue
customers. In parallel, we continue to deliver improvements to our core data to extend our leadership in ETL workloads and harnessing the power of AI.
warehousing capabilities to help you unify your data stack under the lakehouse Databricks SQL now also delivers industry-leading performance for your EDA and
architecture on the Databricks Data Intelligence Platform. BI workloads, while improving cost savings — with no manual tuning.
In this blog post, we are thrilled to share the highlights of what’s new and coming
next in Databricks SQL:
Best TCO & Performance across use cases
AI-driven performance optimizations like Predictive I/O that deliver ETL E BI
leading performance and cost-savings, without manual tuning required.
New user experiences with AI Functions, SQL Warehouses in Notebooks,
Up to Up to Up to
New ways to access your data with the SQL Statement Execution API. 2-3x
Faster
5-10%
Faster
10x
Queries/minute
Simple and efficient data processing with Streaming Tables, Materialized
Views and Workflows integration.
Intelligent assistance powered by DatabricksIQ, our knowledge engine. Say goodbye to manually creating indexes. With Predictive I/O for reads (GA) and
Enhanced administration tools with Databricks SQL System Tables and updates (Public Preview), Databricks SQL now analyzes historical read and write
Live Query Profile. patterns to intelligently build indexes and optimize workloads. Early customers
have benefited from a remarkable 35x improvement in point lookup efficiency,
Additional
features for our partner integrations with Fivetran, dbt labs
impressive performance boosts of 2-6x for MERGE operations and 2-10x for
and PowerBI
DELETE operations.
Big Book of Data Warehousing and BI 31
With Predictive Optimizations (Public Preview), Databricks will seamlessly Unlock siloed data with Lakehouse Federation
optimize file sizes and clustering by running OPTIMIZE, VACUUM, ANALYZE and
CLUSTERING commands for you. With this feature, Anker Innovations benefited Today's organizations face challenges in discovering, governing and querying
from a 2.2x boost to query performance while delivering 50% savings on siloed data sources across fragmented systems. With Lakehouse Federation,
storage costs. data teams can use Databricks SQL to discover, query and manage data in
external platforms including MySQL, PostgreSQL, Amazon Redshift, Snowflake,
Azure SQL Database, Azure Synapse, Google's BigQuery (coming soon) and more.
— A N K E R I N N O VAT I O N S
Finally, to accelerate complicated transformations or cross-joins on federated
sources, Lakehouse Federation supports Materialized Views for better query
latencies.
Tired of managing different warehouses for smaller and larger workloads or
fine tuning scaling parameters? Intelligent Workload Management is a suite of For more details, watch our dedicated session Lakehouse Federation:
features that keeps queries fast while keeping cost low. By analyzing real time Access and Governance of External Data Sources from Unity Catalog from
patterns, Intelligent Workload Management ensures that your workloads have the Data+AI Summit.
the optimal amount of compute to execute incoming SQL statements without
disrupting already running queries.
single SQL statement, scalably ingest data from various sources such as cloud
storage (S3, ADLS, GCS), message buses (EventHub, Kafka, Kinesis), and more.
Need orchestration with DB SQL? Workflows now allows you to schedule SQL
This ingestion occurs incrementally, enabling low-latency and cost-effective
queries, dashboards and alerts. Easily manage complex dependencies between
pipelines, without the need for managing complex infrastructure.
tasks and monitor past job executions with the intuitive Workflows UI or via API.
Streaming Tables and Materialized Views are now in public preview. To learn
more, read our dedicated blog post. To enroll in the public preview for both,
enroll in this form. Workflows in DB SQL is now generally available, and you can
learn more by reading the documentation (AWS | Azure).
Big Book of Data Warehousing and BI 33
Databricks Assistant: Write better and faster SQL with Databricks Assistant uses a number of signals to provide more accurate, relevant
results. It uses context from code cells, libraries, popular tables, Unity Catalog
natural language
schemas and tags to map natural language questions into queries and code.
Databricks Assistant is a context-aware AI assistant embedded inside
In the future, we will be adding integration with DatabricksIQ to provide even
Databricks Notebooks and the SQL Editor. Databricks Assistant can take a
more context for your requests.
natural language question and suggest a SQL query to answer that question.
When trying to understand a complex query, users can ask the Assistant to
explain it using natural language, enabling anyone to understand the logic behind
Manage your data warehouse with confidence
query results.
Administrators and IT teams need the tools to understand data warehouse
usage. With System Tables, Live Query Profile, and Statement Timeouts,
admins can monitor and fix problems when they occur, ensuring that your data
warehouse runs efficiently.
Gain deeper visibility and insights into your SQL environment with System
Tables. System Tables are Databricks-provided tables that contain information
about past statement executions, costs, lineage, and more. Explore metadata
and usage metrics to answer questions like “What statements were run and by
whom?”, “How and when did my warehouses scale?” and “What was I billed for?”.
Since System Tables are integrated within Databricks, you have access to native
capabilities such as SQL alerts and SQL dashboards to automate the monitoring
and alerting process.
As of today, there are three System Tables currently in public preview: Audit
Logs, Billable Usage System Table, and Lineage Sytem Table (AWS | Azure).
Additional system tables for warehouse events and statement history are
coming soon.
Big Book of Data Warehousing and BI 34
For example, to compute the monthly DBUs used per SKU, you can query the Democratize unstructured data analysis with AI Functions
Billiable Usage System Tables.
With AI Functions, DB SQL is bringing the power of AI into the SQL warehouse.
Effortlessly harness the potential of unstructured data by performing tasks such
SELECT sku_name, usage_date, sum(usage_quantity) as `DBUs`
1
as sentiment analysis, text classification, summarization, translation and more.
2 FROM system.billing.usage
3 WHERE Data analysts can apply AI models via self-service, while data engineers can
month(usage_date) = month(NOW())
4
independently build AI-enabled pipelines.
5 AND year(usage_date) = year(NOW())
6 GROUP BY sku_name, usage_date
Using AI Functions is quite simple. For example, consider a scenario where a user
wants to classify the sentiment of some articles into Frustrated, Happy, Neutral,
With Live Query Profile, users gain real-time insights into query performance to or Satisfied.
help optimize workloads on the fly. Visualize query execution plans and assess
live query task executions to fix common SQL mistakes like exploding joins or full
1 -- create a udf for sentiment classification
table scans. Live Query Profile allows you to ensure that running queries on your
2 CREATE FUNCTION classify_sentiment(text STRING)
data warehouse are optimized and running efficiently. Learn more by reading the 3 RETURNS STRING
4 RETURN ai_query(
documentation (AWS | Azure).
5 ‘Dolly’, -- the name of the model serving endpoint
6 named_struct(
Looking for automated controls? Statement Timeouts allow you to set a custom 7 ‘prompt’,
8 CONCAT(‘Classify the following text into one of four categories [Frustrated,
workspace or query level timeout. If a query's execution time exceeds the 9 Happy, Neutral, Satisfied]:\n’,
timeout threshold, the query will be automatically halted. Learn more by reading 10 text),
11 ‘temperature’, 0.5),
the documentation (AWS | Azure) 12 ‘returnType’, ‘STRING’);
Bring the power of SQL warehouses to notebooks simplifies the permissions and management process. Finally, publish a
dashboard to your entire organization, so that any authenticated user in your
Databricks SQL warehouses are now public preview in notebooks, combining identity provider can access the dashboard via a secure web link, even if they
the flexibility of notebooks with the performance and TCO of Databricks SQL don’t have Databricks access.
Serverless and Pro warehouses. To enable SQL warehouses in notebooks, simply
select an available SQL warehouse from the notebooks compute dropdown. New Databricks SQL Dashboards are currently in Private Preview. Contact
your account team to learn more.
Bring the flexibility of Python into Databricks SQL with Python user-defined
functions (UDFs). Integrate machine learning models or apply custom
redaction logic for data processing and analysis by calling custom Python
functions directly from your SQL query. UDFs are reusable functions, enabling
you to apply consistent processing to your data pipelines and analysis.
For instance, to redact email and phone numbers from a file, consider the
following CREATE FUNCTION statement.
Connecting serverless SQL warehouses from Databricks notebooks 1 CREATE FUNCTION redact(a STRING)
2 RETURNS STRING
3 LANGUAGE PYTHON
At Data+AI Summit, Databricks SQL announced new integrations for a seamless Simplify real-time analytics engineering on the lakehouse architecture with
experience with your tools of choice. Databricks and dbt Labs. The combination of dbt's highly popular analytics
engineering framework with the Databricks Platform provides powerful
Databricks + Fivetran capabilities:
We're thrilled to announce the general availability of Fivetran access in Partner dbt + Streaming Tables: Streaming ingestion from any source
Connect for all users including non-admins with sufficient privileges to a catalog. is now built-in to dbt projects. Using SQL, analytics engineers can define
This innovation makes it 10x easier for all users to ingest data into Databricks and ingest cloud/streaming data directly within their dbt pipelines.
using Fivetran. This is a huge win for all Databricks customers as they can now
bring data into the lakehouse architecture from hundreds of connectors Fivetran dbt + Materialized Views: Building efficient pipelines becomes
offers, like Salesforce and PostgreSQL. Fivetran now fully supports serverless easier with dbt, leveraging Databricks' powerful incremental
warehouses as well! refresh capabilities. Users can use dbt to build and run pipelines
backed by MVs, reducing infrastructure costs with efficient,
incremental computation.
Learn more by reading the blog post here.
Databricks + PowerBI: Publish to PowerBI Workspaces Getting Started with Databricks SQL
Publish datasets from your Databricks workspace to PowerBI Online workspace Follow the guide (AWS | Azure | GCP ) on how to setup a SQL warehouse to
with a few clicks! No more managing odbc/jdbc connections — simply select the get started with Databricks SQL today! Databricks SQL Serverless is currently
dataset you want to publish. Simply select the datasets or schema you want to available with a 20%+ promotional discount, visit our pricing page to learn more.
publish and select your PBI workspace! This makes it easier for BI admins and
report creators to support PowerBI workspaces without also having to use Power You can also watch Databricks SQL: Why the Best Serverless Data Warehouse
BI Desktop. is a Lakehouse and What's New in Databricks SQL — With Live Demos for a
complete overview.
Big Book of Data Warehousing and BI 38
S ECT I O N 3 . 5
Effective data governance is essential for any organization that relies on data, Evolution of Data Governance in Databricks
analytics and AI for its operations. In many organizations, there is a growing
recognition of the value proposition of centralized data governance. However, Before the introduction of Unity Catalog, the concept of a workspace was
even with the best intentions, implementing centralized governance can be monolithic, with each workspace having its own metastore, user management,
challenging without the proper organizational processes and resources. and Table ACL store. This led to intrinsic data and governance isolation
The role of Chief Data Officer (CDO) is still emerging in many organizations, boundaries between workspaces and duplication of effort to address
leaving questions about who will define and execute data governance policies consistency across them.
across the organization.
To handle this, some customers resorted to running pipelines or code to
As a result, the responsibility for defining and executing data governance synchronize their metastores and ACLs, while others set up their own self-
policies across the organization is often not centralized, leading to policy managed metastores to use across workspaces. However, these solutions added
variations or governing bodies across lines of business, sub-units, and more overhead and maintenance costs forcing upfront architecture decisions on
other divisions within an organization. For simplicity, we can call this pattern how to partition data across the organization, creating data silos.
distributed governance, where there is a general agreement on the distinctions
between these governing units but not necessarily a central data governance
function. Data Governance with Unity Catalog
In this blog, we’ll explore implementing a distributed governance model using To overcome these limitations, Databricks developed Unity Catalog, which aims
Databricks Unity Catalog, which provides a unified governance solution for data, to make it easy to implement data governance while maximizing the ability to
analytics, and AI in the lakehouse. collaborate on and share data. The first step in achieving this was implementing
a common namespace that permits access to any data within an organization.
This approach may seem like a challenge to the distributed governance pattern
mentioned earlier but Unity Catalog offers new isolation mechanisms within
the namespace that organizations have traditionally addressed using multiple
Hive metastores. These isolation mechanisms enable groups to operate
independently with minimal or no interaction and also allow them to achieve
isolation in other scenarios, such as production vs development environments.
Big Book of Data Warehousing and BI 39
Hive Metastore versus Unity Catalog in Databricks Let’s begin by defining isolation boundaries in a data platform such as Databricks:
With Hive, a metastore was a service boundary, meaning that having different Users should only gain access to data based on agreed access rules
metastores meant different hosted underlying Hive services and different Data can be managed by designated people or teams
underlying databases. Unity Catalog is a platform service within the Databricks
Data should be physically separated in storage
Data Intelligence Platform, so there are no service boundaries to consider.
Data should only be accessed in designated environments
Unity Catalog provides a common namespace that allows you to govern and
audit your data in one place.
Users should only gain access to data based on agreed access rules
When using Hive, it was common to use multiple metastores, each with its
own namespace, to achieve isolation between development and production Organizations usually have strict requirements around data access based on
environments, or to allow for the separation of data between operating units. some organizational/regulatory requirements which is fundamental to keeping
data secure. Typical examples include employee salary information or credit
In Unity Catalog, these requirements are solved through dynamic isolation
card payment information.
mechanisms on namespaces that don’t compromise the ability to share and
collaborate on data and don’t require hard one-way upfront architecture decisions. Access to this type of information is typically tightly controlled and audited
periodically. Unity Catalog provides organizations granular control over data
assets within the catalog to meet these industry standards. With the controls,
Working across different teams and environments Unity Catalog provides users will only see and query the data they are entitled
to see and query.
When using a data platform, there is often a strong need to have isolation
boundaries between environments like dev/prod and between business groups,
teams, or operating units of your organization. Data can be managed by designated people or teams
Unity Catalog gives you the ability to choose from centralized governance or
distributed governance models.
In a distributed governance model, you would consider a catalog or set of Unity Catalog allows you to choose the defaults for how data is separated in
catalogs to be a data domain. The owner of that catalog can create and own storage. By default, all data is stored at the metastore. With feature support for
all assets and manage governance within that domain. Therefore the owners of managed data sources on catalogs and schemas, you can physically isolate data
domains can operate independently of other owners in other domains. storage and access, helping your organization achieve their governance and data
management requirements.
We strongly recommend setting a group to be the owner or service principal for
both of these options if management is done through tooling. When creating managed tables, the data will then be stored using the schema
location (if present) followed by the catalog location (if present), and will only
Has full access over the
metastore, but not access use the metastore location if the prior two locations have not been set.
to data unless granted
RPNKHEFONJQBD
Metastore &!ner MA@>C<?LA>CI=AG
Metastore
provides a single cloud storage location and credential as the default location
for managed tables.
Oftentimes, organizational and compliance requirements maintain that you need prod prod_ws Analysts
that contain PII data for analysis and have special access rules around who Metastore bu_1_staging
bu_dev_stg_ws
BU Testers
can access the data and the environments that allow access to that data.
Sometimes requirements dictate that certain data sets or domains cannot be
bu_1_prod bu_prod_ws BU Users
crossed or combined together.
In Databricks, we consider a workspace to be an environment. Unity Catalog has team_x_sandbox team_x_ws Team X
within a workspace, regardless of a user’s individual ACLs. This means that the
metastore admin, or the catalog owner can define the workspaces that a data Access to data and availability of data can be isolated across workspaces and groups.
Users can only access certain catalogs in certain environments.
catalog can be accessed from. This can be controlled via our UI or via API/
terraform for easy integrations. We even recently published a blog on how to
control Unity Catalog via terraform to help fit your specific governance model. Conclusion
With Unity Catalog at the center of your lakehouse architecture, you can
achieve a flexible and scalable governance implementation without sacrificing
your ability to manage and share data effectively. With Unity Catalog, you can
overcome the limitations and constraints of your existing Hive metastore,
enabling you to more easily isolate and collaborate on data according to your
specific business needs. Follow the Unity Catalog guides (AWS, Azure) to get
started. Download this free ebook on Data, analytics and AI governance to learn
more about best practices to build an effective governance strategy for your
data lakehouse.
Big Book of Data Warehousing and BI 42
S ECT I O N 3 . 6
As the volume, velocity and variety of data grows, organizations are increasingly The Axioms of Unity Catalog access model
relying on staunch data governance practices to ensure their core business
outcomes are adequately met. Unity Catalog is a fine-grained governance Unity Catalog privileges are defined at metastore — Unity Catalog permissions
solution for data and AI powering the Databricks Data Intelligence Platform. always refer to account-level identities, while TACL permissions defined within
It helps simplify the security and governance of your enterprise data assets the hive_metastore catalog always refer to the local identities in the workspace
by providing a centralized mechanism to administer and audit data access.
Privilege inheritance — Objects in Unity Catalog are hierarchical and privileges
Taking a journey down memory lane, before Unity Catalog unified the permission are inherited downward. The highest level object that privileges are inherited
model for files, tables and added support for all languages, customers were from is the catalog
implementing fine-grained data access control on Databricks using the legacy Object ownership is important — Privileges can only be granted by a
workspace-level Table ACL (TACL), which were essentially restricted to certain metastore admin, the owner of an object, or the owner of the catalog or
cluster configurations and worked only for Python and SQL. Both Unity Catalog schema that contains the object. Only the owner of an object, or the owner
and TACL let you control access to securable objects like catalogs, schemas of the catalog or schema that contains it can drop the object
(databases), tables, views, but there are some nuances in how each access
model works. USE privileges for boundaries — USE CATALOG/SCHEMA is required to
interact with objects within a catalog/schema. However, USE privilege does
A good understanding of the object access model is essential for implementing not allow one to browse the object metadata that is housed within the
data governance at scale using Unity Catalog. Even more so, if you have already catalog/schema
implemented the Table ACL model and are looking to upgrade to Unity Catalog
to take advantage of all the newest features, such as multi-language support, Permissions on derived objects are simplified — Unity Catalog only requires
centralized access control and data lineage. the owner of a view to have SELECT privilege, along with USE SCHEMA on the
views’ parent schema and USE CATALOG on the parent catalog. In contrast with
TACL, a view’s owner needs to be an owner of all referenced tables and views
Big Book of Data Warehousing and BI 43
Some more complex axioms Example 2 - Setting boundary for data sharing
Secure by default — only clusters with Unity-Catalog specific access modes Axiom 2 allows catalog/schema owners to set up default access rules for their
(shared or single-user) can access Unity Catalog data. With TACL, all users data. For example the following commands enable the machine learning team to
have access to all data on non-shared clusters create tables within a schema and read each other’s tables:
More interestingly, axiom 4 now allows catalog/schema owners to limit how far
individual schema and table owners can share data they produce. A table owner
Interesting patterns granting SELECT to another user does not allow that user read access to the
There are many governance patterns that can be achieved using the Unity table unless they also have been granted USE CATALOG privileges on its parent
Catalog access model. catalog as well as USE SCHEMA privileges on its parent schema.
Permission page showing analysts group having SELECT and USE SCHEMA permission
on sample_catalog.sample_schema
Big Book of Data Warehousing and BI 44
Axiom 5 unlocks the ability for data consumers to do this seamlessly, without
requiring manual back and forth with the table owners.
Permission page showing a view owned by analysts group, and account users group having SELECT permission
Big Book of Data Warehousing and BI 45
Query page showing the result of the answer to everything Cluster summary indicating Unity Catalog support
Thanks to axiom 6, data owners can be certain that there will be no unauthorized Now that data owners can understand the data privilege model and access
access to their data due to cluster misconfiguration. Any cluster that is not control, they can leverage Unity Catalog to simplify access policy management
configured with the correct access mode will not be able to access data in at scale.
Unity Catalog.
There are upcoming features that will further empower data administrators
Users can check that their clusters can access Unity Catalog data thanks to this |and owners to author even more complex access policy:
handy tooltip on the Create Clusters page
Row filtering and column masking: Use standard SQL functions to define
row filters and column masks, allowing fine-grained access controls on
rows and columns.
Attribute Based Access Controls: Define access policies based on tags
(attributes) of your data assets.
04
SECTION
S ECT I O N 4 .1
Fivetran allows you to easily ingest data from 50+ marketing platforms into
Delta Lake without the need for building and maintaining complex pipelines.
If any of the marketing platforms’ APIs change or break, Fivetran will take care
of updating and fixing the integrations so your marketing data keeps flowing in.
Fivetran and dbt can read and write to Delta Lake using a Databricks cluster or Databricks SQL warehouse
dbt is a popular open source framework that lets lakehouse users build data
pipelines using simple SQL. Everything is organized within directories, as plain Both Fivetran and dbt are a part of Databricks Partner Connect, a one-stop
text, making version control, deployment, and testability simple. Once the data is portal to discover and securely connect data, analytics and AI tools directly
ingested into Delta Lake, we use dbt to transform, test and document the data. within the Databricks Platform. In just a few clicks you can configure and connect
The transformed marketing analytics data mart built on top of the ingested data these tools (and many more) directly from within your Databricks workspace.
is then ready to be used to help drive new marketing campaigns and initiatives.
Big Book of Data Warehousing and BI 48
All the code for the demo is available on Github in the workflows-examples
repository.
The final dbt model lineage graph will look like this. The Fivetran source tables
are in green on the left and the final marketing analytics models are on the right.
By selecting a model, you can see the corresponding dependencies with the
different models highlighted in purple.
Big Book of Data Warehousing and BI 49
Create new Salesforce and Marketo connections in Fivetran to start ingesting For the demo name the schemas that will be created in Delta Lake
the marketing data into Delta Lake. When creating the connections Fivetran marketing_salesforce and marketing_marketo. If the schemas do not exist
will also automatically create and manage a schema for each data source in Fivetran will create them as part of the initial ingestion load.
Delta Lake. We will later use dbt to transform, clean and aggregate this data.
Define a destination schema in Delta Lake for the Salesforce data source
Big Book of Data Warehousing and BI 50
You can then choose which objects to sync to Delta Lake, where each object Data modeling using dbt
will be saved as individual tables. Fivetran also makes it simple to manage and
view what columns are being synchronized for each table: Now that all the marketing data is in Delta Lake, you can use dbt to create your
data model by following these steps
Set up your local dbt development environment in your chosen IDE by following
the set-up instructions for dbt Core and dbt-databricks.
Scaffold a new dbt project and connect to a Databricks SQL Warehouse using
dbt init, which will ask for following information.
1 $ dbt init
2 Enter a name for your project (letters, digits, underscore):
3 Which database would you like to use?
4 [1] databricks
5 [2] spark
6
Fivetran monitoring dashboard to monitor monthly active rows synchronized
7 Enter a number: 1
8 host (yourorg.databricks.com):
9 http_path (HTTP Path):
Additionally, Fivetran provides a monitoring dashboard to analyze how many 10 token (dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX):
11 schema (default schema that dbt will build objects in):
monthly active rows of data are synchronized daily and monthly for each table, 12 threads (1 or more) [1]:
among other useful statistics and logs.
Once you have configured the profile you can test the connection using:
1 $ dbt debug
Big Book of Data Warehousing and BI 51
Install Fivetran dbt model packages for staging spark_utils overriding dbt_utils macros
The first step in using the Marketo and Salesforce data is to create the tables The Fivetran dbt models make use of macros in the dbt_utils package but
as sources for our model. Luckily, Fivetran has made this easy to get up some of these macros need to be modified to work with Databricks which is
and running with their pre-built Fivetran dbt model packages. For this demo, easily done using the spark_utils package.
let's make use of the marketo_source and salesforce_source packages.
It works by providing shims for certain dbt_utils macros which you can set
To install the packages just add a packages.yml file to the root of your using the dispatch config in the dbt_project.yml file and with this dbt will first
dbt project and add the marketo-source, salesforce-source and the search for macros in the spark_utils package when resolving macros from the
fivetran-utils packages: dbt_utils namespace.
1 packages: 1 dispatch:
2 - package: dbt-labs/spark_utils 2 - macro_namespace: dbt_utils
3 version: 0.3.0 3 search_order: [‘spark_utils’, ‘dbt_utils’]
4 - package: fivetran/marketo_source
5 version: [“>=0.7.0”, “=0.4.0”, “
6
7 To download and use the packages run Variables for the marketo_source and salesforce_source schemas
The Fivetran packages require you to define the catalog (referred to as database
in dbt) and schema of where the data lands when being ingested by Fivetran.
1 $ dbt deps
Add these variables to the dbt_project.yml file with the correct catalog and
schema names. The default catalog is hive_metastore which will be used
You should now see the Fivetran packages installed in the packages folder.
if _database is left blank. The schema names will be what you defined when
creating the connections in Fivetran.
Update dbt_project.yml for Fivetran dbt models
1 vars:
There are a few configs in the dbt_project.yml file that you need to modify to 2 marketo_source:
make sure the Fivetran packages work correctly for Databricks. 3 marketo_database: # leave blank to use the default hive_metastore catalog
4 marketo_schema: marketing_marketo
5 salesforce_source:
The dbt_project.yml file can be found in the root folder of your dbt project. 6 salesforce_database: # leave blank to use the default hive_metastore catalog
7 salesforce_schema: marketing_salesforce
Big Book of Data Warehousing and BI 52
To avoid all the staging tables that are created by the Fivetran source models At this stage you can run the Fivetran model packages to test that they work
being created in the default target schema it can be useful to define a separate correctly.
staging schema.
In the dbt_project.yml file add the staging schema name and this will then be 1 dbt run –select marketo_source
1 models:
2 marketo_source:
3 +schema: your_staging_name
4 tmp:
5 stg_marketo__activity_email_bounced_tmp:
6 +enabled: false
7 stg_marketo__email_template_history_tmp:
8 +enabled: false
9 stg_marketo__activity_email_bounced:
10 +enabled: false
12 stg_marketo__email_template_history:
13 +enabled: false
Big Book of Data Warehousing and BI 53
The models will be structured under the mart folder in the following way.
You can view the code for all the models on Github in the /models/mart
directory and below describes what is in each folder along with an example.
Core models
The core models are the facts and dimensions tables that will be used by all
downstream models to build upon.
dbt lineage graph showing the star schema and aggregate tables data model The dbt SQL code for the dim_user model:
Now that the Fivetran packages have taken care of creating and testing the 1 with salesforce_users as (
2 select
staging models you can begin to develop the data models for your marketing
3 account_id,
analytics use cases which will be a star schema data model along with 4 email
5 from {{ ref(‘stg_salesforce__user’) }}
materialized aggregate tables.
6 where email is not null and account_id is not null
7 ),
For example, for the first marketing analytics dashboard, you may want to see 8 marketo_users as (
9 select
how engaged certain companies and sales regions are by the number of email 10 lead_id,
campaigns they have opened and clicked. 12 email
13 from {{ ref(‘stg_marketo__lead’) }}
14 ),
To do so, you can join Salesforce and Marketo tables using the Salesforce user 15 joined as (
email, Salesforce account_id and Marketo lead_id. 16 select
17 lead_id,
18 account_id
19 from salesforce_users
20 left join marketo_users
21 on salesforce_users.email = marketo_users.email
22 )
23
24 select * from joined
Big Book of Data Warehousing and BI 54
You can also add documentation and tests for the models using a yaml file Intermediate models
in the folder.
Some of the final downstream models may rely on the same calculated
There are 2 simple tests in the core.yml file that have been added metrics and so to avoid repeating SQL you can create intermediate models
that can be reused.
1 dbt run
These are the final marketing analytics models that will be used to power the
dashboards and reports used by marketing and sales teams. 1 dbt test
Your dbt project should be managed and version controlled in a Git repository.
You can create a dbt task in your Databricks Workflows job pointing to the
Git repository.
For each run you can see the logs for each dbt command helping you debug
and fix any issues.
To get started with building your own solution, visit the documentation for
integrating Fivetran and dbt with Databricks and re-use the marketing_
analytics_demo project example to quickly get started.
The dbt task type in Databricks Workflows is in private preview. To try the dbt
task type, please reach out to your Databricks account executive.
Big Book of Data Warehousing and BI 58
S ECT I O N 4 . 2
According to the latest reports from global consultancy EY, the future of Automating and optimizing the claims-handling process is one area that can
insurance will become increasingly data-driven, and analytics enabled. The significantly reduce costs through time saved and lesser reliance on human
recent focus on the cloud has improved access to advanced technological capital. Furthermore, effectively leveraging insights from data and advanced
infrastructure, but most organizations still need help implementing and analytics can substantially reduce the overall exposure to risk.
leveraging these capabilities. It’s time to shift the focus on operationalizing
services to realize value. The motivation behind the “Smart Claims” solution accelerator is simple —
improve the claims handling process to enable faster settlement, lower
In today’s economic circumstances, insurance companies face an ever- processing costs, and deliver quicker insights about potentially fraudulent
increasing number of challenges. Insurers are being forced to leverage data to claims, with the lakehouse. Implementing the lakehouse paradigm simplifies
their advantage and innovate at an accelerated pace. For personal P&C insurers, the current architectural landscape and sets the scene for future expansion
this means an increased focus on personalization and customer retention. across the organization. The accompanying assets can be found here.
Brand loyalty is at an all-time low, with customers continuously shopping for
more competitive rates and better overall experiences, which increases the risk
of churn. An increase in fraudulent claims further erodes profit margins. Insurers
need to find additional ways to reduce costs and better manage risks.
Big Book of Data Warehousing and BI 59
Reference Architecture and Workflow Automating and optimizing the claims handling process requires a deep
understanding of customer interaction with operational systems and the various
A typical claims workflow involves some level of orchestration between sources of information available for analysis.
operational systems such as Guidewire and analytical systems like
Databricks. The diagram below shows an example of such a workflow for an In this example, we assume that customers primarily interact through a mobile
automotive insurer. application, and from there, they can submit claims and monitor the status
of existing cases. This touch point offers vital information about customer
behavior. Another important source of information is IoT devices installed
Analytic System
in customer vehicles. Telematics data can be streamed to the operational
Claims Workflow
6
8
and analytical systems, providing valuable insights into customer-driving
1
Policy transformations aggregations
behavior and patterns. Other external data sources may include weather and
road conditions data that supplement the traditional data categories such as
4 5 7
Bronze Silver Gold
vehicle characteristics (make, model, year), driver characteristics and exposure/
Policy Data
coverage (limits, deductibles)
2 3
Claim Telematic Cloud Storage
App
Operational
System Access to additional data sources can become increasingly important,
especially in the absence of data from traditional sources such as credit
9
bureaus. Credit scores from bureaus usually form the basis for risk modeling,
1g assessing the exposure for drivers, which ultimately impacts their premiums.
On the other hand, data from mobile applications and IoT devices provide a
Smart Claims Reference Architecture and Workflow more personalized view of customer behavior, which could be used to create
a more accurate indicator of the risk associated with a given party. This
alternative, behavioral-based approach to risk modeling and pricing is essential
for delivering a hyper-personalized customer experience.
The following steps capture the overall flow: How the lakehouse paradigm aids Smart Claims
Policy
data is ingested.
The lakehouse architecture enables all data personas (data engineers, data
Telematics
data is continuously ingested from IoT sensors. A scientists, analytic engineers, and BI analysts) to work collaboratively on a
claimant submits claims data via a mobile app. single platform. Supporting all big-data workloads and paradigms (e.g., batch
All
the operational data is ingested into cloud storage. processing, streaming, DataOps, ML, MLOps, and BI) in a single, collaborative
platform greatly simplifies the overall architecture, improves stability, and
This
is incrementally loaded as ‘Raw data’ into Delta Bronze tables
reduces cost significantly.
The
data is wrangled and refined via various data transformations
Data
is scored using the trained model Databricks Delta Live Tables (DLT) pipelines offer a simple, declarative
framework to develop and implement workloads quickly. It also provides native
The
predictions are loaded to a Gold table
support for data quality management with granular constraints to guarantee
The
Claims Dashboard is refreshed for visualization the integrity of outputs.
The
resulting insights are fed back to the operational system. This
ML and AI workloads can easily be created and managed with MLflow for
provides the feedback loop of pulling data from Guidewire and
reproducibility and auditability. MLFlow simplifies the entire model lifecycle,
passing ‘Next Best Action’ back to Guidewire in real time to know
from experimenting through model deployment, serving, and archiving. ML can
which claims should be prioritized.
be run on all types of data including unstructured data beyond text (images,
The
Claims Decisioning workflows use these generated insights to audio, video, etc). In this solution, we will use computer vision capabilities to
route the case appropriately. (Eg. approve repair expenses, rental assess damage to the vehicle.
reimbursement, or alert authorities)
Finally, Databricks SQL provides a fast and efficient engine to query curated
and aggregated data. These insights can then be packaged and served through
interactive Dashboards within minutes.
The diagram below shows a reference architecture for the lakehouse in the
context of typical insurance use cases:
Data Governance
Orchestration
Sensor/IoT Data Event Streaming
' Serverless
Apps
ML Runti攀 Real-Time
Cloud Operational DatÜ Data Intelligence Inference @
AutoML Feature Store
(Claims, Policy, Party) Platform (Severity Scoring)
Loss Monitoring
Delta Sharing
B
Delta Sharing
Third-Party DatÜ Ingest Tool Cloud Storage Internal LOBs,
(Cr it ra ata, rii cor s) Third-Party Data
Partner Connect
Storage
Insurance Reference Architecture
Big Book of Data Warehousing and BI 62
Data Ingestion using DLT and Multitask Workflows DLT can simplify and operationalize the data processing pipeline. The framework
offers support for Auto Loader to facilitate ingestion from streaming sources,
Automating the claims-handling process starts with optimizing the ingestion and efficient auto-scaling to handle sudden changes in data volumes, and resiliency
data engineering workflow. The figure below offers a summary of the typical data via a restart of task failure.
sources encountered including structured, semi-structured and unstructured.
Some sources are slower-moving, while others update more rapidly. Additionally, Databricks Workflows can accommodate multiple tasks and workloads
some sources might be additive, requiring appending, while others offer (e.g., notebooks, DLT, ML, SQL). Workflows support repair-and-run and compute
incremental updates and must be treated as slow-changing dimensions. sharing across tasks - enabling robust, scalable, cost-effective workloads.
Additionally, Workflow can easily be automated through schedules or
programmatic invoking via REST APIs.
>=<; = <; Insight Generation using ML and Dynamic Rules Engine
On the other hand, rules engines offer flexible ways of defining known Insight visualization using Dashboards
operational characteristics and statistical checks, which can be automated and
applied without requiring human interaction. Flags are raised whenever data In this example, we created two dashboards to capture critical business insights.
does not comply with preset expectations and are sent for human review and The dashboards include the following:
investigation. Incorporating such an approach with ML-based workflows offers
A
Loss Summary dashboard for a high-level view of the
additional oversight and significantly reduces the time claims investigators
overall business operations; and
require to dissect and review flagged cases.
Scene Recreation
Telematics
Location of accident
Speed assessment
Analyzing recent trends can further aid in reviewing similar cases such as :
Loss
Ratio is computed by insurance claims paid plus adjustment
expenses divided by total earned premiums. E.g. typical average
Loss Ratio (all coverages combined, Bodily Injury, and Physical Damage)
for personal auto should be around 65%
Summary
visualization captures count of incident type by
damage severity
Trend
lines over various features/dimensions
Geographic distribution of policies
Summary
Innovation and personalization are essential for insurance firms to differentiate
themselves from the competition. Databricks provides a data intelligence
platform for insurers to enable and accelerate innovation with an open, secure,
extensible architecture that easily integrates with third-party tools and services.
This Solution Accelerator demonstrates how the paradigm can be applied to
claims handling. Further, the Databricks ecosystem offers a range of capabilities
to enable data teams and business stakeholders to collaborate and generate
and share insights that support business decisions and drive tangible value to
the bottom line.
The technical assets, including pipeline configurations, models, and sample data
used in this example, can be accessed here or directly on Git.
Big Book of Data Warehousing and BI 66
S ECT I O N 4 . 3
Architecting an ecosystem of services able to support the plethora of data- While real-time systems and streaming services can help FSIs remain agile in
driven use cases in this digitally transformed business can, however, seem to be addressing the volatile market conditions at the edge, they do not typically
an impossible task. This blog will focus on one crucial aspect of the modern data meet the requirements of back-office functions. Most business decisions are
stack: batch processing. A seemingly outdated paradigm, we’ll see why batch not reactive but rather, require considered, strategic reasoning. By definition,
processing remains a vital and highly viable component of the data architecture. this approach requires a systematic review of aggregate data collected over a
And we’ll see how Databricks can help FSIs navigate some of the crucial period of time. Batch processing in this context still provides the most efficient
challenges faced when building infrastructure to support these scheduled or and cost-effective method for processing large, aggregate volumes of data.
periodic workflows. Additionally, batch processing can be done offline, reducing operating costs and
providing greater control over the end-to-end process.
The world of finance is changing, but across the board incumbents and
startups continue to rely heavily on batch processing to power core business
functions. Whether for reporting and risk management or anomaly detection and
surveillance, FSIs require batch processing to reduce human error, increase the
speed of delivery, and reduce operating costs.
Big Book of Data Warehousing and BI 67
Getting started
Starting with a 30,000-ft view, most FSIs will have a multitude of data sources The lakehouse architecture is designed to provide a unified platform that
scattered across on-premises systems, cloud-based services and even third- supports all analytical and scientific data workloads. Figure 1 shows the
party applications. Building a batch ingestion framework that caters for all these reference architecture for a decoupled design that allows easy integration with
connections require complex engineering and can quickly become a burden other platforms that support the modern data ecosystem. The lakehouse makes
on maintenance teams. And that's even before considering things like change it easy to construct ingestion and serving layers that operate irrespective of the
data capture (CDC), scheduling, and schema evolution. In this section, we will data's source, volume, velocity, and destination.
demonstrate how the lakehouse architecture for financial services and its
ecosystem of partners can be leveraged to address these key challenges and
greatly simplify the overall architecture.
Data Sorces Ingestion Data Intelligence Platform for Financial Services Serving
Collaborative
Data Science
Enterprise Data
Warehouses
Productionized
Reference Data
Third-Party APIs
and Models
and Services ML Model Registry Centralized Data Wor昀low Job Scheduling
Governance Orchestration
Source Code
GitLab
DataóricÑs ëepo
LFS, we turn to the world of insurance. We consider AWS Batch Amazon RDS (MySQL)
the basic reporting requirements for a typical
Policy records Delta Live Tables Library Delta Table
ingested every 12 hours
claims processes. For example: AWS Lambda MongoDB Atlas Fivetran Connector DBSQL Warehouse Delta Table Delta Live Tables Library Delta Table DB SQL Dashboard
Number
of active policies Curated records aggregate
for business-level insights
d
Aggregated records queried fo
r
Eternal Data Accident dat
a key performance metrics
Number
of claims
ingested every hour
Value
of claims
of accident data
Total
exposure SFTP Transfer Amazon Ss
Loss
ratio
Infrastructure Management Delta Live Tables pipeline and scheduled job
and a breakdown by incident type and severity. All these metrics are easily Terraform
calculable from two key sources of data: 1) the book of policies and 2) claims
Figure 2 — Technical architecture for a simple insurance claims workflow.
filed by customers. The policy and claims records are typically stored in a
combination of enterprise data warehouses (EDWs) and operational databases.
The main challenge becomes connecting to these sources and ingesting data Once the data is ingested and delivered to the LFS, we can use Delta Live
into our lakehouse, where we can leverage the power of Databricks to calculate Tables (DLT) for the entire engineering workflow. DLT provides a simple, scalable
the desired outputs. declarative framework for automating complex workflows and enforcing data
quality controls. The outputs from our DLT workflow, our curated and aggregated
Luckily, the flexible design of the LFS makes it easy to leverage best-in-class assets, can be interrogated using Databricks SQL (DB SQL). DB SQL brings data
products from a range of SaaS technologies and tools to handle specific tasks. warehousing to the LFS to power business-critical analytical workloads. Results
One possible solution for our claims analytics use case would be to use Fivetran from DB SQL queries can be packaged in easy-to-consume dashboards and
for the batch ingestion plane. Fivetran provides a simple and secure platform served to business users.
for connecting to numerous data sources and delivering data directly to the
Databricks Data Intelligence Platform. Additionally, it offers native support
for CDC, schema evolution and workload scheduling. In Figure 2, we show the
technical architecture of a practical solution for this use case.
Big Book of Data Warehousing and BI 69
For the next step, we move to the Fivetran interface. From here, we can
easily create and configure connections to several different source systems
(please refer to the official documentation for a complete list of all supported
connections). In our example, we consider three sources of data: 1) policy
records stored in an Operational Data Store (ODS) or Enterprise Data Warehosue
(EDW), 2) claims records stored in an operational database, and 3) external data
delivered to blob storage. As such, we require three different connections to be
configured in Fivetran. For each of these, we can follow Fivetran's simple guided
process to set up a connection with the source system. Figures 5 and 6 show
how to configure new connections to data sources.
Connections can further be configured once they have been validated. One Fivetran will immediately interrogate and ingest data from source systems once
important option to set is the frequency with which Fivetran will interrogate the a connection is validated. Data is stored as Delta tables and can be viewed
source system for new data. In Figure 7, we can see how easy Fivetran has made from within Databricks through the Catalog Explorer. By default, Fivetran will
it to set the sync frequency with intervals ranging from 5 minutes to 24 hours. store all data under the Hive metastore. A new schema is created for each new
connection, and each schema will contain at least two tables: one containing the
data and another with logs from each attempted ingestion cycle (see Figure 8).
Figure 8 — Summary of tables created by Fivetran in the Databricks Warehouse for an example connection.
Having the data stored in Delta tables is a significant advantage. Delta Lake
natively supports granular data versioning, meaning we can time travel through
each ingestion cycle (see Figure 9). We can use DB SQL to interrogate specific
versions of the data to analyze how the source records evolved.
Figure 9 — View of the history showing changes made to the Fivetran audit table.
Big Book of Data Warehousing and BI 72
One significant advantage of DLT is the ability to specify and enforce data quality
1 ... standards. We can set expectations for each DLT table with detailed data quality
2
3 # Read the staged claim records into memory constraints that should be applied to the contents of the table. Currently, DLT
4 curated_policies = dlt.read(“curated_policies”) supports expectations for three different scenarios:
5 # Evaluate the validity of the claim
6 curated_claims = curated_claims \
7 .alias(“a”) \ Decorator Description
8 .join(
9 curated_policies.alias(“b”), expect Retain records that violate expectations
10 on = F.col(“a.policy_number”) == F.col(“b.policy_number”),
12 how = “left” expect_or_drop Drop records that violate expectations
13 ) \
14 .select([F.col(f”a.{c}”) for c in curated_claims.columns] + [F.col(f”b.
expect_or_fail Halt the execution if any record(s) violate constraints
15 {c}”).alias(f”policy_{c}”) for c in (“effective_date”, “expiry_date”)]) \
16 .withColumn(
17 # Calculate the number of months between coverage starting and the Expectations can be defined with one or more data quality constraints. Each
18 claim being filed constraint requires a description and a Python or SQL expression to evaluate.
19 “months_since_covered”, F.round(F.months_between(F.col(“claim_date”),
20 F.col(“policy_effective_date”))) Multiple constraints can be defined using the expect_all, expect_all_or_
21 ) \ drop, and expect_all_or_fail decorators. Each decorator expects a Python
22 .withColumn(
23 # Check if the claim was filed before the policy came into effect dictionary where the keys are the constraint descriptions, and the values are
24 “claim_before_covered”, F.when(F.col(“claim_date”) < F.col(“policy_ the respective expressions. The example below shows multiple data quality
25 effective_date”), F.lit(1)).otherwise(F.lit(0))
26 ) \ constraints for the retain and drop scenarios described above.
27 .withColumn(
28 # Calculate the number of days between the incident occurring and the
29 claim being filed
30 1 @dlt.expect_all({
“days_between_incident_and_claim”, F.datediff(F.col(“claim_date”),
31 2 “valid_driver_license”: “driver_license_issue_date > (current_date() -
F.col(“incident_date”))
32 3 cast(cast(driver_age AS INT) AS INTERVAL YEAR))”,
)
33 4 “valid_claim_amount”: “total_claim_amount > 0”,
34 5 “valid_coverage”: “months_since_covered > 0”,
# Return the curated dataset
35 6 “valid_incident_before_claim”: “days_between_incident_and_claim > 0”
return curated_claims
7 })
8 @dlt.expect_all_or_drop({
9 “valid_claim_number”: “claim_number IS NOT NULL”,
10 “valid_policy_number”: “policy_number IS NOT NULL”,
12 “valid_claim_date”: “claim_date < current_date”,
13 “valid_incident_date”: “incident_date < current_date”,
14 “valid_incident_hour”: “incident_hour between 0 and 24”,
15 “valid_driver_age”: “driver_age > 16”,
16 “valid_effective_date”: “policy_effective_date < current_date()”,
17 “valid_expiry_date”: “policy_expiry_date <= current_date()”
18 })
19 def curate_claims():
20 ...
Big Book of Data Warehousing and BI 74
We can use more than one Databricks Notebook to declare our DLT tables. Results for each table can be inspected by selecting the desired entity. Figure 11
Assuming we follow the medallion architecture, we can, for example, use different provides an example of the results of the curated claims table. DLT provides
notebooks to define tables comprising the Bronze, Silver and Gold layers. The a high-level overview of the results from the data quality controls:
DLT framework can digest instructions defined across multiple notebooks to
create a single workflow; all inter-table dependencies and relationships are
processed and considered automatically. Figure 10 shows
the complete workflow for our claims example. Starting with three source
tables, DLT builds a comprehensive pipeline that delivers thirteen tables for
business consumption.
Figure 11 — Example of detailed view for a Delta Live Tables (DLT) table entity with the associated data quality report.
Results from the data quality expectations can be analyzed further by querying Again, we can view the complete history of changes made to each DLT table
the event log. The event log contains detailed metrics about all expectations by looking at the Delta history logs (see Figure 12). It allows us to understand
defined for the workflow pipeline. The query below provides an example for how tables evolve over time and investigate complete threads of updates if a
viewing key metrics from the last pipeline update, including the number of pipeline fails.
records that passed or failed expectations:
1 SELECT
2 row_expectations.dataset AS dataset,
3 row_expectations.name AS expectation,
4 SUM(row_expectations.passed_records) AS passing_records,
5 SUM(row_expectations.failed_records) AS failing_records
6 FROM
7 (
8 SELECT
9 explode(
10 from_json(
12 details :flow_progress :data_quality :expectations,
Figure 12 — View the history of changes made to a resulting Delta Live Tables (DLT) table entity.
13 “array<struct<name: string, dataset: string, passed_records: int,
14 failed_records: int>>”
15 ) We can further use change data capture (CDC) to update tables based on
16 ) row_expectations
17 FROM
changes in the source datasets. DLT CDC supports updating tables with slow-
18 event_log_raw changing dimensions (SCD) types 1 and 2.
19 WHERE
20 event_type = ‘flow_progress’
21 AND origin.update_id = ‘${latest_update.id}’ We have one of two options for our batch process to trigger the DLT pipeline.
22 ) We can use the Databricks Auto Loader to incrementally process new data as
23 GROUP BY
24 row_expectations.dataset, it arrives in the source tables or create scheduled jobs that trigger at set times
25 row_expectations.name; or intervals. In this example, we opted for the latter with a scheduled job that
executes the DLT pipeline every five minutes.
Big Book of Data Warehousing and BI 76
directly against the source data with up to 12x better price/performance than
its alternatives.
We can also use the DB SQL query editor to run queries against different
We can leverage DB SQL to perform specific ad hoc queries against our curated
versions of our Delta tables. For example, we can query a view of the aggregated
and aggregated tables. We might, for example, run a query against the curated
claims records for a specific date and time (see example below). We can further
policies table that calculates the total exposure. The DB SQL query editor
use DB SQL to compare results from different versions to analyze only the
provides a simple, easy-to-use interface to build and execute such queries
changed records between those states.
(see example below).
1 SELECT
2 *
3 FROM
4 insurance_demo_lakehouse.aggregated_claims_weekly TIMESTAMP AS OF ‘2022-06-
5 05T17:00:00’;
DB SQL offers the option to use a serverless compute engine, eliminating the
need to configure, manage or scale cloud infrastructure while maintaining the
lowest possible cost. It also integrates with alternative SQL workbenches (e.g.,
DataGrip), allowing analysts to use their favorite tools to explore the data and
generate insights.
Big Book of Data Warehousing and BI 77
Business insights
Finally, we can use DB SQL queries to create rich visualizations on top of our For our use case, we created a dashboard with a collection of key metrics,
query results. These visualizations can then be packaged and served to end rolling calculations, high-level breakdowns, and aggregate views. The dashboard
users through interactive dashboards (see Figure 13). provides a complete summary of our claims process at a glance. We also added
the option to specify specific date ranges. DB SQL supports a range of query
parameters that can substitute values into a query at runtime. These query
parameters can be defined at the dashboard level to ensure all related queries
are updated accordingly.
DB SQL integrates with numerous third-party analytical and BI tools like Power
BI, Tableau and Looker. Like we did for Fivetran, we can use Partner Connect to
link our external platform with DB SQL. This allows analysts to build and serve
dashboards in the platforms that the business prefers without sacrificing the
performance of DB SQL and the Databricks Data Intelligence Platform.
Conclusion
As we move into this fast-paced, volatile modern world of finance, batch processing
remains a vital part of the modern data stack, able to hold its own against the
Figure 13 — Example operational dashboard built on a set of resulting Delta Live Tables (DLT) table entities.
features and benefits of streaming and real-time services. We’ve seen how we
can use the lakehouse architecture for financial services and its ecosystem of
partners to architect a simple, scalable, and extensible framework that supports
complex batch-processing workloads with a practical example in insurance
claims processing. With Delta Live Tables (DLT) and Databricks SQL (DB SQL),
we can build a data platform with an architecture that scales infinitely, is easy
to extend to address changing requirements, and will withstand the test of time.
To learn more about the sample pipeline described, including the infrastructure
setup and configuration used, please refer to this GitHub repository or watch
this demo video.
05
SECTION
S ECT I O N 5 .1
When it comes to advertising, consumers want to see content that’s relevant and
personalized — especially if it’s appearing on their mobile devices, where time is a valuable
resource — so it’s important to instantly capture the consumer’s attention and drive
engagement. InMobi does this by using real-time customer data to deliver targeted mobile
advertising and lock screen experiences. But as data processing requirements increased to
20+ terabytes per hour, the cost of running their multicloud data warehouse skyrocketed,
while its proprietary nature created silos that hindered collaboration and data sharing.
InMobi migrated from their multicloud data warehouse to Databricks to unify their various
workloads (data warehousing, AI and analytics), streamline operations and free up their
engineers to work on more valuable tasks, helping them to achieve better operational agility
Our focus to optimize price/performance was
and efficiency. Since moving to the lakehouse, the company has not only significantly
met head-on by Databricks. The lakehouse
helped us reduce costs without sacrificing reduced their total cost of ownership (TCO) compared to when they used a multicloud data
performance across mixed workloads, allowing warehouse — they have also improved productivity across the organization, resulting in
us to optimize data and AI operations today and faster time-to-market of new products.
into the future.
— M
OHIT SAXENA
Co-founder and Group CTO,
InMobi
32% 15% 20%
LowerTCO compared to their Faster queries compared to their Improved performance of
multicloud data warehouse multicloud data warehouse supplier reporting
Big Book of Data Warehousing and BI 80
Complex legacy infrastructure and multicloud data warehouse Migration to lakehouse results in unified data, analytics and AI
unwieldy to manage
Although InMobi has a team of fully capable engineers, they learned a valuable
InMobi is all about targeted advertising and helping brands reach and engage lesson in productivity and operational agility. “We found that we were spending
consumers in a meaningful and cost-effective way. But to accurately deliver more time maintaining our environment, which was hurting our ability to
relevant ads, you need data — a lot of it. Over time, they extended their on- support and collaborate with the business,” explained Sundaram. “We wanted
premises Hadoop system by adding multiple cloud data warehouses to address to be more strategic and identify ways to operationalize data more efficiently.”
various problems. However, as the amount of data the company needed to After careful evaluation of their current multicloud data warehouse and the
process increased exponentially (20TB of data per hour), InMobi continued to prospects of building in-house, they determined that the Databricks Data
build on top of their legacy system, resulting in a multicloud data warehouse Intelligence Platform clearly aligned best with their goals — to improve developer
that had a host of challenges: It was overly complex, had outages, was extremely productivity through reduced infrastructure complexity while achieving the best
costly as they scaled, and led to data silos that limited data sharing and possible price/performance.
collaboration. The team at InMobi realized that if they continued with this
system in place, it would slow down their ability to innovate and keep precious Once they made their selection, the team partnered with Databricks to start
engineering resources locked into maintenance mode. the migration planning process. With over a decade of customizations built on
top of existing systems and over 1 petabyte of data, InMobi knew the migration
“The data infrastructure that we built worked, but it created was going to be complex. On the ETL side alone, there were 150 pipelines and
significant complexity and overhead that pulled our focus away eight teams involved with migrating from open source Apache Spark™. They
from our own core products,” said Madhan Sundaram, Senior also needed to migrate approximately 300 reporting dashboards to ensure
Director of Platform Engineering at InMobi. “We needed our there was no disruption of information delivery to suppliers and customers.
talented engineers to create more value for our customers, and to To help navigate this process, Databricks worked closely with InMobi — which
do that, we needed a simpler and more unified system.” had allocated two in-house engineers per team to provide support — and
Databricks’ implementation partner Celebal Technologies on the optimizations
What InMobi wanted was a single system that could solve multiple problems. needed for a seamless migration.
To accomplish this goal, the company needed to consolidate their disjointed
systems into a single platform to free up engineers to focus on higher-value As a result, Databricks was able to help InMobi migrate from their multicloud
tasks such as developing ML and large language models. That’s why the data warehouse to the lakehouse with ease. “Databricks gives us the edge
team looked to the Databricks Data Intelligence Platform to unify their data in terms of being able to utilize industry-first features that help us establish
warehousing and AI workloads on a single platform. technical differentiation,” said Sundaram. “Their expertise came through during
the migration as they helped us optimize our compute resources for cost and
performance. The ownership that Databricks took to drive the optimizations and
how transparent and educational they were was commendable.”
Big Book of Data Warehousing and BI 81
Now, with a unified, more streamlined lakehouse architecture, InMobi is able In terms of business impact, there’ve been a number of measurable
to take full advantage of their robust customer data to deliver smarter, more improvements across the board after the migration. Not only are infrastructure
personalized mobile advertising. Various teams leverage Databricks notebooks costs 34% lower than before, but there is also a 15% boost in query speeds
for ad hoc analysis, Power BI for visualizations on top of Databricks SQL — the and 20% fewer job failures compared to their previous data environment, all
serverless data warehouse on the lakehouse — and MLflow to build their next- contributing to 20% better performance of reporting and insights delivery to
generation AI platform. They’ve also found great success with Delta Live Tables end users. The TCO is 32% lower compared to when they used a multicloud data
for anomaly detection, with a 50% improvement in SLAs and an 80% reduction warehouse, and there’s been a 24% cost reduction in running their ETL pipelines
in costs. Data silos and data discoverability are no longer an issue either, thanks as well. More qualitatively, the team is seeing overall better reliability — with
to Unity Catalog. With Unity Catalog, they can now govern access at the table systems more stable than ever — and experiencing a positive reputation boost
and column levels, while ensuring complete data lineage is captured to ensure with their customers.
they have visibility into where data comes from and whether it’s stale or not.
With a platform designed to meet their analytics and AI needs, the InMobi team “Our rate of experimentation has improved tremendously,” said
is diving into new technologies, including large language models (LLMs), to help Mohit Saxena, Co-founder and Group CTO at InMobi. “Databricks
deliver insights to their customers more efficiently. “We’ve started to look at has simplified collaboration, and the unified approach that the
implementing LLMs to make it easier for our end users to ask a question and lakehouse offers allows us to be more efficient, productive and
find the information they need. The lakehouse architecture will make this effort compliant when delivering new features and products.”
easier, as jobs will run automatically under the hood. This will give our teams the
By moving to a unified platform on the Databricks Data Intelligence Platform,
ability to simply ask a question without any expertise and get the contextual
InMobi can now focus on innovating in the mobile advertising space to deliver
answers they want at their fingertips,” explained Sundaram.
real-time personalization that drives value for both InMobi’s customers as well
as their internal end users.
S ECT I O N 5 . 2
Akamai runs a pervasive, highly distributed content delivery network (CDN). Its CDN uses
approximately 345,000 servers in more than 135 countries and over 1,300 networks
worldwide to route internet traffic for some of the largest enterprises in media, commerce,
finance, retail and many other industries. About 30% of the internet’s traffic flows through
Akamai servers. Akamai also provides cloud security solutions.
In 2018, the company launched a web security analytics tool that offers Akamai customers
a single, unified interface for assessing a wide range of streaming security events and
perform analysis of those events. The web analytics tool helps Akamai customers to
take informed actions in relation to security events in real time. Akamai is able to stream
massive amounts of data and meet the strict SLAs it provides to customers by leveraging
Delta Lake allows us to not only query the data
Delta Lake and the Databricks Data Intelligence Platform for the web analytics tool.
better but to also acquire an increase in the
data volume. We’ve seen an 80% increase in
traffic and data in the last year, so being able to
scale fast is critical.
— T
O M E R PAT E L
Engineering Manager,
<1 >85%
Akamai Minute ingestion time, Of queries have a response
reduced from 15 min time of 7 seconds or less
Big Book of Data Warehousing and BI 83
Ingesting and streaming enormous amounts of data Improving speed and reducing costs
Akamai’s web security analytics tool ingests approximately 10GB of data related Today, the web security analytics tool ingests and transforms data, stores it
to security events per second. Data volume can increase significantly when in cloud storage, and sends the location of the file via Kafka. It then uses a
retail customers conduct a large number of sales — or on big shopping days like Databricks Job as the ingest application. Delta Lake, the open source storage
Black Friday or Cyber Monday. The web security analytics tool stores several format at the base of the Databricks Data Intelligence Platform, supports real-
petabytes of data for analysis purposes. Those analyses are performed to time querying on the web security analytics data. Delta Lake also enables Akamai
protect Akamai’s customers and provide them with the ability to explore and to scale quickly. “Delta Lake allows us to not only query the data better but to
query security events on their own. also acquire an increase in the data volume,” says Patel. “We’ve seen an 80%
increase in traffic and data in the last year, so being able to scale fast is critical.”
The web security analytics tool initially relied on an on-premises architecture
running Apache Spark™ on Hadoop. Akamai offers strict service level Akamai also uses Databricks SQL (DBSQL) and Photon, which provide extremely
agreements (SLAs) to its customers of 5 to 7 minutes from when an attack fast query performance. Patel added that Photon provided a significant boost
occurs until it is displayed in the tool. The company sought to improve ingestion to query performance. Overall, Databricks’ streaming architecture combined
and query speed to meet those SLAs. “Data needs to be as real-time as possible with DBSQL and Photon enables Akamai to achieve real-time analytics, which
so customers can see what is attacking them,” says Tomer Patel, Engineering translates to real-time business benefits.
Manager at Akamai. “Providing queryable data to customers quickly is critical.
We wanted to move away from on-prem to improve performance and our SLAs Patel says he likes that Delta Lake is open source, as the company has benefitted
so the latency would be seconds rather than minutes.” from a community of users working to improve the product. “The fact that Delta
Lake is open source and there’s a big community behind it means we don’t need
After conducting proofs of concept with several companies, Akamai chose to to implement everything ourselves,” says Patel. “We benefit from fixed bugs that
base its streaming analytics architecture on Spark and the Databricks Data others have encountered and from optimizations that are contributed to the
Intelligence Platform. “Because of our scale and the demands of our SLA, we project.” Akamai worked closely with Databricks to ensure Delta Lake can meet
determined that Databricks was the right solution for us,” says Patel. “When the scale and performance requirements Akamai defined. These improvements
we consider storage optimization, and data caching, if we went with another have been contributed back to the project (many of which were made available as
solution, we couldn’t achieve the same level of performance.” part of Delta Lake 2.0), and so any user running Delta Lake now benefits from the
technology being tested at such a large scale in a real-world production scenario.
Big Book of Data Warehousing and BI 84
2
{
“/2022-06-13/1655112426123-waf-
and performance
3 triggers-dlr/9a38250e-5210-476e-9fcd-
4 e05815df7896.avro.deflate”,
5 7526435,
7 “statistics”8 E 8 9686;
8 Ⰰ,
9
{
Platform enables the web security analytics tool to stream vast volumes of
1 “/2022-06-13/1655112419993-waf-
1
1 triggers-dlr/95047=9f-14c=-48a3-
1
2 a105-98784534faaa.avro.deflate”,
1
3 10586605,
1
4
1
5
16 Ⰰ
14750,
“statistics”8 E“1655110800000”8 14750; data and provide low-latency, real-time analytics-as-a-service to Akamai’s
customers. That way Akamai is able to make available security event data to
Receiving
Layer customers within the SLA of 5 to 7 minutes from when an attack occurs. “Our
focus is performance, performance, performance,” says Patel. “The platform’s
performance and scalability are what drives us.”
Using the Databricks Data Intelligence Platform, it now takes under 1 minute
to ingest the security event data. “Reducing ingestion time from 15 minutes to
under 1 minute is a huge improvement,” says Patel. “It benefits our customers
because they can see the security event data faster and they have a view of
what exactly is happening as well as the capability to filter all of it.”
S ECT I O N 5 . 3
— D
ANIEL KNIJNIK
CEO,
Quartile
Big Book of Data Warehousing and BI 86
Size of data and performance needed to scale The lakehouse architecture has enabled Quartile’s data stack to evolve. This has
been accomplished through several processes: by leveraging Databricks Auto
Quartile faced challenges in storing and processing data for multiple advertising Loader for incremental and efficient processing of new data files as they arrive in
channels due to crucial needs in reporting and consolidating sales data, cloud storage; by using Delta Lake for its open format storage layer, which delivers
including over 60 days of attribution. The previous architecture couldn’t keep up reliability, security and performance on the data lake; by using Databricks SQL,
with the size of data Quartile was processing. The team was batch processing which continues to bring data engineers and data analysts closer with easy-to-
over 10TB of data on a single job, which applied all transformations required for build queries and dashboards; and with Databricks Workflows, which connects all
data reporting, causing server unavailability and late deliveries of data points. of the pieces together in a stable and scalable way.
This process alone, which is responsible for improving the performance of ads,
had severe performance problems, such as individual jobs running up to 7.5 To provide the best experience for customers, Quartile needs to be able to
hours every day. retrieve accurate data. This is an important challenge that is easier to handle
with Spark User-Defined Functions, as the company uses the power of
parallelism to break down processing into as many parts as needed. For their
Technology evolution, from legacy to modern data stack solution to scale, they utilize Terraform for deploying all workspaces, easily
spinning up new clusters and jobs as well as ensuring the correct standards are
As part of the evolution of the company, Quartile’s data architecture has
being followed across the company.
reached new levels of maturity, growing from traditional SQL databases running
on Azure cloud to a new solution with the Databricks Data Intelligence Platform
as the foundation. This modernization has had direct impact in several areas of Helping their partners run ads better
the company and clear benefits for Quartile’s customers: with Delta Lake there
has been more data reliability, faster performance and reduced costs. When It is critical that Quartile’s customers have a centralized solution in which they
migrating the data from a traditional SQL database to Databricks, they saw a can analyze their sales, costs and other metrics related to their campaigns. At
considerable reduction in data volume, mainly due to the optimizations of Delta Quartile, they take advantage of the solid data engineering work and integrate
with version control and Parquet compacting, resulting in storage reduction from Databricks with Power BI for embedding the dashboards directly in their
90TB to about 18TB. portal. This helps provide a single place for clients to both configure marketing
campaigns across several channels as well as follow up on performance changes
while still maintaining a smaller data storage footprint that is 80% less expensive
compared to the data stored in traditional data warehouses by leveraging Delta
Lake’s file format on object storage. The ability to combine data from all different
channels has already helped several of their customers — SmartyPants, as an
example, has grown over 100% since they partnered with Quartile.
Big Book of Data Warehousing and BI 87
But that’s not all. Quartile has also patented algorithms for improving the
performance of ads, which are implemented utilizing the machine learning
persona in Databricks. The ability of having a central lakehouse architecture
to build their entire data stack has made the lives of Quartile’s developers
much simpler, allowing them to focus on developing innovative solutions and
bringing increasingly better results for their customers. As another example,
OfficeSupply has had excellent results during their first year of working with
Quartile, with a 67% increase in Google Ads revenue and a 103% increase in
Google Shopping clicks for trademark terms by improving the performance
of individual jobs — they used to take 7.5 hours but now run in 45 minutes on
the lakehouse architecture.
S TA R T YO U R F R E E T R I A L
© Databricks 2023. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation. Privacy Policy | Terms of Use