Oreillyfodooltweek 11675274112220
Oreillyfodooltweek 11675274112220
Oreillyfodooltweek 11675274112220
a. 0 - Never heard of it
b. 1-4 - I see broadly what it is
c. 5 - 6 - I consider to use it in my projects
d. 7 - 8 - I already tried to implement a data observability approach
e. 9 - 10 - I already successfully implemented data observability
Data Observability Fundamentals
101
Beyond Data Monitoring - Automation & Observability
What:
● How is data produced?
● How is data consumed?
● What does the data looks like, and
○ Does it match my assumptions (volume, structure, freshness, …)?
○ Does it match others’ expectations (*)?
Why (examples):
● Address requests of information faster Create trust
● Reduce effort to detect and resolve issues Recover trust
● Prevent issues, the known ones at least Maintain trust
How: “By Design” - Turn your Data Platform in a Data Observable System
Data Observable Application
Execution Environment
Data Application
Input
Input
Input Output
Input
Execution Environment
Data Application
Input
Input
Input Output
Input
Execution Environment
Data Application
Input
Input
Input Output
Input
⚙ ⚙ ⚙
?
What is this thing? Data Observability Platform
?
Data Observability (in a Data Platform)
Data
Data Data Data
Observable
Observable Observable Observable
Transformation
Ingestion Tools Serving Tools Analytics Tools
Tools
(Airbyte) (BigQuery) (Scikit-Learn)
(dbt, Spark)
Data
Infrastructure Orchestration
Observability Data Catalog
Platform Platform
Platform
Data Platform
Data Observability Fundamentals
Environment + Setup
Environment
Prerequisites: Docker
cd python_environment
During this course, we will use a modified version the NYC taxi trip data
set (see https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
● Everything is on docker
● To follow the exercises, you will need:
○ A Kensu Community (free) account
○ Optionally (this week): A (free) Google Cloud account with a service account
green_ingestion green_ingestion
company_bill
.csv _KPI
Spark Job: Code deep dive
Code walkthrough
How to make a Spark Job Data Observable? (1/3)
Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input
Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input
Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input
● Add the agent (a JAR file) to the Spark Session (driver) to generate the
observations and send them to the Kensu API
● Call the init function, which configures the agent’s behavior, context,
and API of the DO platform:
○ DO Platform API - URL and JWT Token
○ Project Name, Environment, Application
○ …
- spark_example_v1.py
- spark_example_v2.py
A first issue detected!
Here are the outcome of the company_bill dataset for both versions:
V1 V2
A first issue detected!
Let’s see in the data observability platform how this can be detected.
Databricks showcase
green_ingestion
green_ingestion.csv
BigQuery table
yellow_ingestion
BigQuery table
yellow_ingestion
yellow_ingestion.csv
BigQuery table
Pandas application
Code walkthrough
Python DO agent: How does it work?
Monkey patch: the python agent augments the native Python modules
with extra generation of data observations and reporting capabilities.
Monkey patch in the script
Monkey patch in the script
Patch
import kensu.pandas as pd
import kensu.numpy as np
Code walkthrough