Oreillyfodooltweek 11675274112220

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Data Observability Fundamentals

Concepts and methods


Who is with you today?

Founder / CPO @Kensu Product Manager @Kensu


POLL: Who are you?
a. Data analyst
b. Data engineer
c. Data/ML ops
d. Data scientist
e. Software engineer
f. Team/Project Leader
g. Other, please share in the group chat
POLL: What is your initial driver to follow a DO training

a. Data Quality/Data Governance


b. Data Monitoring/DataOps
c. Curiosity/Learning
d. Upskill/Personal motivation
POLL: How would you rate your level of knowledge of data
observability?

a. 0 - Never heard of it
b. 1-4 - I see broadly what it is
c. 5 - 6 - I consider to use it in my projects
d. 7 - 8 - I already tried to implement a data observability approach
e. 9 - 10 - I already successfully implemented data observability
Data Observability Fundamentals
101
Beyond Data Monitoring - Automation & Observability

What:
● How is data produced?
● How is data consumed?
● What does the data looks like, and
○ Does it match my assumptions (volume, structure, freshness, …)?
○ Does it match others’ expectations (*)?

Why (examples):
● Address requests of information faster Create trust
● Reduce effort to detect and resolve issues Recover trust
● Prevent issues, the known ones at least Maintain trust
How: “By Design” - Turn your Data Platform in a Data Observable System
Data Observable Application

Execution Environment
Data Application
Input
Input
Input Output
Input

Generate Data Observations Contextually


- Application Job (Spark job name)
- Code location and version (Git)
- Environment (PROD) and time (`now`)
Data Observable Application

Execution Environment
Data Application
Input
Input
Input Output
Input

Generate Data Observations Synchronously


- Data source Metadata: Location, Schema
- Compute metrics: Size, Null, Min, Max, Cardinality, …
- And more: Custom measures (Skew, Correlation, …)
Data Observable Application

Execution Environment
Data Application
Input
Input
Input Output
Input

Generate Continuously Validated Data Observations


- Size > 10K
- No Nulls for Address
- No skewed categories
Data Observable Principles
Generate Data Observations Contextually
Framework/
● Application Job (Spark job name) Application
● Code location and version (Git)
● Environment (PROD) and time (`now`) + DO
Agent
Generate Data Observations Synchronously
● Data source Metadata: Location, Schema Data Observations
● Compute metrics: Size, Null, Min, Max, Cardinality, …
● And more: Custom measures (Skew, Correlation, …)
Generate Continuously Validated Data Observations
● Size > 10K Data Observable
● No Null Addresses Framework/
● No skewed categories Application
Data Observable Pipeline/System

⚙ ⚙ ⚙

?
What is this thing? Data Observability Platform

?
Data Observability (in a Data Platform)

Data
Data Data Data
Observable
Observable Observable Observable
Transformation
Ingestion Tools Serving Tools Analytics Tools
Tools
(Airbyte) (BigQuery) (Scikit-Learn)
(dbt, Spark)

Data
Infrastructure Orchestration
Observability Data Catalog
Platform Platform
Platform

Data Platform
Data Observability Fundamentals
Environment + Setup
Environment

Today’s session focuses on pySpark and pandas

Please clone the following repository:


https://github.com/Fundamentals-of-Data-Observability/handson
Environment

Prerequisites: Docker

Run the following command in your terminal:


install.sh/

cd python_environment

docker run -it --rm -v $(pwd)/volume:/volume mypython /bin/bash


Data we will use

During this course, we will use a modified version the NYC taxi trip data
set (see https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

Starting from subsets of the data, we will:

● Read & Join data with python pandas and bigquery


● Transform data and compute KPI with spark
● Compute and summarize KPI with dbt
● Ingest data with Airbyte
● Orchestrate a pipeline with Airflow
Tools we will use

This week, we focus on Spark and Python:

● Everything is on docker
● To follow the exercises, you will need:
○ A Kensu Community (free) account
○ Optionally (this week): A (free) Google Cloud account with a service account

Next week, we focus on Airbyte, dbt and airflow

● To follow the exercises, you must have:


○ A Kensu Community account
○ A Google Cloud account with a service account (cf instructions in the code repository)
Data Observability Fundamentals
Spark and Databricks
Goals of the session

● Set up the environment for the course


● Make your first spark integration
● Solve an issue thanks to data observability
● See how it can integrate with DataBricks
Spark Job

What is the job doing?

- Transform the taxi_driver dataset for green taxis


- Compute a KPI on the money paid to the vendor

Structure of the job:

green_ingestion green_ingestion
company_bill
.csv _KPI
Spark Job: Code deep dive

Code walkthrough
How to make a Spark Job Data Observable? (1/3)

Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input

Generate Data Observations Contextually


- Application Job (Spark job name)
- Code location and version (Git)
- Environment (PROD) and time (`now`)
- Lineage (thanks GAD… ooops, DAG)
How to make a Spark Job Data Observable? (2/3)

Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input

Generate Data Observations Synchronously


- Data source Metadata: Location, Schema
- Compute metrics: Size, Null, Min, Max, Cardinality, …
- And more: Custom measures (Skew, Correlation, …)
How to make a Spark Job Data Observable? (3/3)

Execution Environment
Framework: Spark
Input Foundation: DAG (RDDs and Deps)
Input
Input Output
Input

Generate Continuously Validated Data Observations


- Size > 10K
- No Nulls for Address
- No skewed categories
Getting started with the Spark DO Agent

In pyspark, you need to:

● Add the agent (a JAR file) to the Spark Session (driver) to generate the
observations and send them to the Kensu API
● Call the init function, which configures the agent’s behavior, context,
and API of the DO platform:
○ DO Platform API - URL and JWT Token
○ Project Name, Environment, Application
○ …

Note: All the parameters can be also added to a conf.ini file


Handson

Add and configure the agent in the job


A first issue detected!

Now, let’s run 2 different versions of the code:

- spark_example_v1.py
- spark_example_v2.py
A first issue detected!

Here are the outcome of the company_bill dataset for both versions:

V1 V2
A first issue detected!

This mistake did not generate any error in the code!

Nevertheless, we obviously see that the result is wrong… 😱

Let’s see in the data observability platform how this can be detected.
Databricks showcase

Same method as spark:

installing the jar and kensu-py package


Data Observability Fundamentals
Pandas and Bigquery
Goals of the session

● Set up the environment for the course


● Turn pandas data observable
● Understand the underlying concepts of the python agent
● See how we can apply the same concepts for the BigQuery client
Pandas application

What is the script doing?

- Ingest 2 datasets in Bigquery


- Join them in a QueryJob

Structure of the job:

green_ingestion
green_ingestion.csv
BigQuery table
yellow_ingestion
BigQuery table
yellow_ingestion
yellow_ingestion.csv
BigQuery table
Pandas application

An example with a simpler script:

Copy from CSV to CSV

Code walkthrough
Python DO agent: How does it work?

Monkey patch: the python agent augments the native Python modules
with extra generation of data observations and reporting capabilities.
Monkey patch in the script
Monkey patch in the script

Initialization of the client


Collection of the execution context
Monkey patch in the script

Monkey patch of the pandas module


Monkey patch in the script

Collection of the data source metadata


Creation of the lineage
Computation of the metrics
Monkey patch in the script

Different level of abstractions can be implemented:

More time consuming to implement


Level 3:
Full Monkey
More complex to develop

Patch

Level 2: Best efforts to


activate data
observability

Level 1: Manual reporting


of all the observations
Monkey patch in the script

To sum up, turn a Python script Data Observable needs to:

1. Install the agent (kensu-py) in the environment


2. Configuring the client with:
a. Kensu API URL and Token
b. Project Name, Environment, Application
c. …
3. Modify the imports to enable the monkey patches

import kensu.pandas as pd

import kensu.numpy as np

import kensu.json as json


Bigquery module

The bigquery client is used to manipulate data in the bigquery


data base.

By augmenting this client, information can be retrieved from the


QueryJob.

from kensu.google.cloud import bigquery


Bigquery demonstration

Code walkthrough

You might also like