0% found this document useful (0 votes)
10 views101 pages

Data Science With PostgreSQL

Uploaded by

bupbechanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views101 pages

Data Science With PostgreSQL

Uploaded by

bupbechanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Data Science with PostgreSQL

Data Science with PostgreSQL

Balázs Bárány

Data Scientist

pgconf.de 2015
Data Science with PostgreSQL

Contents

Introduction  What is Data Science?


Process model

Tools and methods of Data Scientists

Data Science with PostgreSQL


Business & data understanding
Preprocessing
Modeling
Evaluation

Deployment

Summary
Data Science with PostgreSQL
Introduction  What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ...


Data Science with PostgreSQL
Introduction  What is Data Science?

Sexiest job of the 21st century

I According to Google, LinkedIn, ...

I Who is a Data Scientist?


Data Science with PostgreSQL
Introduction  What is Data Science?

Data Science Venn Diagram

(c) Drew Conway, 2010. CC-BY-NC


Data Science with PostgreSQL
Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?
Data Science with PostgreSQL
Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis


Data Science with PostgreSQL
Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize


Data Science with PostgreSQL
Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe


Data Science with PostgreSQL
Introduction  What is Data Science?

Tasks of data scientists

I Get data from various sources

I Big data?

I Mash up & format for analysis

I Analyze & visualize

I Predict & prescribe

I Operationalize
Data Science with PostgreSQL
Introduction  What is Data Science?
Process model

The Data Mining process

Cross Industry Standard Process for Data Mining (Kenneth Jensen/Wikimedia Commons)
Data Science with PostgreSQL
Tools and methods of Data Scientists

Tools and methods

Tools and methods


Data Science with PostgreSQL
Tools and methods of Data Scientists

Scripting and programming

I R

I Python with extensions

I Octave/Matlab, other mathematic languages

I Hadoop and Big Data programming libraries (mostly Java)

I Cloud services
Data Science with PostgreSQL
Tools and methods of Data Scientists

Integrated GUI tools

I (partly) Open Source: RapidMiner, KNIME, Orange

I Data Warehouse tools extended for analytics: Pentaho, Talend

I Many commercial tools, e. g. SAS, IBM SPSS

I Hadoop-related newcomers: e. g. Datameer


Data Science with PostgreSQL
Tools and methods of Data Scientists

Data Infrastructure

I Databases and data stores

I Relational, NoSQL
I Hadoop clusters
I In-memory

I Data streams

I Free-form data: text, images, video, audio, ...

I Web APIs

I Open Data
Data Science with PostgreSQL
Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format


Data Science with PostgreSQL
Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ...


Data Science with PostgreSQL
Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ...

I Data cleansing

I Missing values
I Abnormal values
Data Science with PostgreSQL
Tools and methods of Data Scientists

Data acquisition and preprocessing

I Data ingestion in raw format

I Joining, aggregating, ltering, calculating, ...

I Data cleansing

I Missing values
I Abnormal values

I Result: data set suitable for analytics


Data Science with PostgreSQL
Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not


Data Science with PostgreSQL
Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classication (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value


Data Science with PostgreSQL
Tools and methods of Data Scientists

Predictive Modeling

I Supervised and unsupervised methods

I Target variable known or not

I Classication (supervised): Prediction of a class or category

I Regression (supervised): Prediction of numeric value

I Clustering (unsupervised): Automatic grouping of data

I Association analysis, outlier detection, time series prediction,


...
Data Science with PostgreSQL
Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions?


Data Science with PostgreSQL
Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)


Data Science with PostgreSQL
Tools and methods of Data Scientists

Deployment and operationalization

I Model application to new data => prediction + condence

I What to do with predictions?

I Store in ERP or CRM

I Tell someone (email, popup)

I Add a label (e. g. mark email as spam)

I Interrupt nancial transaction => prescription

I Order supplies => prescription

I ...
Data Science with PostgreSQL
Data Science with PostgreSQL

Data Science with PostgreSQL

Doing Data Science with PostgreSQL


Data Science with PostgreSQL
Data Science with PostgreSQL

Caveats

I This stu is not easy


Data Science with PostgreSQL
Data Science with PostgreSQL

Caveats

I This stu is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourself


I Able to compile stu
Data Science with PostgreSQL
Data Science with PostgreSQL

Caveats

I This stu is not easy

I Must be root and postgres

I Maintain your PostgreSQL yourself


I Able to compile stu

I You should ask ;-)

I your boss
I co-workers

I customer
Data Science with PostgreSQL
Data Science with PostgreSQL

Data Science with PostgreSQL

Business & data understanding


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Business understanding

I What is the purpose of the business?

I What are existing processes?

I Drivers of business success

I Project goals and challenges

I Availability of data and resources

I Success criteria

I Not a technical activity, PostgreSQL can't help much


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding

I Existing data

I Entities and covered concepts


I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding

I Existing data

I Entities and covered concepts


I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding

I Existing data

I Entities and covered concepts


I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too small

I Too big
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding

I Existing data

I Entities and covered concepts


I Complete? Correct? In suitable form?

I Usable? (regulations, access constraints, ...)

I Connecting separate data sources

I Simple or complex IDs

I Data size

I Too small

I Too big

I Suitability for predictive modeling

I Target variable?
I Attribute types
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process


I Foreign Data Wrappers
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process


I Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLs
I Search for missing links (incomplete foreign keys)
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding with PostgreSQL

I Get data into PostgreSQL

I Classical import process


I Foreign Data Wrappers

I Analyze data distribution

I Group by and aggregate

I Count, Count Distinct, Min, Max

I Count NULLs
I Search for missing links (incomplete foreign keys)

I Analyze surprizes

I Impossible values
I Missing values in required elds
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding with PostgreSQL  summary

I Good SQL knowledge required

I Tedious manual process

I repetitive
I not suitable for large number of attributes

I No built-in visualization
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding with PostgreSQL  summary

I Good SQL knowledge required

I Tedious manual process

I repetitive
I not suitable for large number of attributes

I No built-in visualization

I Or maybe...
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

SQL barchart output


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Bar chart from GUI tool


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Boxplot output
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Data understanding wrap up

I DBMS not built for this

I It can support more specialized tools

I Introduction: R

I A free software environment for statistical computing and

graphics
I Available in PostgreSQL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methods

I Powerful visualization functions


I Classical, modern and bleeding edge modeling
I Arrays and data frames are central data types
I Operates only in memory
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

PL/R: A statistical language for PostgreSQL

I R as a standalone language

I Mathematical and statistical methods

I Powerful visualization functions


I Classical, modern and bleeding edge modeling
I Arrays and data frames are central data types
I Operates only in memory

I PL/R: R as a loadable procedural language for PostgreSQL

I First released in 2003 by Joe Conway


I License: GPL
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

R usage outside of PostgreSQL

I Development environments

I RStudio (AGPL or commercial, local & web)


I RKWard, Cantor (KDE projects)
I StatET (Eclipse)

I Frontends

I R Commander

I Deducer
I Rattle

I Web framework: Shiny (AGPL or commercial)


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Working with R in PostgreSQL


I Install functions in the database

Example
select install_rcmd('
myfunction <-function(x)
{print(x)}
');
I Install without function body

Example
CREATE FUNCTION rnorm
(n integer, mean double precision, sd double precision)
RETURNS double precision[]
AS 
LANGUAGE 'plr';
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics
Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access


Data Science with PostgreSQL
Data Science with PostgreSQL
Business & data understanding

Using R in PostgreSQL for data understanding

I Advanced visualization

I Data distributions

I Advanced statistics

I Execution in the database

I Clumsy, but direct data access

I Execution outside

I Simple and interactive, but data transfer


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Data Science with PostgreSQL

Preprocessing
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing

I What databases are built for


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joining


I Easy to lter

I Columns: not so much

I Easy to create new columns


I Only explicit access
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing

I What databases are built for

I Rows: very dynamic

I Easy to create new rows by joining


I Easy to lter

I Columns: not so much

I Easy to create new columns


I Only explicit access

I Wider interpretation of preprocessing

I Enrichment with external data


I New attributes from existing ones
I Recoding, recalculation
I Missing value handling
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: organizing workow

I Common Table Expressions

I organize processing steps

I partial and intermediate results

Example
WITH source AS (
SELECT *, ROW_NUMBER() OVER () AS rownum
FROM source_table
),
no_missings AS (
SELECT *
FROM source
WHERE field1 IS NOT NULL
AND field2 IS NOT NULL
)
etc.
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: attribute creation

I Aggregation
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)


I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)


I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)


I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: attribute creation

I Aggregation

I Partial aggregation by window functions

I In-group measures, e. g. ratio

I att / SUM(att) OVER (PARTITION BY ...)


I In-group numbering

I ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)


I Comparing to previous/next value

I att - LAG(att, 1) OVER (ORDER BY ...)


I Much easier in SQL than programming languages and data
mining tools
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,

MSSQL, JDBC, SQL Alchemy ...)


I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)
Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,

MSSQL, JDBC, SQL Alchemy ...)


I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,

MSSQL, JDBC, SQL Alchemy ...)


I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)


Data Science with PostgreSQL
Data Science with PostgreSQL
Preprocessing

Preprocessing: enrichment

I PostGIS for geodata

I Foreign data wrappers (see PostgreSQL Wiki)

I Other databases (other PostgreSQL server, MySQL, Oracle,

MSSQL, JDBC, SQL Alchemy ...)


I NoSQL databases (MongoDB, Cassandra, CouchDB, Redis, ...)

I Big Data (Hadoop Hive, Impala)

I Network sources - Multicorn (RSS, IMAP, Twitter, S3, ...)

I Files (CSV, ZIP, JSON, ...)

I Write your own in C or Python or Ruby


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Data Science with PostgreSQL

Modeling
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Model development

I Machine learning algorithms not well suited for SQL


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear Regression


I Dicult for more advanced algorithms
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Model development

I Machine learning algorithms not well suited for SQL

I Some attempts to build them

I Naive Bayes, Linear Regression


I Dicult for more advanced algorithms

I Better done in specialized language or tool

I PL/R
I PL/Python
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classication, regression, clustering


I Model selection, validation
I Preprocessing

I matplotlib: Generic and statistical plotting library


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

PL/Python

I Python procedural language available in PostgreSQL

I scikit-learn: Machine learning toolbox for Python

I Classication, regression, clustering


I Model selection, validation
I Preprocessing

I matplotlib: Generic and statistical plotting library

I PL/Python is an alternative to PL/R


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Data Science with PostgreSQL

Evaluation
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation of modeling results

I Models return predictions

I Prediction can be compared to known result (target variable)

I Measures of model performance: Accuracy, precision, recall, ...

I Results on the training set meaningless

I Split validation

I Cross validation
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing
Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation


Data Science with PostgreSQL
Data Science with PostgreSQL
Modeling

Evaluation results

I Good result depends on the application

I If not good enough,

I get more data

I do more preprocessing

I select better classier

I optimize classier parameters

I Cycle: preprocessing - modeling - evaluation

I Better done in data mining environment


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Data Science with PostgreSQL

Deployment
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment

I Advantages of deployment in the database:

I Less overhead

I Instant application using triggers

I Well-known execution environment

I Functionality available over standard interface

I Some models easily expressed in SQL


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside


Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(saved object, .GlobalEnv)


I PL/Python: Global dictionary GD

I Application function in matching language

I Uses existing model


I Returns target variable
Data Science with PostgreSQL
Data Science with PostgreSQL
Deployment

Deployment of PL/R or PL/Python models

I Model developed in database or outside

I Put into global context

I PL/R: load(saved object, .GlobalEnv)


I PL/Python: Global dictionary GD

I Application function in matching language

I Uses existing model


I Returns target variable

I Trigger func or UPDATE uses application function


Data Science with PostgreSQL
Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment


Data Science with PostgreSQL
Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing


Data Science with PostgreSQL
Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration


Data Science with PostgreSQL
Summary

Summary

I PostgreSQL's support for data science tasks

I Best: preprocessing, deployment

I Modern SQL for preprocessing

I Foreign Data Wrappers for data integration

I Procedural languages for data mining


Data Science with PostgreSQL
Summary

Questions?

I Balázs Bárány, <balazs@tud.at>

I https://datascientist.at/

You might also like