Lecture 1 Data Quality and Statistics
Lecture 1 Data Quality and Statistics
Lecture 1 Data Quality and Statistics
Introduction
Goals?
Metabolomics
Group 2
Group 1
ANOVA
PCA
PLS
Analytical Dimensions
variables
Samples
Statistical approaches
Multivariate approaches Systems approaches
Pre-analysis
Data quality metrics precision accuracy Remedies normalization outliers detection missing values imputation
Normalization
sample-wise sum, adjusted measurement-wise transformation (normality) encoding (trigonometric, etc.)
standard deviation
mean
Outliers
single measurements (univariate)
Outliers
univariate/bivariate \ vs. multivariate
Transformation
logarithm (shifted) power (BOX-COX) inverse
Quantile-quantile (Q-Q) plots are useful for visual overview of variable normality
X X -0.5
mean
analytical
biological
Imputation methods
single value (mean, min, etc.) multiple multivariate
PCA
Classification
Prediction
Useful Methods
analysis of variance (ANOVA) partial least squares discriminant analysis (PLS-DA) Others: random forest, CART, SVM, ANN
Useful Methods
correlation partial least squares (PLS)
Data Structure
univariate: a single variable (1-D)
Data Types
continuous
discreet
binary
Data Complexity
Meta Data m n samples variables Experimental Design = complexity
Data
m-D 1-D 2-D Variable # = dimensionality
Univariate Analyses
univariate properties
length center (mean, median, geometric mean)
standard deviation
mean
Univariate Analyses
sensitive to distribution shape
wide n-of-one
Type I risk =
1-(1-p.value)m
m = number of variables tested
FDR correction
Example:
Design: 30 sample, 300 variables Test: t-test FDR method: Benjamini and Hochberg (fdr) correction at q=0.05
Results
Bivariate Data
relationship between two variables
correlation (strength)
regression (predictive) regression
correlation
Correlation
Parametric (Pearson) or rank-order (Spearman, Kendall)
Regression describes the least squares or best-fitline for the relationship (Y = m*X + b)
Geyser Example
Goal: Dont miss eruption! Data time between eruptions
70 14 min
duration of eruption
3.5 1 min
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357365
Covariates
Trends in data which mask primary goals can be accounted for using covariate adjustment and appropriate modeling strategies
Summary
Data exploration and pre-analysis: increase robustness of results guards against spurious findings Can greatly improve primary analyses Univariate Statistics: are useful for identification of statically significant changes or relationships sub-optimal for wide data best when combined with advanced multivariate techniques
Resources
Web-based data analysis platforms MetaboAnalyst( MeltDB( )
https://meltdb.cebitec.uni-bielefeld.de/cgi-bin/login.cgi
http://www.metaboanalyst.ca/MetaboAnalyst/faces/Home.jsp
Programming tools The R Project for Statistical Computing( ) Bioconductor( ) GUI tools imDEV( )
http://www.r-project.org/ http://www.bioconductor.org/ http://sourceforge.net/projects/imdev/?source=directory