Discovering Partial Least Squares with JMP
By Ian Cox and Marie Gaudard
()
About this ebook
Ian Cox and Marie Gaudard use a “learning through doing†style. This approach, coupled with the interactivity that JMP itself provides, allows you to actively engage with the content. Four complete case studies are presented, accompanied by data tables that are available for download. The detailed “how to†steps, together with the interpretation of the results, help to make this book unique.
Discovering Partial Least Squares with JMP is of interest to professionals engaged in continuing development, as well as to students and instructors in a formal academic setting. The content aligns well with topics covered in introductory courses on: psychometrics, customer relationship management, market research, consumer research, environmental studies, and chemometrics. The book can also function as a supplement to courses in multivariate statistics and to courses on statistical methods in biology, ecology, chemistry, and genomics.
While the book is helpful and instructive to those who are using JMP, a knowledge of JMP is not required, and little or no prior statistical knowledge is necessary. By working through the introductory chapters and the case studies, you gain a deeper understanding of PLS and learn how to use JMP to perform PLS analyses in real-world situations.
This book motivates current and potential users of JMP to extend their analytical repertoire by embracing PLS. Dynamically interacting with JMP, you will develop confidence as you explore underlying concepts and work through the examples. The authors provide background and guidance to support and empower you on this journey.
This book is part of the SAS Press program.
Ian Cox
Ian Cox currently works in the JMP Division of SAS. Before joining SAS in 1999, he worked for Digital, Motorola, and BBN Software Solutions Ltd. and has been a consultant for many companies on data analysis, process control, and experimental design. A Six Sigma Black Belt, he was a Visiting Fellow at Cranfield University and is a Fellow of the Royal Statistical Society in the United Kingdom. Cox holds a Ph.D. in theoretical physics.
Related to Discovering Partial Least Squares with JMP
Related ebooks
JSL Companion: Applications of the JMP Scripting Language, Second Edition Rating: 0 out of 5 stars0 ratingsDouglas Montgomery's Introduction to Statistical Quality Control: A JMP Companion Rating: 0 out of 5 stars0 ratingsJMP for Basic Univariate and Multivariate Statistics: Methods for Researchers and Social Scientists, Second Edition Rating: 0 out of 5 stars0 ratingsPharmaceutical Quality by Design Using JMP: Solving Product Development and Manufacturing Problems Rating: 5 out of 5 stars5/5Building Better Models with JMP Pro Rating: 0 out of 5 stars0 ratingsJMP for Mixed Models Rating: 0 out of 5 stars0 ratingsApplied Data Mining for Forecasting Using SAS Rating: 0 out of 5 stars0 ratingsMarket Data Analysis Using JMP Rating: 0 out of 5 stars0 ratingsSpark for Data Science Rating: 0 out of 5 stars0 ratingsProcess Analytical Technology: Spectroscopic Tools and Implementation Strategies for the Chemical and Pharmaceutical Industries Rating: 0 out of 5 stars0 ratingsPreparative Chromatography for Separation of Proteins Rating: 0 out of 5 stars0 ratingsSolid Waste Management A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsA Career in Statistics: Beyond the Numbers Rating: 3 out of 5 stars3/5Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data Rating: 5 out of 5 stars5/5Separation and Purification Technologies in Biorefineries Rating: 0 out of 5 stars0 ratingsDesign Of Experiment A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsMinitab Cookbook Rating: 4 out of 5 stars4/5Quality Systems and Controls for Pharmaceuticals Rating: 0 out of 5 stars0 ratingsDownstream Industrial Biotechnology: Recovery and Purification Rating: 0 out of 5 stars0 ratingsStatistics for Quality Control Rating: 0 out of 5 stars0 ratingsMaterials Management Information System A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsConsumer and Sensory Evaluation Techniques: How to Sense Successful Products Rating: 0 out of 5 stars0 ratingsStatistical Method from the Viewpoint of Quality Control Rating: 5 out of 5 stars5/53D Printing of Medical Devices Third Edition Rating: 0 out of 5 stars0 ratingsQuality Management System A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsQuality Enhancement in Voluntary Carbon Markets: Opening up for Mainstream Rating: 0 out of 5 stars0 ratingsPractical Design of Experiments: DoE Made Easy Rating: 4 out of 5 stars4/5A Practical Guide to Analytics for Governments: Using Big Data for Good Rating: 0 out of 5 stars0 ratingsModern Experimental Design Rating: 0 out of 5 stars0 ratingsDesign and Analysis of Experiments, Volume 3: Special Designs and Applications Rating: 0 out of 5 stars0 ratings
Enterprise Applications For You
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Notion for Beginners: Notion for Work, Play, and Productivity Rating: 4 out of 5 stars4/5React Projects: Build 12 real-world applications from scratch using React, React Native, and React 360 Rating: 0 out of 5 stars0 ratingsBlockchain Data Analytics For Dummies Rating: 0 out of 5 stars0 ratingsFinancial Modelling in Power BI: Forecasting Business Intelligently Rating: 5 out of 5 stars5/5Learn SAP Basis in 24 Hours Rating: 5 out of 5 stars5/5Learn PMP in 24 Hours Rating: 0 out of 5 stars0 ratingsAgile Project Management: Scrum for Beginners Rating: 4 out of 5 stars4/5M Is for (Data) Monkey: A Guide to the M Language in Excel Power Query Rating: 4 out of 5 stars4/5Evernote: How to Use Evernote to Organize Your Day, Supercharge Your Life and Get More Done Rating: 0 out of 5 stars0 ratingsChange Management for Beginners: Understanding Change Processes and Actively Shaping Them Rating: 5 out of 5 stars5/5Mastering the Microsoft Deployment Toolkit Rating: 0 out of 5 stars0 ratingsMicrosoft Excel Formulas: Master Microsoft Excel 2016 Formulas in 30 days Rating: 4 out of 5 stars4/5Excel 2021 Rating: 4 out of 5 stars4/5SharePoint 2016 For Dummies Rating: 5 out of 5 stars5/5Enterprise AI For Dummies Rating: 3 out of 5 stars3/5Digital Transformation in Banking & Finance : Unlocking the Power of 110 AI Tools to Revolutionize the Banking and Finance Industry Rating: 0 out of 5 stars0 ratingsExcel Formulas That Automate Tasks You No Longer Have Time For Rating: 5 out of 5 stars5/5Learn SAP MM in 24 Hours Rating: 0 out of 5 stars0 ratingsPower BI for the Excel Analyst: Your Essential Guide to Power BI Rating: 0 out of 5 stars0 ratingsInformation Systems: BCS Level 4 Certificate in IT study guide Rating: 5 out of 5 stars5/5Excel VBA 24-Hour Trainer Rating: 3 out of 5 stars3/5
Reviews for Discovering Partial Least Squares with JMP
0 ratings0 reviews
Book preview
Discovering Partial Least Squares with JMP - Ian Cox
Discovering Partial Least Squares with JMP®
Ian Cox and Marie Gaudard
support.sas.com/bookstore
The correct bibliographic citation for this manual is as follows: Cox, Ian and Gaudard, Marie. 2013. Discovering Partial Least Squares with JMP®. Cary, NC: SAS Institute Inc.
Discovering Partial Least Squares with JMP®
Copyright © 2013, SAS Institute Inc., Cary, NC, USA
ISBN 978-1-61290-829-8 (electronic book)
All rights reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication or disclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a) and DFAR 227.7202-4 and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. The Government's rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414.
October 2013
SAS provides a complete selection of books and electronic products to help customers use SAS® software to its fullest potential. For more information about our offerings, visit support.sas.com/bookstore or call 1-800-727-3228.
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Contents
Preface
A Word to the Practitioner
The Organization of the Book
Required Software
Accessing the Supplementary Content
Chapter 1 Introducing Partial Least Squares
Modeling in General
Partial Least Squares in Today’s World
Transforming, and Centering and Scaling Data
An Example of a PLS Analysis
The Data and the Goal
The Analysis
Testing the Model
Chapter 2 A Review of Multiple Linear Regression
The Cars Example
Estimating the Coefficients
Underfitting and Overfitting: A Simulation
The Effect of Correlation among Predictors: A Simulation
Chapter 3 Principal Components Analysis: A Brief Visit
Principal Components Analysis
Centering and Scaling: An Example
The Importance of Exploratory Data Analysis in Multivariate Studies
Dimensionality Reduction via PCA
Chapter 4 A Deeper Understanding of PLS
Centering and Scaling in PLS
PLS as a Multivariate Technique
Why Use PLS?
How Does PLS Work?
PLS versus PCA
PLS Scores and Loadings
Some Technical Background
An Example Exploring Prediction
One-Factor NIPALS Model
Two-Factor NIPALS Model
Variable Selection
SIMPLS Fits
Choosing the Number of Factors
Cross Validation
Types of Cross Validation
A Simulation of K-Fold Cross Validation
Validation in the PLS Platform
The NIPALS and SIMPLS Algorithms
Useful Things to Remember About PLS
Chapter 5 Predicting Biological Activity
Background
The Data
Data Table Description
Initial Data Visualization
A First PLS Model
Our Plan
Performing the Analysis
The Partial Least Squares Report
The SIMPLS Fit Report
Other Options
A Pruned PLS Model
Model Fit
Diagnostics
Performance on Data from Second Study
Comparing Predicted Values for the Second Study to Actual Values
Comparing Residuals for Both Studies
Obtaining Additional Insight
Conclusion
Chapter 6 Predicting the Octane Rating of Gasoline
Background
The Data
Data Table Description
Creating a Test Set Indicator Column
Viewing the Data
Octane and the Test Set
Creating a Stacked Data Table
Constructing Plots of the Individual Spectra
Individual Spectra
Combined Spectra
A First PLS Model
Excluding the Test Set
Fitting the Model
The Initial Report
A Second PLS Model
Fitting the Model
High-Level Overview
Diagnostics
Score Scatterplot Matrices
Loading Plots
VIPs
Model Assessment Using Test Set
A Pruned Model
Chapter 7 Equation Chapter 1 Section 1Water Quality in the Savannah River Basin
Background
The Data
Data Table Description
Initial Data Visualization
Missing Response Values
Impute Missing Data
Distributions
Transforming AGPT
Differences by Ecoregion
Conclusions from Visual Analysis and Implications
A First PLS Model for the Savannah River Basin
Our Plan
Performing the Analysis
The Partial Least Squares Report
The NIPALS Fit Report
Defining a Pruned Model
A Pruned PLS Model for the Savannah River Basin
Model Fit
Diagnostics
Saving the Prediction Formulas
Comparing Actual Values to Predicted Values for the Test Set
A First PLS Model for the Blue Ridge Ecoregion
Making the Subset
Reviewing the Data
Performing the Analysis
The NIPALS Fit Report
A Pruned PLS Model for the Blue Ridge Ecoregion
Model Fit
Comparing Actual Values to Predicted Values for the Test Set
Conclusion
Chapter 8 Baking Bread That People Like
Background
The Data
Data Table Description
Missing Data Check
The First Stage Model
Visual Exploration of Overall Liking and Consumer Xs
The Plan for the First Stage Model
Stage One PLS Model
Stage One Pruned PLS Model
Stage One MLR Model
Comparing the Stage One Models
Visual Exploration of Ys and Xs
Stage Two PLS Model
Stage Two MLR Model
The Combined Model for Overall Liking
Constructing the Prediction Formula
Viewing the Profiler
Conclusion
Appendix 1: Technical Details
Ground Rules
The Singular Value Decomposition of a Matrix
Definition
Relationship to Spectral Decomposition
Other Useful Facts
Principal Components Regression
The Idea behind PLS Algorithms
NIPALS
The NIPALS Algorithm
Computational Results
Properties of the NIPALS Algorithm
SIMPLS
Optimization Criterion
Implications for the Algorithm
The SIMPLS Algorithm
More on VIPs
The Standardize X Option
Determining the Number of Factors
Cross Validation: How JMP Does It
Appendix 2: Simulation Studies
Introduction
The Bias-Variance Tradeoff in PLS
Introduction
Two Simple Examples
Motivation
The Simulation Study
Results and Discussion
Conclusion
Using PLS for Variable Selection
Introduction
Structure of the Study
The Simulation
Computation of Result Measures
Results
Conclusion
References
Index
Preface
A Word to the Practitioner
Welcome to Discovering Partial Least Squares with JMP. This book introduces you to the exciting area of partial least squares. Partial least squares is a multivariate modeling technique based on the idea of projection—the inspiration for the book’s cover design. You will obtain background understanding and see the technique applied in a number of examples. The book is built around the intuitive and powerful JMP statistical software, which will help you understand and internalize this new topic in a way that just reading simply cannot.
Since our goal is to help you apply partial least squares in your own setting, the textual material exists only to build your understanding and confidence as you progress through the worked examples. Although we endeavor to provide the salient details, the area of partial least squares is very broad and this book is necessarily incomplete. To the extent that we cannot cover certain topics fully, we provide references for your further study.
The Organization of the Book
We open with a number of introductory chapters that describe the concepts behind partial least squares and help position it in the wider world of statistical methodology and application. The meat of the book is found in Chapters 5 through 8, which contain four examples. Working through these examples using JMP prepares you to apply partial least squares to your own data. The book also contains two appendixes that provide further statistical details and the results of some simulation studies. Depending on your level and area of interest, you might find these useful.
Required Software
Although a user of standard JMP 11 or later will find this book useful, many examples require JMP Pro 11 or later. Compared to the standard version of JMP, the Pro version is intended for those who require deeper analytical capabilities. In JMP Pro, the implementation of partial least squares is quite complete.
The book uses JMP Pro 11.0 in screenshots, instructions, and discussions. Even though JMP’s PLS capabilities will continue to be developed, the major features and design shown here will persist. However, in future versions, you may notice very slight differences from the specific instruction sequences and screenshots presented in this book.
Ideally, you will have JMP Pro 11 available as you work through this book. A fully functional version of JMP Pro 11 that runs for 30 days can be requested at http://www.jmp.com/webforms/jmp_pro_eval.shtml.
The standard version of JMP enables you to run some partial least squares analyses through a simplified interface. Using this version you will be able to work through some, but not all, of the examples, and many of the scripts linked to in the book will not function correctly. But the book should still help your understanding of partial least squares, and help you decide if you need the Pro version of JMP.
Accessing the Supplementary Content
The data tables and scripts associated with the book can be accessed at either http://support.sas.com/cox or http://support.sas.com/gaudard, which provides a single ZIP file. Once downloaded, you can unzip the contents to a convenient location on your hard disk. This process creates a master JMP journal file Discovering Partial Least Squares with JMP.jrn, along with a folder for each chapter containing scripts. Data tables are created by running these scripts using the links in the master journal. The master journal file provides a convenient way to access all of the supplementary content, and the instructions in the text assume that you will do this.
The data tables themselves contain saved scripts that are referred to in the chapters. Often, when working through an example, we show the steps that you can follow to generate a report in JMP. In addition, either parenthetically or directly, we give the name of a script that has been saved to the data table and that generates that same analysis.
This way, if you want to see the report without stepping through the selections to create it, you can simply run that script.
The scripts are used to illustrate concepts and to help you develop understanding. Because many of the scripts have an element of randomness built in, it is usually worth running the same script more than once to see the effect over various random choices. Also, be aware that the scripts have been encrypted. If you open one of these scripts directly rather than via the journal file mentioned earlier, you see what appears to be gibberish. Nevertheless, you can right-click within the script window and select Run Script.
1
Introducing Partial Least Squares
Modeling in General
Partial Least Squares in Today’s World
Transforming, and Centering and Scaling Data.
An Example of a PLS Analysis.
The Data and the Goal
The Analysis.
Testing the Model
Modeling in General
Applied statistics can be thought of as a body of knowledge, or even a technology, that supports learning about the real world in the face of uncertainty. The theme of learning is ubiquitous in more or less every context that can be imagined, and along with this comes the idea of a (statistical) model that tries to codify or encapsulate our current understanding.
Many statistical models can be thought of as relating one or more inputs (which we call collectively X) to one or more outputs (collectively Y). These quantities are measured on the items or units of interest, and models are constructed from these observations. Such observations yield quantitative data that can be expressed numerically or coded in numerical form.
By the standards of fundamental physics, chemistry, and biology, at least, statistical models are generally useful when current knowledge is moderately low and the underlying mechanisms that link the values in X and Y are obscure. So although one of the perennial challenges of any modeling activity is to take proper account of whatever is already known, the fact remains that statistical models are generally empirical in nature. This is not in any sense a failing, since there are many situations in research, engineering, the natural sciences, the physical sciences, life science, behavioral science, and other areas in which such empirical knowledge has practical utility or opens new, useful lines of inquiry.
However, along with this diversity of contexts comes a diversity of data. No matter what its intrinsic beauty, a useful model must be flexible enough to adequately support the more specific objectives of prediction from or explanation of the data presented to it. As we shall see, one of the appealing aspects of partial least squares as a modeling approach is that, unlike some more traditional approaches that might be familiar to you, it is able to encompass much of this diversity within a single framework.
A final comment on modeling in general—all data is contextual. Only you can determine the plausibility and relevance of the data that you have, and you overlook this simple fact at your peril. Although statistical modeling can be invaluable, just looking at the data in the right way can and should illuminate and guide the specifics of building empirical statistical models of any kind (Chatfield 1995).
Partial Least Squares in Today’s World
Increasingly, we are finding data everywhere. This data explosion, supported by innovative and convergent technologies, has arguably made data exploration (e-Science) a fourth learning paradigm, joining theory, experimentation, and simulation as a way to drive new understanding (Microsoft Research 2009).
In simple retail businesses, sellers and buyers are wrestling for more leverage over the selling/buying process, and are attempting to make better use of data in this struggle. Laboratories, production lines, and even cars are increasingly equipped with relatively low-cost instrumentation routinely producing data of a volume and complexity that was difficult to foresee even thirty years ago. This book shows you how partial least squares, with its appealing flexibility, fits into this exciting picture.
This abundance of data, supported by the widespread use of automated test equipment, results in data sets with a large number of columns, or variables, v and/or a large number of observations, or rows, n. Often, but not always, it is cheap to increase v and expensive to increase n.
When the interpretation of the data permits a natural separation of variables into predictors and responses, partial least squares, or PLS for short, is a flexible approach to building statistical models for prediction. PLS can deal effectively with the following:
• Wide data (when v >> n, and v is large or very large)
• Tall data (when n >> v, and n is large or very large)
• Square data (when n ~ v, and n is large or very large)
• Collinear variables, namely, variables that convey the same, or nearly the same, information
• Noisy data
Just to whet your appetite, we point out that PLS routinely finds application in the following disciplines as a way of taming multivariate data:
• Psychology
• Education
• Economics
• Political science
• Environmental science
• Marketing
• Engineering
• Chemistry (organic, analytical, medical, and computational)
• Bioinformatics
• Ecology
• Biology
• Manufacturing
Transforming, and Centering and Scaling Data
Data should always be screened for outliers and anomalies prior to any formal analysis, and PLS is no exception. In fact, PLS works best when the variables involved have somewhat symmetric distributions. For that reason, for example, highly skewed variables are often logarithmically transformed prior to any analysis.
Also, the data are usually centered and scaled prior to conducting the PLS analysis. By centering, we mean that, for each variable, the mean of all its observations is subtracted from each observation. By scaling, we mean that each observation is divided by the variable’s standard deviation. Centering and scaling each variable results in a working data table where each variable has mean 0 and standard deviation 1.
The reason that centering and scaling are important is because the weights that form the basis for the PLS model are very sensitive to the measurement units of the variables. Without centering and scaling, variables with higher variance have more influence on the model. The process of centering and scaling puts all variables on an equal footing. If certain variables in X are indeed more important than others, and you want them to have higher influence, you can accomplish this by assigning them a higher scaling weight (Eriksson et al. 2006). As you will see, JMP makes centering and scaling easy.
Later we discuss how PLS relates to other modeling and multivariate methods. But for now, let’s dive into an example so that we can compare and contrast it to the more familiar multivariate linear regression (MLR).
An Example of a PLS Analysis
The Data and the Goal
The data table Spearheads.jmp contains data relating to the chemical composition of spearheads known to originate from one of two African tribes (Figure 1.1). You can open this table by clicking on the correct link in the master journal. A total of 19 spearheads of known origin were studied. The Tribe of origin is recorded in the first column (Tribe A
or Tribe B
). Chemical measurements of 10 properties were made. These are given in the subsequent columns and are represented in the Columns panel in a column group called Xs. There is a final column called Set, indicating whether an observation will be used in building our model (Training
) or in assessing that model (Test
).
Figure 1.1: The Spearheads.jmp Data Table
Figure 1.1: The Spearheads.jmp Data TableOur goal is to build a model that uses the chemical measurements to help us decide whether other spearheads collected in the vicinity were made by Tribe A
or Tribe B
. Note that there are 10 columns in X (the chemical compositions) and only one column in Y (the attribution of the tribe).
The model will be built using the training set, rows 1–9. The test set, rows 10–19, enables us to assess the ability of the model to predict the tribe of origin for newly discovered spearheads. The column Tribe actually contains the numerical values +1 and –1, with –1 representing Tribe A
and +1 representing Tribe B
. The Tribe column displays Value Labels for these numerical values. It is the numerical values that the model actually predicts from the chemical measurements.
The table Spearheads.jmp also contains four scripts that help us perform the PLS analysis quickly. In the later chapters containing examples, we walk through the menu options that enable you to conduct such an analysis. But, for now, the scripts expedite the analysis, permitting us to focus on the concepts underlying a PLS analysis.
The Analysis
The first script, Fit Model Launch Window, located in the upper left of the data table as shown in Figure 1.2, enables us to set up the analysis we want. From the red-triangle menu, shown in Figure 1.2, select Run Script. This script only runs if you are using JMP Pro since it uses the Fit Model partial least squares personality. If you are using JMP, you can select Analyze > Multivariate Methods > Partial Least Squares from the JMP menu bar. You will be able to follow the text, but with minor modifications.
Figure 1.2: Running the Script Fit Model Launch Window
This script produces a populated Fit Model launch window (Figure 1.3). The column Tribe is entered as a response, Y, while the 10 columns representing metal composition measurements are entered as Model Effects. Note that the Personality is set to Partial Least Squares. In JMP Pro, you can access this launch window directly by selecting Analyze > Fit Model from the JMP menu bar.
Below the Personality drop-down menu, shown in Figure