Idams: Internationally Developed Data Analysis and Management Software Package
Idams: Internationally Developed Data Analysis and Management Software Package
Idams: Internationally Developed Data Analysis and Management Software Package
Internationally Developed
Data Analysis and Management
Software Package
WinIDAMS Reference Manual
(release 1.3)
April 2008
Copyright c 2001-2008 by UNESCO
Published by
the United Nations Educational, Scientic and Cultural Organization
Place de Fontenoy, 75700 Paris, France
c UNESCO ninth edition 2008
First published 1988
Revised 1990, 1992, 1993, 1996, 2001, 2003, 2004
Printed in France
UNESCO ISBN 92-3-102577-5 WinIDAMS Reference Manual
Preface
Objectives of IDAMS
The idea behind IDAMS is to provide UNESCO Member States free-of-charge with a reasonably compre-
hensive data management and statistical analysis software package. IDAMS, used in combination with
CDS/ISIS (the UNESCO software for database management and information retrieval), will equip them
with integrated software allowing for the processing in a unied way of both textual and numerical data
gathered for scientic and administrative purposes by universities, research institutes, national administra-
tions, etc. The ultimate objective is to assist UNESCO Member States to progress in the rationalization of
the management of their various sectors of activity, a target which is crucial both to establish sound plans
of development and for the monitoring of their execution.
Origin and a Short History of IDAMS
IDAMS was originally derived from the software package OSIRIS III.2 developed in the early seventies at the
Institute for Social Research of the University of Michigan, U.S.A. It has been and is continuously enriched,
modied and updated by the UNESCO Secretariat with the co-operation of experts from dierent countries,
namely American, Belgian, British, Colombian, French, Hungarian, Polish, Russian, Slovak and Ukrainian
specialists, hence the name of the software: Internationally Developed Data Analysis and Management
Software Package.
In the beginning there was IDAMS for IBM mainframe computers
The rst release (1.2) was issued in 1988; it contained already almost all data management and most of
the data analysis facilities. Although basic routines and a number of programs were taken from OSIRIS III.2,
they were substantially modied and new programs were added providing tools for partial order scoring,
factor analysis, rank-ordering of alternatives and typology with ascending classication. Features for handling
code labels and for documenting program execution were incorporated. The software was accompanied by
the User Manual, Sample Printouts and Quick Reference Card.
Release 2.0 was issued in 1990; in addition to regrouping of: (1) programs for calculating Pearsonian
correlations; and (2) programs for rank-ordering of alternatives, it contained technical improvements in a
number of programs.
Release 3.0 was issued in 1992; it contained signicant improvements such as: harmonization of parameters,
keywords and syntax of control statements, possibility of checking syntax of control statements without
execution, possibility of program execution on limited number of cases, harmonization of error messages,
possibility of aggregating and listing Recoded variables, alphabetic recoding and six new arithmetic functions
in Recode facility. Two new programs were added: (1) for checking data consistency; and (2) for discriminant
analysis. The Annex with statistical formulas was added to the User Manual.
Note: In 1993, after preparation of release 3.02 for both OS and VM/CMS operating systems, the develop-
ment of the mainframe version was terminated.
In parallel, there was IDAMS for micro computers under MS-DOS
Development of micro computer version started in 1988 and was pursued in parallel with the development
of the mainframe version until release 3.
ii
The rst release (1.0) was issued in 1989, with the same features and programs as the mainframe version.
Release 2.0 was issued in 1990; it was also fully compatible with the mainframe version. Moreover, the
User Interface provided facilities for dictionary preparation, data entry, preparation and execution of setup
les and printing of results.
Release 3.0 was issued in 1992 together with the mainframe version. However, the User Interface was made
much more user friendly, providing new dictionary and data editors, a direct access to prototype setups for
all programs as well as a module for interactive graphical exploration of data.
The two intermediate releases 3.02 and 3.04, issued in 1993 and 1994 respectively, included mainly inter-
nal technical improvements and debugging of a number of programs. Release 3.02 was the last one fully
compatible with the mainframe version.
Micro IDAMS started its independent existence in 1993. The software underwent full and systematic testing,
especially in the area of handling user errors, and it was fully debugged.
Release 4 (last release for DOS), issued in 1996, includes improved user-friendly interface, possibility of
environment customization, on-line User Manual, simplied control language, new graphic presentation
modalities and capability of producing national language versions. Two new programs came to give users
cluster analysis and searching for structure techniques. The User Manual has been restructured in order to
present topics in an easy-to-follow but concise way. It was available in English rst.
Since 1998, the release 4 has been gradually developed in French, Spanish, Arabic and Russian.
2000: rst version of IDAMS for Windows and further development
The release 1.0 of IDAMS for 32-bit Windows graphical operating system was given for testing in the
year 2000 and its distribution started in 2001. It oers a modern user interface with a host of new features
to improve ease-of-use and on-line access to the Reference Manual using standard Windows Help. New
interactive components for data analysis provide tools construction of multidimensional tables, graphical
exploration of data and time series analysis.
The release 1.1 was issued in September 2002 with the following improvements: (1) externalization of text
that gives the possibility to have IDAMS software in other languages than English; (2) harmonization of
text in the results. It was the rst release of the Windows version which appeared in English, French and
Spanish.
The release 1.2 was issued in July 2004 in English, French and Spanish with new functions in three
programs, in the User Interface and in the interactive modules for graphical exploration of data and for time
series analysis. It was issued in April 2006 in Portuguese.
The release 1.3 is also issued in English, French, Portuguese and Spanish, and contains new program
for multivariate analysis of variance (MANOVA), calculation of coecient of variation in four programs,
improved handling of Recoded variables with decimals in SCAT and TABLES, and full harmonization of
data record length.
Acknowledgements
First of all, thanks should go to Prof. Frank-M. Andrews ( 1994) from the Institute for Social Research,
University of Michigan, USA, as well as to the Institute who authorized UNESCO to take the OSIRIS III.2
source code and use it as a starting point in developing the IDAMS software package. Major improvements
and additions have taken place since then. In this respect, particular gratitude should go to: Dr Jean-Paul
Aimetti, Administrator of the D.H.E. Conseil, Paris and Professor at Conservatoire National des Arts et
Metiers (CNAM), Paris (France); Prof. J.-P. Benzecri and E.-R. Iagolnitzer, U.E.R. de Mathematiques,
Universite de Paris V (France); Eng. Tibor Diamant and Dr Zoltan Vas, Jozsef Attila University, Szeged
(Hungary); Prof. Anne-Marie Dussaix, Ecole Superieure des Sciences Economiques et Commerciales (ES-
SEC), Cergy-Pontoise (France); Dr Igor S. Enyukov and Eng. Nicola D. Vylegjanin, StatPoint, Moscow
(Russian Federation); Dr Peter Hunya, who has been the Director of the Kalmar Laboratory of Cybernetics,
Jozsef Attila University, Szeged (Hungary), and IDAMS Programme Manager at UNESCO between July
1993 and February 2001; Jean Massol, EOLE, Paris (France); Prof. Anne Morin, Institut de Recherche
en Informatique et Syst`emes Aleatoires (IRISA), Rennes (France); Judith Rattenbury, ex-Director, Data
iii
Processing Division, World Fertility Survey, London, and presently founder and head of SJ MUSIC pub-
lishing house, Cambridge (United Kingdom); J.M. Romeder and Association pour le Developpement et la
Diusion de lAnalyse des Donnees (ADDAD), Paris (France); Prof. Peter J. Rousseeuw, Universitaire In-
stelling Antwerpen, (Belgium); Dr A.V. Skofenko, Academy of Sciences, Kiev (Ukraine); Eng. Neal Van
Eck, Susquehanna University, Selinsgrove (USA); Nicole Visart who has launched the IDAMS Programme
at UNESCO and who, in addition to her technical contributions at all stages, assured the coordination and
monitoring of the whole project until her retirement in 1992.
It is impossible to give due credit to all the many people, besides those already mentioned above, who have
contributed ideas and eort to IDAMS and to OSIRIS III.2 from which it was derived. Up to now IDAMS has
been developed mainly at UNESCO. Follows a list of names of the main programs, components and facilities
included in WinIDAMS, with the names of authors and programmers, and the names of institutions where
the work was done.
User Interface and Basic Facilities
Recode facility Ellen Grun ISR
Peter Solenberger ISR
User Interface Jean-Claude Dauphin UNESCO
On-line access to Pawel Hoser Polish Academy of Sciences
the Reference Manual Jean-Claude Dauphin UNESCO
Data Management Facilities
AGGREG Tina Bixby ISR
Jean-Claude Dauphin UNESCO
BUILD Carl Bixby ISR
Sylvia Barge ISR
Tibor Diamant UNESCO
CHECK Tina Bixby ISR
Jean-Claude Dauphin UNESCO
CONCHECK Neal Van Eck Van Eck Computing Consulting
CORRECT Tibor Diamant UNESCO
IMPEX Peter Hunya UNESCO
LIST Marianne Stover ISR
Sylvia Barge ISR
Jean-Claude Dauphin UNESCO
MERCHECK Karen Jensen ISR
Sylvia Barge ISR
Zoltan Vas JATE
MERGE Tina Bixby ISR
Nancy Barkman ISR
Jean-Claude Dauphin UNESCO
SORMER Carol Cassidy ISR
Jean-Claude Dauphin UNESCO
SUBSET Judy Mattson ISR
Judith Rattenbury ISR
Jean-Claude Dauphin UNESCO
TRANS Jean-Claude Dauphin UNESCO
iv
Data Analysis Facilities
CLUSFIND Leonard Kaufman Vrije Universiteit Brussel
Peter J. Rousseeuw Vrije Universiteit Brussel
Neal Van Eck Van Eck Computing Consulting
Tibor Diamant UNESCO
CONFIG Herbert Weisberg ISR
DISCRAN J.-M. Romeder ADDAD
and ADDAD
Peter Hunya UNESCO
Tibor Diamand UNESCO
FACTOR J.P. Benzecri, Universite de Paris V
E.R. Iagolnitzer Universite de Paris V
Peter Hunya JATE
MANOVA Charles E. Hall George Washington University
Elliot M. Cramer George Washington University
Neal Van Eck ISR
Tibor Diamand UNESCO
MCA Edwin Dean ISR
John Sonquist ISR
Tibor Diamant UNESCO
MDSCAL Joseph Kruskal Bell Telephone
Frank Carmone Bell Telephone
Lutz Erbring ISR
ONEWAY Spyros Magliveras ISR
Tibor Diamant UNESCO
PEARSON John Sonquist ISR
Spyros Magliveras ISR
Neal Van Eck ISR
Ronald Nuttal Boston College
Tibor Diamant UNESCO
POSCOR Peter Hunya JATE
QUANTILE Robert Messenger ISR
Tibor Diamant UNESCO
RANK Anne-Marie Dussaix ESSEC
Albert David ESSEC
Peter Hunya JATE
A.V. Skofenko Ukrainian Academy of Sciences
REGRESSN M.A. Efroymson ESSO Corporation
Bob Hsieh ESSO Corporation
Neal Van Eck ISR
Peter Solenberger ISR
SCAT Judith Goldberg ISR
SEARCH John Sonquist ISR
Elizabeth Lauch Baker ISR
James N. Morgan ISR
Neal Van Eck Van Eck Computing Consulting
Tibor Diamant UNESCO
TABLES Neal Van Eck ISR and Van Eck Computing Consulting
Tibor Diamant UNESCO
TYPOL Jean-Paul Aimetti CFRO
Jean Massol CFRO
Peter Hunya JATE
Jean-Claude Dauphin UNESCO
Multidimensional Tables Jean-Claude Dauphin UNESCO
GraphID Igor S. Enyukov StatPoint
Nicola D. Vylegjanin StatPoint
TimeSID Igor S. Enyukov StatPoint
v
As for the documentation, recognition should be expressed to all the people who contributed to its
preparation, in particular to: Judith Rattenbury who drafted the rst original English version of the Manual
(1988) and who kept revising further editions till 1998; Jean-Paule Griset (UNESCO, Paris) who designed
together with Nicole Visart the typography of the Manual used until 1998; Teresa Krukowska (IDAMS
Group, UNESCO, Paris) who compiled the part with statistical formulas, changed the Manuals typography
in 1998, continues updating the original English version since 1999, who is responsible for production of the
Manual in English, French, Portuguese and Spanish, and takes care of harmonization, as much as possible,
of texts in English, French, Portuguese and Spanish.
Acknowledgement to the authors of OSIRIS documents from which material was taken for WinIDAMS
Reference Manual must be made as follows: the OSIRIS III.2 User Manual Vol.1 (edited by Sylvia Barge
and Gregory A. Marks) and Vol.5 (compiled by Laura Klem), Institute for Social Research, University of
Michigan, USA.
Thanks should also go to translators of the software and documentation into French, Portuguese and Spanish
for their co-operation:
Profesor Jose Raimundo Carvalho, CAEN P os-gradua c ao em Economia, UFC, Fortaleza, Brazil, for
the translation of the Manual and texts as part of the software into Portuguese.
Professor Bernardo Lievano, Escuela Colombiana de Ingeniera (ECI) Bogota, Colombia, for the trans-
lation of the Manual and texts as part of the software into Spanish.
Professor Anne Morin, Institut de Recherche en Informatique et Syst`emes Aleatoires (IRISA), Rennes,
France, for contribution to the translation into French of texts as part of the software.
Nicole Visart, Grez-Doiceau, Belgium, for the translation of the Manual into French.
The following institutions have undertaken translation of the software and the Manual into Arabic and
Russian: ALECSO - Department of Documentation and Information, Tunis, Tunisia, and Russian State
Hydrometeorological University, Department of Telecommunications, St. Petersburg, Russian Federation.
Requests for WinIDAMS and Further Information
For further information on WinIDAMS regarding content, updating, training and distribution, please write
to:
UNESCO
Communication and Information Sector
Information Society Division
CI/INF - IDAMS
1, rue Miollis
75732 PARIS CEDEX 15
France
e-mail: idams@unesco.org
http://www.unesco.org/idams
Contents
1 Introduction 1
1.1 WinIDAMS User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Management Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data Analysis Facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Data in IDAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 IDAMS Commands and the Setup File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Import and Export of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Exchange of Data Between CDS/ISIS and IDAMS . . . . . . . . . . . . . . . . . . . . . . . . 6
1.9 Structure of this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I Fundamentals 9
2 Data in IDAMS 11
2.1 The IDAMS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Method of Storage and Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 The Data Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Characteristics of the Data File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Hierarchical Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Missing Data Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Non-numeric or Blank Values in Numeric Variables - Bad Data . . . . . . . . . . . . . 13
2.2.7 Editing Rules for Variables Output by IDAMS Programs . . . . . . . . . . . . . . . . 13
2.3 The IDAMS Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Example of a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 IDAMS Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 The IDAMS Square Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 The IDAMS Rectangular Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Use of Data from Other Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 The IDAMS Setup File 21
3.1 Contents and Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 IDAMS Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 File Specications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Examples of Use of $ Commands and File Specications . . . . . . . . . . . . . . . . . . . . . 23
3.5 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 General Coding Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.4 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Recode Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
viii CONTENTS
4 Recode Facility 33
4.1 Rules for Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Sample Set of Recode Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Missing Data Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 How Recode Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Basic Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 Basic Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Arithmetic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.9 Logical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.10 Assignment Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.11 Special Assignment Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.12 Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.13 Conditional Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.14 Initialization/Denition Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.15 Examples of Use of Recode Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.16 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.17 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Data Management and Analysis 57
5.1 Data Validation with IDAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Checking Data Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.3 Checking for Non-numeric and Invalid Variable Values . . . . . . . . . . . . . . . . . . 58
5.1.4 Consistency Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Data Management/Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Example of a Small Task to be Performed with IDAMS . . . . . . . . . . . . . . . . . . . . . 60
II Working with WinIDAMS 63
6 Installation 65
6.1 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2 Installation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3 Testing the Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Folders and Files Created During Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.1 WinIDAMS Folders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4.2 Files Installed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.5 Uninstallation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7 Getting Started 69
7.1 Overview of Steps to be Performed with WinIDAMS . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Create an Application Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.3 Prepare the Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.4 Enter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.5 Prepare the Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.6 Execute the Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.7 Review Results and Modify the Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.8 Print the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8 Files and Folders 79
8.1 Files in WinIDAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Folders in WinIDAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9 User Interface 81
9.1 General Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2 Menus Common to All WinIDAMS Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.3 Customization of the Environment for an Application . . . . . . . . . . . . . . . . . . . . . . 83
9.4 Creating/Updating/Displaying Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . 85
CONTENTS ix
9.5 Creating/Updating/Displaying Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.6 Importing Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.7 Exporting IDAMS Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.8 Creating/Updating/Displaying Setup Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.9 Executing IDAMS Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.10 Handling Results Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.11 Creating/Updating Text and RTF Format Files . . . . . . . . . . . . . . . . . . . . . . . . . . 93
III Data Management Facilities 95
10 Aggregating Data (AGGREG) 97
10.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
10.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
10.9 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11 Building an IDAMS Dataset (BUILD) 103
11.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.5 Input Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.6 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.7 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.8 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
12 Checking of Codes (CHECK) 109
12.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
13 Checking of Consistency (CONCHECK) 115
13.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
13.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
13.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
13.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
14 Checking the Merging of Records (MERCHECK) 119
14.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
14.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
14.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
14.4 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
14.5 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
14.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
x CONTENTS
14.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
14.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
14.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
15 Correcting Data (CORRECT) 127
15.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
15.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
15.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
15.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
15.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
15.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
15.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
15.8 Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
15.9 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
16 Importing/Exporting Data (IMPEX) 133
16.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
16.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
16.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
16.4 Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
16.5 Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
16.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
16.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
16.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
16.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
17 Listing Datasets (LIST) 143
17.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
17.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
17.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
17.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
17.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
17.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
17.7 Restriction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
17.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
18 Merging Datasets (MERGE) 147
18.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
18.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
18.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
18.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
18.5 Input Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
18.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
18.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
18.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
18.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
19 Sorting and Merging Files (SORMER) 155
19.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.4 Output Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.5 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
19.6 Input Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
19.7 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
19.8 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
19.9 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
19.10Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
19.11Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
CONTENTS xi
20 Subsetting Datasets (SUBSET) 159
20.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
20.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
20.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
20.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
20.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
20.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
20.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
20.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
20.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
21 Transforming Data (TRANS) 163
21.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
21.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
21.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
21.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
21.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
IV Data Analysis Facilities 169
22 Cluster Analysis (CLUSFIND) 171
22.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
22.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
22.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
22.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
22.5 Input Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
22.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
22.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
22.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
22.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
23 Conguration Analysis (CONFIG) 177
23.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
23.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
23.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
23.4 Output Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
23.5 Output Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
23.6 Input Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
23.7 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
23.8 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
23.9 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
23.10Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
24 Discriminant Analysis (DISCRAN) 183
24.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
24.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
24.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
24.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
24.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
24.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
24.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
24.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
24.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
25 Distribution and Lorenz Functions (QUANTILE) 189
xii CONTENTS
25.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
25.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
25.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
25.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
25.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
25.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
25.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
25.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
26 Factor Analysis (FACTOR) 193
26.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
26.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
26.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
26.4 Output Dataset(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
26.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
26.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
26.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
26.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
26.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
27 Linear Regression (REGRESSN) 201
27.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
27.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
27.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
27.4 Output Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
27.5 Output Residuals Dataset(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
27.6 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
27.7 Input Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
27.8 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
27.9 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
27.10Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
27.11Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
28 Multidimensional Scaling (MDSCAL) 211
28.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
28.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
28.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
28.4 Output Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
28.5 Input Data Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
28.6 Input Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
28.7 Input Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
28.8 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
28.9 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
28.10Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
28.11Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
29 Multiple Classication Analysis (MCA) 217
29.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
29.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
29.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
29.4 Output Residuals Dataset(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
29.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
29.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
29.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
29.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
29.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
30 Multivariate Analysis of Variance (MANOVA) 225
30.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
30.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
CONTENTS xiii
30.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
30.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
30.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
30.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
30.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
30.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
31 One-Way Analysis of Variance (ONEWAY) 231
31.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
31.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
31.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
31.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
31.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
31.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
31.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
31.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
32 Partial Order Scoring (POSCOR) 235
32.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
32.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
32.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
32.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
32.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
32.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
32.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
32.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
32.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
33 Pearsonian Correlation (PEARSON) 243
33.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
33.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
33.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
33.4 Output Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
33.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
33.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
33.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
33.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
33.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
34 Rank-Ordering of Alternatives (RANK) 249
34.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
34.2 Standard IDAMS features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
34.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
34.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
34.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
34.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
34.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
34.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
35 Scatter Diagrams (SCAT) 257
35.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
35.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
35.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
35.4 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
35.5 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
35.6 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
35.7 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
35.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
36 Searching for Structure (SEARCH) 261
xiv CONTENTS
36.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
36.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
36.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
36.4 Output Residuals Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
36.5 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
36.6 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
36.7 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
36.8 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
36.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
37 Univariate and Bivariate Tables (TABLES) 269
37.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
37.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
37.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
37.4 Output Univariate/Bivariate Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
37.5 Output Bivariate Statistics Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
37.6 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
37.7 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
37.8 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
37.9 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
37.10Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
38 Typology and Ascending Classication (TYPOL) 281
38.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
38.2 Standard IDAMS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
38.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
38.4 Output Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
38.5 Output Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
38.6 Input Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
38.7 Input Conguration Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
38.8 Setup Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
38.9 Program Control Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
38.10Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
38.11Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
V Interactive Data Analysis 289
39 Multidimensional Tables and their Graphical Presentation 291
39.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
39.2 Preparation of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
39.3 Multidimensional Tables Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
39.4 Graphical Presentation of Univariate/Bivariate Tables . . . . . . . . . . . . . . . . . . . . . . 294
39.5 How to Make a Multidimensional Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
39.6 How to Change a Multidimensional Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
40 Graphical Exploration of Data (GraphID) 301
40.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
40.2 Preparation of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
40.3 GraphID Main Window for Analysis of a Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 301
40.3.1 Menu bar and Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
40.3.2 Manipulation of the Matrix of Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . 304
40.3.3 Histograms and Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
40.3.4 Regression Lines (Smoothed lines) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
40.3.5 Box and Whisker Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
40.3.6 Grouped Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
40.3.7 Three-dimensional Scatter Diagrams and their Rotation . . . . . . . . . . . . . . . . . 308
40.4 GraphID Window for Analysis of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
40.4.1 Menu bar and Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
40.4.2 Manipulation of the Displayed Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
CONTENTS xv
41 Time Series Analysis (TimeSID) 311
41.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
41.2 Preparation of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
41.3 TimeSID Main Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
41.3.1 Menu bar and Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
41.3.2 The Time Series Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
41.4 Transformation of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
41.5 Analysis of Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
VI Statistical Formulas and Bibliographical References 317
42 Cluster Analysis 319
42.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
42.2 Standardized Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
42.3 Dissimilarity Matrix Computed From an IDAMS Dataset . . . . . . . . . . . . . . . . . . . . 320
42.4 Dissimilarity Matrix Computed From a Similarity Matrix . . . . . . . . . . . . . . . . . . . . 320
42.5 Dissimilarity Matrix Computed From a Correlation Matrix . . . . . . . . . . . . . . . . . . . 320
42.6 Partitioning Around Medoids (PAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
42.7 Clustering LARge Applications (CLARA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
42.8 Fuzzy Analysis (FANNY) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
42.9 AGglomerative NESting (AGNES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
42.10DIvisive ANAlysis (DIANA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
42.11MONothetic Analysis (MONA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
42.12References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
43 Conguration Analysis 327
43.1 Centered Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
43.2 Normalized Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
43.3 Solution with Principal Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
43.4 Matrix of Scalar Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
43.5 Matrix of Interpoint Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
43.6 Rotated Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
43.7 Translated Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
43.8 Varimax Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
43.9 Sorted Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
43.10References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
44 Discriminant Analysis 331
44.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
44.2 Linear Discrimination Between 2 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
44.3 Linear Discrimination Between More Than 2 Groups . . . . . . . . . . . . . . . . . . . . . . . 333
44.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
45 Distribution and Lorenz Functions 335
45.1 Formula for Break Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
45.2 Distribution Function Break Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
45.3 Lorenz Function Break Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
45.4 Lorenz Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
45.5 The Gini Coecient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
45.6 Kolmogorov-Smirnov D Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
45.7 Note on Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
46 Factor Analyses 339
46.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
46.2 Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
46.3 Core Matrices (Matrices of Relations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
46.4 Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
46.5 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
46.6 Table of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
xvi CONTENTS
46.7 Table of Principal Variables Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
46.8 Table of Supplementary Variables Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
46.9 Table of Principal Cases Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
46.10Table of Supplementary Cases Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
46.11Rotated Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
46.12References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
47 Linear Regression 347
47.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
47.2 Matrix of Total Sums of Squares and Cross-products . . . . . . . . . . . . . . . . . . . . . . . 347
47.3 Matrix of Residual Sums of Squares and Cross-products . . . . . . . . . . . . . . . . . . . . . 348
47.4 Total Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
47.5 Partial Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
47.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
47.7 Analysis Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
47.8 Analysis Statistics for Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
47.9 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
47.10Note on Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
47.11Note on Descending Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
47.12Note on Regression with Zero Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
48 Multidimensional Scaling 353
48.1 Order of Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
48.2 Initial Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
48.3 Centering and Normalization of the Conguration . . . . . . . . . . . . . . . . . . . . . . . . 353
48.4 History of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
48.5 Stress for Final Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
48.6 Final Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
48.7 Sorted Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
48.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
48.9 Note on Ties in the Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
48.10Note on Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
48.11References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
49 Multiple Classication Analysis 359
49.1 Dependent Variable Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
49.2 Predictor Statistics for Multiple Classication Analysis . . . . . . . . . . . . . . . . . . . . . . 360
49.3 Analysis Statistics for Multiple Classication Analysis . . . . . . . . . . . . . . . . . . . . . . 361
49.4 Summary Statistics of Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
49.5 Predictor Category Statistics for One-Way Analysis of Variance . . . . . . . . . . . . . . . . . 362
49.6 One-Way Analysis of Variance Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
49.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
50 Multivariate Analysis of Variance 365
50.1 General Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
50.2 Calculations for One Test in a Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . 367
50.3 Univariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
50.4 Covariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
51 One-Way Analysis of Variance 371
51.1 Descriptive Statistics for Categories of the Control Variable . . . . . . . . . . . . . . . . . . . 371
51.2 Analysis of Variance Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
52 Partial Order Scoring 373
52.1 Special Terminology and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
52.2 Calculation of Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
52.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
53 Pearsonian Correlation 377
53.1 Paired Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
53.2 Unpaired Means and Standard Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
CONTENTS xvii
53.3 Regression Equation for Raw Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
53.4 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
53.5 Cross-products Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
53.6 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
54 Rank-ordering of Alternatives 379
54.1 Handling of Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
54.2 Method of Classical Logic Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
54.3 Methods of Fuzzy Logic Ranking: the Input Relation . . . . . . . . . . . . . . . . . . . . . . . 382
54.4 Fuzzy Method-1: Non-dominated Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
54.5 Fuzzy Method-2: Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
54.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
55 Scatter Diagrams 387
55.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
55.2 Paired Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
55.3 Bivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
56 Searching for Structure 389
56.1 Means analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
56.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
56.3 Chi-square Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
56.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
57 Univariate and Bivariate Tables 395
57.1 Univariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
57.2 Bivariate Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
57.3 Note on Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
58 Typology and Ascending Classication 403
58.1 Types of Variables Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
58.2 Case Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
58.3 Group prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
58.4 Distances Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
58.5 Building of an Initial Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
58.6 Characteristics of Distances by Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
58.7 Summary Statistics for Quantitative Variables and for Qualitative Active Variables . . . . . . 407
58.8 Description of Resulting Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
58.9 Summary of the Amount of Variance Explained by the Typology . . . . . . . . . . . . . . . . 408
58.10Hierarchical Ascending Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
58.11References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
Appendix: Error Messages From IDAMS Programs 411
Index 413
Chapter 1
Introduction
IDAMS is a software package for the validation, manipulation and statistical analysis of data. It is organized
as a collection of data management and analysis facilities accessible through a user interface and a common
control language. Examples of the types of data that can be processed with IDAMS are: the answers to
questions by respondents in a survey, information about books in a library, the personal characteristics and
performance of students at a college, measurements from a scientic experiment. The common features of
such data are that they consist of values of variables for each of a collection of objects/cases (e.g. in a sample
survey, the questions correspond to the variables and the respondents to the cases).
Many dierent packages and programs exist for aid in the statistical analysis of such data. One special
feature of IDAMS is that it also provides facilities for extensive data validation (e.g. code checking and
consistency checking) before embarking on analysis. As far as analysis is concerned, IDAMS performs classical
techniques such as table building, regression analysis, one-way analysis of variance, discriminant and cluster
analysis and also some more advanced techniques such as principal components factor analysis and analysis of
correspondences, partial order scoring, rank ordering of alternatives, segmentation and iterative typology. In
addition, WinIDAMS provides for interactive construction of multidimensional tables, interactive graphical
exploration of data and interactive time series analysis.
1.1 WinIDAMS User Interface
It is a multiple document interface (MDI) which allows to work simultaneously with dierent types of
documents in separate windows.
The Interface provides the following:
denition of Data, Work and Temporary folders for an application;
Dictionary window for creating/updating/displaying Dictionary les;
Data window for creating/updating/displaying Data les;
Setup window to prepare/display Setup les;
Results window to display, copy and print selected parts of results;
general text editor;
an option for executing IDAMS setups from a le or from the active Setup window;
interactive data import/export facilities;
access to interactive data analysis components (Multidimensional Tables, GraphID, TimeSID);
on-line access to the Reference Manual.
2 Introduction
1.2 Data Management Facilities
Aggregating data (AGGREG). Allows the grouping of records from a number of cases into one record
and to output a new dataset with one record for each group, for example, records representing members of
a household are grouped into household representing record. The variables in the new records are summary
statistics of specied variables from the individual records, e.g. the sum, mean, minimum/maximum value.
Building an IDAMS dataset (BUILD). A raw data le (which may contain multiple records per case) is
input along with a dictionary describing the variables to be selected. BUILD checks for non-numeric values
in numeric elds; blank elds can be recoded to user-specied numeric values and other non-numerics are
reported and replaced by 9s. The output is an IDAMS dataset comprising a Data le with a single record
per case and a dictionary which describes each eld in the data records.
Checking of codes (CHECK). Reports cases which have invalid variable values. Valid codes for each
variable are specied by the user and/or taken from the dictionary.
Checking of consistency (CONCHECK). Reports cases with inconsistencies between two or more vari-
ables. IDAMS Recode statements are used to specify the logical relationships to be checked.
Checking the merging of records (MERCHECK). Checks that the correct records are present for each
case in a le with multiple records per case. It outputs a le containing equal numbers of records per case.
Invalid or duplicate records can be deleted and missing records can be inserted with missing values specied
by the user.
Correcting data (CORRECT). Updates a Data le by applying corrections to individual variable values
for specied cases. The Results le contains a written trace of corrections allowing them to be archived.
Importing/exporting data (IMPEX). Import is aimed at building IDAMS datasets or matrices from les
coming from other software. The aim of export is to make possible the use of Data and Matrix les, stored
in or created by IDAMS, in other packages. Free and DIF format text les can be imported/exported.
Listing datasets (LIST). Values for selected variables (original or recoded) and/or selected cases can be
listed in the column format.
Merging datasets (MERGE). Two datasets can be merged by matching cases according to a common set
of variables called match variables. There are 4 options for selecting cases for the output dataset: (1) only
cases present in both les (intersection); (2) cases present in either le (union); (3) each case in the rst le;
(4) each case in the second le. The user species which variables from each of the two input les are to be
output. An option exists for matching a case from one le with more than one case from the second le, e.g.
for adding household data from one le to each individuals record in a second le.
Sorting and merging les (SORMER). This is a general purpose utility for sorting data into ascending
or descending order on up to 12 elds. Up to 16 les may be merged.
Subsetting datasets (SUBSET). Outputs a new dataset (Data and Dictionary les) containing selected
cases and/or variables from the input dataset. There is an option to check for duplicate cases.
Transforming data (TRANS). Allows variables created with the IDAMS Recode facility to be saved in a
permanent dataset.
1.3 Data Analysis Facilities
Cluster analysis (CLUSFIND). Performs cluster analysis by partitioning a set of objects (cases or variables)
into a set of clusters as determined by one of 6 algorithms, 2 based on partitioning around medoids, one
based on fuzzy clustering and the other 3 based on hierarchical clustering.
Conguration analysis (CONFIG). Performs analysis on a single input conguration, created for example
by MDSCAL program. It has the capability of centering, norming, rotating, translating dimensions, comput-
ing inter-point distances and scalar products. The conguration can be plotted after each transformation.
Discriminant analysis (DISCRAN). Looks for the best linear discriminant function(s) of a set of variables
which reproduces, as far as possible, an a priori grouping of the cases. It uses a stepwise procedure, i.e.
in each step the most powerful variable is entered. Three samples of cases can be distinguished: basic
1.3 Data Analysis Facilities 3
sample on which the main discriminant analysis steps are performed, test sample on which the power of the
discriminant function is checked and anonymous sample which is used only for classifying the cases. Case
assignment and values of the two rst discriminant factors (if there are more than 2 groups) can be saved in
a dataset.
Distribution and Lorenz functions (QUANTILE). Distribution functions with 2 to 100 subintervals,
Lorenz functions, Lorenz curve and Gini coecients, and the Kolmogorov-Smirnov test.
Factor analysis (FACTOR). Covers a set of principal component factor analyses (scalar products, co-
variances, correlations) and factor analysis of correspondences. For each analysis, it constructs a matrix
representing the relations between variables and computes its eigenvalues and eigenvectors. Then it cal-
culates the case and/or variable factors giving for each case and/or variable its ordinate, its quality of
representation and its contributions to the factors. Factors can be saved in a dataset and a graphic repre-
sentation of cases and/or variables in the factor space can be obtained. Active and passive variables and
cases can be distinguished.
Linear regression (REGRESSN). Multiple linear regression analysis: standard and stepwise. Either a
dataset or a correlation matrix may be used as input. Residuals can be printed with the Durbin-Watson
statistic for their rst-order autocorrelation, and they can also be output for further analyses.
Multidimensional scaling (MDSCAL). This is a non-metric multidimensional scaling procedure for the
analysis of similarities. Operates on a matrix of similarity or dissimilarity measures and looks for the best
geometric representation of the data in n-dimensional space. The user controls the dimensionality of the
conguration obtained, the distance metric used and the way the ties (equal values) in the input data should
be handled.
Multiple classication analysis (MCA). Examines the relationships between several predictors and a
single dependent variable, and determines the eect of each predictor before and after adjustment for its
inter-correlations with other predictors. Provides information about bivariate and multivariate relationships
between predictors and the dependent variable. Residuals can be printed and/or saved in a dataset.
Multivariate analysis of variance (MANOVA). Performs univariate and multivariate analysis of variance
and of covariance, using a general linear model. Up to eight factors (independent variables) can be used.
If more than one dependent variable is specied, both univariate and multivariate analyses are performed.
The program performs an exact solution with either equal or unequal numbers of cases in the cells.
One-way analysis of variance (ONEWAY). Descriptive statistics of the dependent variable within cate-
gories of the control variable and one-way analysis statistics such as: total sum of squares, between means
sum of squares, within groups sum of squares, eta and eta squared (unadjusted and adjusted) and the F-test
value.
Partial order scoring (POSCOR). Calculates ordinal scale scores from interval or ordinal scale variables.
Scores are calculated for each case involved in analysis and they measure the relative position of the case
within the set of cases. The scores, optionally with other user-specied variables, are output in the form of
an IDAMS dataset.
Pearsonian correlation (PEARSON). Calculates Pearsons r correlation coecients, covariances, and
regression coecients. Pairwise or casewise deletion of missing data can be requested. Output correlation
and covariance matrices can be saved in a le.
Rank-ordering of alternatives (RANK). Determines a reasonable rank-order of alternatives using prefer-
ence data and three dierent ranking procedures, one based on classical logic and two others based on fuzzy
logic. Preference data can represent either a selection or ranking of alternatives. Two types of individual
preference relations can be specied: weak and strict. With fuzzy ranking, the data completely determine
the results obtained whereas with classical ranking the user has the possibility of controlling the calculations.
Scatter diagrams (SCAT). Scatter diagrams, univariate statistics (mean, standard deviation and N) and
bivariate statistics (Pearsons r and regression statistics: coecient B and constant A).
Searching for structure (SEARCH). A binary segmentation procedure to develop predictive models. The
question what dichotomous split on which predictor variable will give the maximum improvement in the
ability to predict values of the dependent variable embedded in an iterative scheme, is the basis of the
algorithm used.
Univariate and bivariate tables (TABLES). Options include: (1) univariate simple and cumulative
4 Introduction
frequency and percentage distributions; (2) univariate statistics: mean, median, mode, variance, standard
deviation, skewness, kurtosis, minimum, maximum; (3) bivariate frequency tables with row, column and
total percentages; (4) tables of mean values of an additional variable; (5) bivariate statistics: t-test of means
between pairs of rows, Chi-square, contingency coecient, Cramers V, Kendalls Taus, Gamma, Lambdas,
Spearman rho, a number of statistics for Evidence Based Medicine, and 3 non-parametric tests: Wilcoxon,
Mann-Whitney and Fisher.
Typology and ascending classication (TYPOL). Creates a typology variable as a summary of a large
number of variables both quantitative and qualitative. The user chooses the initial and nal number of
groups, the type of distance used, and the way the initial typology is started. The groups of initial typology
are stabilized using an iterative procedure. The number of groups can be reduced using an algorithm of
hierarchical ascending classication. A distinction can be made between active variables which participate
in the construction of typology, and passive variables, for which main statistics are calculated within the
groups of the typology.
Interactive multidimensional tables. This component allows to visualize and customize multidimen-
sional tables with frequencies, row, column and total percentages, summary statistics (sum, count, mean,
maximum, minimum, variance, standard deviation) of additional variables, and bivariate statistics. Up to
seven variables can be nested in rows or in columns. Construction of a table can be repeated for each value
of up to three page variables. The tables can also be printed, or exported in free format (comma or
tabulation character delimited) or in HTML format.
Interactive graphical exploration of data. A separate component, GraphID, is available for exploring
data through graphic displays. The basic display is in the form of multiple scatterplots for dierent pairs
of variables. Additional information such as histograms and regression lines may be displayed on each plot.
The plots may be manipulated in various ways. For example, selected cases can be marked in one plot and
then highlighted in all the other plots. Parts of the display may be enlarged (zoomed). IDAMS matrices
are displayed as three dimensional plots with rows and columns being represented by two of the axes and
the third dimension being used to show the size of the statistic for each cell.
Interactive time series analysis. Another separate component, TimeSID, provides a possibility for in-
teractive analysis of time series. It contains analysis of trends, auto-correlations and cross-correlations,
statistical and graphical analysis of time series values, tests of randomness and trends, forecasting for short
terms, periodograms and estimation of spectral densities. Series can be transformed by calculating aver-
ages, arithmetic compositions, sequential dierences, rates of change, smoothed by moving averages and
decomposed using frequency lters.
1.4 Data in IDAMS
IDAMS dataset - the Data le. The data le input to IDAMS may be any character (ASCII) xed
format le, i.e. the values for a given variable occupy the same position (eld) in the record for every case.
Characteristics of this le are:
1-50 records per case;
each case can contain up to 4096 characters;
number of cases limited by the disk capacity and the internal representation of numbers;
variables can be numeric (up to 9 characters) or alphabetic (up to 255 characters).
IDAMS dataset - the Dictionary le. The dictionary is used to describe the data:
it may contain up to 1000 variables identied by a unique number between 1 and 9999;
for each variable, it contains at minimum the variables number, its type (numeric or alphabetic), and
its location in the data record;
for each variable, a variable name, two missing data codes, the number of decimal places and a reference
number may also be specied;
1.5 IDAMS Commands and the Setup File 5
for qualitative variables, codes and corresponding labels may be included.
The pair of les consisting of a Dictionary le and the Data le it describes is known as an IDAMS dataset.
IDAMS matrices. Some analysis programs use a square or rectangular matrix as input rather than the
raw data.
The square matrix is used for symmetric arrays of bivariate statistics with a constant on the diagonal.
Only the upper right-hand corner of the matrix is stored, without the diagonal.
The rectangular matrix is for non-symmetric arrays of values. The meaning of the rows and columns
varies according to the IDAMS program.
1.5 IDAMS Commands and the Setup File
With the exception of WinIDAMS interactive components, execution of an IDAMS program is launched by
a setup. The setup contains information such as le specications, program control statements, variable
recoding instructions, etc., separated by IDAMS commands (starting with a $ character) which identify the
kind of information being specied. The rst IDAMS command in the Setup le always identies the rst
program to be executed, e.g.
$RUN TABLES
$FILES
DICTIN = name of Dictionary file
DATAIN = name of Data file
$SETUP
control statements for TABLES program
$RECODE
variable recoding statements
1.6 Standard IDAMS Features
Case selection. By default all cases from a Data le will be processed in a program execution. To select
a subset, a lter statement is included in the setup, e.g. INCLUDE V3=1 (include only those cases where
variable 3 is equal to 1).
Variable selection. Variables are referenced by their numbers assigned in the dictionary. A set of variables
is specied in a variable list following keywords such as VARS, CONVARS, OUTVARS. Such variable lists
may also include R-variables constructed by the IDAMS Recode facility (see below), e.g. VARS=(V3-
V6,V129,R100,R101).
Transforming/recoding data. A powerful Recode facility permits the recoding of variables and the
construction of new variables. Recoding instructions are prepared by the user in the IDAMS Recode language.
This includes the possibility of arithmetic computation as well as the use of several special functions for
operations such as the grouping of values, the creation of dummy variables, etc. Conditional statements
are also allowed. Examples of Recode statements for constructing 3 new variables R100, R101 and R102 are:
R100=V4+V5
R101=BRAC(V10,0-15=1,16-60=2,61-98=3,99=9)
IF (MDATA(V3,V4) OR V4 EQ 0) THEN V102=99 ELSE R102=V3*100/V4
The R-variables thus constructed for each case can be used temporarily in the program being executed or
can be saved in a dataset using the TRANS program.
Weighting data. When complex sampling procedures are used during data collection, it may be necessary
to use dierent weights for cases during analysis. Such weights are usually stored as a variable in the Data
le. The WEIGHT parameter is then used in the program control statements to invoke weighting, e.g.
WEIGHT=V5.
6 Introduction
Treatment of missing data and bad data. Special values for each numeric variable can be identied
as missing data codes and stored in the dictionary. During data processing missing data is handled through
two parameters:
MDVALUES (species which missing data codes are to be used to check for missing data in numeric
variables);
MDHANDLING (species what is to be done if missing data are encountered).
Normally it is assumed that data have been cleaned prior to analysis. If this is not the case, then the
BADDATA parameter is available for skipping cases with non-numeric values (including blank elds) in
numeric elds, or for treating such values as missing data.
1.7 Import and Export of Data
IDAMS does not use special internal le format for storing data. Any character le in xed format can be
described by an IDAMS dictionary and then input to IDAMS. On the other hand, free format data with Tab,
comma or semicolon used as separator can be imported through the WinIDAMS User Interface. Moreover,
the IMPEX program allows a xed format IDAMS le to be created from any text le in free or DIF format.
Data les created by IDAMS are always character les in xed format. Such les can be used directly by
other software along with the appropriate data descriptive information for that software. Free format les
with Tab, comma or semicolon used as separator can be obtained through the WinIDAMS User Interface.
Moreover, the IMPEX program allows a xed format IDAMS le to be exported as a text le in free or DIF
format.
IDAMS matrices are stored in a format specic to IDAMS (described in the Data in IDAMS chapter).
The IMPEX program can be used to import/export free format matrices.
1.8 Exchange of Data Between CDS/ISIS and IDAMS
There is a separate program, WinIDIS, which prepares data description and performs data transfer between
IDAMS and CDS/ISIS (the UNESCO software for database management and information retrieval). Such
transfer is controlled by IDAMS and ISIS data description les (the IDAMS dictionary and the CDS/ISIS
Field Denition Table). When going from ISIS to IDAMS, a new IDAMS Dictionary and Data les are always
constructed and they can be merged with other data using IDAMS data management facilities. When going
from IDAMS to ISIS, there are three possibilities: (1) a completely new data base can be constructed, (2)
transferred records can be added to an existing data base as new data base records, (3) records of an existing
data base can be updated with the transferred data.
1.9 Structure of this Manual
All the general features of IDAMS, including the Recode facility, are described in Part 1 of this Manual.
Part 2 includes installation instructions, description of les and folders used in WinIDAMS, a section enti-
tled Getting Started which takes a user through the steps required to perform simple task, and description
of the WinIDAMS User Interface.
1.9 Structure of this Manual 7
In-depth descriptions of each IDAMS program are given in Parts 3 and 4. These write-ups contains the
following sections:
General Description. A statement of the primary purpose of the program.
Standard IDAMS Features. Statements about the case and variable selection possibilities, data
transformation, weighting capabilities, and missing data handling.
Results. Details of results destined to be printed (or reviewed on the screen).
Description of output and input les. One section for each IDAMS dataset, each matrix and each
other input or output le, giving a description of their contents.
Setup Structure. A designation of the le specications, IDAMS commands, and program control
statements needed to execute the program.
Program Control Statements. The parameters and/or formats of each of the program control
statements with an example of each type.
Restrictions. A summary of the program limitations.
Examples. Examples of complete sets of control statements for executing the program.
Part 5 provides description of WinIDAMS interactive components for construction of multidimensional
tables, for graphical exploration of data and for time series analysis.
Part 6 provides details of statistical techniques, formulas and bibliographical references for all analysis
programs.
Finally, errors issued by IDAMS programs are summarized in the Appendix.
Part I
Fundamentals
Chapter 2
Data in IDAMS
2.1 The IDAMS Dataset
2.1.1 General Description
The dataset consists of 2 separate les: a Data le and a Dictionary le which describes some or all of the
elds (variables) in the records of the data le. All Dictionary/Data les output by IDAMS programs are
IDAMS datasets.
2.1.2 Method of Storage and Access
Both Dictionary and Data les are read and written sequentially. Thus they may be stored on any media.
There is no special IDAMS internal system le as in some other packages. The les are in character/text
format (ASCII) and can be processed at any time with general utilities or editors, or input directly to other
statistical packages.
2.2 Data Files
2.2.1 The Data Array
Irrespective of its actual format in the data le, the data can be visualized as a rectangular array of variable
values, where element x
ij
is the value of the variable represented by the j-th column for the case represented
by the i-th row. For example, the data from a survey can be displayed in the following way:
Cases Variables
identification education sex age ...
_________________________________________________________________
case 1 1300 6 2 31 ... ...
case 2 1301 2 1 25 ...
. 1302 3 1 55 ...
. . . . . ...
In the example, each row represents a respondent in a survey and each column represents an item from the
questionnaire.
12 Data in IDAMS
2.2.2 Characteristics of the Data File
These les contain normally, but not necessarily, xed length records, since the end of the record is recognized
by carriage return/line feed characters. However, the length of the longest record must be supplied on the
le denition (see $FILES command). There is no limit to the number of records in the Data le.
The maximum record length is 4096 characters.
Each case may consist of more than one record (up to a maximum of 50). If, in a particular program
execution, variables are to be accessed from more than one type of record, then there must be exactly the
same number of records for each case. The MERCHECK program can be used to create les complying with
this condition. Note that any Data le output by an IDAMS program is always restructured to contain a
single record per case.
If a raw data le contains dierent record types (and the record type is coded) and does not have exactly the
same number of records per case, IDAMS programs can be executed using variables from one record type at
a time by selecting only that record type at the start.
2.2.3 Hierarchical Files
IDAMS only processes rectangular les as described above. Hierarchical les can be handled by storing
records from the dierent levels in dierent les and then using the AGGREG and MERGE programs
to produce composite records containing variables from the dierent levels. Alternatively, the complete
hierarchical data le can be processed one level at a time by ltering records for that level only (providing
record types are coded).
2.2.4 Variables
Referencing variables. The variables in a Data le are identied by a unique number between 1 and 9999.
This number, preceded by a V (e.g. V3) is used to refer to a particular variable in control statements to
programs. The variable number is used to index a variable-descriptor record in the dictionary which provides
all other necessary information about the variable such as its name and its location in the data record.
Variable types. Variables can be of numeric or alphabetic type, both stored in character mode.
Numeric variables. These can be positive or negative valued with the following characteristics:
A value can be composed of the numeric characters 0-9, a decimal point and a sign (+,-). Leading
blanks are allowed.
Values must be right justied in the eld (i.e. with no trailing blanks) unless an explicit decimal point
appears.
Maximum eld width is 9 but only up to 7 signicant digits (both integers and decimals taken together)
are retained in processing.
Variable values can be integers (e.g. an age variable or a categorical variable such as sex) or may be
decimal valued (e.g. a variable with percentage values). The number of decimals (NDEC) is stored in
the variables descriptor record in the dictionary. Normally the decimal point is implicit and does
not appear in the data. In this case NDEC gives the number of digits of the variables value that are
to be treated as decimal places. If an explicit decimal point is coded in the data, then NDEC is used
to determine the number of digits to the right of the decimal point that will be retained, rounding
up the value if necessary, e.g. values coded 4.54 and 4.55 with NDEC=1 will be used as 4.5 and 4.6
respectively.
A sign (if it appears) must be the rst character, e.g. -0123.
Blank elds are considered non-numeric and treated as bad data. See below for how to deal with
blanks used in the data to indicate missing or inapplicable data.
With the exception of BUILD, all IDAMS programs accept values in exponential notation, e.g. value
coded .215E02 will be used as 21.5 .
2.2 Data Files 13
Alphabetic variables. Alphabetic variables can be held in Data les and can be up to 255 characters
long. They can be used in data management programs. 1-4 character alphabetic variables can be also used
in lters. In order to be used in analysis, 1-4 character alphabetic variables must be recoded to numeric
values. This can be done with Recodes BRAC function.
2.2.5 Missing Data Codes
The value of a variable for a particular case may be unknown for a number of reasons, for example a question
may be inapplicable to certain respondents or a respondent may refuse to answer a question. Special missing
data codes can be established for each numeric variable and coded into the data when needed. Two missing
data codes are allowed: MD1 and MD2. If used, any value in the data equal to MD1 is considered a missing
value; any value greater than or equal to MD2 (if MD2 is positive or zero) or less than or equal to MD2 (if
MD2 is negative) is also considered missing.
These missing data codes are stored in the dictionary record for the variable. Similar to data values, they
can be integer or decimal valued, with an implicit or explicit decimal point. If MD1 or MD2 is specied with
an implicit decimal point, NDEC gives the number of digits to be treated as decimal places. If an explicit
decimal point is coded in MD1 or MD2, then NDEC determines the number of digits to the right of the
decimal point to be retained, rounding up the value accordingly.
When a variables MD1 and MD2 codes are blank in the dictionary, this means that there are no special
numeric missing data codes. During an IDAMS program execution, blank dictionary MD1 and MD2 elds
are lled in by the default missing data codes of 1.5 10
9
and 1.6 10
9
respectively.
Since the missing data codes are each limited to a maximum of 7 digits (or 6 digits and a negative sign),
they can present a problem for 8 and 9 digit variables. The user should consider the use of a negative rst
missing data code in this case.
2.2.6 Non-numeric or Blank Values in Numeric Variables - Bad Data
In IDAMS data management programs, data values are merely copied from one place to another and conver-
sion to a computational (binary) mode is not carried out; in this case there is no check on whether numeric
variables have numeric values. However, when variables are being used for analysis or in Recode operations,
then their values are converted to binary mode and values containing non-numeric characters will cause
problems. Normally data should be cleaned of such characters prior to analysis. In addition, blank values in
numeric variables are not automatically treated as missing values; they are also considered to be non-numeric
or bad data.
To allow for analysis of incompletely cleaned data and for the handling of unrecoded blank elds, the
BADDATA parameter may be used to treat blank and other non-numeric values as missing and thus have
the possibility of eliminating them from analysis. Specication of the parameter BADDATA=MD1 or
BADDATA=MD2 results in the conversion of bad values to the MD1 or MD2 code for the variable. If
the MD1 or MD2 codes are blank, then bad data values are converted to the corresponding default missing
data code (see above) and are thus treated as missing values (see the description of BADDATA parameter
in The IDAMS Setup File chapter).
2.2.7 Editing Rules for Variables Output by IDAMS Programs
IDAMS programs always create a Data le and a corresponding IDAMS dictionary, i.e. an IDAMS dataset.
The Data le contains one record for each case. The record length is the sum of the eld widths of all
variables output and is determined by the program.
14 Data in IDAMS
Numeric variable values are edited to a standard form as described below:
If the entire eld contains only the numeric characters 0-9, these are output exactly as they appear in
the input data.
If the eld contains a number entered with leading blanks (e.g. 5), the blanks are converted to
zeros before the data are output. Fields with trailing blanks (e.g. 04 in a three digit numeric eld),
embedded blanks (e.g. 0 4) and all blanks are treated according to the BADDATA specication.
If the eld contains a positive value or a negative value with the + and - characters explicitly entered,
the positive sign is removed and the negative sign is put before the rst signicant numeric digit.
If the eld contains a number with an explicit decimal point, this is removed and the value output has
the same width as the input eld and n decimal places as dened in the NDEC eld of the variable
description. Leading blanks in the eld are converted to zeros. If more than n digits are found in the
input eld after the decimal point, the value is rounded and output to n decimal places (e.g. if n=2,
an input value of 2.146 will be output as 215; if n=0, an input value of 1.5 will be output as 002).
Trailing blanks do not cause an error condition. If fewer than n digits are found, zeros are inserted on
the right for the missing decimal places.
Values which are too big to t into the eld assigned are treated according to BADDATA specication.
Alphabetic variable values are not edited and are the same on input and output.
2.3 The IDAMS Dictionary
2.3.1 General Description
The dictionary is used to describe the variables in the data. For each variable it must contain at minimum the
variables number, its type and its location in the data record. In addition, a variable name, two missing data
codes, the number of decimal places and a reference number or name may be given. This information is stored
in variable-descriptor records sometimes known as T-records. Optional C-records for categorical variables
give labels for the dierent possible codes. The rst record in the dictionary, the dictionary-descriptor record,
identies the dictionary type, gives the rst and last variable numbers used in the dictionary and species
the number of data records making up a case.
The original dictionary is prepared by the user to describe the raw data. IDAMS programs which output
datasets always produce new dictionaries reecting the new format of the data.
Dictionary records have xed format and are 80-characters long.
A detailed description of each type of dictionary record is given below.
Dictionary-descriptor record. This is always the rst record in the dictionary.
Columns Content
4 3 (indicates the type of dictionary).
5-8 First variable number (right justied).
9-12 Last variable number (right justied).
13-16 Number of records per case (right justied).
20 Form in which variable location is specied (columns 32-39) on the variable-descriptor records.
blank Record number and starting and ending columns. Record length must be 80 to use
this format if the number of records per case is > 1.
1 Starting location and eld width.
Variable-descriptor records (T-records). The dictionary contains one such record for each variable.
These records are arranged in ascending order by the variable number. The variable numbers need not be
contiguous. The maximum number of variables is 1000.
2.3 The IDAMS Dictionary 15
Columns Content
1 T
2-5 Variable number.
7-30 Variable name.
32-39 Location; according to column 20 of the dictionary-descriptor record.
Either
32-33 Record sequence number containing starting column of variable.
34-35 Starting column number.
36-37 Record sequence number containing ending column of variable.
38-39 Ending column number.
Or
32-35 Starting location of the variable within the case.
36-39 Field width (1-9 for numeric variables and 1-255 for alphabetic variables).
40 Number of decimal places (numeric variables only).
Blank implies no decimal places.
41 Type of variable.
blank Numeric.
1 Alphabetic.
45-51 First missing data code for numeric variables (or blanks if no 1st missing data code).
Right justied.
52-58 Second missing data code for numeric variables (or blanks if no 2nd missing data code).
Right justied.
59-62 Reference number (optional - can be used to contain some unchangeable alphanumeric reference
for the variable, e.g. the original variable number or a question reference).
73-75 Study ID (optional - can be used to identify the study to which this dictionary belongs).
Note 1: When record and column numbers are used to indicate variable location, listings of the dictionary
records do not show the record and column numbers as they appear on the dictionary record. Rather, the
variable location is translated to and printed in the starting location/width format. For example, for a
variable in columns 22-24 of the third record of a multiple record (record length 80) per case data le, the
starting location will be 182 (2 * 80 + 22) and the width 3.
Note 2: If there is more than one record per case and the record length is not 80, then starting location and
eld width notation must be used on the T-records. The starting location is counted from the start of the
rst record. For example, for records of length 121, the starting location of a eld at position 11 of the 2nd
record for a case would be 132.
Code-label records (C-records). The dictionary may optionally contain these records for any of the
variables. They follow immediately after the T-record for the variable to which they apply and provide codes
and their labels for dierent possible values of the variable. They are used by programs such as TABLES to
print row and column labels along with the corresponding codes. They can also be used as the specication
of valid codes for a variable during data entry with the WinIDAMS User Interface and for data validation
with the program CHECK.
Columns Content
1 C
2-5 Variable number.
6-9 Reference number (optional - can be used to contain some unchangeable alphanumeric reference
for the variable, e.g. the original variable number or a question reference).
15-19 Code value left justied.
22-72 Label for this code. (Note that only the rst 8 characters will be used by analysis programs
printing code labels although the complete label will appear in listings of the dictionary).
73-75 Study ID (optional).
16 Data in IDAMS
2.3.2 Example of a Dictionary
Columns: 1 2 3 4 5 6...
123456789012345678901234567890123456789012345678901234567890...
3 1 20 1 1
T 1 Identification 1 5
T 2 Age 6 2 99
T 3 Sex 8 1
C 3 1 Female
C 3 2 Male
T 11 Region 16 1
C 11 1 North
C 11 2 South
C 11 3 East
C 11 4 West
T 12 Grade average 17 31 000 900
T 20 Name 31 30 1
This is a dictionary describing 6 data elds in a data record as shown diagrammatically below.
1-5 6-7 8 16 17-19 31-60
V1 V2 V3 V11 V12 V20
ID Age Sex Region Grade Name
Locations of variables are expressed in terms of starting position and eld width (1 in column 20 of dictionary-
descriptor) and there is one record per case (1 in column 16). There is one implied decimal place in the
grade average variable (V12). The age variable has a code 99 for missing data. For the grade average, 0s
imply missing data as do all values greater than or equal to 90.0. The name of each respondent (V20) is
recorded as a 30 character alphabetic (type 1) variable. Note that variable numbers need not be contiguous
and that not all elds in the data need to be described.
2.4 IDAMS Matrices
There are two types of IDAMS matrices: square and rectangular. Both types are self-described, but unlike
the IDAMS dataset, the dictionary is stored in the same le as the array of values. In general, these
matrices are created by one IDAMS program to be used as input to another program and the user need
not be familiar with the format. If, however, it is necessary to prepare a similarity matrix, a conguration
matrix, etc. by hand, then the formats described below must be observed.
Regardless of type, all records are xed length 80-character records.
2.4.1 The IDAMS Square Matrix
The square matrix can be used only for a square and symmetric array. Only the values in the upper-right
triangular, o-diagonal portion of the array are actually stored in the square matrix. An array of Pearsonian
correlation coecients is suitably stored like this.
Programs which input/output square matrices. PEARSON outputs square matrices of correlations
and covariances; REGRESSN outputs square matrix of correlations; TABLES outputs square matrices of
bivariate measures of association. These matrices are appropriate input to other programs, e.g. the correla-
tion matrix output from PEARSON can be input to REGRESSN and to CLUSFIND. Moreover, CLUSFIND
and MDSCAL input square matrix of similarities or dissimilarities.
2.4 IDAMS Matrices 17
Example.
Columns: 111111111122222222223...
123456789012345678901234567890...
Matrix descriptor 2 4
Format statements | #F (12F6.3)
| #F (6E12.5)
Variable identifi- | #T 1 AGE
cations | #T 3 EDUCATION
| #T 9 RELIGION
| #T 10 SEX
Array of values | -.011 -.174 -.033
| .131 -.105
| -.133
Means & standard | 0.33350E 01 0.54950E 01 0.50251E 01 0.40960E 01
deviations | 0.20010E 01 0.19856E 01 0.15000E 01 0.12345E 01
Format. The square matrix contains the following:
1. A matrix-descriptor record. This, the rst record, gives the matrix type and the dimensions of the
array of values.
Columns Content
4 2 (indicates square matrix).
5-8 The number of variables (right justied).
2. A Fortran format statement describing each row of the array of values. The format statement describes
the number of value elds per 80-character record and the format of each. For example, a format of
(12F6.3) indicates that each row of the array is recorded with up to 12 values per record, each value
occupying 6 columns, 3 of which are decimals. If a row contains more than 12 values, a new record
contains the 13th value, etc. Each new row of the array always starts on a new record.
Columns Content
1-2 #F
3-80 The format statement, enclosed in parentheses.
3. A Fortran format statement describing the vectors of the variable means and standard deviations. The
format statement describes the number of values per record and the format of each.
Columns Content
1-2 #F
3-80 The format statement, enclosed in parentheses.
4. Variable identication records. These are n records, where n is the number of variables specied on
the matrix-descriptor record. The order of these records corresponds to the order of variables indexing
the rows (and columns) of the array of values. When a matrix is created by an IDAMS program, the
variable numbers and names are retained from the IDAMS dataset from which the bivariate statistics
were generated.
Columns Content
1-2 #T or #R (indicates variable identication for a row of the matrix).
3-6 The variable number (right justied).
8-31 The variable name.
The above four sections of the matrix are referred to as the matrix dictionary. Following the matrix
dictionary is the array of values.
5. The array of values. Since the array is symmetric and has diagonal cells usually containing a constant
(e.g. a correlation of 1.0 for a variable correlated with itself), only the o-diagonal, upper-right corner
of the array is stored. Note that for a covariance matrix the diagonal elements can be calculated using
standard deviations which are included in the matrix le (see point 7 below).
In the example of the 4-variable matrix above, the full array (before entering in the square format)
would be as follows:
18 Data in IDAMS
vars 1 3 9 10
1 1.000 -.011 -.174 -.033
3 -.011 1.000 .131 -.105
9 -.174 .131 1.000 -.133
10 -.033 -.105 -.133 1.000
The portion of the array that is stored is:
vars 1 3 9 10
1 -.011 -.174 -.033
3 .131 -.105
9 -.133
10
Each row of this reduced array begins a new record and is written according to the format specication
in the matrix dictionary (see above).
6. A vector of variable means. The n values are recorded in accordance with the format statement in the
matrix dictionary.
7. A vector of variable standard deviations. The n values are recorded in accordance with the format
statement in the matrix dictionary.
2.4.2 The IDAMS Rectangular Matrix
The rectangular matrix diers from the square matrix in that the array of values may be square (and non-
symmetric) or rectangular. Further, since the rows of some arrays are not indexed by variables, e.g. a
frequency table, the rectangular matrix may or may not contain variable identication records; the rectan-
gular matrix does not contain variable means and standard deviations.
Programs which input/output rectangular matrices. These matrices are created by the CONFIG,
MDSCAL, TABLES and TYPOL programs. They are appropriate input for CONFIG, MDSCAL and
TYPOL.
Example.
Columns: 111111111122222222223...
123456789012345678901234567890...
Matrix descriptor 3 4 3
Format statement #F (l6F5.0)
Variable identifications | #T 2 IQ
| #T 5 EDUCATION
| #T 8 MOBILITY
| #T 12 SIBLING RIVALRY
Array of values | 59 20 10
| 37 15 2
| 50 40 7
| 8 26 31
Format. The rectangular matrix contains the following:
1. A matrix-descriptor record.
Columns Content
4 3 (indicates rectangular matrix).
5-8 The number of rows (right justied).
9-12 The number of columns (right justied).
16 Number of format (#F) statement records. (Blank implies 1).
20 Presence of row and column labels.
blank/0 Row labels only are present (#R or #T records).
2.5 Use of Data from Other Packages 19
1 Column labels only are present (#C records).
2 Row and column labels are present (#R or #T, and #C records).
3 No row or column labels are present.
21-40 Row variable name (optional).
41-60 Column variable name (optional).
61-80 Description of the matrix contents (optional):
Weighted frequencies
Unweighted freqs
Row percentages
Column percentages
Total percentages
Name of the variable for which mean values are included in the matrix.
2. A Fortran format statement describing each row of the array of values. The format describes an 80-
character record. For example, a format of (16F5.0) indicates that each row of the array is recorded
with up to 16 values per record and with each value occupying 5 columns, none of which is a decimal
place.
Columns Content
1-2 #F
3-80 The format statement, enclosed in parentheses.
3. Variable identication records. The order of these records corresponds to the order of the vari-
ables/codes indexing the rows and columns of the matrix. When a rectangular matrix is created
by an IDAMS program, the variable/code numbers and names are retained from the input dataset or
matrix from which the array of values was derived.
Columns Content
1-2 #T or #R for row labels, #C for column labels.
3-6 The variable number or the code value (right justied).
The code values longer than 4 characters are replaced by ****.
8-58 The variable name or the code label.
The above three sections of the matrix are referred to as the matrix dictionary. Following the matrix
dictionary is the array of values.
4. The array of values. The full array is stored. Each row of the array begins a new record and is written
according to the format specied in the matrix dictionary.
2.5 Use of Data from Other Packages
2.5.1 Raw Data
Any data in the form of xed format records in character (ASCII) mode can be input directly to IDAMS
programs. Nearly all data base and statistical packages have an export or convert function to produce
xed format character mode data les. An IDAMS dictionary must be prepared to describe the elds
required from the data.
Free format data les with Tab, comma or semicolon used as separator can be imported directly through
the WinIDAMS User Interface. See the User Interface chapter for details.
Free format (any character being used as delimiter including blank) and DIF format text les can also be
imported using the IMPEX program.
Data stored in an CDS/ISIS data base can be imported to IDAMS using the WinIDIS program.
2.5.2 Matrices
The IMPEX program can be used to import free format matrices. Furthermore, matrices produced outside
IDAMS, for example a matrix provided in a publication, may also be entered according to the format given
above.
Chapter 3
The IDAMS Setup File
3.1 Contents and Purpose
To execute IDAMS programs, the user prepares a special le called the Setup le which controls the
execution of the programs. This le contains IDAMS commands and control statements necessary for
execution such as: reference to program to be executed, the names of les, the options to be selected for the
program and variable transformation instructions, e.g.
$RUN program name
$FILES
file specifications
$SETUP
program control statements
$RECODE
Recode statements
3.2 IDAMS Commands
These commands, which start with a $, separate the dierent kind of information being provided for an
IDAMS program execution. Available commands are:
$RUN program (name of program to be executed)
$FILES [RESET] (signals start of le specications)
$RECODE (signals start of Recode statements)
$SETUP (signals start of program control statements)
$DICT (signals start of dictionary)
$DATA (signals start of data)
$MATRIX (signals start of a matrix)
$PRINT (turns printing on and o)
$COMMENT [text] (comments)
$CHECK [n] (checking if previous step terminated well).
The rst line in a Setup le must always be a $RUN command identifying the IDAMS program to be
executed. Other commands relating to this program execution (followed by associated control statements or
data) can be placed in any order. These are then followed by the $RUN command for the next program (if
any) to be executed and so on. The individual IDAMS commands are described below in alphabetical order.
$CHECK [n]. If this command is present, the program will not be executed if the immediately preceding
program terminated with a condition code greater than n. If the command is present, but no value is
supplied, the value of n defaults to 1.
22 The IDAMS Setup File
All IDAMS programs terminate with a condition code of 16 if setup errors are encountered. For
example, if TABLES is to be executed immediately after TRANS, but the user does not want to
execute TABLES if a setup error occurred in the TRANS execution, a $CHECK command after the
$RUN TABLES command will prevent execution of TABLES.
The $CHECK command may appear anywhere in the setup for the program, but is usually placed
immediately after the $RUN command.
$COMMENT [text]. The text from this command is printed in the listing of the setup. This command
has no eect on program execution.
$DATA. The $DATA command signals that the data follow.
This feature cannot be used if the program generates an output Data le and a DATAOUT le is not
specied, i.e. the data are output to a default temporary le.
This feature cannot be used if the $MATRIX feature is used.
The record length of data in the setup cannot exceed 80 characters. If longer records or lines are input,
only the rst 80 characters will be used.
The print switch is turned o by the $DATA command. Thus, unless a $PRINT command immediately
follows the $DATA command, the data will not be printed.
$DICT. The $DICT command signals that an IDAMS dictionary follows.
This feature cannot be used if the program generates an output dictionary and a DICTOUT le is not
specied, i.e. if the dictionary is output to a default temporary le.
The print switch is turned o by the $DICT command. Thus, unless a $PRINT command immediately
follows the $DICT command, the dictionary will not be printed.
$FILES [RESET]. This signals the start of le specications. Default le names are attached to each
le at the start of IDAMS program(s) execution through the use of a special le idams.def. Any of these
default names may be changed by introducing le specication statements after the $FILES command (see
File Specications below). To get back default le names for Fortran FT les (except FT06 and FT50),
use FILES RESET command.
$MATRIX. The $MATRIX command signals that a matrix or set of matrices follows.
This feature cannot be used if the $DATA feature is used.
The print switch is turned o by the $MATRIX command. Thus, unless a $PRINT command imme-
diately follows the $MATRIX command, the matrix input will not be printed.
$PRINT. The print switch is reversed; if it was on, $PRINT will turn it o; if it was o, $PRINT will
turn it on. When printing is on, lines from the Setup le are listed as part of the program results.
When a $RUN command is encountered, the print switch is always turned on. The $DICT, $DATA,
and $MATRIX commands automatically turn the print switch o.
$RECODE. The occurrence of this command signals that the IDAMS Recode facility is to be used. The
Recode facility is described in the Recode Facility chapter of this manual.
The Recode statements normally follow the $RECODE command. If a new IDAMS command follows
immediately after a $RECODE command, Recode statements from the setup for the preceding program
will be used.
$RUN program. $RUN species the program to be executed and always is the rst statement in the setup.
program is the 1 to 8 character name of the program.
3.3 File Specications 23
All commands and statements following the $RUN command and up to the next $RUN command
apply to the program named.
The print switch is turned on when $RUN is encountered. See the $PRINT description.
$SETUP. The $SETUP command signals the beginning of the program control statements, i.e. the lter,
label, parameter statement, etc. (see below).
The $SETUP command is required even when program control statements follow immediately after
the $RUN command.
3.3 File Specications
The names of the les to be used are given following the $FILES command and take the following format:
ddname=lename [RECL=maximum record length]
where:
ddname is the le reference name used internally by programs, e.g. DICTIN. The required les and
the corresponding ddnames for a particular program are given in the program write-up in the section
Setup Structure.
lename is the physical le name. Enclose the name in primes if it contains blanks. See section Folders
in WinIDAMS for additional explanation.
RECL must be used if the rst record in a Data le is not the longest. If RECL is not specied the
record length is taken as the record length of the rst record. If a subsequent record is longer, an input
error results.
Examples:
DATAIN = A:ECON.DAT RECL=92
PRINT = RSLTS.LST
FT02 = ECON.MAT
DICTIN = \\nec0102\commondata\econ.dic
For additional explanation, see section Customization of the Environment for an Application in the User
Interface chapter.
3.4 Examples of Use of $ Commands and File Specications
Example A. Perform multiple executions of an analysis program, e.g. ONEWAY using the same data but
with, for instance, dierent lters.
$RUN ONEWAY
$FILES
DICTIN = CHEESE.DIC
DATAIN = CHEESE.DAT
$SETUP
Filter 1
Other control statements for ONEWAY
$RUN ONEWAY
$SETUP
Filter 2
Other control statements for ONEWAY
24 The IDAMS Setup File
Example B. Execute TABLES and ONEWAY, using the same Dictionary and Data les for each and using
the same Recode; do not list the Recode statements.
$RUN TABLES
$FILES
DICTIN = ABC.DIC
DATAIN = ABC.DAT RECL=232
$SETUP
Control statements for TABLES
$RECODE
$PRINT
Recode statements
$RUN ONEWAY
$SETUP
Control statements for ONEWAY
$RECODE
$COMMENT THE RECODE STATEMENTS INPUT FOR TABLES WILL BE REUSED FOR ONEWAY
Example C. Execute TABLES using IDAMS Recode, dictionary in the setup, data on diskette. Print the
input dictionary.
$RUN TABLES
$FILES
DATAIN = A:MYDATA
$RECODE
Recode statements
$SETUP
Control statements for TABLES
$DICT
$PRINT
Dictionary
Example D. Use the output from a data management program as input to analysis programs without
retaining the output le, e.g. execute TRANS followed by TABLES using the output data from TRANS by
specifying parameter INFILE=OUT. TABLES is not to be executed if the TRANS has control statement
errors.
$RUN TRANS
$FILES
DICTIN = MYDIC4
DATAIN = MYDAT4
$SETUP
Control statements for TRANS
$RECODE
Recode statements
$RUN TABLES
$CHECK
$SETUP
Control statements for TABLES including parameter INFILE=OUT
3.5 Program Control Statements
3.5.1 General Description
IDAMS program control statements (which follow the $SETUP command) are used to specify the parameters
for a particular execution. There are three standard control statements used by all programs:
1. the optional lter statement for selecting the cases from the data le to be used,
3.5 Program Control Statements 25
2. the mandatory label statement which assigns a title for the execution,
3. a mandatory parameter statement which selects the options for the program; some program options
are standard across most programs, others are program specic.
Additional program control statements required by individual programs are described in the program write-
up.
3.5.2 General Coding Rules
Control statements are entered on lines up to 255 characters long.
Lines may be continued by entering a dash at the end of one line and continuing on the next.
The maximum length of information that may be entered for one control statement is 1024 characters
excluding the continuation characters.
Lower case letters, except for those occurring in strings enclosed in primes, are converted to upper
case.
If character strings enclosed in primes are included on a control statement, these should be continued
in one line.
3.5.3 Filters
Purpose. A lter statement is used to select a subset of data cases. It is expressed in terms of variables
and the values assumed by those variables. For example, if variable V5 indicates sex of respondent in a
survey and code 1 represents female, then INCLUDE V5=1 is a lter statement which species female
respondents as the desired subset of cases.
The main lter selects cases from an input Data le and applies throughout a program execution. These
lters are available with all IDAMS programs which input a dictionary (except BUILD and SORMER).
Some programs allow for additional subsetting. Such local ltering applies only to a specic program
action, e.g. one frequency table.
Examples.
1. INCLUDE V2=1-5 AND V7=23,27,35 AND V8=1,2,3,6
2. EXCLUDE V10=2-3,6,8-9 AND V30=<5 OR V91=25
3. INCLUDE V50=FRAN,UK,MORO,INDI
Placement. If a main lter is used, it is always the rst program control statement. Each program write-up
indicates whether local lters may also be used.
Rules for coding.
The lter statement begins with the word INCLUDE or EXCLUDE. Depending on which word is
given, the lter statement denes the subset of cases to be used by the program (INCLUDE) or the
subset to be ignored (EXCLUDE).
A statement may contain a maximum of 15 expressions. An expression consists of a variable number,
an equals sign, and a list of possible values. The list of values can contain individual values and/or
ranges of values separated by commas, e.g. V2=1,5-9. Open ended ranges are indicated by < or >,
e.g. INCLUDE V1=0,3-5,>10; however the variable must always be followed by an = sign to begin
with, e.g. V1>0 must be expressed V1=>0 and V1<0 as V1=<0.
Expressions are connected by the conjunctions AND and OR.
AND indicates that a value from each of the series of expressions connected by AND must be
found.
OR indicates that a value from at least one of a series of expressions connected by OR must be
found.
26 The IDAMS Setup File
Expressions connected by AND are evaluated before expressions connected by OR. For example,
expression-1 OR expression-2 AND expression-3 is interpreted as expression-1 OR (expression-2
AND expression-3). Thus, in order for a case to be in the subset dened by these expressions, either a
value from expression-1 occurs, values from both expression-2 and expression-3 occur, or a value from
each of the three expressions occurs.
Parentheses cannot be used in the lter statement to indicate precedence of expression evaluation.
Variables may appear in any order and in more than one expression. However, note that V1=1 OR
V1=2 is equivalent to the single expression V1=1,2. Note also that V1=1 AND V1=2 is an
impossible condition, as no single case can have both a 1 and a 2 as a value for variable V1.
A lter statement may optionally be terminated by an asterisk.
The variables in a lter.
Numeric and alphabetic character type variables can be used.
R-variables are not allowed in main lters. They are allowed in analysis specic or local lters.
Note that the REJECT statement in Recode can be used to lter cases on R-variables.
The values in a lter for numeric variables.
Numeric values may be integer or decimal, positive or negative, e.g. 1, 2.4, -10.
Values are expressed singly or in ranges and are separated by commas, e.g. 1-5, 8, 12-13.
For numeric lter variables, variable values in the data le are rst converted to real binary mode
using the correct number of decimal places from the dictionary and the comparison with the lter
value is then done numerically. Note that this means that for a variable with decimals, lter
values must be given with the decimal point in the correct place, e.g. V2=2.5-2.8.
Cases for which a lter variable has a non-numeric value are always excluded from the execution.
The values in a lter for alphabetic variables.
Values of 1-4 characters are expressed as character strings enclosed in primes, e.g. F. Blanks on
the right need not be entered, i.e. trailing blanks will be added.
If the variable has a eld width greater then 4, only the rst 4 characters from the data are used
for the comparison with the lter variable.
Only single values, separated by commas are allowed; ranges of character strings cannot be used.
Note. The rst statement following a $SETUP command is recognized as a main lter if it starts with
INCLUDE or EXCLUDE. If the rst non-blank characters are anything else, the statement is assumed to
be a label.
3.5.4 Labels
Purpose. A label statement is used to title the results of a program execution. Some IDAMS programs
print this label once at the start of the results, while others use it to title each page.
Examples.
1. TABLES ON 1998 ELECTION DATA - JULY, 2000
2. PRINTING OF CORRECTED A34 SURVEY DATA
Placement. A label statement is required by all IDAMS programs. The label is either the rst or (if a
lter is used), the second program control statement. If no special labeling is desired, it is still necessary to
include a blank line.
3.5 Program Control Statements 27
Rules for coding.
The statement may be a string of any characters from which the rst 80 characters are used, i.e. if a
label longer than 80 characters is input, it is truncated to the rst 80.
If the label is not enclosed in primes, lower case letters are converted to upper case and blanks are
reduced to one blank.
The label should not begin with the words INCLUDE or EXCLUDE.
3.5.5 Parameters
Purpose. All IDAMS programs have been designed in a fairly general way, allowing the user to select
among several options. These options and values are generated by parameters and are supplied on program
control statements, such as parameters, regression specications, table specications, etc. Parameters
are specied by the user in a standard keyword format with an English word or abbreviation being used to
identify an option.
Examples.
1. WRITE=CORR WEIGHT=V3, PRINT=(DICT, PAIR)
(PEARSON - parameters)
2. DEPV=V5 METHOD=STEP VARS=(R3-R9,V30) WRITE=RESID
(REGRESSN - regression parameters)
3. ROWV=(V3,V9,V10) COLV=(V4,V11,V19) CELLS=(FREQ,ROWPCT) STATS=(CHI,TAUA)
(TABLES - table description)
Placement. The main parameter statement is required by all IDAMS programs and it must follow the
label statement. If all defaults are chosen, a line with a single asterisk must be supplied. Each program
write-up indicates the type and content of any other parameter lists that are required and indicates their
position relative to other program control statements.
Presentation of keyword parameters in the program write-ups. All write-ups have a standard
notation in the sections which describe the program parameters which are available. The basic notation is
as follows:
A slash indicates that only one of the mutually exclusive items can be chosen, e.g. SAMPLE/POPUL
or PRINT=CDICT/DICT.
A comma indicates that all, some, or none of the items may be chosen, e.g. STATS=(TAUA, TAUB,
GAMMA).
When commas and slashes are combined, only one (or none) of the items from each group separated
by commas and connected by slashes may be chosen, e.g. PRINT=(CDICT/DICT, LONG/SHORT).
Defaults, if any, are in bold, e.g. METHOD=STANDARD/STEPWISE/DESCENDING. A default
is a parameter setting that the program assumes if an explicit selection is not made by the user.
When a parameter setting is obligatory but has no default, the words No default are used.
Words in upper case are keywords. Words or phrases in lower case indicate that the user should replace
the word or phrase with an appropriate value, e.g. MAXCASES=n, VARS=(variable list).
Types of keywords. There are 5 types of keywords used for specifying parameters.
1. A keyword followed by a character string. This type of keyword identies a parameter consisting of a
string of characters, e.g.
INFILE=IN/xxxx
A 1-4 character ddname sux for the input dictionary and data les.
28 The IDAMS Setup File
A user might specify:
INFILE=IN2
(the ddnames would be DICTIN2 and DATAIN2)
2. A keyword followed by one or more variable numbers, e.g.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
VARS=(variable list)
Use only the variables in the list; the numbers may be listed in any order with or without V-
notation, i.e. VARS=(V1-V3) or VARS=(1-3). Note that the program write-ups always indicate
whether V- and R-type variables or only V-type variables may be used.
A user might specify:
WEIGHT=V39
(the weight variable is V39)
VARS=(32,1,10)
(only the variables specied are to be used)
3. A keyword followed by one or more numeric values, e.g.
MAXCASES=n
Only the rst n cases will be processed.
IDLOC=(s1,e1,s2,e2, ...)
Starting and ending columns of 1-5 case identication elds.
A user might specify:
MAXCASES=100
(only the rst 100 cases will be used)
IDLOC=(1,3,7,9)
(case ID is located in columns 1-3 and 7-9)
4. A keyword followed by one or more keyword values. The keyword values may be a mixture of mutually
exclusive options (separated by slashes) and independent options (separated by commas). For example:
PRINT=(OUTDICT/OUTCDICT/NOOUTDICT,DATA)
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
NOOU Do not print output dictionary.
DATA Print the values of the output variables.
A user might specify:
PRINT=(OUTC,DATA)
(full output dictionary is printed, and data values are printed)
PRINT=NOOUTDICT
(no output dictionary or data values are printed)
5. A set of mutually exclusive keywords. Only one of a set of options can be selected, e.g.
SAMPLE/POPULATION
SAMP Compute the variance and/or standard deviation using the sample equation.
POPU Use the population equation.
All keywords except the last type are followed by an equals sign. The character, numeric, and keyword
values that follow the equals sign are called the associated values.
Rules for coding.
Rules for specifying keywords
Only the rst four letters of a keyword or an associated keyword need to be specied, although the
whole keyword may be supplied. Thus, TRAN is an appropriate abbreviated form of the keyword
TRANSVARS. There are no abbreviations for keywords with four letters or less.
3.5 Program Control Statements 29
Rules for specifying associated values
Associated value is a list of items.
The items in the list are separated by commas.
If there are two or more items, the list must be enclosed in parentheses.
Ranges of integer numeric values or variables are indicated by a dash.
Ranges of decimal numeric values are not allowed.
For example:
R=(V2,3,5)
PRIN=(DICT,DATA,STAT)
MAXC=5
TRAN=(V5,V10-V25,V32)
IDLOC=(1,3,7,8)
Associated value is a character string.
The string must be enclosed in primes if it contains any non-alphanumeric characters, e.g.
FNAME=EDUCATION: WAVE 1. Note that blank, dot and comma are non-alphanumeric
characters. When in doubt, use primes.
Two consecutive primes (not a quotation mark) must be used to represent a prime, e.g, ANAME=KEVINS
(the extra prime is deleted, once the string is read).
A string is better not split across lines.
Rules for specifying lists of keywords
Keywords (with or without associated values) are separated from one another by a comma or by one
or more blanks, e.g.
FNAME=FRED, TRAN=3 KAISER
Lists of keywords may spread across several lines but in this case there must be a dash (-) at the end
of each line indicating continuation, e.g.
FNAME=FRED -
TRAN=3 -
KAISER
Keywords may be given in any order. If a keyword appears more than once in the list, then the last
value encountered is used.
A keyword may not be split across lines.
Each list of keywords may optionally be terminated by an asterisk.
If all default options are chosen, a line with a single asterisk must be supplied.
Details of most common parameters not described fully in each program write-up.
1. BADDATA. Treatment of non-numeric data values.
BADDATA=STOP/SKIP/MD1/MD2
When non-numeric characters (including embedded blanks and all-blank elds) are found in nu-
meric variables, the program should:
STOP Terminate the execution.
SKIP Skip the case.
MD1 Replace non-numeric values by the rst missing data code (or 1.5 10
9
if 1st missing
data code is not specied).
30 The IDAMS Setup File
MD2 Replace non-numeric values by the second missing data code (or 1.610
9
if 2nd missing
data code is not specied).
For SKIP, MD1, and MD2 a message is printed about the number of cases so treated.
2. MAXCASES. The maximum number of cases to be processed.
MAXCASES=n
The value given is the maximum number of cases that will be processed. If n=0, no cases are
read; this option can be used to test setups without reading the data. If the parameter is not
specied at all, all cases from the input le are processed.
3. MDVALUES. Specify which, if either, of the missing data codes are to be used to check for missing
data in variable values. Note that some programs have, in addition, a MDHANDLING parameter to
specify how data values which are missing are to be handled.
MDVALUES=BOTH/MD1/MD2/NONE
BOTH Variable values will be checked against the MD1 codes and against the ranges of codes
dened by MD2.
MD1 Variable values will be checked only against the MD1 codes.
MD2 Variable values will be checked only against the ranges of codes dened by MD2.
NONE MD codes will not be used. All data values will be considered valid.
The default is always that both MD codes are used.
4. INFILE, OUTFILE. Specifying ddnames with which input and output dictionary and data les are
dened.
INFILE=IN/xxxx
OUTFILE=OUT/yyyy
Input and output Dictionary and Data les for IDAMS programs are dened with ddnames DIC-
Txxxx, DATAxxxx, DICTyyyy and DATAyyyy. These normally default to DICTIN, DATAIN,
DICTOUT, DATAOUT. If several IDAMS programs are being executed in one setup, for example
programs using dierent datasets as input, or when using the output from one program as input
directly to another (chaining), then it is sometimes necessary to change these defaults.
5. WEIGHT. This parameter species the variable whose values are to be used for weighting data cases.
WEIGHT=variable number
The variable specied may be a V-type or R-type, integer or decimal valued. Cases with missing,
zero, negative and non-numeric weight values are always skipped and a message is printed about
the number of cases so treated. If the WEIGHT parameter is not specied, no weighting is
performed.
6. VARS. This parameter and similar ones such as ROWVARS, OUTVARS, CONVARS, etc. are used
to specify a list of variables.
VARS=(variable list)
If more than one variable is specied, the list must be enclosed in parentheses.
Rules for specifying variable lists
Variables are specied by a variable number preceded by a V or an R. A V denotes a variable
from an IDAMS dataset or matrix. An R denotes a resultant variable from a Recode operation.
Note that internal to the programs and in the results, V- and R-type variables are distinguished by
the sign of the variable number; positive numbers denote V-type variables and negative numbers
denote R-type variables.
To specify a set of contiguously numbered variables, such as V3, V4, V5, V6, connect two variable
numbers, each preceded by a V, with a dash (e.g. V3-V6 is valid; V3-6 is invalid). Use ranges
with caution if the dataset has gaps in the variable numbering, as all variables within the range
must appear in the dataset or matrix, i.e. V6-V8 implies V6,V7,V8. If V7 is not in the dictionary,
then an error message will result. V-type and R-type variables may not be mixed in a range, i.e.
V2-R5 is invalid.
Single variable numbers or ranges of variable numbers are separated by commas.
In general, for data management programs, variables may be listed more than once, while for
analysis programs specifying a variable more than once is inappropriate and will cause termination.
See the program write-up for details.
3.6 Recode Statements 31
Blanks may be inserted anywhere in the list.
In general, variables may be specied in any order. The order of variables may, however, have
special meaning in some programs; check the program write-up for details.
Examples:
VARS=(V1-V6, V9, V16, V20-V102, V18, V11, V209)
OUTVARS=(R104, V7, V10-V12, R100-R103, -
V16, V1)
CONVARS=V10
3.6 Recode Statements
The IDAMS Recode facility permits the temporary recoding of data during execution of IDAMS programs.
Results from such recoding operations (together with variables transferred from the input le) can also be
saved in permanent les using the TRANS program.
Recoding is invoked by the $RECODE command. This command and the associated Recode statements are
placed after the $RUN command for the program with which the Recode facility is to be used. For example:
$RUN program $RUN ONEWAY
$FILES $FILES
File definitions DICTIN=MYDIC
DATAIN=MYDAT
$RECODE $RECODE
Recode statements R10 = BRAC(V3,0-10=1,11-20=2)
R11 = SUM(V7,V8)
NAME R10 EDUC LEVEL, R11TOTAL INCOME
$SETUP $SETUP
Program control statements INCOME BY EDUC,SEX
BADDATA=SKIP
CONVARS=(R10,V2) DEPVAR=R11
A complete description of the Recode facility is provided in the Recode Facility chapter.
Chapter 4
Recode Facility
4.1 Rules for Coding
Recode statements take the form:
lab statement
where lab is an optional 1-4 character label starting in position 1 of the line and followed by at least
one blank. Unlabelled statements must start in position 2 or beyond.
The label allows control statements such as GO TO to refer to a specic statement, e.g. GO TO ST1.
Labels cannot be given on initialization statements (CARRY, MDCODES, NAME).
To continue a statement onto another line, enter a dash at the end of the line and continue from any
position on the next line.
The maximum line length is 255 characters and the maximum total number of characters for a statement
is 1024 excluding continuation dashes and trailing blanks after the dash.
4.2 Sample Set of Recode Statements
To give some idea of how the elements of the Recode language t together, a sample set of Recode statements
is given below.
$RECODE
IF V5 LT 8 THEN REJECT (exclude cases where V5 < 8)
IF NOT MDATA(V6) THEN R51=TRUNC(V6/4) -
ELSE R51=0
R52=BRAC(V10,0-24=1,25-49=2,50-74=3, - (group values of V10)
74-99=4,TAB=1)
R53=BRAC(V11,TAB=1) (group V11 the same way as V10)
IF V26 INLIST(1-10) THEN R54=1 AND -
R55=1 ELSE R54=2
IF R54 EQ 1 THEN GO TO L1
R55=99
R56=V15 + V35
GO TO L2
L1 R56=99
L2 R57=COUNT(1,V20-V27,V29) (count how many of the listed
variables have the value 1)
NAME R52 GROUPED AGE, -
R53 GROUPED AGE AT MARRIAGE
MDCODES R55(99),R56 (99)
34 Recode Facility
4.3 Missing Data Handling
Except in the special functions MAX, MEAN, MIN, STD, SUM, VAR, Recode does not automatically check
the values of variables for missing data. The user must therefore control specically for missing data before
doing calculations with variables. The MDATA function is available for this purpose; e.g.
IF MDATA (V5,V6) THEN R1=999 ELSE R1=V5+V6
There are two additional functions, MD1 and MD2, which return the 1st or 2nd missing data code value for
a variable; e.g.
R2=MD1(V6)
assigns R2 the value of the 1st missing data code of V6.
Finally, missing data codes can be assigned to R or V variables with the MDCODES denition statement;
e.g.
MDCODES R3(8,9)
assigns 8 and 9 as the 1st and 2nd missing data codes for R3.
Sometimes a set of Recode statements does not assign a value to an R-variable for a particular data record.
The R-variable will then take the default MD1 value of 1.5 10
9
to which it is initialized. To change this
to a more acceptable missing data value, we must test if the value is large and, if so, assign an appropriate
missing data value, e.g.
IF R100 GT 1000000 THEN R100=99
MDCODES R100(99)
4.4 How Recode Functions
Syntax checking and interpretation. Recode statements are read and analyzed for errors prior to
interpretation of other IDAMS program control statements and prior to program execution. If errors are
found, diagnostic messages are printed and execution of the program is terminated.
Results. Recode prints out the Recode statements input by the user along with syntax errors detected
if any. This occurs before the program is executed, i.e. before the interpretation of the program control
statements is printed.
Initialization before starting to process the Data le. If there are no syntax errors, tables, missing
data codes, names, etc. are initialized (according to the initialization/denition statements supplied by the
user) before starting to read the data. R-variables in CARRY statements are initialized to zero.
Initialization before processing each data case. At the start of processing of each case and before
execution of the Recode statements for that case, all R-variables, except those listed in CARRY statements,
are initialized to the IDAMS internal default missing data value (1.5 10
9
).
Execution of Recode statements. The actual recoding takes place after the data for a case is read and
after the main lter has been applied. Cases not passing the lter are not passed to the recoding routines.
Recode variables cannot therefore be used in main lters.
The use of the Recode statements is sequential (i.e. the rst statement is used rst, then the second, third,
etc.) except as modied by GO TO, BRANCH, RETURN, REJECT, ENDFILE, ERROR statements (the
control statements). When all statements have been used, the case is passed to the IDAMS program being
executed.
When the IDAMS program has nished using the case, the next case passing the main lter is processed,
the R-variables (except the CARRY variables) being reinitialized to missing data and the Recode statements
executed for that case and so on until the end of the data le is reached.
Testing Recode statements. Errors in logic can be made which are not detectable by the Recode facility.
To check the intended results against those generated by Recode, the Recode statements should be tested
on a few records using the LIST program with the parameter MAXCASES set, say, to 10. The data values
4.5 Basic Operands 35
for the variables input and the corresponding result variables can then be inspected.
Files used by Recode. When a $RECODE command is encountered in the Setup le, subsequent lines
are copied into a work le on unit FT46. The RECODE program reads Recode statements from this le and
analyzes them for errors prior to interpretation of other IDAMS program control statements and prior to
program execution. If errors are found, diagnostic messages are printed and execution of the entire IDAMS
step is terminated.
Interpreted statements are written in the form of tables to a work le on unit FT49 from where they are
read by the IDAMS program being executed.
Messages about Recode statements are written to unit FT06 along with results from the IDAMS program
being executed.
4.5 Basic Operands
Variables. Variables in Recode refer either to input variables (V-variables) or result variables (R-variables).
They are dened as follows:
Input variables (Vn). V followed by a number. These are variables as dened by the input
dictionary. Their values may be changed by Recode (e.g. V10=V10+V11). Variables should
normally be numeric but alphabetic variables of not more than 4 characters can also be used, in
particular, they can be recoded to numeric values.
Result variables (Rn). R followed by a number (1 to 9999). These are variables that are
created by the user. R-variables (except for those listed in CARRY statements - see below) are
initialized to the default missing value of 1.5 10
9
before processing of each case.
To use an R-variable in a program, specify an R (instead of V) on the variable list attached to a
keyword parameter (e.g. WEIGHT=R50 or VARS=(R10-R20)). When printed out by programs,
a result variable number is sometimes identied by a negative sign. Thus, variable 10 is V10
and variable -10 is R10. It is less confusing to use numbers for the result variables which are
distinct from input variable numbers. R-variables are always numeric.
Numeric constants. Constants may be integer or decimal, positive or negative, e.g. (3, 5.5, -50, -0.5).
Character constants. Character constants are enclosed in single primes (e.g. ABCXYZ, M). A prime
within a character constant must be represented by two adjacent primes (e.g. DONT would be written:
DONT). Character constants are used in the NAME statement to assign names to new variables. They
can also be used in logical expressions to test values of alphabetic variables (e.g. IF V10 EQ M); only the
rst 4 characters are used in such comparisons and constants/variables values of length < 4 are padded on
the right with blanks. Character constants cannot be used in arithmetic functions (except BRAC).
4.6 Basic Operators
Arithmetic operators. Arithmetic operators are used between arithmetic operands. Available operators,
in precedence order, are:
- (negation)
EXP x (exponentiation to the power x, where -181 < x < 175)
* (multiplication)
/ (division)
+ (addition)
- (subtraction)
36 Recode Facility
Relational operators. Relational operators are used to determine whether or not two arithmetic values
have a particular relationship to one another. The relational operators are:
LT (less than)
LE (less than or equal)
GT (greater than)
GE (greater than or equal)
EQ (equal)
NE (not equal)
Logical operators. Logical operators are used between logical operands. Logical operands take only the
values true or false. These are:
NOT
AND (both)
OR (either)
4.7 Expressions
An expression is a representation of a value. A single constant, variable, or function reference is an expression.
Combinations of constants, variables, functions and other expressions with operators are also expressions.
Recode can evaluate arithmetic and logical expressions. Note that brackets can be used anywhere in an
expression to clarify the order in which it is to be evaluated.
Arithmetic expressions. Arithmetic expressions are created using arithmetic operators and variables,
constants and arithmetic functions. They yield a numeric value. Examples are:
V732 (the value of V732)
44 (the constant 44)
R67/V807 + 25 (25 plus the value of R67 divided by the value of V807)
LOG(R10) (the log of the value of R10)
Logical expressions. Logical expressions are evaluated to a true or false value. Logical variables do
not exist in the Recode language, so that the result of logical expressions cannot be assigned to a variable.
Logical expressions can only be used in IF statements. Examples are:
R5 EQ V333
True if the value of R5 is equal to the value of V333, and false otherwise.
(V62 GT 10) OR (R5 EQ V333)
True if either of the logical expressions results in a true value, and false if both result in a false value.
MDATA(V10,R20) AND V9 GT 2
True if the value of V10 or the value of R20 is a missing data code and the value of V9 is larger than 2, false
otherwise.
4.8 Arithmetic Functions
Arithmetic functions all return a single numeric value. The argument list for functions can be simple lists
enclosed in parentheses or highly structured lists involving both keyword elements and elements in specic
positions in the list. The available functions are:
4.8 Arithmetic Functions 37
Function Example Purpose
ABS ABS(R3) Absolute value
BRAC BRAC(V5,TAB=1,ELSE=9, - Univariate grouping
1-10=1,11-20=2)
BRAC(V10,F=1,M=2) Alphabetic recoding
COMBINE COMBINE V1(2), V42(3) Combination of 2 variables
COUNT COUNT(1,V20-V25) Counting occurrences of a value
across a set of variables
LOG LOG(V2) Logarithm to the base 10
MAX MAX(V10-V20) Maximum value
MD1,MD2 MD1(V3) Value of missing data code
MEAN MEAN(V5-V8,MIN=2) Mean value
MIN MIN(V10-V20) Minimum value
NMISS NMISS(V3-V6) Number of missing data values
NVALID NVALID(V3-V6) Number of non-missing values
RAND RAND(0) Random number
RECODE RECODE V7,V8,(1/1)(1/2)=1, - Multivariate recoding
(2-3/3)=2, ELSE=0
SELECT SELECT (BY=V10,FROM=R1-R5,9) Selecting the value of one of a set of variables
according to an index variable
SQRT SQRT(V2) Square root
STD STD(V20-V25,MIN=4) Standard deviation
SUM SUM(V6,V8,V9-V12,MIN=3) Sum of values
TABLE TABLE(V5,V3,TAB=2,ELSE=9) Bivariate recoding
TRUNC TRUNC(V26/3) Integer part of the arguments value
VAR VAR(V6,R5-R10,MIN=7) Variance
The exact syntax for each function is given below.
ABS. The ABS function returns a value which is the absolute value of the argument passed to the function.
Prototype: ABS(arg)
Where arg is any arithmetic expression for which the absolute value is to be taken.
Example:
R5=ABS(V5-V6)
BRAC. The BRAC function returns a value which is derived from performing specied operations (rules)
upon a single variable.
Prototype: BRAC(var [,TAB=i] [,ELSE=value] [,rule1,...,rule n] )
Where:
var is any V- or R-type variable whose values are being tested.
TAB=i either numbers the set of rules and the associated ELSE established in this use of BRAC
(optional), or references a set of rules established in a previous use of BRAC. Note: The ELSE clause
is considered part of the set of rules.
ELSE=value is used when the value of var cannot be found in the rules given. If ELSE=value is
omitted, ELSE=99 is assumed, i.e. BRAC always recodes.
rule1, rule2,...,rule n are the set of rules dening the values to be returned depending on the value of
var. The rules are expressed in the form: x=c, where x denes one or more codes and c is the value to
be returned when the value of var equals the code(s) dened by x. The possible rules (where m is any
numeric or character constant) are:
>m=c (if the value of var is greater than m, return value c).
<m=c (if the value of var is less than m, return value c).
38 Recode Facility
m=c (if the value of var is equal to m, return value c).
m1-m2=c (if the value of var is in the range m1 to m2, i.e. m1<=var<=m2, return value c).
As many rules may be given as necessary. They are evaluated from left to right, and the rst one
which is satised is used. Note that > and < are used, not the GT and LT logical operators.
ELSE, TAB, and the rules may be specied in any order.
Ranges of alphabetic values, e.g. A-C, are not allowed.
Examples:
R1=BRAC(V10,TAB=1,ELSE=9,1-10=1,11-20=2,<0=0)
The value of R1 will be 1 if variable 10 is in the range 1 to 10, 2 if V10 is in the range 11 - 20, and 0 if V10
is less than 0. If V10 has any other value, e.g. -3, 10.5, 25, 0, then the ELSE clause would be applied, and
R1 would be 9. These bracketing rules are labelled table 1 so they can be re-used, e.g.
R2=V1 + BRAC(V2, TAB=1) * 3
In this example V2 would be bracketed by the same rules as for V10 in the previous example. R2 would be
set to V1 + (the result of bracketing multiplied by 3).
R100=BRAC(V10,F=1,M=2,ELSE=9)
This is an example of recoding an alphabetic variable, which has values F or M, to numeric values of 1
and 2.
COMBINE. The COMBINE function returns a unique value for each combination of values of the variables
that are used as arguments. This function is normally used with categorical variables.
Prototype: COMBINE var1(n1), var2(n2),...,varm(nm)
Where:
var1 to varm are the V- or R-variables to combine.
n1 to nm are the maximum codes +1 of the respective variables.
The list of arguments to the COMBINE function is not enclosed in parentheses.
Each variable must have only non-negative and integer values.
The values returned are computed by the following formula:
V1 + (n1 * V2) + (n1 * n2 * V3) + (n1 * n2 * n3 * V4) etc.
The user, however, would normally determine the result of the function by listing the combinations of
values in a table as in the rst example below.
Examples:
R1=COMBINE V6(2), R330(3)
Assume that V6 has two codes (0,1) representing men and women respectively and R330 has three codes
(0,1,2) representing young, middle aged and old respondents, the statement will combine the codes of V6
and R330 to give a single variable R1 as follows:
V6 V330 R1
0 0 0 Young men
1 0 1 Young women
0 1 2 Middle aged men
1 1 3 Middle aged women
0 2 4 Old men
1 2 5 Old women
4.8 Arithmetic Functions 39
Since V6 has two codes, and R330 has three, R1 will have six. In the above example, if V6 had codes 1 and
2 instead of 0 and 1, the maximum value should be stated as 3. This would allow for the values of 0,1,
and 2, although code 0 would never appear. To avoid these extra codes, the user should rst recode such
variables to give a contiguous set of codes starting from 0, e.g. BRAC(V6,1=0,2=1).
Restrictions:
There may be up to 13 variables.
The COMBINE function cannot be used with other functions in the same assignment statement.
Care should be taken to accurately specify the maximum codes when using the COMBINE function.
Otherwise, non-unique values will be generated. For example, with COMBINE V1(2), V2(4) the
function will return a value of 7 for the pair of values, V1=1 and V2=3, and will also return a value of
7 for the pair of values V1=3 and V2=2. If values of 3 might exist for V1, then n1 should be specied
as 4 (1 + maximum code).
COUNT. The COUNT function returns a value which is equal to the number of times the value of a variable
or constant occurs as the value of one of the variables in the list varlist.
Prototype: COUNT(val,varlist)
Where:
val is normally a constant but can also be a V- or R-variable.
varlist gives the V- and/or R-variables whose values are to be checked against val.
Examples:
R3=COUNT(1,V20-V25)
R3 will be assigned a value equal to the number of times the value 1 occurs in the 6 variables V20-V25. This
might be used for example to count the number of YES responses by a respondent to a set of questions.
R5=COUNT(V1,V8-V10)
R5 will be assigned a value equal to the number of times that the value of V1 occurs also as the value of
variables V8-V10.
LOG. The LOG function returns a oating-point value which is the logarithm to the base 10 of the argument
passed to the function.
Prototype: LOG(arg)
Where arg is any arithmetic expression for which the log to the base 10 is to be taken.
Examples:
R10=LOG(V30)
Note: The logarithm of any number X to any other base B can readily be found by the following simple
transformation:
R1=LOG(X)/LOG(B)
For the natural logarithm (base e), this becomes simply: R1=2.302585 * LOG(X).
Thus R1=2.302585 * LOG(V30) will assign to R1 the natural logarithm of variable 30.
MAX. The MAX function returns the maximum value in a set of variables. Missing data values are
excluded. The MIN argument can be used to specify the minimum number of valid values for a maximum
to be calculated. Otherwise the default missing data value 1.5 10
9
is returned.
Prototype: MAX(varlist [,MIN=n] )
40 Recode Facility
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the maximum value. n defaults to 1.
Example:
R12=MAX(V20-V25)
MD1, MD2. The MD1 (or MD2) function returns a value which is the rst (or second) missing data code
of the variable given as the argument.
Prototype: MD1(var) or MD2(var)
Where var is any input variable (V-variable) or previously dened result variable (R-variable).
Example:
R12=MD2(V20)
For each case processed, R12 will be assigned the second missing data code for input variable V20.
MEAN. The MEAN function returns the mean value of a set of variables. Missing data values are excluded.
The MIN argument can be used to specify the minimum number of valid values for a mean to be calculated.
Otherwise the default missing value 1.5 10
9
is returned.
Prototype: MEAN(varlist [,MIN=n] )
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the mean value. n defaults to 1.
Example:
R15=MEAN(R2-R4,V22,V5,MIN=2)
The result will be the mean of the specied variables, if at least two of the variables have non-missing values.
Otherwise, the result will be 1.5 10
9
.
MIN. The MIN function returns the minimum value in a set of variables. Missing data values are excluded.
The MIN argument can be used to specify the minimum number of valid values for a minimum to be
calculated. Otherwise the default missing value 1.5 10
9
is returned.
Prototype: MIN(varlist [,MIN=n] )
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the minimum value. n defaults to 1.
Example:
R10=MIN(V5,V7,V9,R2)
NMISS. The NMISS function returns the number of missing values in a set of variables.
Prototype: NMISS(varlist)
Where varlist is a list of V- and R-type variables.
Example:
R22=NMISS(R6-R10)
4.8 Arithmetic Functions 41
The returned value depends on how many of the variables R6 - R10 have missing values. The maximum
value is 5 for a case in which all 5 variables have missing data.
NVALID. The NVALID function returns the number of valid values (non-missing values) in a set of vari-
ables.
Prototype: NVALID(varlist)
Where varlist is a list of V- and R-type variables.
Example:
R2=NVALID(V20,V22,V24)
The returned value depends on how many of the variables have valid values. The maximum value of 3 will
be obtained if all 3 variables have valid values. 0 will be returned if all 3 are missing.
RAND. The RAND function returns a value which is a uniformly distributed random number based upon
the arguments starter and limit as described below.
Prototype: RAND(starter [,limit] )
Where:
starter is an integer constant that is used to initiate the random sequence. If starter is 0, then the
current clock time is used.
limit is an optional argument. It is an integer constant that is used to specify the range (i.e. 3 means
a range of 1 to 3). The default value is 10, which means the default range is 1 to 10.
Examples:
R1=RAND(0)
IF RAND(0) NE 1 THEN REJECT
For each case processed, R1 will be set equal to a random number, uniformly distributed from 1 to 10. The
sequence is initialized to the clock time the rst time RAND is executed. Note that RAND can be used
with the REJECT statement to select a random sample of cases. The 2nd example will result in including
a random 1/10 sample of cases.
RECODE. The RECODE function is used to return one value based upon the concurrent values of m
variables.
Prototype: RECODE var1,var2,...,varm [,TAB=i] [,ELSE=value] [,rule1,rule2,...,rule n]
Where:
var1,var2,...,varm is a list of up to 12 V and/or R variables to be tested.
TAB=i either numbers the set of recode rules established in this use of RECODE (optional) or refer-
ences a set of rules established in a previous use of RECODE. Note: the ELSE value is not considered
a part of the set of recode rules.
ELSE=value (optional) indicates the value to be returned if none of the code lists match the values
of the variables. While it is usually a constant, the value may be any arithmetic expression. If ELSE
is omitted, and none of the code lists match the variable values, the function does not return a value,
i.e. the value of the result variable is left unchanged. If this is the rst assignment statement for a
variable, then its value will be the input data value for a V-variable or missing data for an R-variable.
rule1, rule2,..., rule n are the set of rules dening the values to be returned depending on the values
of var1, var2,..., varm. Each rule is of the form (code list 1) (code list 2) ... (code list p)=c. Each
code list is of the form (a1/a2/.../am) where a1 is the code to be compared with var1, a2 is the code
to be compared with var2, etc. Here c is the value to be returned when var1,var2,..., varm match the
codes dened in any of the code lists.
42 Recode Facility
The prototype for a rule is:
(a1/a2/.../am)(b1/b2/.../bm)...(x1/x2/.../xm)=c
Each code list contains a list and/or a range of values for every variable, e.g. with two variables,
(3/2)(6-9/4)(0/1,3,5)=1.
The codes in the code list may be separated by a slash (indicating AND) or by a vertical bar
(indicating OR), although only one or the other may be used in any given code list.
For example:
(a1/a2/a3)=c
(the function will return c if var1=a1 and var2=a2 and var3=a3)
(a1|a2|a3)=c
(the function will return c if var1=a1 or var2=a2 or var3=a3)
Rules are examined from left to right. The rst code list which matches the variable list values
determines the value to be returned.
The argument list for the RECODE function is not enclosed in parentheses.
TAB, ELSE and rules may be in any order.
Examples:
R7=RECODE V1,V2,(3/5)(7/8)=1,(6-9/1-6)=2
R7 will be assigned a value based on the values of V1 and V2. In this example, R7 will be set to 1 if V1=3
and V2=5, or if V1=7 and V2=8. R7 will be set to 2 if V1=6-9 and V2=1-6. In all other instances, R7 will
be unchanged (see above).
R7=RECODE V1,V2,TAB=1,ELSE=MD1(R7),(3/5)(7/8)=1,(6-9/1-6)=2
R7 will be assigned a value the same as in the preceding example, except that R7 will be set equal to its
MD1 value when the rules are not met. The TAB=1 will allow these rules to be used in another RECODE
function call.
Restriction: When the RECODE function is used, it must be the only operand on the right-hand side of the
equals sign.
SELECT. The SELECT function returns the value of the variable or constant in the FROM list holding
the same position as the value of the BY variable. (Warning: If the value of the BY variable is less than 1 or
greater than the number of variables in the FROM list, a fatal error results). There may be up to 50 items
in the FROM list. The maximum value of the BY variable is therefore 50. A SELECT function may be
combined with other functions, operations, and variables to form a complex expression. Note: The SELECT
function selects the value of one of a set of variables; the SELECT statement selects the variable to be
used for the result. (See section Special Assignment Statements for description of SELECT statement).
Prototype: SELECT (FROM=list of variables and/or constants, BY=variable)
Example:
R10=SELECT (FROM=R1-R3,9,BY=V2)
R10 will take the value of R1, R2, R3 or 9 for values of 1, 2, 3 or 4 respectively of V2.
SQRT. The SQRT function returns a value which is the square root of the argument passed to the function.
Prototype: SQRT(arg)
Where arg is any arithmetic expression.
Example:
R5=SQRT(V5)
4.8 Arithmetic Functions 43
STD. The STD function returns the standard deviation of the values of a set of variables. Missing data
values are excluded. The MIN argument can be used to specify the minimum number of valid values for a
standard deviation to be calculated. Otherwise the default missing value 1.5 10
9
is returned.
Prototype: STD(varlist [,MIN=n] )
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the standard deviation. n defaults to 1.
Example:
R5=STD(V20-V24,R56-R58,MIN=3)
SUM. The SUM function returns the sum of the values of a set of variables. Missing values are excluded.
The MIN argument can be used to specify the minimum number of valid values for a sum to be calculated.
Otherwise the default missing value 1.5 10
9
is returned.
Prototype: SUM(varlist [,MIN=n] )
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the sum. n defaults to 1.
Example:
R8=SUM(V20,V22,V24,V26,MIN=3)
If three or more of the variables have valid values, the sum of these is returned. Otherwise the value 1.510
9
is returned.
TABLE. The TABLE function returns a value based on the concurrent values of two variables.
Prototype: TABLE (r, c, [TAB=i,] [ELSE=value,] [PAD=value,] COLS c1,c2,...,cm,
ROWS r1(row r1 values),r2(row r2 values),...,rn(row rn values))
Where:
r is a variable or constant that will be used as a row index to a table.
c is a variable or constant that will be used as a column index to a table.
TAB=i either numbers the table dened in this use of TABLE (optional) or references a table dened
in a previous use of TABLE.
ELSE=value gives a value to use for pairs of values that are not dened in the table. The value may be
an arithmetic expression. The value of ELSE defaults to 99 if not specied, i.e.TABLE always returns
a value.
PAD=value gives a value to be inserted into any cell which is dened by the COLS specications but
not dened by the ROWS specications.
TAB, ELSE and PAD may be specied in any order.
c1,c2,...,cm are the columns of the table. Ranges may be used in the column denitions.
r1,r2,...,rn are the rows of the table. The total size of the table will be m by n, where m is the number
of columns and n is the number of rows.
(row r1 values), (row r2 values),...,(row rn values) are the values returned depending on the values of r
and c. The values are given in the same order as the column specications; the rst value corresponds
to c1, the second to c2, etc. Ranges may be used in the row value denitions.
44 Recode Facility
Examples: Assume the following table:
Col: 1 2 3 4 5 6
Row: 2 1 1 2 2 3 4
3 1 2 2 2 3 4
5 1 2 2 2 3 4
6 3 3 3 3 3 4
8 9 9 9 9 9 9
R1=TABLE (V6, V4, TAB=1, ELSE=0, PAD=9, COLS 1-6, ROWS 2(1,1,2,2,3,4), -
3(1,2,2,2,3,4),5(1,2,2,2,3,4),6(3,3,3,3,3,4),8(9))
If V6 equals 5 and V4 equals 3, then R1 will be assigned the value 2 (intersect of row 5 and column 3).
If V6 equals 2 and V4 equals 6, then R1 will be assigned the value 4 (intersect of row 2 and column 6).
If V6 equals 4 and V4 equals 2, then R1 will be assigned the value 0 (row 4 is not dened; the ELSE value
is used).
R5=TABLE (3, V8, TAB=7, ELSE=TABLE(V1,V8,TAB=1) )
This will use the table named 7 with 3 as the row index and the value of V8 as the column index. If a
value of V8 is not in table 7 then the table 1 will be used with row index V1 and column index V8.
TRUNC. The TRUNC function returns the integer value of an argument.
Prototype: TRUNC(arg)
Where arg is any arithmetic expression for which the integer value is to be taken.
Example:
R5=TRUNC(V5)
R5 will be assigned the value of the input variable V5 truncated to an integer.
VAR. The VAR function returns the variance of the values of a set of variables, excluding missing data. The
MIN argument can be used to specify the minimum number of valid values for the variance to be calculated.
Otherwise the default missing value 1.5 10
9
is returned.
Prototype: VAR(varlist [,MIN=n] )
Where:
varlist is a list of V- and R-type variables, and constants.
n is the minimum number of valid values for computation of the variance. n defaults to 1.
Example:
R9=VAR(V5-V10)
4.9 Logical Functions
Logical functions return a value of true or false when evaluated. They cannot be used as arithmetic
operands. Logical functions are used in logical expressions and logical expressions comprise the test portion
of conditional IF test THEN... statements. The available functions are:
Function Example Purpose
EOF IF EOF THEN GO TO NEXT Checks for the end of the data le
INLIST IF V5 INLIST(2,4,6) THEN - Searches a list of values
R100=1 ELSE R100=0
MDATA IF MDATA(V5,V6) THEN R101=99 Checks for missing data
4.10 Assignment Statements 45
EOF. The EOF function is used for aggregation of values across cases. See example 10 in section Examples
of Use of Recode Statements. The presence of the EOF function causes the Recode statements to be
executed once more after the end-of-le has been encountered. The value of the EOF function is true during
this after-end-le pass of the Recode statements and is false at all other times.
For the nal pass through the Recode statements, V-variables will have the value they had after the last case
was fully processed. R-variables (except those listed in CARRY statements) will be reinitialized to 1.510
9
.
CARRY R-variables will be left untouched. The user must be careful to set up a correct path to be followed
through the Recode statements when end-of-le is reached.
Prototype: EOF
Example:
IF R1 NE V1 OR EOF THEN GO TO L1
INLIST. The INLIST function (abbreviated IN) returns a value of true if the result of an arithmetic
expression is one of a specied set of values. If the expression equals a value outside the set of values, the
function returns a value of false.
Prototype: expr INLIST(values) or expr IN(values)
Where:
expr is any arithmetic expression or a single variable.
values is a list of values. These may be discrete and/or value ranges.
Examples:
IF R12 INLIST(1-5,9,10) THEN V5=0
If R12 has a value of 1,2,3,4,5,9 or 10, the INLIST function returns a value of true, and input variable V5
is set to 0. Otherwise, INLIST returns a value of false and input variable V5 retains its original value.
IF (V3 + V7) IN(2,4,5,6) THEN R1=1 ELSE R1=9
If the sum of input variables V3 and V7 results in the value 2,4,5, or 6, then INLIST returns a value of
true and result variable R1 will contain the value 1. Otherwise, INLIST returns a value of false and R1
will be set to 9.
MDATA. The MDATA function returns a value of true if any of the variables passed to the function
have missing data values; otherwise, the function returns a value of false. This function is used quite often,
since missing data is not automatically checked in the evaluation of expressions except in the MAX, MEAN,
MIN, STD, SUM and VAR functions.
Prototype: MDATA(varlist)
Where varlist is a list of V- and R-variables. There can be a maximum of 50 variables in this list.
Example:
IF MDATA(V1,V5-V6) THEN R1=MD1(R1) ELSE R1=V1+V5+V6
If any variable in the list V1, V5, V6 has a value equal to its MD1 code or in the range specied by its
MD2 code, the MDATA function will return a value of true, and result variable R1 will be set to its rst
missing data code. Otherwise, the MDATA function will return a value of false and R1 is set to the sum
of V1, V5, V6.
4.10 Assignment Statements
These are the main structural units of the Recode language. They are used to assign a value to a result.
Any number between 1 and 9999 may be used for an R-variable but it avoids confusion if the R-numbers are
distinct from V-numbers of variables in the input dictionary, e.g. if there are 22 variables in the dictionary
then start numbering R-variables from R30. Assignment statements can also be used to assign a new value
46 Recode Facility
to an input variable. In this case the original value of the input variable is lost for the duration of the
particular IDAMS program execution.
Prototype: variable=expression
Where:
variable is any input (Vn) or result (Rn) variable.
expression is any arithmetic expression optionally using Recode arithmetic functions.
Note that variables used in the expression are not automatically checked for missing data except in the
special functions MAX, MEAN, MIN, STD, SUM, VAR. In all other cases, specic statements to check
for missing data must be introduced where appropriate. See below under Conditional statements for
example.
Examples:
R10=5
R10 is assigned the constant 5 as its value.
R5=2*V10 + (V11 + V12)/2
Any arithmetic expression may be used and parentheses are used to change normal precedence of the arith-
metic operators.
V20=SQRT(V20)
The value in V20 is replaced by its square root using the SQRT function.
R20=BRAC(V6,0-15=1,16-25=2,26-35=3,36-90=4,ELSE=9)
R20 is assigned the value 1, 2, 3, 4 or 9 according to the group into which the value of V6 falls.
R10=MD1(V10)
R10 is assigned a value equal to V10s rst missing data code.
4.11 Special Assignment Statements
DUMMY. The DUMMY statement produces a series of dummy variables, coded 0 or 1, from a single
variable.
Prototype: DUMMY var1,...,varn USING var(val1)(val2)...(valn)[ELSE expression]
Where:
var1, var2,...,varn is a list of the dummy variables whose values are dened by this statement. They
may be V- or R-variables, may be listed singly or in ranges, and must be separated by commas (e.g.
R1-R3, R10, R7-R9, V20). The order specied is preserved.
Double references (R1, R3, R1) are valid.
var is any V- or R-variable. The value of this variable is tested against the value lists (val1)(val2) etc.
to set the appropriate value of the dummy variables.
(val1)(val2)...(valn) are lists of values used to set the values of the dummy variables. There must be
the same number of lists as dummy variables (var1, var2, ..., varn). Value lists can contain single
constants or ranges or both.
expression is any arithmetic expression that is used as the value for all dummy variables when the
value of the variable var is not in one of the lists of values. Expression defaults to the constant 0.
4.12 Control Statements 47
The value of the variable var is tested against the value lists (the number of value lists must equal the
number of dummy variables); if var has a value in the rst value list, the rst dummy variable is set
to 1, the others to 0; if the var value occurs in the second value list, the second dummy variable is set
to 1, the others to 0, etc. If the var value occurs in none of the value lists, all dummy variables are set
to the value specied after the ELSE (defaults to 0).
Example:
DUMMY R1-R3 USING V8(1-4)(5,7,9)(0,8) ELSE 99
The following chart shows the values of R1, R2 and R3 based on dierent V8 values:
V8: 1 2 3 4 5 7 8 9 0 OTHER
R1: 1 1 1 1 0 0 0 0 0 99
R2: 0 0 0 0 1 1 0 1 0 99
R3: 0 0 0 0 0 0 1 0 1 99
SELECT. The SELECT statement causes the variable in the FROM list holding the same position as the
value of the BY variable to be set equal to the value of the expression to the right of the equals sign i.e.
it selects which variable is to be assigned a value. If the value of the BY variable is less than 1 or greater
than the number of variables in the FROM list, a fatal error results. The maximum number of items in the
FROM list is 50. Therefore the maximum value of the BY variable is 50.
Prototype: SELECT (FROM=variable list, BY=variable)=expression
Examples:
SELECT (FROM=R1,V3-V10, BY=R99)=1
SELECT (BY=V1, FROM=V8,R2,R5)=R7*5
In the rst example, R1 will be set to 1 if R99 equals 1; V3 will be set to 1 if R99 equals 2; ... ; and V10
will be set to 1 if R99 equals 9. If R99 is greater than 9 or less than 1, a fatal error will result. The values
of the eight variables not selected will not be altered.
SELECT may be used to form a loop as follows:
R99=1
L1 SELECT (BY=R99, FROM=R1,V3-V10)=0
IF R99 LT 9 THEN R99=R99+1 AND GO TO L1
The nine variables R1, V3-V10 will be set to zero, one after another, as R99 is incremented from 1 to 9. The
loop is completed when R99 equals 9 and all variables have been initialized.
4.12 Control Statements
Recode statements are normally executed on each data case in order from rst to last. The order can be
changed with one of the control statements:
Statement Example Purpose
BRANCH BRANCH (V16,L1,L2) Branch depending on the value of a variable
CONTINUE CONTINUE Continue with next statement
ENDFILE ENDFILE Do not process any more
data cases after this one
ERROR ERROR Terminate execution completely
GO TO GO TO TOWN Branch unconditionally
REJECT REJECT Reject the current data case
RELEASE RELEASE Release the current data case to the program
for processing and then execute recode
statements again without reading another case
RETURN RETURN Use the current case for analysis
with no further recoding
48 Recode Facility
BRANCH. The BRANCH statement changes the sequence in which statements are executed, depending
on the value of a variable.
Prototype: BRANCH(var,labels)
Where:
var is a V or R-variable.
labels is a list of one or more 1 to 4-character statement labels.
Example:
BRANCH(R99,LAB1,LAB2,LAB3)
Transfer is made to LAB1, LAB2, or LAB3, depending on whether R99 has a value of 1,2, or 3.
CONTINUE. CONTINUE is a simple statement which performs no operation. It is used as a convenient
transfer point.
Prototype: CONTINUE
Example:
IF V17 EQ 10 THEN GO TO AT
R10=V11
GO TO THAT
AT R20=V11*100
THAT CONTINUE
ENDFILE. The ENDFILE statement causes the Recode facility to close the input dataset exactly as if an
end-of-le had been reached. If the EOF function has been specied, the EOF function will be given a true
value for a nal pass through the Recode statements from the beginning, after ENDFILE has been executed.
Prototype: ENDFILE
Example:
IF V1 EQ 100 THEN ENDFILE
This statement can be used to test a set of Recode statements or an IDAMS setup on the rst n cases of a
dataset.
ERROR. The ERROR statement directs the Recode facility to terminate execution with an error message
that indicates the number of the case and the number of the Recode statement at which the error occurred.
Prototype: ERROR
Example:
IF R6 EQ 2 THEN GO TO B
ERROR
B CONTINUE
GO TO. The GO TO statement is used to change the sequence in which the statements are executed. In
the absence of a GO TO or a BRANCH statement, each statement is executed sequentially.
Prototype: GO TO label
Where label is a 1-4 character statement label. The statement identied by the label may be physically
before or after the GO TO statement. (Warning: Be careful of referencing a statement before the GO TO,
as an endless loop can be formed).
4.13 Conditional Statements 49
Example:
GO TO TOWN
.
.
R10=R5
GO TO 1
TOWN R10=R5+V11
1 R11=...
REJECT. The REJECT statement directs the Recode facility to reject the present case and obtain another
case. The new case is then processed from the beginning of the Recode statements. Thus, REJECT can be
used as a lter with R-variables.
Prototype: REJECT
Example:
IF MDATA (V8,V12-V13) THEN REJECT
RELEASE. The RELEASE statement directs the Recode facility to release the present case to the program
for processing and to regain control after the processing without reading another case. After regaining control,
Recode resumes with the rst Recode statement. RELEASE can be used to break up a single record into
several cases for analysis. Note: When using the RELEASE statement, care should be taken that processing
will not continue indenitely.
Prototype: RELEASE
Example:
CARRY (R1)
R1=R1+1
IF R1 LT V1 THEN RELEASE ELSE R1=0
RETURN. The RETURN statement directs the Recode facility to return control to the IDAMS program.
No other Recode statements are executed for the current case.
Prototype: RETURN
Example:
IF V8 LT 12 THEN GO TO A
RETURN
A R10=V8
4.13 Conditional Statements
The IF statement allows conditional assignment and/or conditional control. It is a compound statement
with several simple statements connected by the keywords THEN, AND and ELSE.
Prototype:
IF test THEN stmt1 [AND stmt2 AND ... stmt n][ELSE estmt1] [AND estmt2 AND ... estmt n]
Where:
test may be any combination of logical expressions (including logical functions) connected by AND or
OR and optionally preceded by NOT. It may be, but need not be, enclosed in parentheses.
stmt1,...,stmt n,estmt1,...,estmt n may be any assignment or control statement (except CONTINUE).
The statement(s) between the THEN and ELSE are executed if the test is true.
The statement(s) after the ELSE are executed if the test is false. If no ELSE clause is present, the
next statement is executed.
50 Recode Facility
The THEN and ELSE keywords may each be followed by any number of statements, each connected
by the keyword AND.
Examples:
IF V5 EQ V6 THEN R1=1 ELSE R1=2
Set R1 to 1 if the value of V5 equals the value of V6; otherwise set R1 to 2.
IF MDATA(V7,V10-V12) THEN R6=MD1(V7) AND R10=99 -
ELSE R6=V7+V10+V11 AND R10=V12*V7
Set R6 to V7s rst missing data value and R10 to 99 if any of the variables V7, V10, V11, V12 are equal to
their missing data codes. Otherwise set R6 equal to the sum of V7, V10 and V11, and also set R10 equal to
the product of V12 and V7.
IF (V5 NE 7 AND R8 EQ 9) THEN V3=1 ELSE V3=0
Set V3 to 1 if both V5 is not equal to 7 and R8 is equal to 9. (Note: The parentheses are not required).
IF MDATA(V6) OR V10 LT 0 THEN GO TO X
If the value of V6 is missing or V10 is less than 0, branch to the statement labelled X; otherwise continue
with the next statement.
4.14 Initialization/Denition Statements
These statements are executed once, before processing of the data starts, to initialize values to be used during
the execution of Recode statements. They cannot be used in expressions and they cannot have labels.
CARRY. The CARRY statement causes the values of the variables listed to be carried over from case to
case. CARRY variables are initialized only once (before starting to read the data) to zero. The CARRY
variables can be used as counters or as accumulators for aggregation.
Prototype: CARRY(varlist)
Where varlist is a list of R-variables.
Example:
CARRY(R1,R5-R10,R12)
MDCODES. The MDCODES statement changes dictionary missing data codes for input variables or
assigns missing data codes for result variables. Defaults used by Recode for R- and V-variables with no
dictionary missing data specication and no MDCODES specication are MD1=1.510
9
and MD2=1.610
9
.
Prototype: MDCODES (varlist1)(md1,md2),(varlist2)(md1,md2), ..., (varlistn)(md1,md2)
Where:
varlist1, varlist2, ..., varlistn are variable lists containing lists of single variables and variable ranges.
md1 and md2 are rst and second missing data codes respectively, for all variables listed. Decimal
valued missing data codes must be specied with explicit decimal point. Warning: only 2 decimal
places are retained for R-variables, rounding up the values accordingly, e.g. md1 specied as 9.999 is
treated as 10.00.
Either md1 or md2 may be omitted. If md1 is omitted, a comma must precede the md2 value.
4.15 Examples of Use of Recode Statements 51
Examples:
MDCODES V5(8,9)
The rst missing data code for V5 will be 8; the second missing data code will be 9.
MDCODES (R9-R11)(,99), V7(8,9), V6(9)
For R9, R10 and R11, the rst missing data code will be 1.5 10
9
and the second missing data code will be
99.
For V7, the rst missing data code will be 8 and the second missing data code will be 9.
For V6, the rst missing data code will be 9 and the second missing data code will be 1.6 10
9
.
NAME. The NAME statement assigns names to R-variables or renames V-variables.
Prototype: NAME var1 name1 ,var2 name2, ..., varn name n
Where:
var1,var2,...,varn are V- or R-variables.
name1, name2,...,name n are names to assign to these variables.
The maximum number of characters per name is 24; if longer, the name is truncated to 24 characters.
Default name for an R-variable is RECODED VARIABLE Rn.
To include an apostrophe in a name (e.g. PERSONS), use two primes (e.g. PERSONS).
Example:
NAME R1 V5 + V6, V1 PERSONS STATUS
4.15 Examples of Use of Recode Statements
Suppose a data le exists with the following variables:
V1 Village ID
V2 Sex 1=male, 2=female
V4 Age 21-98, 99=not stated
V5 Education level 1=primary, 2=secondary,
3=university, 9=Not stated
V8 Income from 1st job
V9 Income from 2nd job
V10 Partners income
V21 Weight in kg (one decimal)
V22 Height in meters (2 decimals)
V31 Owns car? 1=yes, 2=no, 9=NS
V32 Owns TV?
V33 Owns stereo?
V34 Owns freezer?
V35 Owns Micro computer?
V41 Number of children
V42 Age of lst child
V43 Age of 2nd child
V44 Age of 3rd child
V45 Age of 4th child
Ways to construct some possible analysis variables from this data are outlined below.
52 Recode Facility
1. Total Income. If income from lst and 2nd jobs are both missing, then the total income will be missing.
If only one is missing, then use this as the total.
IF NVALID(V8,V9) EQ 0 THEN R101=-1 AND GO TO END
IF NVALID(V8,V9) EQ 2 THEN R101=V8+V9 AND GO TO END
IF MDATA(V8) THEN R101=V9 ELSE R101=V8
END CONTINUE
MDCODES R101(-1)
or R101=SUM(V8,V9,MIN=1)
IF R101 EQ 1.5 * 10 EXP 9 THEN R101=-1
MDCODES R101(-1)
2. Do not use the case if total income is zero or missing.
IF MDATA(R101) OR R101 EQ 0 THEN REJECT
3. Composite income taking 3/4 of own income plus 1/4 of partners income. If partners income is
missing, assume zero.
IF MDATA(V10) THEN V10=0
IF MDATA(R101) THEN R102=MD1(R102) -
ELSE R102=R101 * .75 + V10 * .25
NAME R102Composite income
MDCODES R102(99999)
4. Weight of respondent grouped into light (30-50), medium (51-70) and heavy (70+).
R103=BRAC(V21,30-50=1,50-70=2,70-200=3,ELSE=9)
Note that V21 is recorded with a decimal place. To make sure that values such as 50.2 get assigned to
a category, ranges in the BRAC statement should overlap. Recode works from left to right and assigns
the code for the rst range into which the case falls. Thus a value of 50.0 will fall in category 1 but a
value 50.1 will fall into category 2. To put values of 50 in the 2nd category, use
R103=BRAC(V21, <50=1, <70=2, <200=3, ELSE=9)
A value of 49 would t in all 3 ranges, but Recode will use the rst valid range it nds (code 1). A
value of 50 will not satisfy the rst range and will be assigned code 2.
5. Auence index with values 0-5 according to the number of possessions owned.
R104=COUNT(1,V31-V35)
If all items are coded 1 (yes), the index, R104, will take the value 5. If all are coded 2 (no) or are
missing, then the index will be zero.
6. Create 3 dummy variables (coded 0/1) from the education variable.
DUMMY R105-R107 USING V5(1)(2)(3)
The 3 result variables will take values as follows:
V5=1 R105=1, R106=0, R107=0
V5=2 R105=0, R106=1, R107=0
V5=3 R105=0, R106=0, R107=1
V5 not 1,2 or 3 R105=0, R106=0, R107=0 (default if no ELSE value given)
7. Age of youngest child. Ages of the last 4 children are stored in variables 42 to 45, the oldest child
being in V42. If someone has 3 children, then the value of V44 gives the age of the youngest child; if
someone has 4 or more children then we want V45. In this case, V41 (number of children) can be used
as an index to select the correct variable using the SELECT function.
4.15 Examples of Use of Recode Statements 53
IF V41 GT 4 THEN V41=4
IF V41 EQ 0 OR MDATA(V41) THEN R109=99 ELSE -
R109=SELECT (FROM=V42-V45, BY=V41)
NAME R109Last childs age
MDCODES R109(99)
8. Weight/Height ratio as a decimal number and rounded to the nearest integer.
IF MDATA (V21,V22) OR V22 EQ 0 THEN R111=99 AND R112=99 -
ELSE R111=V21/V22 AND R112=TRUNC ((V21/V22) + .5)
NAME R111Weight/Height ratio dec, R112 W/H rounded
MDCODES (R111,R112)(99)
9. Create a single variable combining sex and educational level into 4 groups as follows:
Females, primary education only
Females, secondary+ education
Males, primary education only
Males, secondary+ education
Method a. First reduce the codes for sex and education into contiguous codes starting from 0, storing
the results temporarily in variables R901, R902.
R901=BRAC (V5,1=0,2=1,ELSE=9)
R902=BRAC (V6,1=0,2=1,3=1,ELSE=9)
Then use the COMBINE function, making sure rst that cases with spurious codes are put in a missing
data category.
IF R901 GT 1 OR R902 GT 1 THEN R110=9 ELSE -
R110=COMBINE R901(2),R902(2)
Method b. Use IFs, setting a default value of 9 at the start.
R110=9
IF V5 EQ 1 AND V6 EQ 1 THEN R110=1
IF V5 EQ 1 AND V6 INLIST (2,3) THEN R110=2
IF V5 EQ 2 AND V6 EQ 1 THEN R110=3
IF V5 EQ 2 AND V6 INLIST (2,3) THEN R110=4
Method c. Use the RECODE function.
R110=RECODE V5,V6(1/1)=1,(1/2-3)=2,(2/1)=4,(2/2-3)=5,ELSE=9
10. Aggregating cases with Recode. Suppose we want to analyze the data (consisting of individual level
records) at the village level, for example to produce a table showing the distribution of villages by
income (V8,V9) and % of people owning a car (V31) in the village. We could do this by using
AGGREG to aggregate the data to the village level and then executing TABLES. Alternatively, we
may use the CARRY, EOF and REJECT statements of the Recode language and use TABLES directly.
1 CARRY (R901,R902,R903,R904)
2 IF (R901 EQ 0) THEN R901=V1
3 IF (R901 NE V1) THEN GO TO VIL
4 IF EOF THEN GO TO VIL
5 R902=R902+1
6 R903=R903+V8+V9
7 IF (V31 EQ 1) THEN R904=R904+1
8 REJECT
9 VIL R101=(R904*100)/R902
10 R101=BRAC(R101,<25=1,<50=2,<75=3,<101=4)
54 Recode Facility
11 R102=R903/R902
12 R102=BRAC(R102,<1000=1,<2000=2,<5000=3,ELSE=4)
13 R901=V1
14 R902=1
15 R903=V8+V9
16 IF (V31 EQ 1) THEN R904=1 ELSE R904=0
17 NAME R102average income, R101% owning car
R901 is a work variable used to hold the current village ID; when the rst case is read (R901=0), R901
is assigned the value of the village ID (V1); R902 to R904 are work variables for, respectively, the
number of people in the village, the total income of the people in the village and the number of people
owning cars in the village.
While the village ID stays the same, data is accumulated in variables R902 to R904 (whose values are
carried as new cases are read). The case is then rejected (not passed to the analysis) and the next
case read. When a change in village ID is encountered, the instructions at label VIL are executed: the
current contents of R902, R903 and R904 are used to compute the required variables (grouped mean
income and grouped % of car owners) and these variables are then passed to the analysis after rst
resetting the work variables to the values for the last case read (the rst case for the next village).
When the end of le is reached, we need to make sure that the data from the last village is used.
Statement 4 achieves this.
4.16 Restrictions
1. Maximum number of R-variables is 200.
2. Maximum number of numbered tables (BRAC, RECODE, TABLE) is 20.
3. Maximum number of characters in a Recode statement excluding continuation -s is 1024.
4. Maximum number of statement labels is approximately 60.
5. Maximum number of constants, including those in all tables, is approximately 1500.
6. Maximum number of names that may be dened in NAME statements is 70.
7. Maximum number of missing data values that may be dened in MDCODES statements is 100 and
only 2 decimal places are retained for R-variables.
8. Maximum number of parenthetical nestings within a statement (i.e. parentheses within parentheses)
is 20.
9. Maximum number of arithmetic operators is approximately 400.
10. Maximum number of variables with SELECT statement is 50.
11. Maximum number of IF statements is approximately 100.
12. Maximum number of function nestings (i.e. function references as function arguments) is 25.
13. Maximum number of statements is approximately 200.
14. Maximum number of labels in a BRANCH statement is 20.
15. Maximum number of CARRY variables is 100.
16. The maximum number of variables given in the Restrictions section of each analysis program
write-up includes R- and V-variables used in the analysis and V-variables used in Recode but not used
in the analysis. Thus, if a program has a 40-variable maximum and 40 input variables are used in the
analysis, one cannot use any other input variables than those 40 in the Recode statements. R-variables
dened in Recode statements but not used in the analysis need not be counted toward the maximum
number of variables.
17. Filtering takes place prior to recoding so that result variables may not be referenced in main lters.
4.17 Note 55
4.17 Note
Univariate/bivariate recoding can be achieved using TABLE, IF or RECODE method. Below is a brief
comparison of these methods taking into account two execution aspects.
Completeness
TABLE...performs complete recoding. A result value is produced even when the input value is outside
the table (since ELSE defaults to 99).
RECODE allows partial recoding. If no test is true, and no ELSE value is specied, no recoding occurs.
Size of table
Large, complete bivariate and univariate recodings are performed most eciently by TABLE and IF...
For a large one-to-one, univariate recoding, using one line of a rectangular table, TABLE is better than
IF...
Chapter 5
Data Management and Analysis
5.1 Data Validation with IDAMS
5.1.1 Overview
Before starting analysis of data with whatever software, data normally need to be validated. Such validation
typically comprises three stages:
1. Checking data completeness, i.e. verifying that all cases expected are present in the data le and that
the correct records exist for each case if there are multiple records per case.
2. Checking that numeric variables have only numeric values and checking that values are valid.
3. Consistency checking between variables.
Like much other statistical software, IDAMS requires that there must be the same amount of data for each
case. If the data for one case spans several records, then each case must comprise exactly the same set
of records. If certain variables are not applicable to some cases, then missing values must none-the-less
be assigned. Record merge checking capabilities in IDAMS allow for checking that each case of data has
the correct set of records. This is performed by the program MERCHECK which produces a rectangular
output le where extra/duplicate records have been deleted and cases with missing records have either been
dropped or else padded with dummy records.
Checking for non-numeric values in numeric variables and the optional conversion of blank elds to user
specied numeric values is performed by the BUILD program. Checking for other invalid codes is performed
by the program CHECK where what are valid codes are dened on special control statements or taken from
C-records in the dictionary describing the data.
If data are entered using the WinIDAMS User Interface, non-numeric characters (except empty elds) in
numeric elds are not allowed. Moreover, there is a possibility of code checking during data entry and of an
overall check for invalid codes in the whole data le. C-records in the dictionary are used for this purpose.
Consistency checks can be expressed in the IDAMS Recoding language and used with the CONCHECK
program to list cases with inconsistencies.
Errors found in any of these steps can be corrected directly through the User Interface or by using the
IDAMS program CORRECT. A typical sequence of steps for data error detection and correction with
IDAMS is described in more detail below.
5.1.2 Checking Data Completeness
Step 1 Produce summary tables showing the distribution of cases amongst sampling units, geograph-
ical areas, etc. for checking against expected totals. This is particularly useful in a sample
survey. For example, suppose a survey of households is done. A sample is taken by rst
58 Data Management and Analysis
selecting primary sampling units (PSU) then up to 5 areas within each PSU and then inter-
viewing households in those areas. The distribution of households by PSU and area in the
data can be produced by preparing a small dictionary containing just the 2 variables: PSU
and area. The table would look something like this:
V2 AREA
01 02 03 04 05
01 3 6 2
V1 PSU 02 10 4 2 8 5
03
.
.
This table could be compared with the interviewers log-book to check whether the data for
all interviews taken exist in the le.
Steps 2, 3 and 4 are necessary only when cases are composed of more than one record.
Step 2 The original raw data records are sorted into case identication/record identication order
using the SORMER program.
Step 3 The sorted raw data are checked with MERCHECK to see if they have the correct set of
records for each case. The output le contains only good cases, i.e. ones with the correct
records. Extra records and duplicate records are dropped. Cases with missing records are
either dropped or padded. All cases with merge errors are listed.
Step 4 Corrections are now made for the errors detected by MERCHECK. These can be done in a
variety of ways:
Re-enter bad cases and merge them with the output le of MERCHECK using SORMER.
Correct the original raw data with an editor and re-do steps 2 and 3.
Re-enter bad cases, perform steps 2 and 3 on these and then merge the output from
this execution of step 3 with the original output from step 3.
Whichever method is selected, MERCHECK should be re-executed on the corrected le to
make sure all errors have been dealt with.
5.1.3 Checking for Non-numeric and Invalid Variable Values
Step 5 Prepare a dictionary for all variables with appropriate instructions for dealing with blank
elds. Execute BUILD. An IDAMS dataset is output (Data le and Dictionary le). All
unexpected non-numeric values are converted to 9s and reported in the results.
Step 6 Using TABLES, print frequency distributions of all qualitative variables and minimum, maxi-
mum and mean values for quantitative variables. This gives an initial idea of the content of the
data and shows which variables have invalid codes (qualitative variables) or too large/small
values (quantitative variables). It also can be compared later with a similar distributions and
values obtained after cleaning to see how data validation has aected the data.
Step 7 Prepare control statements specifying the valid codes or range of values for each variable.
These can be prepared ahead of time for all variables or alternatively, after step 6 for only
those variables which are known to have invalid codes. Use the output dataset from step 5
as input to the CHECK program to get a list of cases with invalid values. Note that the
specication of valid codes for variables can also be taken from C-records in the dictionary if
these were introduced in step 5.
Step 8 Prepare corrections for errors detected at step 5 and step 7. Use the CORRECT program
to update the IDAMS dataset created in step 5.
Note that corrections could also be done with the WinIDAMS User Interface if the number
of cases is not too large. However using CORRECT is a less error prone method.
Perform steps 7 and 8 until no errors are reported.
5.2 Data Management/Transformation 59
5.1.4 Consistency Checking
Step 9 Prepare logical statements of the consistency checks to be performed, e.g.
PREGNANT (V32) = inapplicable if and only if SEX (V6) = Male.
Assign a result number to each consistency check and translate the logic into Recode
statements where the result is set to 1 for an inconsistency, e.g.
IF V6 EQ 1 AND V32 NE 9 THEN R1001=1
IF V6 NE 1 AND V32 EQ 9 THEN R1001=1 ELSE R1001=0
Use the set of Recode statements with CONCHECK to print cases with errors.
Step 10 Correct cases with errors as in step 8.
Perform steps 9 and 10 until no errors are reported. The data output from the nal execution of CORRECT
will be ready for analysis.
5.2 Data Management/Transformation
IDAMS contains an extensive set of facilities for generating indices, derived measures, aggregations, and
other transformations of the data, including alphabetic recoding. The most frequently used capabilities are
provided by the Recode facility, which can perform temporary operations in all analysis programs that input
an IDAMS dataset. Results of recoding can be saved as permanent variables using the TRANS program.
These facilities operate on variables within one case and permit recoding of the values of one or more
variables, generation of variables by combinations of variables, control of the sequence of these operations
through tests of logical expressions, and a number of specialized statements and functions. The necessary
new dictionary information to describe the results of the operations performed is automatically produced.
For aggregation across cases, the AGGREG program is available. AGGREG provides arithmetic sums and
related measures, ranges, and counts of valid data values within groups of cases. Typical use of AGGREG
involves the prior use of the SORMER program to order the Data le into the desired groups.
There are a number of circumstances in which it is necessary to combine the records from two dierent
les, for example, data collected at dierent points in time. As values for variables for each new wave are
received, the objective is to add them to the record containing all the previous data for the same respondent
or case. The MERGE program will accomplish this, including appropriate padding with missing data where
respondents are not found in the new wave. Similar examples occur when residuals or some form of scale
scores are generated for each case by an analysis program and need to be included with the original data.
A somewhat dierent combination process occurs when data from dierent levels of analysis are to be
combined. One illustration of this is the addition of household data to individual respondents records. When
a dataset is ordered such that all respondents in the same household are together, MERGE will provide the
necessary duplicate record merge. A similar situation occurs when group summaries from AGGREG are to
be added to the records for each case in each respective group.
Another dataset combination process, often also termed a merge, occurs when additional cases are to be
added to a dataset. The new records must be described by the same dictionary as the original data. This
type of merge may be achieved with the SORMER program.
Sub-setting functions are available as temporary operations in most IDAMS programs (by using a lter)
to select particular cases for processing. Permanent les containing subsets of IDAMS datasets (a subset of
variables or a subset of cases, or both) may also be created. The SUBSET and TRANS programs are most
likely to be used for such tasks, although several other programs that output datasets, such as MERGE, may
also be used. Selection of cases may be done on the basis that only certain cases are logically of interest (such
as only the female respondents), or it may be done on a random basis using the Recode function RAND
with the TRANS program.
A display of the actual values stored in an IDAMS dataset is often of substantial help for checking the results
from data modication steps and indeed at any other stages. The LIST program is available for this purpose,
and allows complete listings of a selection of specic cases and variables. The selection or ltering of cases
for display may be done using combinations of several variables in logical expressions; an example would be
60 Data Management and Analysis
a selection of only records for unmarried women between 21 and 25 years of age. Numeric and alphabetic
variables from a dataset as well as variables constructed with Recode statements can be listed. The User
Interface also has an option to print the data in a table format.
5.3 Data Analysis
The paramount consideration for the user in selecting analysis programs is whether the appropriate statistical
functions are provided. Guidance on such matters is well beyond the scope of this manual. A summary of
the functions of each IDAMS analysis program can be found in the Introduction. More details are given
in the individual program write-ups. The formulas used for computing the statistics in each program, and
references are given in relevant chapters of the part Statistical Formulas and Bibliographic References.
5.4 Example of a Small Task to be Performed with IDAMS
Suppose that an IDAMS dataset contains responses to a survey questionnaire and includes the following
variables:
V11 gives the sex of the respondent according to the following code:
1. Male 2. Female 9. Not ascertained
V12 is the respondents income in dollars (99999 = not ascertained).
V13 through V16 are attitudinal measures on dierent issues. The variables are each coded to reect the
feelings of the respondent as follows:
1. Very positive 2. Positive 3. Neutral 4. Negative 5. Very negative 8. Dont know
9. Not ascertained 0. The question is irrelevant for this respondent
Suppose that only a grouping or recoding of income levels is needed of the following kind:
New code Meaning
1 Income in the range $0 to $9999
2 Income in the range $10,000 to $29,999
3 Income $30,000 and over
9 Refused, Not ascertained, Dont know
Cross-tabulations are desired between the recoded version of the income variable, V12, and each of the
attitudinal variables, V13 to V16. Only the female respondents are to be selected for this analysis.
An IDAMS setup containing the necessary control statements to perform this work is shown below. The
numbers in parentheses on the left identify each control statement and link it to the subsequent explanation.
(1) $RUN TABLES
(2) $FILES
(3) DICTIN = ECON.DIC
(4) DATAIN = ECON.DAT
(5) $RECODE
(6) R101=BRAC(V12,0-9999=1,10000-29999=2,30000-99998=3, -
(7) ELSE=9)
(8) NAME R101 GROUPED INCOME
(9) $SETUP
(10) INCLUDE V11=2
(11) EXAMPLE OF TABLES USING ECONOMIC DATA
(12) *
(13) TABLES
(14) ROWVARS=(R101,V13-V16)
(15) ROWVAR=R101 COLVARS=(V13-V16) CELLS=(FREQS,ROWPCT) STATS=CHI
5.4 Example of a Small Task to be Performed with IDAMS 61
Briey, this is what each statement does:
(1) $RUN TABLES is an IDAMS command specifying that the TABLES program is to be
executed.
(2) This statement signals the start of le denitions for the execution.
(3)&(4) The IDAMS dataset is stored in two separate les. One contains the dictionary, the other
the data.
(5) This statement signals that transformations of the data are required. The statements follow-
ing this are the specic commands to the Recode facility.
(6)(7) These two lines (an original and a continuation) form a statement to the Recode facility
indicating the desired grouping for the income variable, V12, following the scheme outlined
earlier. The result of the BRAC function is stored as result variable R101.
(8) This statement assigns name to the variable R101.
(9) $SETUP is a command which indicates the end of Recode statements and that the TABLES
program control statements follow.
(10) This is a lter which states that the only data cases to be used are those where variable
V11 has the code value 2, for females.
(11) This is a label, which contains the text to be used to title the results.
(12) This line species the main parameters. Since only the asterisk is given, all the default options
for the parameters are chosen for the current execution.
(13) The word TABLES is supplied here to separate the preceding global information for the entire
execution from the specications for individual tables that follow.
(14) This statement requests univariate frequency distributions for 5 variables.
(15) Now bivariate (2-way) tables are requested. The cells are to contain the counts (frequencies)
and row percentages; a Chi-square statistic will be printed for each table. The 2 lists of
variables following the keywords ROWVAR and COLVARS specify the variables that will be
used for the rows and columns of the tables respectively. Four tables will be produced: R101
(grouped income) by V13, V14, V15 and V16).
Part II
Working with WinIDAMS
Chapter 6
Installation
6.1 System Requirements
The WinIDAMS software is available for 32-bit versions of Windows operating systems (Windows 95,
98, NT 4.0, 2000 and XP)
A Pentium II or faster processor and 64 megabytes RAM are recommended.
On all systems, you should have about 11 megabytes of free disk space before attempting to install the
WinIDAMS software in each language.
6.2 Installation Procedure
The release 1.3 of WinIDAMS is stored on CD in a self-extracting le
WinIDAMS\English\Install\WIDAMSR13E.EXE : English version
WinIDAMS\French\Install\WIDAMSR13F.EXE : French version
WinIDAMS\Portuguese\Install\WIDAMSR13P.EXE : Portuguese version
WinIDAMS\Spanish\Install\WIDAMSR13S.EXE : Spanish version
or in equivalent downloaded le.
To install the English version:
1. Select WIDAMSR13E.EXE with Windows explorer.
2. Double-click on this le and follow the prompts.
3. At the end of the installation procedure, a dialog box appears asking: Do you wish to install
HTML Help 1.3 update now?. It is recommended to answer YES.
The installation procedure creates two items in the Program Manager/Start menu, one for executing
WinIDAMS and one for uninstalling WinIDAMS. It also creates an icon on the desktop which is a
link/shortcut to WinIDAMS.
6.3 Testing the Installation
A Setup le containing instructions for executing 4 data management programs (CHECK, CONCHECK,
TRANS and AGGREG) and 6 data analysis programs (TABLES, REGRESSN, MCA, SEARCH, TYPOL
and RANK) is copied into the Work folder during the installation. To execute it:
Start WinIDAMS by a double-click on its icon.
66 Installation
You will see the WinIDAMS main window with a default application displayed in the left pane. Open
the Setups folder. There is the demo.set le with instructions for execution of the 10 programs.
By double-click, the le opens in the Setup window. Execute it from this window. Results of the
execution are sent to the le idams.lst which is immediately opened in the Results window.
The distributed version of the results is provided in the le demo.lst in the Results folder.
Compare the two versions of the results.
6.4 Folders and Files Created During Installation
6.4.1 WinIDAMS Folders
The full path name of the WinIDAMS System folder is given on the Select Destination Directory of the
installation wizard and the following folders are created during the installation (see Files and Folders
chapter for details):
English version French version
<WinIDAMS13-EN>\appl <WinIDAMS13-FR>\appl
<WinIDAMS13-EN>\data <WinIDAMS13-FR>\data
<WinIDAMS13-EN>\temp <WinIDAMS13-FR>\temp
<WinIDAMS13-EN>\trans <WinIDAMS13-FR>\trans
<WinIDAMS13-EN>\work <WinIDAMS13-FR>\work
Portuguese version Spanish version
<WinIDAMS13-PT>\appl <WinIDAMS13-SP>\appl
<WinIDAMS13-PT>\data <WinIDAMS13-SP>\data
<WinIDAMS13-PT>\temp <WinIDAMS13-SP>\temp
<WinIDAMS13-PT>\trans <WinIDAMS13-SP>\trans
<WinIDAMS13-PT>\work <WinIDAMS13-SP>\work
6.4.2 Files Installed
System les in the System folder
(\WinIDAMS13-EN, \WinIDAMS13-FR, \WinIDAMS13-PT, \WinIDAMS13-SP)
WinIDAMS.exe Main executable file for the WinIDAMS User Interface
Ter32.dll |
Hts32.dll | Dlls used by WinIDAMS User Interface
unesys.exe Executable file used for processing setups
Idame.mst Master file of the text data base for IDAMS programs
Idame.xrf Cross reference file of the text data base for IDAMS programs
idams.def Definition of the mapping between ddnames and file names
Graph32.exe GraphID executable file
graphid.ini Ini file used by GraphID for storing colours, fonts and co-ordinates
Idtml32.exe TimeSID executable file
idaddto32.dll Dll used by GraphID and TimeSID
IDAMSC_DLL.dll Dll used by TimeSID
Idams.chm WinIDAMS Manual help file
<pgmname>.pro Prototypes for IDAMS programs
6.5 Uninstallation 67
Dictionary and data les used for examples in the Data folder
(\WinIDAMS13-EN\data, \WinIDAMS13-FR\data, \WinIDAMS13-PT\data, \WinIDAMS13-SP\data)
educ.dic
educ.dat
rucm.dic
rucm.dat
watertim.dic
watertim.dat
data.csv
tab.mat
Demonstration setup and result les in the Work folder
(\WinIDAMS13-EN\work, \WinIDAMS13-FR\work, \WinIDAMS13-PT\work, \WinIDAMS13-SP\work)
demo.set
demo.lst
6.5 Uninstallation
An uninstaller program is created during the installation procedure. The user can execute the uninstaller
either by clicking on WinIDAMS13-EN/Uninstall WinIDAMS13-EN in the Program Manager/Start menu
or by deleting the WinIDAMS Release 1.3, English version, July 2004 entry in the Add/Remove Programs
Control Panel applet. This uninstaller deletes the content of the WinIDAMS folder selected during the
installation process. It does not delete folders if they are not empty.
Chapter 7
Getting Started
7.1 Overview of Steps to be Performed with WinIDAMS
In this example, an IDAMS dictionary for the description of data collected by a questionnaire is prepared
and data for a few respondents are entered. A set of IDAMS control statements (a setup) is then prepared
and used to produce frequency distributions of Age, Sex and Education (number of years) bracketed into 4
groups. The steps below are followed:
1. Create an application environment.
2. Prepare and store an IDAMS dictionary describing the variables in the data.
3. Enter the data (this step would be eliminated if the data were prepared outside WinIDAMS).
4. Prepare and store a setup of instructions specifying what is to be done with the data.
5. Execute the IDAMS program as given in the setup.
6. Review the results and modify the setup if necessary; then repeat from step 4.
7. Print the results.
To get started, rst launch WinIDAMS. You will see the WinIDAMS Main window.
70 Getting Started
7.2 Create an Application Environment
The application environment allows you to predene full paths for three folders. All input/output les will
be opened/created by default in one of these folders. This saves you from having to enter the full folder
path.
The Data and Dictionary les: in the Data folder.
The Setup and Results les: in the Work folder.
The temporary les: in the Temporary folder.
Click on Application in the menu bar and then on New. You now see the following dialogue:
We will create a new application with the name MyAppl and with application folders C:\MyAppl\data,
C:\MyAppl\work and C:\MyAppl\temp by entering these names in the corresponding text-boxes.
For each application folder entered which does not exist, you will see a dialogue like this:
7.3 Prepare the Dictionary 71
Click on Yes for each new folder and then click on OK. Now you see the WinIDAMS Main window again.
7.3 Prepare the Dictionary
We will create a dictionary to describe data records containing the following variables:
Number Name Width Missing Data code
1 Identication 3
2 Age 2
3 Sex 1 9
1 Male
2 Female
9 MD
4 Education 2
Press Ctrl/N or click on File/New. These commands open the New document dialogue:
The dialogue displays the list of document types used in WinIDAMS. Choose IDAMS Dictionary
le, already selected by default.
Click in the File name eld and enter the name demog. Then click OK. Note that extension .dic is
added automatically to the le name.
You now see:
the Application window;
a 2 pane window for entering variable descriptions and optional associated codes and labels. The
full Dictionary le name demog.dic is displayed in the tab.
72 Getting Started
Click on the rst cell in the row of the pane for describing variables and enter the rst variable number.
As soon as you begin to enter information in the row marked with an asterisk, a new row is created
just after the current row and the row you are editing displays a pencil in the row header. Pressing
Enter or Tab you move to the next eld. Now enter variable name and width. Skip the rest of elds
by pressing Enter or Tab and accept the description by pressing Enter or Tab on the last eld. Note
that the default location is provided by WinIDAMS when variable description row has been accepted.
When you press Enter or Tab on the last eld, the pencil disappears which means that the row has
been accepted after some rudimentary checking of the elds. The current eld is now the rst eld of
the next row (marked with an asterisk) and you can enter the description for the 2nd variable, Age. Do
the same for variable 3, Sex, but give this variable an MD1 (missing data) code of 9 (the non-response
code).
After accepting the description of variable 3, the rst eld (variable number) of the row with an asterisk
becomes the current eld. Click on any eld of the row just entered (variable 3, Sex) to make it the
current row.
Switch to the pane for codes and their labels by clicking on the code eld in the rst row. Note that
this pane is synchronized with the variable selected in the pane for describing variables.
Enter 1 in the code eld. Again, as soon as you begin to enter code label, a new row with an asterisk
is created just after the current row and the row you are editing displays a pencil. Press Enter to move
to the next eld, enter Male in the label eld. Press Enter. The current eld is now the code eld of
the next row and you can enter code 2 with label Female and similarly for code 9.
7.4 Enter Data 73
Go back to the variable description pane by clicking on the variable number eld of the row with an
asterisk. Enter the information for variable 4.
To delete rows, click at the side of the row and select Cut from the Edit menu.
Save the dictionary by clicking on File/Save As, and accepting the Dictionary le name demog.dic.
7.4 Enter Data
Press Ctrl/N or click on File/New. The same New document dialogue as we have seen above for the
dictionary is displayed.
Select the IDAMS Data le item from the list and enter the name of the Data le. By convention, it
is better to use the same name for the Data le and the Dictionary le which describes the data. Only
the le extension changes, .dic for the Dictionary le and .dat for the Data le. The dictionary
and data make up an IDAMS dataset. Enter demog as le name and click on OK.
A File Open dialogue now displays the dictionaries which exist for the active application and asks you
to select the dictionary which describes the data. Select demog.dic and click Open.
74 Getting Started
A window with three panes now appears. You enter data only in the bottom pane. The 2 other panes
are synchronized for displaying the current variable description and the code labels if any. The full
Data le name demog.dat (extension .dat is added automatically) is displayed in the tab.
Note that in illustrations presented below the Application window has been closed.
Click on the rst eld of the row with an asterisk and type the rst line of data as given below, pressing
the Enter key after entering each data value. As soon as you begin to enter data, a new row is created
just after the current row and the current row header displays a pencil which means that you are
editing this row.
After entering the value for the last variable V4 and pressing Enter, the rst eld of the next row
becomes the current eld.
Enter the data for the 5 cases given below.
7.5 Prepare the Setup 75
Click on File/Save to save the data in the le demog.dat.
7.5 Prepare the Setup
Press Ctrl/N or click on File/New.
Select the IDAMS Setup le item from the list and enter a name, e.g. demog1 for the Setup
le. Click OK. Note that extension .set is added automatically to the le name and the full le name
demog1.set is displayed in the tab.
You will now see an empty window for entering the setup. Type the following:
76 Getting Started
The $RUN identies the desired IDAMS program; following the $FILES command, the Data le and
associated Dictionary le are specied; the $RECODE command followed by Recode statements (here
the recoding is used to bracket years of education into 4 groups); the $SETUP command followed
by the parameters for the task (in this case requesting univariate frequency distributions) are given
(according to the rules for the TABLES program).
Click on File/Save and save the setup in the le demog1.set.
7.6 Execute the Setup
From inside the Setup window, click on Execute/Current Setup. The current setup is saved in a
temporary le and executed. A dialogue appears during the execution and disappears if the execution
is successful.
The results are, by default, written into the le idams.lst. It can be changed by adding a PRINT
line under $FILES for giving the name of Results le, e.g. print=a:demog1.lst to store the results
in a le on diskette.
7.7 Review Results and Modify the Setup
The Results le is loaded automatically when the execution is nished.
7.7 Review Results and Modify the Setup 77
The table of contents provided in the left pane allows quick location of parts of the results. Open it
by clicking idams.lst and pushing button with an asterisk on the numeric pad. Then, click on the
element you want to see.
If you want to change something in the setup while reviewing the results, then click on the tab
demog1.set and make the required modications. Press Ctrl/E to execute.
78 Getting Started
7.8 Print the Results
Select File/Print.
Select the pages that you wish to print and click on OK.
Chapter 8
Files and Folders
8.1 Files in WinIDAMS
User les
They are created by the user with the help of tools provided by the WinIDAMS User Interface, or they
are produced by an IDAMS procedure as a nal result or as output for further processing. All user les
in IDAMS are ASCII text les. Tabulation characters are allowed; they are automatically converted to the
correct number of blanks. Standard lename extensions are used by the Interface for recognizing the le
type.
Data le (*.dat). Any data le can be input to IDAMS programs providing that each case is
contained in an equal number of xed format records. However, if a data le is used by the WinIDAMS
User Interface, then there can only be one record per case.
Records can be of variable length with a maximum of 4096 characters per case. If the rst record
in the le is not the longest, then the maximum record length (RECL) must be provided on the
corresponding le specications. Data les produced by IDAMS programs have xed length records
with no tabulation characters. There is generally no limit to the number of cases that can be input to
an IDAMS program.
Dictionary le (*.dic). The dictionary is used to describe the variables in the data. It may, at
minimum, describe just the variables being used for a particular program execution, but it can also
describe all the variables in each data record. The record length is variable but the maximum length
is 80. If a dictionary is output by an IDAMS program, then the record length is xed (80 characters)
with no tabulation characters.
The dictionary can be prepared, without knowing its internal format, in the Dictionary window of the
User Interface. Alternatively, it can be prepared using the General Editor and following the format
given in Data in IDAMS chapter.
Matrix le (*.mat). IDAMS matrices for storing various statistics have xed length (80 characters)
records with no tabulation characters.
Setup le (*.set). This le is used to store IDAMS commands, le specications, program control
statements and Recode statements (if any). The Setup le can be prepared in the Setup window of
the User Interface. The record length is variable although the maximum is 255 characters.
Results le (*.lst). IDAMS normally writes the results into a le. The contents of this le can then
be reviewed before actually printing.
Note: In order to facilitate the work with WinIDAMS, it is advisable to use a common name for Data and
Dictionary les, and also a common name for Setup and Results les.
The user les are specied in the Setup le following the $FILES command (see The IDAMS Setup File
chapter for detailed description).
80 Files and Folders
System les
System les are normally not accessed directly by the user. They are created during the installation process
(permanent System les), during application customization (Application les) or during the execution of
WinIDAMS procedures (temporary work les).
Permanent System les. These include the executable program les, dll les, system parameter
les, le with the on-line Manual (in HTML Help format), and setup prototype les.
System control les.
Idams.def : default le denitions providing connection between logical and physical lenames
for user les and temporary work les.
<application name>.app : one le per application containing paths of Data folder, Work folder
and Temporary folder.
lastapp.ini : le containing the name of the last application used.
graphid.ini : conguration settings for the GraphID component.
tml.ini : conguration settings for the TimeSID component.
Temporary work les. They need not concern the user since they are dened and removed auto-
matically. They have lename extensions .tmp and .tra.
8.2 Folders in WinIDAMS
Files used in WinIDAMS are stored in the following folders:
System les in the System folder,
Application les in the Application folder,
Data, Dictionary and Matrix les in the Data folder,
Setup les and Results les in the Work folder, and
temporary work les in the Temporary folder and Transposed folder.
Five folders, mandatory for the default application, should always be present under the <system dir>
folder. They are dened and created rst during the installation process, Then, when WinIDAMS is started
and any of the folders is missing, it is automatically recreated.
Application folder <system dir>\appl
Data folder <system dir>\data
Temporary folder <system dir>\temp
Transposed folder <system dir>\trans
Work folder <system dir>\work
where <system dir> is the name of the System folder xed during the installation.
For more details on how IDAMS programs use the paths dened in the application, see section Customiza-
tion of the Environment for an Application in the User Interface chapter.
Chapter 9
User Interface
9.1 General Concept
The WinIDAMS User Interface is a multiple document interface. It can display and allow to work simulta-
neously with dierent types of documents such as Dictionary, Data, Setup, Results and any Text document
in separate windows. Moreover, it provides access to execution of IDAMS setups and to components for
interactive data analysis, namely: Multidimensional Tables, Graphical Exploration of Data and Time Series
Analysis from any document window. The WinIDAMS Main window contains:
the menu bar to open drop-down menus with WinIDAMS commands or options,
the toolbar to choose commands quickly,
the status bar to display information about the active document or highlighted command/option,
the Application window, docked on the left side, to display the active application name, and folders
and documents for this application,
the document windows to display dierent WinIDAMS documents.
82 User Interface
The menu bar and the toolbar have xed, document dependent contents. The common menus are described
below while document type dependent menus are described in relevant sections.
9.2 Menus Common to All WinIDAMS Windows
The main menu bar contains always the seven following menus: File, Edit, View, Execute, Interactive,
Window and Help.
File
New Calls the dialogue box to select the type of document to be created, and to
provide its name and location.
Open After choosing the type of document, calls the dialogue box to select the
document to be opened.
Close Closes the active window.
Save Saves the document displayed in the active window.
Save As Calls the dialogue box to save the document in the active window.
Print Setup Calls the dialogue box for modifying printing and printer options.
Print Preview Displays the active document as it will look when printed.
Print Calls the dialogue box for printing the contents of the document displayed
in the active pane/window. Note that hidden parts of the document are not
printed.
Exit Terminates the WinIDAMS session.
The menu can also contain the list of up to 7 recently opened documents, i.e. documents used in previous
WinIDAMS sessions.
Edit
The availability and sometimes the title of some commands in this menu may be dierent in dierent
windows.
Undo Cansels the last action.
Redo Does again the last canceled action.
Cut Moves the selection to the Clipboard.
Copy Copies the selection to the Clipboard.
Paste Copies the Clipboard content to the place where the cursor is positioned.
Find Starts the Windows searching mechanism.
Replace Starts the Windows replacing mechanism.
Find again/next Looks for the next appearance of the character string displayed in the Find
dialogue box.
Note that in the Results and Text windows, the search/replace actions are activated by the Search, Search
Forward, Search Backward and Replace commands.
View
Toolbar Displays/hides toolbar.
Status Bar Displays/hides status bar.
Application Displays/hides the Application window.
Show Full Screen Displays the active window in full screen. Click the Close Full Screen icon
in the left-top corner or press Esc to go back to the previous screen.
9.3 Customization of the Environment for an Application 83
Execute
With exception of the Setup window, the menu has only one command, Select Setup, to select a le with
the setup to be executed.
Interactive
Through this menu, three components for interactive analysis can be accessed, namely:
Multidimensional Tables
Graphical Exploration of Data
Time Series Analysis
See relevant chapters for a detailed description of each component.
Window
The menu contains the list of opened windows and standard Windows commands for arranging them.
Help
WinIDAMS Manual Provides access to the WinIDAMS Reference Manual.
About WinIDAMS Displays information about the version and copyright of WinIDAMS and a
link for accessing the IDAMS Web page at UNESCO Headquarters.
9.3 Customization of the Environment for an Application
Names of Data folder, Work folder and Temporary folder can be dened by the user and saved in an
Application le with the application name as lename. The name of the last application used is stored by
the system and the settings dened for this application are loaded at the beginning of the following session.
These settings can be changed any time during the working session by selecting/creating and activating
another application.
Since at least one Application le is necessary for the use of WinIDAMS, a standard application called
Default is provided and will be activated when you start WinIDAMS for the rst time after installation.
Dened default settings are the following:
Data folder <system dir>\data
Work folder <system dir>\work
Temporary folder <system dir>\temp
where <system dir> is the System folder name xed during the installation. This application (stored in the
le Default.app) should neither be deleted nor modied by the user.
Application les (except Default.app) can be created, modied or deleted by the user through the Appli-
cation menu in the WinIDAMS Main window. It contains the following commands:
New Calls the dialogue box for creating a new application.
Open Calls the dialogue box to select the le containing details of the application
to be opened.
Display Calls the dialogue box to select the application le and displays the appli-
cation settings.
Close Closes the active application and opens the Default application.
Refresh Recreates the current application tree.
84 User Interface
Creating a new application. Selection of the menu command Application/New provides a dialogue box
for entering the name of a new application as well as names of Data, Work and Temporary folders. Except
the application name eld which is empty, all the other elds contain default values taken from the Default
application. You can type in the pathname directly or select it moving the highlight to the required name
in the displayed tree of folders.
Press OK button to save the application. Pressing Cancel cancels the creation of a new application and
returns to the WinIDAMS Main window with the settings displayed previously.
Opening an application. The menu command Application/Open calls the dialogue box to select an
application le to be opened and provides a list of existing applications in the Application folder. Clicking
the required le name activates the settings for this application.
Modifying an application. To modify an application, rst open it and then change the values in the same
way as for creating a new application.
Displaying the settings for an application. Use the menu command Application/Display to call the
dialogue box and click the required le name.
To display settings for the active application, double-click its name in the Application window.
Deleting an application. It can be done by deleting the corresponding le. Use the menu command
Application/Open to get a list of Application les, select the le to delete and use the right button to access
the Windows Delete command. The le Default.app should not be deleted.
Resetting WinIDAMS defaults. To replace the displayed application by the default application you can
either close it using the menu command Application/Close, or select and open the Default.app le.
Closing an active application. Use the menu command Application/Close. The default application
becomes active.
IDAMS programs use the paths dened in the application to prex any lename not beginning with
<drive>:\... or with \...
The Data folder path is prexed to all lenames in statements with ddnames DICT..., DATA..., or
FTnn referring to matrices.
The Work folder path is prexed to lenames in statements with ddnames PRINT or FT06.
The Temporary folder path is prexed to names of temporary les.
Examples:
Data folder: c:\MyStudy\students\data
Specification in the setup: dictin=students2004.dic
Complete dictionary file name: c:\MyStudy\students\data\students2004.dic
9.4 Creating/Updating/Displaying Dictionary Files 85
9.4 Creating/Updating/Displaying Dictionary Files
The Dictionary window to create, update or display an IDAMS dictionary is called when:
you create a new Dictionary le (the menu command File/New/IDAMS Dictionary le or the toolbar
button New),
you open a Dictionary le (with extension .dic) displayed in the Application window (double-click on
the required le name in the Datasets list),
you open a Dictionary le (with any extension) which is not in the Application window (the menu
command File/Open/Dictionary or the toolbar button Open).
This window provides two panes: one for the variable denitions (Variables pane) and another for the codes
and code labels of the current variable (Codes pane). A blue line at the top of each pane indicates which
pane is active.
The column headings in the Variables pane have following meaning:
Number Variable number.
Name Variable name.
Loc, Width Starting location and eld width of the variable in the Data le.
Dec Number of decimal places; blank implies no decimal places.
Type Type of variable (N = numeric, A = alphabetic).
Md1 First missing data code for numeric variables.
Md2 Second missing data code for numeric variables.
Refe Reference number.
StId Study ID.
For more details, see section The IDAMS Dictionary in Data in IDAMS chapter. Note that only dictio-
naries describing data with one record per case can be created, updated or displayed using the Dictionary
window.
Changing the pane appearance. The appearance of each pane can be changed separately and the changes
apply exclusively to the active pane.
86 User Interface
The following modication possibilities are available in each pane:
Increasing the font size - use the toolbar button Zoom In.
Decreasing the font size - use the toolbar button Zoom Out.
Resetting default font size - use the toolbar button 100%.
Increasing/Decreasing the width of a column - place the mouse cursor on the line which separates two
columns in the column heading until the cursor becomes a vertical bar with two arrows and move it
to the right/left holding the left mouse button.
The Variables pane can further be modied as follows:
Increasing/Decreasing the height of rows - place the mouse cursor on the line which separates two rows
in the row heading until the cursor becomes a horizontal bar with two arrows and move it down/up
holding the left mouse button.
Dening a variable. Place the cursor in the Variables pane, ll the variable number (at least one is
mandatory, subsequent variables will be numbered by adding the value 1), name (optional), location (if not
supplied, 1 will be assigned to the rst variable and for subsequent variables, location will be calculated
by adding the width of the preceding variable) and width (mandatory). Other elds have default values
(which you can either accept or modify) or they are optional and can be left blank. Press Enter or Tab to
accept a value in a eld and move to the next eld, or Shift/Tab to move to the previous eld. Note that as
long as a little pencil appears in the row heading, the row is not saved. Press Enter to accept the complete
variable denition. An asterisk in the row heading indicates that this is the next row and you can enter a
new variable description.
Dening the codes and code labels for a variable. Switch to the Codes pane and ll the code and label
elds. Fill in the code value, then press Enter or Tab and ll the code label, then Enter or Tab to accept the
row and move to the next row. When all codes and labels have been dened, switch back to the Variables
pane to continue with another variable denition.
Modifying a eld in either Variables pane or in Codes pane. Click the eld and enter the new value
(entering the rst character of the new value clears the eld). After a double-click on a eld, its current
value can be partly modied. The Esc key may be used to recuperate previous value.
Editing operations can be performed on one row or on a block of rows. To mark one row, click any eld
of this row. A triangle appears in the row heading and the row is coloured in dark blue. To mark a block of
rows, place the mouse cursor in the row heading where you want to start marking and click the left mouse
button. The row becomes yellow, indicating that it is active. Then move the mouse cursor up or down to
the row where you want to end marking and click the left mouse button holding the Shift key. Marked rows
become dark blue, and the yellow colour shows the active row.
You can Cut, Copy and Paste marked row(s) using the Edit commands, equivalent toolbar buttons or
shortcut keys Ctrl/X, Ctrl/C and Ctrl/V respectively.
Using the right mouse button you can Insert Before, Insert After, Delete or Clear the active row (even when
a block of rows is marked).
Detecting errors in a dictionary. Use the menu command Check/Validity. Errors are signaled one by
one and can be corrected once they have all been displayed. Moreover, Interface tries to prevent you from
saving dictionaries with errors. Also, when you open a dictionary with errors, their presence is signaled
before the dictionary is actually opened.
9.5 Creating/Updating/Displaying Data Files
The Data window is used to create, update or display an IDAMS Data le. Note that the corresponding
Dictionary le must already have been constructed and that only Data les with one record per case can be
created, updated or displayed using the Data window. This window is called when:
9.5 Creating/Updating/Displaying Data Files 87
you create a new Data le (the menu command File/New/IDAMS Data le or the toolbar button
New),
you open a Data le (with extension .dat) displayed in the Application window (double-click on the
required le name in the Datasets list),
you open a Data le (with any extension) which is not in the Application window (the menu command
File/Open/Data or the toolbar button Open).
The window is divided into 3 panes: one displaying the codes and code labels of the current variable (Codes
pane), the second displaying variable denitions (Variables pane) and the third providing place for data
entry/modication (Data pane). Only the Data pane can be edited. The other two panes just display
the relevant information. A blue line at the top of each pane indicates which pane is active. The panes
are synchronized, i.e. selection of a variable eld in the Data pane highlights the corresponding variable
description, and selection of a eld in the Variables pane shows the corresponding variable value in the
current case. For the selected variable, codes and code labels (if any) are always displayed.
Changing the pane appearance. The appearance of each pane can be changed separately and the changes
apply exclusively to the active pane.
The following modication possibilities are available in all panes:
Increasing the font size - use the menu command View/Zoom In or the toolbar button Zoom In.
Decreasing the font size - use the menu command View/Zoom Out or the toolbar button Zoom Out.
Resetting default font size - use the menu command View/100% or the toolbar button 100%.
Increasing/Decreasing the width of a column - place the mouse cursor on the line which separates two
columns in the column heading until the cursor becomes a vertical bar with two arrows and move it
to the right/left holding the left mouse button.
The Data pane can be modied further as follows:
Increasing/Decreasing the height of rows - place the mouse cursor on the line which separates two rows
in the row heading until the cursor becomes a horizontal bar with two arrows and move it down/up
holding the left mouse button.
88 User Interface
Placing column(s) at the beginning - mark the required column(s) and use the menu command View/Freeze
Columns (use the menu command View/Unfreeze Columns to put them back).
Displaying data in a multiple pane - use the menu command Window/Split. You are provided with a
cross to determine the size of four panes. This size can be changed later using the standard Windows
technique. Your entire data are displayed four times. The horizontal split can be removed by a double-
click on the horizontal line, the vertical split can be removed by a double-click on the vertical line, and
the whole split can be removed by a double-click on the split centre.
Entering a new case. Click the rst eld in an empty row and start entering data values. Press Enter
or Tab to accept a data value for the variable and move to the next variable, or Shift/Tab to move to the
previous variable. Note that as long as a little pencil appears in the row heading, the case is not saved.
Pressing Enter on the last variable saves the case and moves the cursor to the beginning of next row. A new
row can be inserted before or after the highlighted row (click on the right mouse button), or can be added
at the end of le (row with an asterisk in the row heading).
Data entry can be facilitated taking advantage of two options given in the Options manu:
Code Checking checks data values during data entry against codes dened in the dictionary, being the
only codes considered valid.
AutoSkip moves the cursor automatically to the next eld once enough digits have been entered to ll the
eld. If not selected, you have to press Enter or Tab to move to the next eld.
Modifying a variable value. Click the variable eld and enter the new value (entering the rst character
of the new value clears the eld). A double-click on a variable eld can be used to modify part of the current
value. The Esc key may be used to recuperate the previous value.
Copying a variable value to another eld. Click the variable eld and copy its content to the Clipboard
(Edit/Copy command, Ctrl/C or Copy button in the toolbar). Then click the required eld and paste the
value (Edit/Paste command, Ctrl/V or Paste button in the toolbar). The menu command Edit/Undo Case
may be used to recuperate the previous value.
Editing operations on one row or on a block of rows can be performed in the same way as in the Dictionary
window. To mark one row, click any eld of this row. A triangle appears in the row heading and the row is
coloured in dark blue. To mark a block of rows, place the mouse cursor in the row heading where you want
to start marking and click the left mouse button on. The row becomes yellow, indicating that it is active.
Then move the mouse cursor up or down to the row where you want to end marking and click the left mouse
button holding the Shift key. Marked rows become dark blue, and the yellow colour shows the active row.
You can Cut, Copy and Paste marked row(s) using the Edit commands, equivalent toolbar buttons or
shortcut keys Ctrl/X, Ctrl/C and Ctrl/V respectively.
Using the right mouse button you can Insert Before, Insert After, Delete or Clear the active row (even when
a block of rows is marked).
Two data management commands are provided in the Management menu to allow for data verication
and sorting:
Check Codes checks data values for all cases in the Data le against codes dened in the dictionary, being
the only codes considered valid. At the end of verication, a message showing the number of errors
found is displayed and you are invited to correct them one by one using the data correction dialogue
box. This box provides case sequential number, variable number and name, invalid code value and a
drop-down list of valid codes as dened in the dictionary.
Sort calls the sort dialogue box to specify up to 3 sort variables and corresponding sort order for each of
them. After clicking OK, the sorted le appears in the Data pane.
Sorting the data on one variable (one column) can also be done by a double-click on the variable number
in the Data pane heading. One double-click sorts cases in ascending order. To get the sort in descending
order, repeat the double-click.
9.6 Importing Data Files 89
Two types of graphics are proposed for a variable in the menu Graphics.
Bar Chart provides a bar chart based on either frequencies or percentages for qualitative variable categories.
For quantitative variables, the user denes the number of bars (NB) on both sides of the mean (M) and
a coecient (C) for calculating bar (class) width. The bar width (BW) is equal to the value of standard
deviation (STD) multiplied by the coecient (BW=C*STD). The bars are constructed using the values
M-NB*BW, ..., M-2BW, M-BW, M, M+BW, M+2BW, ..., M+NB*BW. The height of a rectangle = (relative
frequency of class)/(class width). In addition, normal distribution curve having the calculated mean and
standard deviation can be projected for quantitative variables.
Histogram, meant for quantitative variables, provides a histogram based either on frequencies or on per-
centages with the number of bins specied by the user.
Graphics for quantitative variables contain also univariate statistics for the projected variable such as: mean,
standard deviation, variance, skewness and kurtosis. Variables with decimal places are multiplied by a scale
factor in order to obtain integer values. In this case, mean value, standard deviation and variance should be
adjusted accordingly.
9.6 Importing Data Files
WinIDAMS provides a tool for importing data les to IDAMS directly through the WinIDAMS User Inter-
face. This facility can be accessed in the WinIDAMS Main window, the Data window and the Multidimen-
sional Tables window.
Three types of free format les can be imported:
.txt les in which elds are separated by tabs,
.csv les in which elds are separated by commas,
.csv les in which elds are separated by semicolons.
Information provided in the rst row is considered to be column labels and is used as variable names during
the dictionary construction process. Thus, the presence of column labels is mandatory in the rst row of
input les.
Also the separation character is determined from the rst line while the character used as decimal separator
is detected from the second line (rst data line) of the le. Thus, if a variable is expected to have decimal
values, it should be shown in the rst data line.
During the import process, contents of imported alphabetic variables can be changed to numeric codes,
keeping the alphabetic values as code labels in the created IDAMS dictionary. Commas used as decimal
separator for numeric variables are changed to points.
The Data Import operation is activated with the command File/Import, followed by selection of required
le in the standard le Open dialogue box. The separation character and the character used as decimal
separator are displayed together with values of all elds for the rst three cases. Data reading can then be
checked before launching the import. Afterwards, you are provided with two windows called External data
and Variables Denition, both having form of a spreadsheet.
The External data window only displays the contents of the le to import. No editing operations are
allowed, except copying a selection to the Clipboard.
The Variables Denition window serves for preparing IDAMS variable descriptions. Its initial content
is provided by default and on the basis of the imported data, but you are free to change and to complete it
as necessary.
The columns contain the following information:
Description Variable name.
Type Type of variable (numeric by default). This is the input variable type. If
an input variable is alphabetic and should be output as numeric, ask for
recoding (see below).
90 User Interface
MaxWidth Maximum eld width of the variable.
NumDec Number of decimal places; blank implies no decimal places.
Md1 First missing data code for numeric variables.
Md2 Second missing data code for numeric variables.
Recoding Requesting a recoding of alphabetic variables to numeric values.
To modify variable denitions, place the cursor inside the window. Then use the navigation keys or the
mouse to move to the required eld and change its contents.
Use the menu command Build/IDAMS Dataset to create IDAMS Dictionary and Data les. They will both
be placed in the Data folder of the current application.
9.7 Exporting IDAMS Data Files
WinIDAMS also has a tool for exporting IDAMS Data les directly through the WinIDAMS User Interface.
This can be done from the Data window using the command File/Export. The IDAMS Data le displayed
in the active window can be saved in one of the three types of free format data les:
.txt les in which elds are separated by tabs,
.csv les in which elds are separated by commas,
.csv les in which elds are separated by semicolons.
Variable names from the corresponding Dictionary le are output in the rst row of the exported data as
column labels.
If code labels exist for a variable, numeric code values can be optionally replaced by their corresponding
code label in the output data le. Moreover, numeric variables can be output with comma used as decimal
separator.
9.8 Creating/Updating/Displaying Setup Files
The Setup window to prepare or to display an IDAMS Setup le is called when:
you create a new Setup le (the menu command File/New/IDAMS Setup le or the toolbar button
New),
you open a Setup le (with extension .set) displayed in the Application window (double-click on the
required le name in the Setups list),
you open a Setup le (with any extension) which is not in the Application window (the menu command
File/Open/Setup or the toolbar button Open).
9.8 Creating/Updating/Displaying Setup Files 91
The window provides two panes: the top one is for preparing the Setup le itself (Setup pane) and the
bottom one for displaying error messages when lter and Recode statements are checked (Messages pane).
Only the Setup pane can be edited. Note that IDAMS commands are displayed in bold and program names
in pink if they are spelled correctly. Text put on a $comment command is displayed in green.
To prepare a new program setup, you can either type in all statements or you can use the prototype
setup for the required program and modify it as necessary. Prototype setups are provided for all programs.
They can be accessed by selecting the program name in the list under the toolbar button Prototype. To copy
the prototype to the Setup pane, click the required program name. For details on how to prepare setups,
see the chapter The IDAMS Setup File and the relevant program write-up.
Editing operations can be performed as with any ASCII le editor, i.e. you can Cut, Copy and Paste any
selection, using the Edit commands, equivalent toolbar buttons or shortcut keys Ctrl/X, Ctrl/C and Ctrl/V
respectively.
Two setup verication commands are provided in the Check menu to allow for syntax verication of
sets of Recode statements and lter statements:
Recode Syntax activates verication of syntax in Recode statements included in the setup. All errors
found are reported in the Messages pane giving the Recode set number, erroneous statement line and
character(s) causing the syntax problem. A double-click on the erroneous line text or on the error
message in the Message pane shows this line in the Setup pane with a yellow arrow. You can correct
the errors and repeat syntax verication, before passing the setup for execution.
Filter Syntax activates verication of syntax errors in lter statements included in the setup. All errors
found are reported in the Messages pane giving the lter statement number, erroneous statement line
and character(s) causing the syntax problem. A double-click on the erroneous line text or on the error
message in the Messages pane shows this line in the Setup pane with a yellow arrow.
Note that although most syntax errors in lter and Recode statements can be detected and corrected here,
another syntax verication is systematically performed by IDAMS during setup execution. Also execution
errors, which cannot be detected here, are reported in the results.
92 User Interface
9.9 Executing IDAMS Setups
To execute IDAMS program(s) (for which instructions have been prepared and saved in a Setup le), use
the menu command Execute/Select Setup in any WinIDAMS document window. You are asked, through
the standard Windows dialogue box, to select the le from which instructions should be taken for execution.
If you are preparing your instructions in the Setup window, you can execute programs from the Current
Setup using the menu command Execute/Current Setup.
The program(s) will be executed and the results written to the le specied for PRINT under $FILES (the
default is IDAMS.LST in the current Work folder). At the end of execution, the Results le will be opened
in the Results window.
9.10 Handling Results Files
The Results window to access, display and print selected parts of the results is called when:
you open a Results le (with extension .lst) displayed in the Application window (double-click on the
required le name in the Results list),
you open a Results le (with any extension) which is not in the Application window (the menu command
File/Open/Results or the toolbar button Open),
you execute IDAMS setup; the contents of the Results le is displayed automatically.
Quick navigation in the results is facilitated through their table of contents. You can access the beginning
of particular program results or even a particular section. Moreover, the menu Edit provides access to a
searching facility.
The window is divided into 3 panes: one showing the table of contents (TOC) of the results as a structure
tree, the second displaying the results themselves and the third displaying error messages and warnings
included in the results.
By default, the pagination of results done by programs is retained (the Page Mode option in the check box
of View menu is marked). To make the results more compact, unmark this option. Trailing blank lines will
be removed from all pages and page breaks inserted by programs will be replaced by Page break text line.
9.11 Creating/Updating Text and RTF Format Files 93
To open/close quickly the TOC tree, three buttons on the numeric pad are available:
* opens all levels of the tree under the selected node
- closes all levels of the tree under the selected node
+ opens one level under the selected node.
To view a particular part of the results, double-click on its name in the TOC.
To locate an error message or a warning, double-click its text.
Modication of the results is not allowed. However, selected parts (highlighted or marked in tick-boxes
in the TOC tree) or all the results can be copied to the Clipboard (Edit/Copy command, Ctrl/C or Copy
button in the toolbar) and pasted to any document using standard Windows techniques.
Printing the whole contents or selected pages of the results can be done through the menu command
File/Print or using the Print toolbar button. Note that printing is done in Landscape orientation, and this
orientation cannot be changed.
The contents of the Results le as displayed can be saved in RTF or in text format using the menu command
File/Save As. Trailing blank lines are always removed. Page breaks are handled according to the Page Mode
option.
9.11 Creating/Updating Text and RTF Format Files
WinIDAMS has a General Editor which allows you to open and modify any type of document in character
format. However, its basic function is to provide a facility for editing Text les and to oer sophisticated
formatting and editing features. Manipulation of Dictionary, Data or Setup les using the General Editor
should be avoided, and manipulation of Matrix les should be performed with caution.
The Text window is called when:
you create a new Text le (the menu command File/New/Text le or RTF le, or the toolbar button
New),
you open a Matrix le (with extension .mat) displayed in the Application window (double-click on the
required le name in the Matrices list),
you open any character le which is not in the Application window (the menu command File/Open/File
Using General Editor or the toolbar button Open).
94 User Interface
The General Editor provides a number of standard editing commands which are known to Windows users.
They are listed below but will not be described in detail.
Insert provides commands for inserting page and section breaks, picture, OLE object (Object Linking &
Embedding), frame and drawing object.
Font commands allow you to change font and colour of selected text, and the colour of its background.
Paragraph commands enable you user to align paragraphs dierently, to indent them, to display them in
double space, and to draw a border around and shade the background.
Table gives access to a number of commands to insert and manipulate tables.
View contains three additional commands to display the active document in page mode, to display the ruler
and the paragraph marker.
Formatting toolbar allows you to choose quickly formatting commands that are used most frequently.
Part III
Data Management Facilities
Chapter 10
Aggregating Data (AGGREG)
10.1 General Description
AGGREG aggregates individual records (data cases) into groups dened by the user and computes summary
descriptive statistics on specied variables for each group. The statistics include sums, means, variances,
standard deviations, as well as minimum and maximum values and the counts of non-missing data values. An
output IDAMS dataset is created, i.e. the grouped (aggregated) data le described by an IDAMS dictionary;
the aggregated data le contains one record (case) per group with variables that are the summary to the
group level of each of the selected input variables.
Formulas for calculating mean, variance and standard deviation can be found in Part Statistical Formulas
and Bibliographic References, chapter Univariate and Bivariate Tables. However, they need to be adjusted
since cases are not weighted and the coecient N/(N-1) is not used in computation of sample variance and/or
standard deviation. Note that the summary statistics are selected for the entire set of aggregate variables.
Thus, if there were 2 aggregate variables and if 3 statistics were selected, there would be 6 computed variables.
AGGREG enables the user to change the level of aggregation of data e.g. from individual family members to
household, or from district to regional level, etc. For example, suppose a data le contains records on every
individual in a household and that we wish to analyze these data at the household level. AGGREG would
permit us to aggregate values of variables across all the individual records for each household to create a le
of household level records for further analysis. If, to be more specic, the individual level data le contained
a variable giving the persons income, AGGREG could create household level records with a variable on the
total household income.
Grouping the data. The user species up to 20 group denition (ID) variables which determine the
level of aggregation for the output le. For example, if one wanted to aggregate individual level data to
the household level, a variable identifying the household would be the group denition variable. Each time
AGGREG reads an input record, it checks for a change in any of the ID variables. When this is encountered,
a record is output containing the summary statistics on the specied aggregate variables for the group of
records just processed.
Inserting constants into the group records. Constants can be inserted into each group record using
the parameters PAD1, ... , PAD5, which specify so called pad variables. The value of a pad variable is a
constant.
Transferring variables. Variables can be transferred to the output group records. Note that only the
values of the rst case in the group are transferred.
10.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of the cases from the input
data. ID variables dening the groups and the variables to be aggregated are specied with the parameters.
The ID variables are automatically included in the output group dataset.
98 Aggregating Data (AGGREG)
Transforming data. Recode statements may be used.
Treatment of missing data. Each aggregate variable value is compared to both missing data codes and if
found to be a missing data value, is automatically excluded from any calculation. A user-supplied percentage,
the cuto point (see the parameter CUTOFF) determines the number of missing data values allowed before
the summarization value is output as a missing data code. Thus, for example, suppose the mean value of an
aggregate variable within a group was to be computed, and the group contained 12 records and 6 of them
had missing data values, i.e. 50%. If the CUTOFF value was 75%, the mean of the 6 non-missing values
would be calculated and output for that group. If the CUTOFF value was 25%, however, the mean would
not be calculated and the rst missing data code would be output.
10.3 Results
Missing data summary. (Optional: see the parameter PRINT). For each variable in each group, the input
variable number, the output variable number, the number of records with substantive data (i.e. non-missing
data) and the percentage of records with missing data are printed.
Group summary. (Optional: see the parameter PRINT). The number of input records for each group.
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Output dictionary. (Optional: see the parameter PRINT).
Statistics. (Optional: see the parameter PRINT). All of the computed variables can be printed for each
aggregate record. The variable number of the corresponding aggregate variable and the ID variables are also
given.
10.4 Output Dataset
The grouped output dataset is a Data le, described by an IDAMS dictionary. Each record contains values of
the ID variables, computed variables, transferred variables and pad constants; there is one record produced
for each group.
Variable sequence and variable numbers. The output variables are in the same relative order as
the input variables from which they were derived, regardless of whether the input variable is used as an ID,
aggregate, or variable to be transferred. Thus, if the rst variable in the input is used, the variable(s) derived
from it will be the rst output variable(s). Each input variable used as an ID or variable to be transferred
corresponds to one output variable; each aggregate variable corresponds to from 1 to 7 output variables,
according to the number of summary statistics requested (these variables are output in the relative order:
sum, mean, variance, standard deviation, count, minimum, maximum). The output variables are always
renumbered, starting with the number supplied in the parameter VSTART. Pad constants always come last.
Variable names. The output variables have the same names as input variables from which they were
derived except that for the aggregate variables, the 23rd and 24th characters of the name eld are coded:
S = sum
M = mean
V = variance
D = standard deviation
CT = count
MN = minimum
MX = maximum.
Pad constants are given names Pad variable 1, Pad variable 2, etc.
Variable type. ID variables and transferred variables are output in their input type. Computed variables
are always output as numeric.
Field width and number of decimals. Field widths for output aggregated variables depend on the
statistic, the input eld width (FW), the input number of decimal places (ND) and the extra decimal places
10.5 Input Dataset 99
requested by the user with the DEC parameter. Field widths and decimal places are assigned as shown below,
where FW=input eld width and ND=input number of decimal places for input variables, and FW=6 and
ND=0 for recoded variables.
Statistic Field Width Decimal Places
SUM FW + 3 * ND
MEAN FW + DEC ** ND + DEC ***
VARIANCE FW + DEC ** ND + DEC ***
SD FW + DEC ** ND + DEC ***
MIN FW ND
MAX FW ND
COUNT 4 0
* If the eld width exceeds 9, then it is reduced to 9.
** If the eld width exceeds 9, then the number of extra decimals (DEC) is reduced accordingly.
*** If the number of decimals exceeds 9, then DEC is reduced accordingly.
Missing data codes. Missing data codes for ID variables and transferred variables are taken from the
input dictionary. The second missing data code (MD2) for the computed variables is always blank. The
value of the rst missing data code (MD1) is allocated as follows:
Output variable Output MD1
Output FW <= 7 9s
Output FW > 7 -999999
COUNT variable 9999
Reference numbers. Computed variables are given the reference number of their base variable.
C-records. C-records in the input dictionary are transferred to the output dictionary for ID and transfer
variables.
A note on computation of the statistics. Before output, computed values are rounded up to the
calculated width and number of decimal places. If the computed value exceeds 999999999 or is less than
-99999999, it is output as 999999999.
10.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. Group-denition (ID) variables and variables to
be transferred may be numeric or alphabetic, although numeric variables are treated as strings of characters,
i.e. a value of 044 is dierent from 44. They cannot be recoded variables. Variables to be aggregated
must be numeric and may be recoded variables.
The le is processed serially and contiguous records with the same value on the ID variables are aggregated.
Thus, the input le should be sorted on the ID variables prior to using AGGREG. Note that AGGREG does
not check the input le sort order.
100 Aggregating Data (AGGREG)
10.6 Setup Structure
$RUN AGGREG
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
10.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V1=10,20,30,50 OR V10=90-300
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: AGGREGATION TEACHER/STUDENT DATA
3. Parameters (mandatory). For selecting program options.
Example: IDVARS=(V1,V2) STATS=(SUM,VARI) DEC=3 -
AGGV=(V5-V10,V50-V75) PAD1=80
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values in aggregates variables and in variables used in Recode.
See The IDAMS Setup File chapter.
10.7 Program Control Statements 101
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
IDVARS=(variable list)
Up to 20 variable numbers to dene the groups. R-variables are not allowed.
No default.
AGGV=(variable list)
V- or R-variables to be aggregated.
No default.
STATS=(SUM, MEAN, VARIANCE, SD, COUNT, MIN, MAX)
Parameters for selecting required statistics (at least one of: SUM, MEAN, VARIANCE, SD must
be selected). They are output for each group and for each AGGV variable.
SUM Sum.
MEAN Mean.
VARI Variance.
SD Standard deviation.
COUN Number of valid cases.
MIN Minimum value.
MAX Maximum value.
SAMPLE/POPULATION
SAMP Compute the variance and/or standard deviation using the sample equation.
POPU Use the population equation.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
VSTART=1/n
Variable number for the rst variable in the output dataset.
CUTOFF=100/n
The percentage of cases with MD codes allowed before a MD code is output. An integer value.
DEC=2/n
For computed variables involving mean, variance or standard deviation: the number of decimal
places in addition to those of the corresponding input variables (see Restriction 7).
TRANSVARS=(variable list)
Variables whose values, as given for the rst case of each group, are to be transferred to the
output le. R-variables are not allowed.
PAD1=constant
PAD2=constant
PAD3=constant
PAD4=constant
PAD5=constant
Up to 5 constants can be added to the output dataset. The number of characters given determines
the eld width of the constant.
102 Aggregating Data (AGGREG)
PRINT=(MDTABLES, GROUPS, DATA, CDICT/DICT, OUTDICT/OUTCDICT/NOOUTDICT)
MDTA Print a table giving the percentage of missing data found for each aggregate variable
in each group.
GROU Print the number of cases per group.
DATA Print values for each computed variable in each group record.
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records of ID and transfer variables if any.
NOOU Do not print the output dictionary.
10.8 Restrictions
1. Maximum number of variables to be aggregated is 400.
2. Maximum number of ID variables is 20.
3. Maximum number of characters in ID variables is 180.
4. Maximum number of variables to be transferred is 100.
5. Recoded variables are not allowed as IDVARS or as TRANSVARS.
6. Same variable cannot appear in two variable lists.
10.9 Example
Output a dataset containing one aggregate case for each unique value of V5 and V7; the variables in each
case are to be the sum, mean and standard deviation of 4 input variables and 1 recoded variable, aggregated
over the cases forming the group (i.e. with the same values for V5, V7); values of V10, V11 for the rst
case of each group are to be transferred to the output records; a listing of the values output for each case is
requested; in the output le, variables are to be numbered starting from 1001.
$RUN AGGREG
$FILES
PRINT = AGGR.LST
DICTIN = IND.DIC input Dictionary file
DATAIN = IND.DAT input Data file
DICTOUT = AGGR.DIC output Dictionary file
DATAOUT = AGGR.DAT output Data file
$RECODE
R100=COUNT(1,V20-V29)
NAME R100WEALTH INDEX
$SETUP
AGGREGATION OF 4 INPUT VARIABLES AND 1 RECODED VARIABLE
IDVARS=(V5,V7) AGGV=(V31,V41-V43,R100) STATS=(SUM, MEAN, SD) -
VSTART=1001 PRINT=DATA TRANS=(V10,V11)
Chapter 11
Building an IDAMS Dataset (BUILD)
11.1 General Description
BUILD takes a raw data le, which may contain several records per case, along with a dictionary describing
the required variables and creates a new Data le with a single record per case containing values only for
the specied variables. At the same time, it outputs an IDAMS dictionary describing the newly formatted
Data le, in other words an IDAMS dataset is created.
In addition to restructuring the data, BUILD also checks for non-numeric values in numeric variables.
Why use BUILD? Any IDAMS program can be used without rst using BUILD by preparing separately an
IDAMS dictionary. However BUILD is recommended as a preliminary step since it:
- provides checks on the correct preparation of the dictionary,
- ensures that there is an exact match between the dictionary and the data,
- ensures that there are no unexpected non-numeric characters in the data,
- reduces the data into a compact single record per case form,
- recodes all blank elds to user specied values.
Numeric variable processing. When BUILD processes a eld as containing a numeric variable, it checks
that the eld either contains a recognizable number or is blank. If a value other than these occurs, e.g. 3J,
3-, **2, etc. the sequential position of the case, the variable number associated with the eld, and the
input case are printed and a string of nines is used as the output value.
Processing rules are as follows:
If a eld contains a recognizable number, the number is edited into a standard form and output (see
the Data in IDAMS chapter for details).
If a eld contains all blanks, it is either recoded to the 1st or 2nd missing data code, nines or zeros, or,
if no recoding is specied, it is signaled as an error and output as blank eld. Column 64 of T-records
may be used to specify recoding rule for the variable (see Input Dictionary section for details).
If a eld contains illegal trailing blanks, e.g. 04 in a three digit numeric eld, or embedded blanks,
e.g. 0 4, it is reported as error and the value is changed to 9s.
If a eld contains a positive value or a negative value with the + or - characters wrongly entered,
e.g. 1-23, it is reported as error and the value is changed to 9s.
If a missing data code for a variable has one more digit than the input eld, the output eld will be
one character longer than the input. This feature can be used when it is necessary to increase the
output eld width without changing the input eld width; for example, if codes 0-9 and a blank were
dened for a single column variable, the blank eld could not be recoded to a unique numeric value
without allowing a 2-digit code on output.
104 Building an IDAMS Dataset (BUILD)
Table showing examples of editing performed by BUILD
and the contents of the output field for a 3-digit input numeric field
======================================================================
Input No. MD1 Recoding Output Output Error message
value dec. specified value field
width
===== ==== === ========= ====== ====== ===============
032 0 9999 - 0032 4 -
32 0 - 032 3 -
3 2 0 - 999 3 embedded blanks in var ...
32 0 - 999 3 embedded blanks in var ...
-03 0 - -03 3 -
-3 0 - -03 3 -
- 3 0 - -03 3 -
3.2 0 - 003 3 -
32 1 - 032 3 -
.32 1 - 003 3 -
3.2 1 - 032 3 -
.32 2 - 032 3 -
.35 1 - 004 3 -
-.3 0 - -00 3 -
-.3 1 - -03 3 -
-03 1 - -03 3 -
- 8888 1 8888 4 (only if PRINT=RECODES)
- 0 000 3 (only if PRINT=RECODES)
- None 3 blanks in var ...
A32 - - 999 3 bad characters in var ...
3-2 - - 999 3 bad characters in var ...
11.2 Standard IDAMS Features
Case and variable selection. This program has no provision for selecting cases from the input data le.
The standard lter is not available. By way of the variable descriptions, any subset of the elds within a
case may be selected for the output data.
Transforming data. Recode statements may not be used.
Treatment of missing data. BUILD makes no distinction between substantive data and missing data
values. However, blank elds may be replaced by missing data codes, zeros or nines.
11.3 Results
Input dictionary. (Optional: see the parameter PRINT). Brule column on the dictionary listing contains
recoding rules for blank elds, as specied in col. 64 of the input dictionary. Note that error messages for
the dictionary are interspersed with the dictionary listing and do not contain a variable number. If the input
dictionary is not printed, the errors may be dicult to identify.
Output dictionary. (Optional: see the parameter PRINT). Variable description records (T-records) are
printed without or with C-records, if any.
Output data le characteristic. Record length of the output data le.
Data editing messages. For each case containing errors, the input case (up to 100 characters per line)
and a report of errors in variable number order are printed.
Blank eld recoding messages. (Optional: see the parameter PRINT). For each case containing blank
elds that were recoded, a message about this along with the input data case are printed. These messages
are integrated with the data editing messages, if any errors also occur in the case.
11.4 Output Dataset 105
11.4 Output Dataset
BUILD creates a Data le and a corresponding IDAMS dictionary, i.e. an IDAMS dataset. Note that the
T-records always dene the locations of variables in terms of starting position and eld width.
The data le contains one record for each case. The record length is the sum of the eld widths of all
variables output and is determined by the BUILD program.
Numeric variable values. Numeric variable values are edited to a standard form as described in the
Numeric variable processing paragraph above.
Alphabetic variable values. The data values for alphabetic variables are not edited and are the same on
input and output.
Variable width. Normally BUILD assigns the width of a variable to be the same as the number of characters
the variable occupies in the input data. However, if a missing data code has one more signicant digit than
the input eld width, the output eld width will be increased by one.
Variable location. BUILD assigns the output elds in variable number order. Thus, if the rst two
variables have output widths of 5 and 3, locations 1-5 are assigned to the rst variable and 6-8 are assigned
to the second, etc.
Reference number and study ID. The reference number, if it is not blank, and study ID are the same
as their input values. If the reference number eld of an input T-record or C-record is blank, it is lled with
the variable number.
11.5 Input Dictionary
This describes those variables that are to be selected for output. The format is as described in the Data in
IDAMS chapter with column 64 of T-records being used to specify a recoding rule for blanks in a variable
as follows:
blank - no recoding of blank elds,
0 - recode blank elds to zeros,
1 - recode blank elds to 1st missing data code for variable,
2 - recode blank elds to 2nd missing data code for variable,
9 - recode blank elds to 9s.
Note: The Dictionary window of the User Interface does not provide access to the column 64. Thus, use the
WinIDAMS General Editor (File/Open/File Using General Editor) or any other text editor to ll in this
column.
11.6 Input Data
The data can be any xed-length record le with one or more records per case providing there are exactly
the same number of records for each case. The le should be sorted by record type within case ID. The
values for any variable must be located in the same columns in the same record for every case.
If the input data has more than one record per case, MERCHECK should always be used prior to BUILD
to ensure that the data do have the same set of records for each case.
Note that the exponential notation of data is not accepted by BUILD.
106 Building an IDAMS Dataset (BUILD)
11.7 Setup Structure
$RUN BUILD
$FILES
File specifications
$SETUP
1. Label
2. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
11.8 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-2 below.
1. Label (mandatory). One line containing up to 80 characters to label the results.
Example: FILE BUILDING STUDY A35
2. Parameters (mandatory). For selecting program options.
Example: MAXERROR=50
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
LRECL=80/n
The length of each input data record.
(Used to check if variable starting locations on T-records are valid).
MAXCASES=n
The maximum number of cases to be used from the input le.
Default: All cases will be used.
VNUM=CONTIGUOUS/NONCONTIGUOUS
CONT Check that variables are numbered in ascending order and consecutively in the input
dictionary.
NONC Check only that variables are numbered in ascending order.
11.9 Examples 107
MAXERR=10/n
The maximum number of cases with errors (unrecoded blanks and non-numeric values for numeric
variables) before BUILD terminates execution.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
PRINT=(RECODES, CDICT/DICT, OUTDICT/OUTCDICT/NOOUTDICT)
RECO Print input cases that contain one or more blank elds which have been recoded.
CDIC Print the input dictionary for all variables with C-records if any.
DICT Print the input dictionary without C-records.
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
NOOU Do not print the output dictionary.
11.9 Examples
Example 1. Build an IDAMS dataset (dictionary and data le); input data records have a record length
of 80 with 3 records per case; variables are numbered non-contiguously in the input dictionary; variable V2
is the complete ID (columns 5-10) while variables V3 and V4 contain the two parts of the ID (columns 5-8,
9-10 respectively); blank elds should be replaced by the rst missing data code for variables V101, V122,
V168, and by zeros for variable V169; blanks for V123 (age) should be treated as errors.
$RUN BUILD
$FILES
DATAIN = ABCDATA RECL=80 input Data file
DICTOUT = ABC.DIC output Dictionary file
DATAOUT = ABC.DAT output Data file
$SETUP
BUILDING A IDAMS DATASET
VNUM=NONC MAXERR=200
$DICT
3 1 169 3
T 1 TOWN CODE 1 1 1 3 ID
T 2 RESPONDENT ID 5 10 ID
T 3 HOUSEHOLD NUMBER 5 8 ID
T 4 RESPONDENT NUMBER 9 10 ID
T 101 RESP POSITION IN FAMILY 13 0 9 1 QS1
T 122 SEX 225 9 1 QS2
T 123 AGE 48 49 QS2
T 168 OCCUPATION 358 59 99 98 1 QS3
T 169 INCOME 61 65 99998 0 QS3
108 Building an IDAMS Dataset (BUILD)
Example 2. Verify the presence of non-numeric characters in 4 numeric elds; the input data le has one
record per case; records are identied by an alphabetic eld; the 5 variables are not numbered contiguously;
the output les normally produced by BUILD are not required and are dened as temporary les (extension
TMP) which are automatically deleted by IDAMS at the end of execution.
$RUN BUILD
$FILES
DATAIN = A:NEWDATA RECL=256 input Data file
DICTOUT = DIC.TMP temporary output Dictionary file
DATAOUT = DAT.TMP temporary output Data file
$SETUP
CHECKING FOR AND REPORTING NON-NUMERIC CHARACTERS AND BLANKS
VNUM=NONC LRECL=256 PRINT=NOOU MAXERR=200
$DICT
3 1 35 1 1
T 1 RESPONDENT NAME 1 20 1
T 21 AGE 21 2
T 22 INCOME 29 6
T 25 NO. WORK PLACES 129 1
T 35 SCI. TITLE 201 1
Chapter 12
Checking of Codes (CHECK)
12.1 General Description
CHECK veries whether variables have valid data values and lists all invalid codes by case ID and variable
number.
Code specication. There are two ways in which the codes for the variables to be checked may be specied.
First, the program control statements include a set of code specications with which to dene the variables
and their valid codes. Second, the user may supply a list of variables for which valid codes are to be taken
from C-records in the dictionary. In any given execution of CHECK, the user may apply the rst method
for some variables and the second method for others. Code specications for a variable in the setup override
dictionary specications.
Method used for checking data values. Data values for variables, both numeric and alphabetic, are
checked against the valid codes specied on a character by character basis. Thus, if a valid code specication
of V2=02,03 is given, then a value of 2 in the data will be invalid; a leading blank in the data is not
considered equal to a zero. If code values are specied with fewer digits than the eld width of the variable,
leading zeros are assumed. Thus, if the specication V2=2,3 is given where V2 is a 2-digit variable, valid
values used for comparison to the data will be taken as 02, 03. Similarly, if -3 and 1 were supplied as
valid codes for a 3-digit variable, CHECK would edit the codes to -03 and 001 before comparing any data
value to them.
Note. If a syntax error is found in a code specication, the other code specications are checked but the
data are not processed.
12.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
dataset. The user selects the variables to be checked either by specifying them on a variable list and/or
on the code specications.
Transforming data. Recode statements may not be used.
Treatment of missing data. CHECK makes no distinction between substantive data and missing data
values; all data are treated the same.
12.3 Results
Input dictionary. (Optional: see the parameter PRINT). Dictionary records for all variables are printed,
not just for those being checked.
110 Checking of Codes (CHECK)
Documentation of invalid codes. For each case in which a variable is found to have an invalid code,
CHECK prints the ID variable value(s), the variables in error and their values.
12.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. CHECK can check for valid data on both
numeric and alphabetic variables. If the dictionary contains C-records, these can be used to dene valid
codes for variables.
Values for numeric variables are assumed to be in the form they would have after being edited by BUILD.
This assumption implies that there are no leading blanks (they have been replaced by zeros), that a negative
sign, if any, appears in the left most position, and that explicit decimal points do not appear.
12.5 Setup Structure
$RUN CHECK
$FILES
File specifications
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Code specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
12.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V10=3 AND V20=1-9
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: DATA: THESIS DATA, VERSION 1
12.6 Program Control Statements 111
3. Parameters (mandatory). For selecting program options.
Example: IDVA=(V1-V4) VARS=(V22-V26,V101-V102)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
START=1/n
The sequential number of the rst case to be checked.
VARS=(variable list)
Variables for which valid codes are to be taken from the C-records in the dictionary.
MAXERR=100/n
Maximum number of cases with invalid codes allowed; if this number is exceeded, the execution
is terminated.
IDVARS=(variable list)
Up to 20 variables whose value(s) are to be printed when an invalid code is found. These will
normally consist at minimum of the variables that identify a case but can include others which
will provide additional information to the user. The variables may be alphabetic or numeric.
No default.
PRINT=CDICT/DICT
CDIC Print the input dictionary for all variables with C-records if any.
DICT Print the input dictionary without C-records.
4. Code specications (optional). These specications dene the variables to be checked and their
valid or invalid code values.
Examples:
V3=1,3,5-9 (The data for variable 3 may have codes 1,3,5-9.
Any other code values are invalid and will be documented).
V7,V9,V12-V14= - (The data for variables 7,9 and 12 through 14
2,50-75,100 may only have values 2,50-75,100).
V50 <> 75 (The data for variable 50 may have any code except 75).
General format
variable list = list of code values
or
variable list <> list of code values
Rules for coding
Each code specication must start on a new line. To continue to another line, break after a comma
and enter a dash. As many continuation lines may be used as necessary. Blanks may occur anywhere
on the specications.
112 Checking of Codes (CHECK)
Variable list
Each variable number must be preceded by a V.
Variables may be expressed singly (separated by a comma), in ranges (separated by a dash), or
as a combination of both (V1, V2, V10-V20).
The variables may be dened in any order.
All the variables grouped together in one expression must have the same eld width (e.g. for V2,
V3=10-20 V2 and V3 must both have the same eld width dened in the dictionary).
The variables to be checked may be alphabetic or numeric.
Valid (=) or invalid (<>)
An = sign indicates that the code values which follow are the valid codes for the variables specied.
All other codes will be documented as errors.
<> (not equal) indicates that the codes which follow are invalid. All cases having these codes for
the variables specied will be documented as errors.
List of code values
Codes may be expressed singly (separated by a comma), in ranges (separated by a dash), or as a
combination of both.
For numeric variables, leading zeros do not have to be entered (e.g. V1=1-10), but remember
that several variables being checked for common codes must all have the same eld width dened
in the dictionary.
For data with decimal places, do not enter the decimal point in the value, but give the value
which accurately reects the number assuming implied decimal places, e.g. the number 2 with
one decimal place should be given as 20.
For alphabetic values, trailing blanks do not have to be entered; they are added by the program
to match variable width.
To dene a blank or to specify a value containing embedded blanks, enclose the value in primes
(e.g. V10=NEW YORK,WASHINGTON, ).
Code values may be dened in any order.
Notes.
1) If two dierent specications are given for the same variable, only the last one is used.
2) Code specications for a variable override use of code label records from the dictionary for the
variables provided with VARS parameter.
12.7 Restrictions
1. The maximum number of ID variables is 20.
2. The maximum number of distinct codes which can be given on the code specications is 4000. This
restriction can be overcame using ranges of codes since a range of codes counts as only 2 codes.
12.8 Examples
Example 1. Check for illegal codes in qualitative variables and out-of-range values in quantitative variables;
the only valid codes for variables V10, V12 and V21 through V25 are 1 to 5 and 9; code 9998 is illegal for
variable V35; codes 0 and 8 are illegal for variables V41, V44, V46; variables V71 to V77 should have values
within the range 0 to 100, or 999; cases are identied by variables V1, V2 and V4; code values from the
dictionary are not used.
12.8 Examples 113
$RUN CHECK
$FILES
PRINT = CHECK1.LST
DICTIN = STUDY1.DIC input Dictionary file
DATAIN = STUDY1.DAT input Data file
$SETUP
JOB TO SCAN FOR ILLEGAL CODES AND OUT-OF-RANGE VALUES
IDVARS=(V1,V2,V4)
V10,V12,V21-V25=1-5,9
V35<>9998
V41,V44,V46<>0,8
V71-V77=0-100,999
Example 2. Check for code validity only for a subset of cases (when variable V21 is equal 2 or 3 and
variable V25 is equal 1); valid codes for some variables are taken from dictionary C-records; in addition, a
code specication is given for variable V48; cases are identied by variable V1.
$RUN CHECK
$FILES
DICTIN = STUDY2.DIC input Dictionary file
DATAIN = STUDY2.DAT input Data file
PRINT = CHECK.PRT
$SETUP
INCLUDE V21=2,3 AND V25=1
JOB TO SCAN FOR ILLEGAL CODES
IDVARS=V1 VARS=(V18-V28,V36-V41)
V48=15-45,99
Chapter 13
Checking of Consistency
(CONCHECK)
13.1 General Description
CONCHECK used in conjunction with IDAMS Recode statements provides a consistency check capability to
test for illegal relationships between values of dierent variables. Condition statements in the CONCHECK
setup are used to name each check and to indicate which variables are to be listed in the event of an error.
The consistency checks are dened through Recode by testing a logical relationship and then setting the
value of a result variable to a value 1 if the relationship is not satised, e.g. if V3 cannot logically take the
value 9 when V2 takes the value 3 then the following Recode statement can be used:
IF V2 EQ 3 AND V3 EQ 9 THEN R100=1 ELSE R100=0
When an inconsistency is detected in a case, values of specied ID variables for the case are printed. In
addition, the values for a set of variables, dened with parameter VARS, are printed. This set is used to get
an overall picture of the case in order to more easily detect the reason for the inconsistency and to make sure
that a correction for one inconsistency will not cause another. For each consistency condition that fails, a
separate set of variables, normally consisting of the particular variables being checked, can be printed along
with the number and name of the condition.
13.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases for checking.
Variables to be listed when inconsistencies occur are specied with the parameter VARS (for the case) or
CVARS (for an individual condition).
Transforming data. Recode statements are used to express the required consistency checks.
Treatment of missing data. CONCHECK makes no distinction between substantive data and missing
data values; all data are treated the same.
13.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Inconsistencies. For each case containing an inconsistency, one line of identication is printed consisting
of the case sequence number and, optionally, the values of specied ID variables. This is followed by the
values of the variables specied with the VARS parameter.
116 Checking of Consistency (CONCHECK)
For each individual inconsistency detected in a case, the number and name of the corresponding condition
and the values of the variables specied on the condition statement are printed.
Error statistics. At the end of the execution, a summary table is printed giving the number of cases
processed, the number of cases containing at least one inconsistency and, for each consistency condition, its
number and name, and the number of cases failing the test.
13.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. Numeric or alphabetic variables can be used.
13.5 Setup Structure
$RUN CONCHECK
$FILES
File specifications
$RECODE (optional)
Recode statements expressing inconsistencies
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Condition statements
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
13.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V1=1
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: TESTING FOR INCONSISTENCIES IN NORTH REGION
13.6 Program Control Statements 117
3. Parameters (mandatory). For selecting program options.
Example: IDVARS=(V1,V3-V4) MAXERR=50
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MAXERR=999/n
The maximum number of inconsistencies to be printed before CONCHECK will stop.
IDVARS=(variable list)
Up to 5 variables whose values will be listed to identify cases with inconsistencies.
Default: Case sequential number is printed.
VARS=(variable list)
Variables to be listed for any case which has at least one error.
FILLCHAR=string
Up to 8 characters used to separate variables when listing inconsistencies.
Default: 2 spaces.
PRINT=(CDICT/DICT, VNAMES)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
VNAM Print the rst 6 characters of variable names instead of variable numbers when listing
values of variables for inconsistent cases.
4. Condition statements (at least one must be given). One condition statement is supplied for each
consistency to be tested giving a reference to the corresponding Recode statements, a name for the
test and the variables whose values are to be listed when the test fails.
The coding rules are the same as for parameters. Each condition statement must begin on a new line.
Example: TEST=R3 CVARS=(V34,V36,V52) -
CNAME=AGE, SEX AND PREGNANCY STATUS
TEST=variable number
Variable for which a non zero value indicates that a consistency check failed.
No default.
CVARS=(variable list)
List of variables whose values will be listed when this inconsistency is encountered.
Default: Only variables specied with IDVARS and VARS will be listed.
CNUM=n
Condition number.
Default: Condition sequence number.
CNAME=string
Name for this condition, up to 40 characters.
Default: No name.
118 Checking of Consistency (CONCHECK)
13.7 Restrictions
1. Only the rst 4 characters of alphabetic variables are printed.
2. Condition names may not be more than 40 characters long.
3. Maximum number of ID variables is 5.
4. Maximum number of variables listed for each case in error (VARS list) is 20.
5. Maximum number of variables listed for each condition (CVARS list) is 20.
13.8 Examples
Example 1. Test the relationship between V6 and V7 and between V20 and V21; the identication variables
V2 and V3 should be printed for each case with an error along with the values of key variables V8-V10;
names of variables should be printed.
$RUN CONCHECK
$FILES
PRINT = CONCH1.LST
DICTIN = MY.DIC input Dictionary file
DATAIN = MY.DAT input Data file
$RECODE
R1=0
R2=0
IF V5 INLIST(1-5,8) AND V7 EQ 2 THEN R1=1
IF V20 LE 3 AND V21 EQ 5 OR V20 EQ 8 AND V21 EQ 7 OR V20 EQ V21 THEN R2=1
$SETUP
TESTING FOR 2 INCONSISTENCIES
PRINT=VNAMES IDVARS=(V2,V3) VARS=(V8-V10)
TEST=R1 CNAME=1st Inconsistency CVARS=(V5,V7)
TEST=R2 CNAME=2nd Inconsistency CVARS=(V20,V21)
Example 2. Test 5 conditions in part 2 of a questionnaire; tests are numbered starting at 201; all variables
from part 2 should be listed for each questionnaire with an error, along with key variables from part 1
(V5-V10); in addition, particular variables used in tests should be listed again for each test that fails. Note
the use of the Recode SELECT function to initialize the corresponding result variables to 0.
$RUN CONCHECK
$FILES
DICTIN = MY.DIC input Dictionary file
DATAIN = MY.DAT input Data file
$SETUP
PART 2 OF CONSISTENCY CHECKING
MAXERR=400 IDVARS=(V1,V3) VARS=(V5-V10,V200-V231)
TEST=R1 CNUM=201 CVARS=(V203-V205)
TEST=R2 CNUM=202 CVARS=(V203,V210-V212)
TEST=R3 CNUM=203 CVARS=(V214,V215)
TEST=R4 CNUM=204 CVARS=(V222-V226)
TEST=R5 CNUM=205 CVARS=(V229,V230)
$RECODE
R900=1
A SELECT (FROM=(R1-R5), BY R900) = 0
IF R900 LT 5 THEN R900=R900+1 AND GO TO A
IF V203 IN(1-5,17,20-25) AND V204 EQ 3 OR V205 EQ M THEN R1=1
IF V203 GT 6 AND MDATA(V210,V211,V212) THEN R2=1
IF 2*TRUNC(V214/2) EQ V214 OR V215 EQ 0 THEN R3=1
IF COUNT(1,V222-V226) LT 2 THEN R4=1
IF MDATA(V229) AND NOT MDATA(V230) THEN R5=1
Chapter 14
Checking the Merging of Records
(MERCHECK)
14.1 General Description
The MERCHECK program detects and corrects merge errors (missing, duplicate or invalid records) in a
data le containing multiple records per case. It outputs a le containing equal numbers of records per case
by padding in missing records and deleting duplicate and invalid records. Although originally written for
checking card-image data, the input data record length may be any value up to 128. Since all other IDAMS
programs assume that each case in a data le has exactly the same number of records, using MERCHECK
is an essential rst checking step for all data les which have more than one record per case.
Program operation. The user supplies a set of Record descriptions dening the permissible record types.
While processing the data, the program reads into a work area all the contiguous input data records it nds
which have identical case ID values. These records are compared one by one with the dened record types,
and an output case is constructed. Records are padded, deleted, reordered, etc., as needed. The data case
is then transferred to the output le, and the program returns to read the set of input records for the next
case. The results document the corrections of the input data performed by the program.
Case and record identication. MERCHECK requires that the case ID is in the same position for all
records. Case ID elds may be located in non-contiguous columns and may be composed of any characters.
Record types are identied by a single record ID eld (of 1-5 columns) which may be composed of any
character except a blank. A sketch of a data le with two record types follows. The intervening periods
stand for data or blank elds.
...SE23...01...............10......
...SE23...01...............12......
...SE23...02...............10......
...SE23...02...............12......
...SE24...01...............10......
...SE24...01...............12......
first second record ID
case ID case ID field
field field
In the example, there are 2 types of record for each case, identied by a 10 or 12 in columns 28, 29. The
case ID consists of two non-contiguous elds, columns 4-7 and columns 11-12. Thus SE2301 is a case ID,
as are SE2302 and SE2401.
Eliminating invalid records. An input data record containing a record ID not dened by the Record
descriptions, known as an extra record, is optionally printed but never transmitted to the output le. In
addition, there are two options for eliminating other types of invalid records.
120 Checking the Merging of Records (MERCHECK)
Records which do not contain a specied constant are rejected. (See the parameters CONSTANT,
CLOCATION, and MAXNOCONSTANT).
The user may supply the case ID value of the rst valid data case. All records containing a case ID
value less than the one specied are rejected. (See the parameter BEGINID).
Options to handle cases with missing records. The user must select, using the parameter DELETE,
one of the three possible ways to handle incomplete cases.
1. DELETE=ANYMISSING. A case is not output if one or more of its record types is missing.
2. DELETE=ALLMISSING. A case is not output if not a single valid record ID is found for a particular
case ID.
3. DELETE=NEVER. The program never excludes from the output le a case missing one or more
records. Instead, it constructs a record for each missing record type and pads its contents with
blanks or user-supplied values. See the PADCH parameter and the PAD parameter on the Record
descriptions. Padding takes place in column locations other than the case and record ID elds. The
appropriate case and record IDs are always inserted by the program.
Options to handle cases with duplicate records. A duplicate record is one having the same case and
record IDs as another record regardless of the rest of the contents of the two records. The user species which
duplicate is to be kept if there is more than one input record bearing the same case and record IDs. For
example, the option DUPKEEP=1 causes the program to retain the rst record and to discard any others.
The case is not transferred to the output le if fewer than n duplicates are found (where DUPKEEP=n)
i.e. to delete cases with duplicate records, specify a large value for n. Caution: It may happen that records
with duplicate IDs do not contain the same data. It is up to the user to determine the appropriateness of
the record that was retained.
Options to handle deleted records. Those input data records which are deleted, i.e. not written to the
output le, may be saved in a separate le (see the parameter WRITE).
Selection of record types. MERCHECK allows the user to subset selected record types from a more
comprehensive input data le. Simply include only the required IDs in the Record descriptions, and choose
an appropriate error printing option (EXTRAS=n or PRINT=ERRORS, for example) and a realistic MAX-
ERR value. Minimizing printed output for cases in error is essential, as nearly every case in the input data
le will be reported in error due to records with invalid record IDs (i.e. those not specied on Record
descriptions).
Restart capabilities. The parameter BEGINID can be used to restart MERCHECK if a prior execution
terminated before all input data were processed. The user must determine the case ID value for the last case
output and set BEGINID equal to that value +1. (If termination occurred because the parameter MAXERR
was exceeded, the last input record read will appear displayed in the results, and BEGINID should be set
to the case ID of that record).
Note. MERCHECK is intended for checking data les with multiple records per case and there must be a
record ID entered in each record. MERCHECK could theoretically be used for eliminating duplicate records
and records without a particular constant for data les with a single record per case. This however can only
be done if each data record contains a constant value which can be treated as the record ID. This operation
is better performed by the SUBSET program, using a lter to exclude records without a constant and the
DUPLICATE=DELETE option to eliminate duplicates. (See write-up for SUBSET).
14.2 Standard IDAMS Features
Case and variable selection. Except as dened above, not available for this program.
Transforming data and missing data. These options do not apply in MERCHECK.
14.3 Results 121
14.3 Results
Error cases. The full report with the documentation of each error case has three parts: an error summary,
the records not transferred to the output (bad records), and the case as it appears in the output le (good
records). See below for more details of these components. For data with a large number of record types and
with many cases in error, the report for error cases can be costly and, for some jobs, quite unnecessary. The
amount of report needed depends on how much a user knows about the data, as well as the ability to correct
or double-check the errors. For instance, if a user expects considerable padding to occur, but virtually no
duplicate or invalid records, it may be sucient to have only the error summary printed and to specify that
cases with errors (if any) be saved (see the option WRITE=BADRECS) and listed later. Various controls
on the quantity of results are possible with the parameters PRINT, EXTRAS, DUPS, and PADS.
Error cases: error summary. The error summary consists of an identication of the error case (case
count or case ID) and any of three messages about the errors which occurred. The sequential case count
does not account for records or cases eliminated because they appear before the beginning ID or lack the
required constant. The case ID is taken from the case ID eld(s) as specied by the IDLOC parameter.
The 3 kinds of errors are reported, namely:
1. invalid record types,
2. cases with missing records,
3. cases with duplicate records.
Error cases: bad records. There are the invalid and duplicate records as well as all records for cases
which have been rejected because of missing records. They are printed in the order that they appear in the
input le.
Error cases: good records. If a case is kept after an error has been encountered, the actual records
written to the output le, including any padding records, are listed.
Records occurring before the one with BEGINID. These are optionally printed. See the parameter
PRINT=LOWID.
Records out of sort order. These are normally printed although results can be suppressed. See the
parameter PRINT=NOSORT.
Records without the specied constant. Any record which does not contain the user specied constant
in the correct columns is printed. This report can be suppressed. See the parameter PRINT=NOCONSTANT.
Execution statistics. At the end of the report the total number of missing records, invalid records and
duplicate records, and the total number of cases which were read, written, deleted and containing errors are
printed.
14.4 Output Data
The output data is a le with the same record length as the input data and equal number of records per
case. Each case contains one each of the record types specied on the Record descriptions.
14.5 Input Data
The input consists of a le of xed length data records normally sorted by case ID and record ID within
case. The record length may not exceed 128.
122 Checking the Merging of Records (MERCHECK)
14.6 Setup Structure
$RUN MERCHECK
$FILES
File specifications
$SETUP
1. Label
2. Parameters
3. Record descriptions (repeated as required)
$DATA (conditional)
Data
Files:
FT02 rejected records ("bad case" records)
when WRITE=BADRECS specified
DATAxxxx input data (omit if $DATA used)
DATAyyyy output data (good cases)
PRINT results (default IDAMS.LST)
14.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Label (mandatory). One line containing up to 80 characters to label the results.
Example: CHECKING THE MERGE OF RECORDS IN STUDY 95 DATA
2. Parameters (mandatory). For selecting program options.
Example: MAXE=25 RECORDS=8 IDLOC=(1,5)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Data le.
Default ddname: DATAIN.
MAXCASES=n
The maximum number of cases to be used from the input le.
Default: All cases will be used.
MAXERR=10/n
Maximum number of cases with errors. When n + 1 error cases occur, execution terminates.
Cases before the BEGINID, those out of sort order, and records without the constant do not
count as error cases. Error cases are those with invalid, duplicate, or missing records.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Data le.
Default ddname: DATAOUT.
14.7 Program Control Statements 123
RECORDS=2/n
The number of records per case (as dened on the Record descriptions).
IDLOC=(s1,e1, s2,e2, ...)
Starting and ending columns of 1-5 case identication elds. At least one must be given. If there
is more than one case ID eld, then they must be specied in the order in which the input data
are sorted.
No default.
BEGINID=case id
Lowest valid case ID value at which program begins processing: 1 to 40 characters enclosed in
primes if contain any non-alphanumeric characters. If multiple case ID elds are used, the value
should be the concatenation of the individual case IDs supplied in sort order.
Default: Blanks.
NOSORT=0/n
The maximum number of cases out of sort order tolerated by the program. When n + 1 cases
out of order occur, execution terminates.
DELETE=NEVER/ANYMISSING/ALLMISSING
Species under what conditions with respect to missing records a case is to be deleted.
NEVE Never reject a case due to missing records. If any or all of the records are missing, the
program will pad (with blanks or user-supplied values) all records which are missing
and reject any records with invalid record IDs before outputting the case.
ANYM Do not output any case in which one or more records is missing, i.e. no incomplete
case is to be output.
ALLM Do not output any case in which there are no valid records, i.e. when all records for a
case have invalid record IDs.
PADCH=x
Character to be used on padded records. Non-alphanumeric character must be enclosed in primes.
See also Record descriptions for more detailed padding values.
Default: Blank.
DUPKEEP=1/n
Species (for duplicate data records) that the n-th duplicate encountered is to be kept. If fewer
than n duplicates are found, the case in which they occur is deleted (even if DELETE=NEVER
is specied).
WRITE=BADRECS
Create a le of the rejected (bad case) records.
CONSTANT=value
Value of a constant. Must be enclosed in primes if it contains non-alphanumeric characters. Any
input data record without the constant is rejected. The location of the constant must be the same
across all input records regardless of record type.
CLOCATION=(s, e)
(Supplied only if CONSTANT is used). Location of the constant eld.
s Starting column of constants eld on each record.
e Ending column of constants eld on each record.
MAXNOCONSTANT=0/n
(Supplied only if CONSTANT is used). Maximum number of records without the constant toler-
ated by the program. When n + 1 records without the constant are encountered, MERCHECK
terminates execution.
124 Checking the Merging of Records (MERCHECK)
PRINT=(CONSTANT/NOCONSTANT, SORT/NOSORT, ERRORS/NOERRORS, LOWID,
BADRECS, GOODRECS)
CONS Print records without specied constant.
NOCO Do not print records without the constant.
SORT Print a 3-line notice for cases out of sort order.
NOSO Do not print cases out of sort order.
LOWI Print all records with case ID lower than the one specied with BEGINID.
The following print options refer to the report of cases with errors (i.e. missing, invalid, or
duplicate records).
ERRO Print error summary for each case with an error.
NOER Do not print error summary for cases with errors.
BADR Print rejected (bad) records for cases with errors.
GOOD Print kept (good) records for cases with errors.
EXTRAS=0/n
DUPS=0/n
PADS=0/n
If a case has fewer than n invalid (extra/duplicate/padded) records and no other errors, no report
will occur for the case. Thus, a case with only 2 invalid records and no missing or duplicate records
would not generate report if EXTRAS=3, but would print according to the PRINT specication
if it also had 1 missing record.
Default: All error cases will be printed according to PRINT specication.
3. Record descriptions (mandatory: one for each type of record to be selected for output). The coding
rules are the same as for parameters. Each record description must begin on a new line.
Example: RECID=21 RIDLOC=1
RECID=3 RIDLOC=2 PAD=43599-
999998889999999881119
RECID=xxxxx
A 1-5 non-blank character record type code. Must be enclosed in primes if it contains lower case
characters.
No default.
RIDLOC=s
Starting column of record ID eld.
No default.
PAD=xxx....
Pad values to be used when padding a record of this type. The string of values must be enclosed
by primes if it contains non-alphanumeric characters. The rst character will be put in column 1
of the output padded record, etc. To continue on a subsequent line, enter a dash. If the length of
the string is less than the record length, then the rest of the string is lled on the right with the
PADCH specied on the parameter statement.
Default: PADCH is used for entire string.
Note: The correct case ID and record ID are automatically inserted into the padded record in the
correct positions.
14.8 Restrictions
1. Maximum record length of input data records is 128.
2. Maximum number of output records per case is 50.
3. The program reserves work space for a maximum of 60 records with identical case ID value. Included in
the count are invalid, duplicate, and valid records, and also records which are padded by the program.
MERCHECK terminates execution if more than 60 records with identical case ID values occur in the
work area.
14.9 Examples 125
4. Maximum combined length of the individual case ID elds is 40 characters.
5. Maximum length of the record ID eld is 5 contiguous non-blank characters.
6. Maximum length of a constant to be checked for is 12 characters.
7. Maximum number of case ID elds is 5.
14.9 Examples
Example 1. Check the merge of three records per case which have record types 1, 2 and 3 respectively;
missing records are padded: records 1 and 2 are padded with blanks, record 3 is padded with a copy of the
values given with the PAD parameter; cases with no valid records (when all records for a case have invalid
record types) are written to the le BAD; cases with up to four duplicate records are also written to the le
BAD (if a case has 5 or more duplicates of a particular record type, then it is kept as a good case using the
5th of the duplicates and eliminating the others).
$RUN MERCHECK
$FILES
PRINT = MERCH1.LST
FT02 = \DEMO\BAD file for output bad cases
DATAIN = \DEMO\DATA1 input Data file
DATAOUT = \DEMO\DATA2 output Data file (with only good cases)
$SETUP
CHECKING THE MERGE OF DATA
IDLO=(1,3,5,6,10,10) RECO=3 DELE=ALLM DUPK=5 WRITE=BADRECS MAXE=200
RECID=1 RIDLOC=12
RECID=2 RIDLOC=12
RECID=3 RIDLOC=12 PAD=9999999999-
9399999999999999999999999999999999999999999999999999999999999999999999
Example 2. Check data, deleting all cases with missing records and eliminating cases which do not belong
to the study; Data le contains two records per case; cases with duplicate records are kept (dropping all
except the rst of a set of duplicate records); there is a record type TT in columns 4 and 5 of one record
and one of AB in columns 7 and 8 of the other; the study ID, HST, should appear in columns 124-126 of
each record.
$RUN MERCHECK
$FILES
FT02 = BAD file for output bad cases
DATAIN = DATA RECL=126 input Ddata file
DATAOUT = GOOD output Data file (with only good cases)
$SETUP
CHECKING THE MERGE OF DATA
IDLO=(1,3) RECO=2 WRITE=BADRECS MAXE=20 -
CONS=HST CLOC=(124,126)
RECID=TT RIDLOC=4
RECID=AB RIDLOC=7
Chapter 15
Correcting Data (CORRECT)
15.1 General Description
CORRECT provides correction facilities for data in an IDAMS dataset. Individual variable values in
specied cases may be corrected or entire cases deleted.
CORRECT is useful for correcting errors in individual variables for specic cases as detected for example
by BUILD, CHECK or CONCHECK. The preparation of update instructions is easy. Checks are made for
compatibility between the data and the correction and good documentation is printed describing all the
corrections made.
Program operation. CORRECT rst reads the dictionary and stores the information about all the
variables in the dataset. Each data correction instruction is then processed. After an instruction is read,
CORRECT reads the data le copying cases until the case identied in the instruction is encountered.
CORRECT executes the instruction, listing the case, or revising values for selected variables and outputting
the case, or deleting the case from the output as appropriate. When all instructions are exhausted, the
remaining data cases (if any) are copied to the output, and execution terminates normally. If errors in
the sort order of the correction instructions or data cases occur and also if there are syntax errors on the
correction instructions, CORRECT documents the situation in the results and continues with the next
instruction.
Variable correction. The user species the case identication followed by the variable numbers of the
variables to be corrected together with their new values. Both numeric (integer or decimal valued) and
alphabetic variables can be corrected.
Correcting case ID variables. If an ID eld is to be corrected, normally the sort order will be aected
and the parameter CKSORT=NO should therefore be specied. If the ID variable contains erroneous non-
numeric characters, then enclose its value in primes on the correction instruction.
Case deletion. The user can delete a case from the data le by specifying case identication information
and the word DELETE.
Case listing. The user can choose to have a particular data case listed by specifying case identication
information and the word LIST.
15.2 Standard IDAMS Features
Case and variable selection. One may select a subset of cases to be processed and output by including
a standard lter. Selection of variables is inappropriate.
Transforming data. Recode statements may not be used.
Treatment of missing data. CORRECT makes no distinction between substantive data and missing data
values; the concept does not apply to the program operation.
128 Correcting Data (CORRECT)
15.3 Results
Input dictionary. (Optional: see the parameter PRINT). Dictionary records for all variables are printed,
not just for those being corrected.
Listing of the correction instructions. Correction instructions are always listed. With each correction
the program also optionally lists: (1) input data records, (2) deleted records, or (3) corrected records (see
PRINT parameter).
15.4 Output Dataset
A copy of the dictionary is always output. If it is not required, the DICTOUT le denition can be omitted.
The data are always copied to the output, even if there are no corrections or deletions.
15.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. Normally, CORRECT expects the data cases
to be sorted in ascending order on values of their case ID variables. The user can, however, indicate (via the
parameter CKSORT) that the cases are not in ascending order. This option should be used with caution:
the order of the correction instructions must exactly match the order of the data in the le.
15.6 Setup Structure
$RUN CORRECT
$FILES
File specifications
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Correction instructions (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
15.7 Program Control Statements 129
15.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V1=10,20,30 AND V12=1,3,7
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: CORRECTION OF ALPHA CODES IN 1968 ELECTION
3. Parameters (mandatory). For selecting program options.
Example: PRINT=CORRECTIONS, IDVARS=V4
INFILE=IN/xxxx
A 1-4 character ddname sux for the input dictionary and data les.
Default ddnames: DICTIN, DATAIN.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le. If MAXC=0, all
correction instructions will be checked for syntax errors but no data processed.
Default: All cases will be used.
IDVARS=(variable list)
Up to 5 variable numbers for the case identication elds. If more than one case ID eld is
specied, the variable numbers must be given in major to minor sort eld order.
No default.
CKSORT=YES/NO
Indicates whether the data cases will have their case ID eld(s) checked for ascending sequential
ordering. The execution terminates if a case out of order is detected.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output dictionary and data les.
Default ddnames: DICTOUT, DATAOUT.
PRINT=(DELETIONS, CORRECTIONS, CDICT/DICT)
DELE List those cases for which the delete option is specied in correction instructions.
CORR List corrected cases.
CDIC Print the input dictionary for all variables with C-records if any.
DICT Print the input dictionary without C-records.
4. Correction instructions. These statements indicate which of the listing, deletion, or correction
options are to be applied and for which cases.
Examples:
ID=1026,V5=9,- (For the case with ID "1026" change the
V6=22 value of V5 to 9 and the value of V6 to 22)
ID=JOHN DOE,DELETE (Delete the case with ID "JOHN DOE" from the output)
ID=091,3,LIST (List the case with ID "091", "3")
ID=023,16,V8=DON_T,- (Change V8 to DONT and V9 to TEACH,RES)
V9=TEACH|RES
130 Correcting Data (CORRECT)
Rules for coding
Each correction instruction must start on a new line. To continue to another line, break after the
comma at the end of a complete variable correction and enter a dash. As many continuation lines may
be used as necessary. Blanks may occur anywhere on the instructions.
The correction instructions must be ordered in exactly the same relative sequence by case ID values
as the data cases.
Case ID values
The case to be corrected is identied using the keyword ID= followed by the value(s) of the ID
variable(s).
The list of values on the instruction is not enclosed in parentheses.
Each value, including the last, must be followed by a comma, and the order of the values should
correspond to the order of the variables in the list of ID variables specied with the IDVARS
parameter.
The number of digits or characters in a value must equal the width of the variable as stated in
the dictionary, i.e. leading zeros may need to be included.
Values containing non-numeric characters should be enclosed in primes, e.g. ID=9,PAM.
Type of instruction
The case identication is followed either by the word LIST, by the word DELETE, or by a string
of variable corrections.
Variable corrections
A variable correction consists of a variable number preceded by a V and followed by an =
and the correct value, e.g. V3=4.
Variable corrections for dierent variables for the same case are separated by commas.
Correction values for numeric variables may be specied without leading zeros.
If the variable includes decimal places, the decimal point may be entered, but is not written to
the output le. The digits are aligned according to the number of decimal places indicated in the
dictionary and excess decimal digits are rounded.
If the value contains non-numeric characters it must be enclosed in primes. An embedded comma
must be represented as a vertical bar and an embedded prime must be represented as an un-
derscore; the program will convert the vertical bar and underscore to the comma and prime
respectively, e.g. v8=Don t.
Correction values for alphabetic variables must match the variable width. If the correction value
contains blanks or lower case characters it should be enclosed in primes.
15.8 Restriction
The maximum number of case ID variables is 5.
15.9 Example
Correction of data le; both numeric and alphabetic variables are to be corrected, and two cases are to be
deleted; cases are identied by variables V1, V2 and V5; the dictionary is not changed, and therefore an
output dictionary is not needed.
15.9 Example 131
$RUN CORRECT
$FILES
PRINT = CORRECT1.LST
DICTIN = DATA1.DIC input Dictionary file
DATAIN = DATA1.DAT input Data file
DICTOUT = DATA2.DIC output Dictionary file (same as input)
DATAOUT = DATA2.DAT output Data file (corrected)
$SETUP
CORRECTING A DATA FILE
IDVARS=(V1,V2,V5)
ID=311,01,21,V12=JOHN MILLER
ID=311,05,41,DELETE
ID=557,11,32,V58=199,V76=2,V90=155
ID=559,11,35,V12=AGATA CHRISTI,V13=F
ID=657,31,11,V58=100,V77=4,V90=105,V36=999999,V37=999999,V38=999999, -
V41=98,V44=99
ID=711,15,11,DELETE
Chapter 16
Importing/Exporting Data (IMPEX)
16.1 General Description
The IMPEX program performs import/export of data in free or DIF format, and import/export of matrices
in free format. In a free format le, elds may be separated with space, tabulator, comma, semicolon or any
character dened by the user. Decimal point or comma can be used in decimal notation. Imported/exported
Data le may contain variable numbers and/or variable names as column headings. Imported/exported
matrix le may contain variable numbers/code values and/or variable names/code labels as column/row
headings.
Data import. The program creates a new IDAMS dataset from an existing free or DIF (format for data
interchange developed by Software Arts Products Corp.,) format ASCII data le and from an IDAMS
dictionary. The input dictionary denes how the elds of the input data le must be transferred into the
output IDAMS dataset.
Data export. The program creates a new ASCII data le containing variables from an existing IDAMS
dataset and new variables dened by IDAMS Recode statements. The exported le may be of free or DIF
format.
Matrix import. The program creates an IDAMS Matrix le from a free format ASCII le containing a
lower triangle of a square matrix or a rectangular matrix.
Matrix export. The program creates an ASCII le containing all matrices stored in an IDAMS Matrix
le. For matrix export, only free format is available.
16.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data when data export is requested. Also in data export, variables are selected through the parameter
OUTVARS.
Transforming data. Recode statements may be used in data export.
Treatment of missing data. No missing data checks are made on data values except through the use of
Recode statements in data export. In data import, empty elds (empty elds between consecutive delimiters)
are replaced with the rst missing data code or with a eld of 9s if the rst missing data code is not dened.
16.3 Results
Data Import
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, for all variables included in the input dictionary.
134 Importing/Exporting Data (IMPEX)
Input column labels and codes. (Optional: see the parameters PRINT and EXPORT/IMPORT).
Column labels and column codes are printed (unformatted) as they are read from the input le.
Input data. (Optional: see the parameter PRINT). Unformatted input data lines are printed for all cases
exactly as they are read from the input data le.
Output dictionary. (Optional: see the parameter PRINT).
Output data. (Optional: see the parameter PRINT). Values for all cases and for all variables are given,
10 values per line, in the same order as input data lines.
Data Export
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Output data. (Optional: see the parameter PRINT). Values for all cases for each V- or R-variable are
given, 10 values per line. For alphabetic variables, only the rst 10 characters are printed.
Matrix Import
Input matrix. (Optional: see the parameter PRINT). A matrix contained in the input ASCII le is printed
with or without column labels and column codes.
Matrix Export
Input matrices. (Optional: see the parameter PRINT). Matrices contained in the input IDAMS matrix
le are printed with or without variable descriptor records or code label records.
16.4 Output Files
Import
The output is either an IDAMS dataset or an IDAMS matrix depending on whether data or matrix import
is requested.
In the case of an IDAMS dataset, values of the numeric variables are edited according to IDAMS rules (see
the Data in IDAMS chapter).
Empty numerical elds (i.e. empty strings between delimiter characters) in a free format input le are
replaced with the corresponding rst missing data code or with 9s if the rst missing data code is not
dened.
Export
The output is an ASCII le, the content of which varies according to the export requirements.
Data in DIF format. This is a le with standard Header and Data sections. Vectors correspond to
IDAMS variables, and TUPLES to cases. In addition to the required header items, LABEL (a standard
optional item) is used to export variable names. In the Data section, the Value Indicator V is always used
for numeric values. A decimal point or comma is used in decimal notation if the number of decimals dened
in the dictionary is greater than zero.
Data in free format. This is a le in which variable values are separated by a delimiter (see the parameters
WITH and DELCHAR) and cases are separated additionally by carriage return plus line feed characters.
For numeric variable values, a decimal point or comma (see the parameter DECIMALS) is included if the
number of decimals dened in the dictionary is greater than zero. Alphabetic variable values may be enclosed
in primes or quotes, or not enclosed in any special characters (see the parameter STRINGS).
Matrix in free format. The format of matrices output by IMPEX is the same as the format required
for imported matrices (see Matrix Import in the Input Files section below). The only dierence is
that additional delimiter characters are inserted to ensure correct positioning of column and row labels in a
spreadsheet package.
16.5 Input Files 135
16.5 Input Files
Data Import
For data import, the input is:
an ASCII le containing a free format data array in which elds are separated with a delimiter, and
an IDAMS dictionary which denes how to transfer data into an IDAMS dataset (all elds have to be
described in the input dictionary);
a DIF format data le, and also an IDAMS dictionary.
The input les may also contain dictionary information. For free format les, this means that column labels
and column codes (which correspond to variable names and variable numbers) are supplied with the data
array as the rst rows in the array. Both labels and codes are optional. If provided, column labels override
variable names from the input dictionary, and they are inserted in the output dictionary. They may be
enclosed in special characters (see the parameter STRINGS). Column codes are used only to perform a
check against variable numbers from the input dictionary. For DIF format les, column labels appear as
LABEL items in the Header section. Column codes can be present as the rst row in the data array.
Matrix Import
The input is always a free format ASCII le in which numerical values/strings of characters are separated
with a delimiter. Empty elds (i.e. empty strings between delimiter characters) are skipped. Each le may
contain only one matrix to import.
The input matrix le may optionally provide dictionary information consisting of a series of strings for
labelling columns/rows of the matrix and the corresponding codes. If provided, they must follow the syntax
given below (which is dierent for rectangular and square matrices).
Rectangular matrix
This is an ASCII le containing a free format rectangular array of values; dictionary information may be
optionally included.
Example.
Average salary; Age group; Sex;
Male; Female;
1;2;
20 - 30;1;600;530;
31 - 40;2;650;564;
41 - 60;3;723;618;
Format.
1. The rst three strings contain, respectively: (1) a description of the matrix contents, (2) the row title
(row variable name), and (3) the column title (column variable name). (Optional).
2. Column labels. (Optional: one label per column of the array of values).
3. Column codes. (Optional: one code per column of the array of values).
4. The array of values. (This may optionally contain one row label and/or code before each row of values).
Note. If row and column labels and/or codes are not present, they are automatically generated for the
output IDAMS matrix (labels as R-#0001, R-#0002, ... C-#0001, C-#0002, ... and codes from 1 to the
number of rows and columns respectively).
Square matrix
This is an ASCII le containing a lower-left triangle of a matrix (only o-diagonal elements), and optionally
vectors of means and standard deviations following the matrix, in free format.
136 Importing/Exporting Data (IMPEX)
Example.
;;Paris;London;Brussels;Madrid; ...
;;1;2;3;4; ...
Paris;1;
London;2;0.55;
Brussels;3;0.45;0.35;
Madrid;4;1.45;2.35;1.15;
. . .
Format.
1. Column labels (variable names). (Optional: as many labels as columns/rows in the array of values).
2. Column codes (variable numbers). (Optional: as many codes as columns/rows in the array of values).
3. The array of values. (This may optionally contain one row label and/or code before each row of values).
4. A vector of means. (Optional).
5. A vector of standard deviations. (Optional).
Note. If labels and/or codes are not present, they are automatically generated for the output IDAMS matrix
(labels as V-#0001, V-#0002, ... and codes from 1 to the number of columns/rows).
Data and Matrix Export
Depending on whether data or matrix(ces) are to be exported, the input is either a data le described by
an IDAMS dictionary (both numeric and alphabetic variables can be used) or a le of IDAMS square or
rectangular matrix(ces).
16.6 Setup Structure 137
16.6 Setup Structure
$RUN IMPEX
$FILES
File specifications
$RECODE (optional with data export; unavailable otherwise)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary for data export/import (omit if $DICT used)
DATAxxxx input data/matrix (omit if $DATA used)
DICTyyyy output dictionary for data import
DATAyyyy output data/matrix
PRINT results (default IDAMS.LST)
16.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution if data export is specied.
Example: EXCLUDE V19=2-3
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: EXPORTING SOCIAL DEVELOPMENT INDICATORS
3. Parameters (mandatory). For selecting program options.
Example: EXPORT=(DATA,NAMES) FORMAT=DELIMITED WITH=SPACE
IMPORT=(DATA/MATRIX, NAMES, CODES)
DATA Data import is requested.
MATR Matrix import is requested.
NAME Variable names are included in the Data le to import. Variable names/code labels
are included in the Matrix le to import.
CODE Variable numbers are included in the Data le to import. Variable numbers/code
values are included in the Matrix le to import.
138 Importing/Exporting Data (IMPEX)
EXPORT=(DATA/MATRIX, NAMES, CODES)
DATA Data export is requested.
MATR Matrix export is requested.
NAME Variable names are to be exported in the outpur Data le. Variable names/code labels
are to be exported in the outpur Matrix le.
CODE Variable numbers are to be exported in the output Data le. Variable numbers/code
values are to be exported in the output Matrix le.
Note. No defaults. Either IMPORT or EXPORT (but not both) must be specied.
INFILE=IN/xxxx
A 1-4 character ddname sux for the input le(s):
Data or Matrix le to import (default ddname: DATAIN),
Dictionary and Data les to export data (default ddnames: DICTIN, DATAIN),
IDAMS Matrix le to export (default ddname: DATAIN).
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric import or export data values and insucient eld width output
values. See The IDAMS Setup File chapter.
MAXCASES=n
Applicable only if data import/export is specied.
The maximum number of cases (after ltering) to be used from the input data le.
Default: All cases will be used.
MAXERR=0/n
The maximum number of insucient eld width errors allowed before execution stops. These
errors occur when the value of a variable is too big to t into the eld assigned, e.g. a value of
250 when a eld width of 2 has been specied.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output le(s):
Dictionary and Data les obtained by import (default ddnames: DICTOUT, DATAOUT),
IDAMS Matrix le obtained by import (default ddname: DATAOUT),
exported Data or Matrix le (default ddname: DATAOUT).
OUTVARS=(variable list)
Applicable only if data export is specied.
V- and R-variables which are to be exported. The order of the variables in the list is not signicant,
since they are output in ascending numerical order. All V- and R-variable numbers must be
unique.
No default.
MATSIZE=(n,m)
Applicable only if matrix import is specied.
Number of rows and columns of the matrix to import. The program assumes a rectangular matrix
if both are specied and a square symmetric matrix if one of them is omitted.
n Number of rows.
m Number of columns.
No default.
FORMAT=DELIMITED/DIF
Species the input data/matrix format for import, or the output data/matrix format for export.
DELI Data/matrix(ces) is expected to be of free format, in which elds are separated with
a delimiter (see below).
DIF Data are expected to be in DIF format.
Note: DIF format is available only for data export or import.
16.8 Restrictions 139
WITH=SPACE/TABULATOR/COMMA/SEMICOLON/USER
(Conditional: see FORMAT=DELIMITED).
Species the delimiter character to separate elds in free format le.
SPAC Blank character (ASCII code: 32).
TABU Tabulator character (ASCII code: 9).
COMM Comma , (ASCII code: 44).
SEMI Semicolon ; (ASCII code: 59).
USER User specied character (see the parameter DELCHAR below).
Note: In importing/exporting DIF les, COMMA is always used as the delimiter character,
independently of what is selected.
DELCHAR=x
(Conditional: see the parameter WITH=USER above).
Denes the character used to separate elds in free format les.
Default: Blank.
DECIMALS=POINT/COMMA
Denes the character used in decimal notation.
POIN Point . (ASCII code: 46).
COMM Comma , (ASCII code: 44).
STRINGS=PRIME/QUOTE/NONE
Denes the character used to enclose character strings.
PRIM Prime.
QUOT Quote.
NONE No special character is used.
Note: In importing/exporting DIF les, QUOTE is always used, independently of what is selected.
NDEC=2/n
Number of decimal places to be retained in export.
PRINT=(DICT/CDICT/NODICT, DATA)
DICT Print the dictionary without C-records.
CDIC Print the dictionary with C-records if any.
DATA Print data values.
Note:
(a) Dictionary printing options control both input and output dictionary printing.
(b) Data printing option controls output data printing if a data le is exported, and controls both
input and output if data import is requested (input is never printed if a DIF format data le is
imported).
(c) For matrices, the input matrix is printed whenever data printing is specied.
16.8 Restrictions
1. The maximum number of R-variables that can be exported is 250.
2. The maximum number of variables that can be used in one execution (including variables used only in
Recode statements) is 500.
3. The maximum number of matrix rows is 100.
4. The maximum number of matrix columns is 100.
5. The maximum number of matrix cells is 1000.
140 Importing/Exporting Data (IMPEX)
16.9 Examples
Example 1. Selected variables from the input dataset are transferred to the output le along with two
new variables; data are output in free format with values separated by a semicolon; commas will be used
in decimal notation while alphabetic variable values will be enclosed in quotes; variable names and variable
numbers will be included in the output data le.
$RUN IMPEX
$FILES
PRINT = EXPDAT.LST
DICTIN = OLD.DIC input Dictionary file
DATAIN = OLD.DAT input Data file
DATAOUT = EXPORTED.DAT exported Data file
$SETUP
EXPORTING IDAMS FIXED FORMAT DATA TO FREE FORMAT DATA
EXPORT=(DATA,NAMES,CODES) BADD=MD1 MAXERR=20 -
OUTVARS=(V1-V20,V33,V45-V50,R105,R122) -
FORMAT=DELIM WITH=SEMI DECIM=COMMA STRINGS=QUOTE
$RECODE
R105=BRAC(V5,15-25=1,<36=2,<46=3,<56=4,<66=5,<90=6,ELSE=9)
MDCODES R105(9)
NAME R105GROUPS OF AGE
IF MDATA(V22) THEN R122=99.9 ELSE R122=V22/3
MDCODES R122(99.9)
NAME R122NO ARTICLES PER YEAR
Example 2. DIF format data are imported to IDAMS; column labels and column codes are included in the
input data le, and commas are used in decimal notation.
$RUN IMPEX
$FILES
PRINT = IMPDAT.LST
DICTIN = IDA.DIC Dictionary file describing data to be imported
DATAIN = IMPORTED.DAT Data file to be imported
DICTOUT = IDAFORM.DIC output Dictionary file
DATAOUT = IDAFORM.DAT output Data file
$SETUP
IMPORTING DIF FORMAT DATA TO IDAMS FIXED FORMAT DATA
IMPORT=(DATA,NAMES,CODES) BADD=MD1 MAXERR=20 -
FORMAT=DIF DECIM=COMMA
Example 3. A set of rectangular matrices created by the TABLES program is exported; values will be
separated by a semicolon and commas will be used in decimal notation; column and row labels and codes
will be included in the output matrix le; input matrices are printed.
$RUN IMPEX
$FILES
PRINT = EXPMAT.LST
DATAIN = TABLES.MAT file with rectangular matrices
DATAOUT = EXPORTED.MAT file with exported matrices
$SETUP
EXPORTING IDAMS RECTANGULAR FIXED FORMAT MATRICES TO FREE FORMAT MATRICES
EXPORT=(MATRIX,NAMES,CODES) PRINT=DATA -
FORMAT=DELIM WITH=SEMI DECIM=COMMA STRINGS=QUOTE
Example 4. Importing a square matrix containing distance measures for 10 objects numbered from 1 to
10; only integer values are included and are separated by the % sign; column/row codes as well as vectors
of means and standard deviations are included in the matrix le.
16.9 Examples 141
$RUN IMPEX
$FILES
PRINT = IMPMAT.LST
DATAOUT = IMPORTED.MAT file with the imported matrix
$SETUP
IMPORTING A FREE FORMAT MATRIX TO THE IDAMS SQUARE FIXED FORMAT MATRIX
IMPORT=(MATRIX,CODES) MATSIZE=10 -
FORMAT=DELIM WITH=USER DELCH=%
$DATA
$PRINT
% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10%
1%
2%38%
3%72%25%
4%24%53%17%
5%64%26%76%18%
6%48%25%63%15%61%
7%12%50%7%42%8%8%
8%19%7%13%4%14%1%15%
9%29%37%34%21%24%35%3%5%
10%32%57%29%45%26%28%74%24%61%
%46%15%7%7119%74%38%9%19%34%256%
%9%11%84%8971%23%28%12%20%35%843%
Chapter 17
Listing Datasets (LIST)
17.1 General Description
LIST can be used to print data values from a le, recoded variables and information from the associated
IDAMS dictionary. Specic variables may be selected for printing, or the entire data and/or dictionary may
be listed.
Each record in a data le is a continuous stream of data values. When printed as is, it becomes dicult
to distinguish the values of adjacent variables. LIST eliminates this inconvenience by oering data printing
format which separates variable values.
An IDAMS dictionary can be printed without the corresponding Data le by supplying a dummy le (i.e.
an empty or null le), when dening the Data le.
17.2 Standard IDAMS Features
Case and variable selection. Cases may be selected by using a lter, or the skip cases option (SKIP).
The skip option, if used, species that the rst and every subsequent n-th case is to be printed. If a lter is
specied, the skip option applies to those cases passing the lter. From the cases selected, the data values
are listed for all the variables described in the dictionary or a subset if the parameter VARS is specied.
Transforming data. Recode statements may be used.
Treatment of missing data. Missing data values are printed as they occur, causing no special action.
17.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution. If all variables are selected for printing, then the complete
dictionary is printed in sequential order.
Data. Numeric variables are printed with explicit decimal point, if any, and without leading zeros. If a
value overows the eld width it is printed as a string of asterisks. Bad data replaced by default missing
data codes are printed as blanks. Values for a variable are printed in a column that extends for as many
pages as necessary for all cases selected for printing. Below is a block sketch of the printing format:
v v v v
xxx xxxx x xxxxxxxx
xxx xxxx x xxxxxxxx
xxx xxxx x xxxxxxxx
. . . .
. . . .
144 Listing Datasets (LIST)
The v headings on the columns represent variable numbers and the xs represent variable values. If the
user requests printing of more variables than will t on a line (127 characters), LIST will make a number
of passes through the data, listing as many variables as it can each time. For example, if 50 variables were
to be printed, LIST would read through the data, printing all the values, say, for the rst 10 variables.
Then the data would be read again for the printing, say of the next 12 variables, and so on. The number of
variables printed on any pass over the data depends on the eld width of the variables being printed and is
automatically computed by LIST.
Sequence and case identication. Options exist to print a case sequence number and/or values of
identication variable(s) with each case. (See parameters PRINT and IDVARS). They are printed as the
rst columns.
Recode variables. These are printed with 11 digits including an explicit decimal point and 2 decimal
places.
17.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. If only a listing of the dictionary is required,
the Data le is specied as NUL.
17.5 Setup Structure
$RUN LIST
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
17.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V5=100-199
17.7 Restriction 145
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: PRINTING THE STUDY: 113A
3. Parameters (mandatory). For selecting program options.
Example: VARS=(V3,V10-V25) IDVARS=V1
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases to be printed.
Default: All cases will be printed.
SKIP=n
Every n-th case (or every n-th case passing the lter) is printed, starting with 1st case. The last
case will always be printed unless the MAXCASES option forbids it.
Default: All cases (or all cases passing the lter) are printed.
VARS=(variable list)
Print the data values for the specied variables. Variable values will be printed in the order they
appear in this list.
Default: All variables in the dictionary are listed.
IDVARS=(variable list)
The values of the variable(s) specied are printed to identify each case.
SPACE=3/n
Number of spaces between columns.
The maximum value is SPACE=8.
PRINT=(CDICT/DICT, SEQNUM, LONG/SHORT, SINGLE/DOUBLE)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
SEQN Print a case sequence number for each case printed. Note that cases are numbered
after the lter is applied.
LONG Assume 127 characters per print line.
SHOR Assume 70 characters per print line.
SING Single space between data lines.
DOUB Double space between data lines.
17.7 Restriction
The sum of the eld widths of variables to be printed, including case ID variables, must be less than or equal
to 10,000 characters.
146 Listing Datasets (LIST)
17.8 Examples
Example 1. Listing fty variables including one recoded variable; all cases will be printed with their
identication variables (V1, V2 and V4); dictionary will be printed but without C-records.
$RUN LIST
$FILES
PRINT = LIST1.LST
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
$RECODE
R6=BRAC(V6,0-50=1,51-99=2)
$SETUP
LISTING THE VALUES OF 50 VARIABLES WITH 3 ID VARIABLES WITH EACH GROUP
IDVA=(V1,V2,V4) VARS=(V3-V49,V59,V52,R6) PRIN=DICT
Example 2. Listing a complete dictionary with C-records without listing the data.
$RUN LIST
$FILES
DICTIN = STUDY.DIC input Dictionary file
DATAIN = NUL
$SETUP
LISTING COMPLETE DICTIONARY
PRIN=CDICT
Example 3. Check recoding by listing values of input and recoded variables for 10 cases.
$RUN LIST
$FILES
DICTIN = A.DIC input Dictionary file
DATAIN = A.DAT input Data file
$RECODE
R101=COUNT(1,V40-V49)
IF MDATA(V9,V10) THEN R102=99 ELSE R102=V9+V10
R103=BRAC(V16,15-24=1,25-34=2,35-54=3,ELSE=9)
$SETUP
CHECKING VALUES FOR 3 RECODED VARIABLES
MAXCASES=10 SKIP=10 SPACE=1 -
VARS=(V40-V49,R101,V9,V10,R102,V16,R103)
Chapter 18
Merging Datasets (MERGE)
18.1 General Description
MERGE merges variables from cases in one IDAMS dataset with variables from a second dataset, matching
the cases pair-wise on a common match variable(s). The cases in the two datasets do not have to be identical;
that is, all cases present in one dataset do not have to be present in the other. The output data le consists
of records containing user specied variables from each of the two input les along with a corresponding
IDAMS dictionary. In order to distinguish the two input datasets, one is referred to as dataset A, the
other as dataset B throughout the write-up.
Combining datasets with identical collections of cases. An example of one use of the program is
the combination of the data from the rst and a subsequent wave of interviews with the same collection of
respondents.
Combining datasets with somewhat dierent collections of cases. When there is more than one
wave of interviews in a survey, some respondents may drop out, and some may be added. The program
allows for these discrepancies between datasets and may, for example, be requested to output the records for
all respondents, including those interviewed in only one wave. In this example, the variable values for the
wave when a respondent was not interviewed would be output as missing data values.
Combining datasets with dierent levels of data. MERGE may also be used to combine two datasets,
one of which contains data at a more aggregated level than the other. For example, household data can be
added to individual household member records.
18.2 Standard IDAMS Features
Case and variable selection. A lter may be specied for either or both of the input datasets. The only
dierence in the format of the lter is that it must be preceded by an A: or B: in columns 1-2 to indicate
the dataset to which the lter applies.
All or selected variables from each input dataset can be included in the output dataset. These output
variables are specied in a variable list which has the usual format, except that variables are denoted by an
A or B (instead of V) to identify the input dataset in which they exist. For example, A1, B5, A3-
A45 selects variable V1, V3-V45 from dataset A and variable V5 from dataset B. See the output variables
description in the Program Control Statements section.
Transforming data. Recode statements may not be used.
Treatment of missing data. For the options MATCH=UNION, MATCH=A, and MATCH=B, missing
data codes are used as values for the output variables which are not available for a particular case. See
the paragraph Handling cases that appear in only one input dataset in the section describing the output
dataset below. The missing data codes are obtained from the dictionaries of the A and B datasets. The
user species for each dataset whether the rst or second missing data code should be used, and this for all
variables from this dataset (see the parameters APAD and BPAD). If a variable does not have an appropriate
148 Merging Datasets (MERGE)
missing data code in the dictionary, then blanks are output.
Missing data are never output as the value for an output variable that is also one of the match variables,
because a match variable value is always available from the one dataset that does contain the case. For
example, with MATCH=UNION selected, suppose that variable A1 and B3 were used as the match variables
and that only A1 was listed as an output variable (A1 and B3 would not both be listed as they presumably
have the same value): then, if a case in dataset A was missing, the value for the A1 output variable would
be the B3 value.
18.3 Results
Old (input) versus new (output) variable numbers. (Optional: see the parameter PRINT). A chart
containing the input variable numbers and reference numbers, and the corresponding output variable numbers
and reference numbers.
Output dictionary. (Optional: see the parameter PRINT).
Documentation of unmatched cases in either datasets A or B. There are several ways that unmatched
cases, i.e. cases appearing in only one le, may be documented (see the parameter PRINT).
The values of match variables may be printed:
- whenever output variables from one of the datasets are padded with missing data,
- whenever cases from dataset A are deleted,
- whenever cases from dataset B are deleted.
The values of variables A may be printed whenever a case from dataset A does not match any case
from dataset B. The variables are printed in the order specied for the dataset in the output variables,
followed by all the match variables which are not also output variables.
The values of variables B may be printed whenever a case from dataset B does not match any case
from dataset A. The variables are printed in the order specied for the dataset in the output variables,
followed by all the match variables which are not also output variables.
Case counts. The program prints the number of cases existing in datasets A and B, the number of cases
in dataset A and not in dataset B, the number of cases in dataset B and not in dataset A, and the total
number of output cases written.
18.4 Output Dataset
The output is a new Data le and a corresponding IDAMS dictionary.
Each data record contains the values of the output variables for matching cases from datasets A and B. Note
that a match variable is not automatically output: the user must include the match variable(s) from one of
the datasets in the output variable list in order to give the output a case ID.
Handling cases that appear in only one input dataset. Four actions are possible:
1. MATCH=INTERSECTION. Cases that appear in only one input dataset are not included in the
output dataset. (If data sets A and B are thought of as sets of cases, the output is the intersection of
sets A and B).
2. MATCH=UNION. Any case that appears in either input dataset is included in the output dataset.
Variables from the input dataset that does not contain the case are assigned missing data values in
the output dataset. (The output is the union of sets A and B).
3. MATCH=A. Any case that appears in dataset A is included in the output dataset, while a case that
appears only in dataset B is not included. If a case is found only in dataset A, variables from dataset
B are assigned missing data values in the output dataset for that case. (The output is set A).
18.5 Input Datasets 149
4. MATCH=B. The same as option 3, except that dataset B denes the cases included in the output
dataset. (The output is set B).
Handling duplicate cases. When one of the two input datasets contains more than one case with the
same value on the match variable(s), the dataset is said to contain duplicate cases. Normally (i.e. when the
parameter DUPBFILE is not specied) the program prints a message about the occurrence of duplicates
and then treats each of them as a separate case. The cases actually written to the output le depend on the
MATCH option selected. The following gure shows how this works.
Merging Files with Duplicates (DUPBFILE not specied)
Input Output
A | B | MATCH = UNION| MATCH = A | MATCH = B | MATCH = INTER
| | | | |
ID N1 | ID N2 | ID N1 N2 | ID N1 N2 | ID N1 N2 | ID N1 N2
| | | | |
01 MARY| 01 JOHN | 01 MARY JOHN | 01 MARY JOHN | 01 MARY JOHN | 01 MARY JOHN
01 ANN | 02 PETER| 01 ANN ____ | 01 ANN ____ | 02 JANE PETER| 02 JANE PETER
02 JANE| 03 MIKE | 02 JANE PETER| 02 JANE PETER| 03 ____ MIKE |
| | 03 ____ MIKE | | |
However duplicates can be interpreted and handled dierently when one of the two datasets contains cases
at a lower level of analysis than the other. For example, one dataset contains household data and the second
contains data for household members. In this instance, the match variables specied from each le would
be the household identication. Thus, duplicates would naturally occur in the member of a household
dataset, as most households would have more than one member. By specifying the parameter DUPBFILE,
the message about the occurrence of duplicates is not printed and cases are constructed for each duplicate
case in dataset B with the variables from the matching A case copied onto each. The following gure shows
an example of this procedure.
Merging Files at Dierent Levels (DUPBFILE specied)
Input Output
A | B | MATCH = UNION| MATCH = A | MATCH = B | MATCH = INTER
| | | | |
ID N1 | ID N2 | ID N1 N2 | ID N1 N2 | ID N1 N2 | ID N1 N2
| | | | |
01 JONE| 01 MARY | 01 JONE MARY | 01 JONE MARY | 01 JONE MARY | 01 JONE MARY
03 SMIT| 01 JOHN | 01 JONE JOHN | 01 JONE JOHN | 01 JONE JOHN | 01 JONE JOHN
04 SCOT| 01 ANN | 01 JONE ANN | 01 JONE ANN | 01 JONE ANN | 01 JONE ANN
| 02 PETE | 02 ____ PETE | 03 SMIT MIKE | 02 ____ PETE | 03 SMIT MIKE
| 02 JANE | 02 ____ JANE | 04 SCOT ____ | 02 ____ JANE |
| 03 MIKE | 03 SMIT MIKE | | 03 SMIT MIKE |
| | 04 SCOT ____ | | |
Variable sequence and variable numbers. Variables are output in the order they are given in the
output variable list and are always renumbered, starting at the value of the parameter VSTART. Thus, an
output variable list such as A1-A5, B6, A7-A25, B100 would create a dataset with variables V1 through
V26 if VSTART=1. Reference numbers for variables, if they exist, are transferred unchanged to the output
dictionary.
Variable locations. Variable locations are assigned by MERGE starting with the rst output variable and
continuing in order through the output variable list.
18.5 Input Datasets
MERGE requires 2 input Data les each described by an IDAMS dictionary.
150 Merging Datasets (MERGE)
The match variables may be alphabetic or numeric. Corresponding match variables from the A and B
datasets must have the same eld width.
The output variables may be alphabetic or numeric.
Each input Data le must be sorted in ascending order on its match variables prior to using MERGE.
18.6 Setup Structure
$RUN MERGE
$FILES
File specifications
$SETUP
1. Filter(s) (optional)
2. Label
3. Parameters
4. Match variable specification
5. Output variables
$DICT (conditional)
Dictionary (see Note below)
$DATA (conditional)
Data (see Note below)
Files:
DICTxxxx input dictionary for dataset A (omit if $DICT used)
DATAxxxx input data for dataset A (omit if $DATA used)
DICTyyyy input dictionary for dataset B (omit if $DICT used)
DATAyyyy input data for dataset B (omit if $DATA used)
DICTzzzz output dictionary
DATAzzzz output data
PRINT results (default IDAMS.LST)
Note. Either the A dataset or the B dataset, but not both, may be introduced in the setup. However
records following $DICT and $DATA are copied into les dened by DICTIN and DATAIN respectively.
Therefore, if the A le is introduced in the setup, the A dataset will be dened by DICTIN and DATAIN
and INAFILE=IN must be specied. Similarly, if the B le is introduced in the setup then INBFILE=IN
must be specied.
18.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter(s) (optional). Selects a subset of cases from dataset A and/or dataset B to be used in the
execution. Note that each lter statement must be preceded by A: or B: in columns one and two
to indicate the dataset to which the lter applies.
Example: A: INCLUDE V1=10,20,30
B: INCLUDE V1=10,20,30
18.7 Program Control Statements 151
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: MERGE OF TEACHER DATA AND STUDENT DATA
3. Parameters (mandatory). For selecting program options.
Example: MATCH=INTE PRINT=(A, B)
INAFILE=INA/xxxx
A 1-4 character ddname sux for the A input Dictionary and Data les.
Default ddnames: DICTINA, DATAINA.
INBFILE=INB/xxxx
A 1-4 character ddname sux for the B input Dictionary and Data les.
Default ddnames: DICTINB, DATAINB.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le A.
Default: All cases will be used.
MATCH=INTERSECTION/UNION/A/B
INTE Output only cases appearing in both datasets A and B.
UNIO Output cases appearing in either or both datasets A and B, padding variables with
missing data when necessary.
A Output cases appearing in the A dataset only, padding B variables with missing data
when necessary.
B Output cases appearing in the B dataset only, padding A variables with missing data
when necessary.
No default.
DUPBFILE
A case in dataset A may be paired with one or more cases (i.e. duplicates) from dataset B. For
each pairing, an output record will be created, depending on the MATCH parameter.
Note: The dataset with the expected duplicates must be dened as the B dataset.
Default: Duplicate cases in either dataset will be noted in the printed output and then treated
as distinct cases according to the MATCH specication.
OUTFILE=OUT/zzzz
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
VSTART=1/n
Variable number for the rst variable in the output dataset.
APAD=MD1/MD2
When padding A variables with missing data:
MD1 Output rst missing data code.
MD2 Output second missing data code.
BPAD=MD1/MD2
When padding B variables with missing data:
MD1 Output rst missing data code.
MD2 Output second missing data code.
152 Merging Datasets (MERGE)
PRINT=(PAD/NOPAD, ADELETE/NOADELETE, BDELETE/NOBDELETE, VARNOS,
A, B, OUTDICT/OUTCDICT/NOOUTDICT)
PAD Print the values of match variables when padding any A or B variables with missing
data.
ADEL Print the values of match variables for dataset A whenever a case from dataset A is
not included in the output data le.
BDEL Print the values of match variables for dataset B whenever a case from dataset B is
not included in the output data le.
VARN Print a list of the variable numbers in the input datasets and corresponding variable
numbers in the output dataset.
A Print all output and match variable values for cases appearing only in dataset A,
whether or not they are included in the output dataset.
B Print all output and match variable values for cases appearing only in dataset B,
whether or not they are included in the output dataset.
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
NOOU Do not print the output dictionary.
4. Match variable specication (mandatory). This statement denes the variables from datasets A
and B that are to be compared to match cases. Note that each input data le must be sorted on its
match variable(s) prior to using MERGE.
Example: A1=B3, A5=B1
which means that for a case from dataset A to match a case from dataset B, the value of variable V1
from the dataset A must be identical to the value of variable V3 from the dataset B, and similarly for
the variables V5 and V1.
General format
An=Bm, Aq=Br, ...
Rules for coding
The eld width of the two variables to be compared must be identical. The comparison is done
on a character basis, not a numeric one. Thus, 0.9 is not equivalent to 009, nor is 9 equal to
09. If the eld widths are not the same, use the TRANS program to change the width of one of
the variables prior to using MERGE.
Each match variable pair is separated by a comma.
Blanks may occur anywhere in the statement.
To continue to another line, terminate the information at a comma and enter a dash (-) to indicate
continuation.
5. Output variables (mandatory). This denes which variables from each input dataset are to be
transferred to the output and species their order in the output.
Example: A1, B2, A5-A10, B5, B7-B10
which means that the output dataset will contain variable V1 from dataset A, followed by variable V2
from dataset B, followed by variables V5 through V10 from dataset A, etc., in that order.
Rules for coding
The rules for coding are the same as for specifying variables with the parameter VARS, except
that As and Bs are used instead of Vs. Each variable number from dataset A is preceded by an
A and each variable number from dataset B is preceded by a B.
Duplicate variables in the list count as separate variables.
18.8 Restrictions 153
18.8 Restrictions
1. The maximum number of match variables from each dataset is 20.
2. Match variables must be of the same type and eld width in each le.
3. The maximum total length of the set of match variables from each dataset is 200 characters.
18.9 Examples
Example 1. Combining records from 2 datasets with an identical set of cases; in both datasets cases are
identied by variables 1 and 3; all variables are to be selected from each input dataset.
$RUN MERGE
$FILES
DICTOUT = AB.DIC output Dictionary file
DATAOUT = AB.DAT output Data file
DICTINA = A.DIC input Dictionary file for dataset A
DATAINA = A.DAT input Data file for dataset A
DICTINB = B.DIC input Dictionary file for dataset B
DATAINB = B.DAT input Data file for dataset B
$SETUP
COMBINING RECORDS FROM 2 DATASETS WITH AN IDENTICAL SET OF CASES
MATCH=UNION
A1=B1,A3=B3
A1-A112,B201-B401
Example 2. Combining datasets with somewhat dierent collections of cases; only cases having records
in both datasets are output; cases are identied by variables 2 and 4 in the rst dataset, and by variables
105 and 107 respectively in the second dataset; variables in the output dataset will be re-numbered starting
from the number 201, and a listing of references is requested; only selected variables will be taken from each
input dataset.
$RUN MERGE
$FILES
as for Example 1
$SETUP
COMBINING RECORDS FROM 2 DATASETS WITH DIFFERENT SETS OF CASES
MATCH=INTE VSTA=201 PRIN=VARNOS
A2=B105,A4=B107
B105,B107,A36-A42,B120,B131
Example 3. Combining datasets with dierent levels of data; cases from dataset A are combined with a
subset of cases from dataset B; a case from dataset A may be paired with one or more cases from dataset
B; cases in dataset A which do not match with a case in selected subset of dataset B are dropped and not
listed.
$RUN MERGE
$FILES
as for Example 1
$SETUP
B: INCLUDE V18=2 AND V21=3
COMBINING 2 DATASETS WITH DIFFERENT LEVELS OF DATA
MATCH=B DUPB
A1=B15
B15,A2,A6-A12,B20-B31,B40
154 Merging Datasets (MERGE)
Example 4. Household income is to be calculated from a le of household members and then merged back
into individual member records; AGGREG is rst used to sum the income (V6) over the individuals in the
household; V3 is the variable which identies the household; the output le from AGGREG (dened by
DICTAGG and DATAAGG) will contain 2 variables, the household ID (V1) and household income (V2);
this le is then used as the A le with MERGE to add the appropriate household income (variable A2)
to each original individuals record (variables B1-B46).
$RUN AGGREG
$FILES
PRINT = MERGE4.LST
DICTIN = INDIV.DIC input Dictionary file
DATAIN = INDIV.DAT input Data file
DICTAGG = AGGDIC.TMP temporary output Dictionary file from AGGREG
DATAAGG = AGGDAT.TMP temporary output Data file from AGGREG
DICTOUT = INDIV2.DIC output Dictionary file from MERGE
DATAOUT = INDIV2.DAT output Data file from MERGE
$SETUP
AGGREGATING INCOME
IDVARS=V3 AGGV=V6 STATS=SUM OUTF=AGG
$RUN MERGE
$SETUP
MERGING HOUSEHOLD INCOME TO INDIVIDUAL RECORDS
INAFILE=AGG INBFILE=IN DUPB MATCH=B
A1=B3
B1-B46,A2
Note that once le assignments have been made under $FILES, they do not need to be repeated if they are
being reused in subsequent steps.
Chapter 19
Sorting and Merging Files
(SORMER)
19.1 General Description
SORMER allows the user to more conveniently execute a Sort/Merge by allowing the specication of the
sort or merge control-eld information in the usual IDAMS parameter format. If the data le is described
by an IDAMS dictionary, then a copy of the dictionary corresponding to the sorted data can be output and
the sort elds may be specied by providing the appropriate variables; if not, they are specied by their
location.
Sort order. The user may specify that the data are to be sorted/merged in ascending or descending order.
19.2 Standard IDAMS Features
SORMER is a utility program and contains none of the standard IDAMS features.
19.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, for sort key variables.
Sort/Merge results. Number of records sorted/merged.
19.4 Output Dictionary
A copy of the input dictionary corresponding to the output Data le.
19.5 Output Data
Output consists of one le with the same attributes as the input le(s) with the records sorted into the
requested order.
156 Sorting and Merging Files (SORMER)
19.6 Input Dictionary
If the sort elds are being specied with variable numbers, then an IDAMS dictionary containing T-records
for at minimum these variables must be input. Only dictionaries describing one record per case data are
allowed.
19.7 Input Data
For sorting, one data le is input, containing one or more elds (or variables) whose values dene the desired
order.
For merging, input consists of 2-16 data les, each with the same record format, i.e. the same record length
and elds dening the sort order in the same positions. Each le must be sorted into order by the merge
control elds before merging.
19.8 Setup Structure
$RUN SORMER
$FILES
File specifications
$SETUP
1. Label
2. Parameters
$DICT (conditional)
Dictionary for sort/merge field variables
Files for sorting:
DICTxxxx IDAMS dictionary for sort field variables (omit if $DICT used)
SORTIN input data
DICTyyyy output dictionary
SORTOUT output data
Files for merging:
DICTxxxx IDAMS dictionary for merge field variables (omit if $DICT used)
SORTIN01 1st data file
SORTIN02 2nd data file
.
.
DICTyyyy output dictionary
SORTOUT output data
PRINT results (default IDAMS.LST)
Note. When SORMER execution is requested more than once in one setup le, the input le denitions
specied in the subsequent execution only modify but not replace the input le denitions specied previously,
e.g. if SORTIN01, SORTIN02 and SORTIN03 are specied for the rst execution, and SORTIN01 and
SORTIN02 are specied for the second execution in the same setup, the new SORTIN01 and SORTIN02
as well as the old SORTIN03 will be taken for merging.
19.9 Program Control Statements 157
19.9 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-2 below.
1. Label (mandatory). One line containing up to 80 characters to label the results.
Example: SORTING WAVE ONE
2. Parameters (mandatory). For selecting program options.
Example: KEYVARS=(V2,V3)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary le.
Default ddname: DICTIN.
OUTFILE=yyyy
A 1-4 character ddname sux for the output Dictionary le.
Needs to be specied to obtain in output a copy of the input Dictionary.
SORT/MERGE
SORT The input data are to be sorted.
MERG Two or more data les are to be merged.
ORDER=A/D
A Sort in ascending order on sort elds.
D Sort in descending order.
KEYVARS=(variable list)
List of variables to be used as sort elds (IDAMS dictionary must be supplied).
Note: The data le must have one record per case for this option to be selected. If more than one
record per case, use KEYLOC.
KEYLOC=(s1,e1, s2,e2, ...)
Sn Starting location of n-th sort eld.
En Ending location of n-th sort eld. Must be specied even when equal to the starting
location.
Note. No defaults. Either KEYVARS or KEYLOC (but not both) must be specied.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the sort key variables with C-records if any.
DICT Print the input dictionary without C-records.
19.10 Restrictions
1. A maximum of 16 les may be merged.
2. A maximum of 12 Sort/Merge control elds or variables may be specied.
3. The maximum number of records depends on the disk space available for the work les SORTWK01,
02, 03, 04, 05. These work les can be assigned to a disk other than the default drive if necessary.
158 Sorting and Merging Files (SORMER)
19.11 Examples
Example 1. Merging three pre-sorted data les of the same format; each le is described by the same
IDAMS dictionary; cases are sorted in ascending order on three variables: V1, V2 and V4.
$RUN SORMER
$FILES
PRINT = SORT1.LST
DICTIN = \SURV\DICT.DIC input Dictionary file
SORTIN01 = DATA1.DAT input Data file 1
SORTIN02 = DATA2.DAT input Data file 2
SORTIN03 = DATA3.DAT input Data file 3
DICTOUT = \SURV\DATA123.DIC output Dictionary file
SORTOUT = \SURV\DATA123.DAT output Data file
$SETUP
MERGING THREE IDAMS DATA FILES: DATA1, DATA2 AND DATA3
MERG KEYVARS=(V1,V2,V4) OUTF=OUT
Example 2. Sorting a Data le in descending order on two elds: rst eld is 4 characters long, starting in
column 12; second eld is 2 characters long, starting in column 3; a dictionary is not used.
$RUN SORMER
$FILES
SORTIN = RAW.DAT input Data file
SORTOUT = SORT.DAT output Data file
$SETUP
SORTING DATA FILE WITHOUT USING DICTIONARY
KEYLOC=(12,15,3,4) ORDER=D
Chapter 20
Subsetting Datasets (SUBSET)
20.1 General Description
SUBSET subsets a Data le and corresponding IDAMS dictionary by case and/or by variable, or copies the
complete les.
Sort order check. The program has an option to check that the data cases are in ascending order, based
on a list of sort order variables (see the parameter SORTVARS). Adjacent cases with duplicate identication
are not considered out of order. However, there is an option to delete duplicate occurrences of any case.
20.2 Standard IDAMS Features
Case and variable selection. Case subsetting is accomplished by using a lter to select a particular set of
cases from the input dataset. Variable selection is done by dening a set of input variables to be transferred
to the output dataset. The variables may be output in any order, and may be transferred more than once,
provided that the output variable numbers are re-numbered.
Transforming data. Recode statements may not be used.
Treatment of missing data. SUBSET makes no distinction between substantive data and missing data
values; all data are treated the same.
20.3 Results
Output dictionary. (Optional: see the parameter PRINT).
Subsetting statistics. The output record length, the number of output dictionary records and the number
of output data records.
Old (input) versus new (output) variable numbers. (Optional: see the parameter PRINT). A chart
containing the input variable numbers and reference numbers, and the corresponding output variable numbers
and reference numbers.
Notication of duplicate cases. (Conditional: if the sort order of the le is being checked, all duplicate
cases are documented whether or not the parameter DUPLICATE=DELETE is specied). For each case
identication which appears more than once in the data, the number of duplicates, the sequential number of
the case, and the case identication are printed. In addition, the program prints the number of input data
records and the number of input data records deleted.
160 Subsetting Datasets (SUBSET)
20.4 Output Dataset
The output is an IDAMS dataset constructed from the user specied subset of cases and/or variables from
the input le. When all variables are copied, i.e. when OUTVARS is not specied, the output and input
data records have the same structure and the dictionary output is an exact copy of the input. Otherwise,
the dictionary information for the variables in the output le is assigned as follows:
Variable sequence and variable numbers. If VSTART is specied, variables are placed as they appear
in the OUTVARS list and they are numbered according to the VSTART parameter. If VSTART is not
specied, the output variables have the same numbers as input variables and they are sorted in ascending
order by variable number.
Variable locations. Variable locations are assigned contiguously according to the order of the variables in
the OUTVARS list (if VSTART is specied) or after sorting into variable number order (if VSTART is not
specied).
Variable type, width and number of decimals are the same as for input variables.
Reference numbers. As from input or modied according to REFNO parameter.
C-records. Codes and their labels are copied as they are in the input dictionary.
20.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. Numeric or alphabetic variables can be used.
20.6 Setup Structure
$RUN SUBSET
$FILES
File specifications
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
20.7 Program Control Statements 161
20.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V1=10,20,30 AND V2=1,5,7
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: SUBSET OF 1968 ELECTION, V1-V50
3. Parameters (mandatory). For selecting program options.
Example: SORT=(V1,V2), DUPLICATE=DELETE
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
SORTVARS=(variable list)
If the sort order of the le is to be checked, specify up to 20 variables which dene the sort
sequence in major to minor order. Duplicates are considered as being in ascending order.
DUPLICATE=KEEP/DELETE
Deletion of duplicate cases (only applicable if SORT specied).
KEEP Output all occurrences of duplicate cases.
DELE Output only the rst occurrence of duplicate cases, and print message for duplicate(s).
OUTVARS=(variable list)
Supply this list only if a subset of the variables in the input dataset is to be output. If VSTART
is not selected, then duplicates are not allowed. Otherwise, variables can be provided in any order
and repeated as needed.
Default: All variables are output.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data le.
Default ddnames: DICTOUT, DATAOUT.
VSTART=n
The variables will be numbered sequentially, starting at n, in the output dataset.
Default: Input variable numbers are retained.
REFNO=OLDREF/VARNO
OLDR Retain the reference numbers in C- and T-records as in the input dictionary.
VARN Update the reference number eld in C- and T-records to match the output variable
number.
PRINT=(OUTDICT/OUTCDICT/NOOUTDICT, VARNOS)
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
VARN Print a list of the old and new variable numbers and reference numbers.
162 Subsetting Datasets (SUBSET)
20.8 Restrictions
1. The maximum number of sort variables that may be dened is 20.
2. The combined eld widths of the sort variables must not exceed 200 characters.
20.9 Examples
Example 1. Constructing a subset of cases for selected variables; variables will be re-numbered starting at
1 and a table giving the old and new variable numbers will be printed.
$RUN SUBSET
$FILES
PRINT = SUBS1.LST
DICTIN = ABC.DIC input Dictionary file
DATAIN = ABC.DAT input Data file
DICTOUT = SUBS.DIC output Dictionary file
DATAOUT = SUBS.DAT output Data file
$SETUP
INCLUDE V5=2,4,5 AND V6=2301
SUBSETTING VARIABLES AND CASES
PRINT=VARNOS VSTART=1 -
OUTVARS=(V1-V5,V18,V43-V57,V114,V116)
Example 2. Using the SUBSET program to check for duplicate cases; cases are identied by variables in
columns 1-3 and 7-8; there is one record per case; the output dataset is not required and is not kept.
$RUN SUBSET
$FILES
DATAIN = DEMOG.DAT input Data file
$SETUP
CHECKING FOR DUPLICATE CASES
SORT=(V2,V4) PRIN=NOOUTDICT
$DICT
$PRINT
3 2 4 1 1
T 2 CASE FIRST ID VAR 1 3
T 4 CASE SECOND ID VAR 7 2
Chapter 21
Transforming Data (TRANS)
21.1 General Description
The TRANS program creates a new IDAMS dataset containing variables from an existing dataset and new
variables dened by Recode statements. It is the way to save recoded variables.
TRANS has a print option and so it can also be used for testing Recode statements on a small number of
cases before executing an analysis program or before saving the complete le.
21.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of the cases from the input
data. Variable selection is accomplished through the parameter OUTVARS.
Transforming data. Recode statements may be used.
Treatment of missing data. Appropriate missing data codes are written to the output dictionary; these
are normally copied from the input dictionary but can also be overridden or supplied for output variables
through the Recode statement MDCODES. No missing data checks are made on data values except through
the use of Recode statements.
21.3 Results
Output dictionary. (Optional: see the parameter PRINT).
Output data. (Optional: see the parameter PRINT). Values for all cases for each V- or R-variable are
given, 10 variable values per line. For alphabetic variables, only the rst 10 characters are printed.
21.4 Output Dataset
The output is an IDAMS dataset which contains only those variables (V and R) specied in the OUTVARS
parameter. The dictionary information for the variables in the output le is assigned as follows:
Variable sequence and variable numbers. If VSTART is specied, variables are placed as they appear
in the OUTVARS list and they are numbered according to the VSTART parameter. If VSTART is not
specied, the output variables have the same numbers as in the OUTVARS list and they are sorted in
ascending order by variable number.
Variable names and missing data codes. Taken from the input dictionary (V-variables only) or from
Recode NAME and MDCODES statements, if any.
164 Transforming Data (TRANS)
Variable locations. Variable locations are assigned contiguously according to the order of the variables in
the OUTVARS list (if VSTART is specied) or after sorting into variable number order (if VSTART is not
specied).
Variable type, width and number of decimals.
V-variables: Type, eld width and number of decimals are the same as their input values.
R-variables: Type for R-variables is always numeric; width and number of decimals are assigned according
to the values specied for parameters WIDTH (default 9) and DEC (default 0), or according to the
values provided for individual variables on dictionary specications.
Reference numbers and study ID. The reference number and study ID for a V-variable are the same as
their input values. For R-variables, the reference number is left blank and the study ID is always REC.
C-records. C-records cannot be created for R-variables. C-records (if any) for all V-variables are copied
to the output dictionary. Note that if a V-variable is recoded during the TRANS execution, the C-records
that are output may no longer apply to the new version of the variable.
21.5 Input Dataset
The input is a data le described by an IDAMS dictionary. Numeric or alphabetic variables can be used.
21.6 Setup Structure
$RUN TRANS
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Dictionary specifications (optional)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
21.7 Program Control Statements 165
21.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: EXCLUDE V19=2-3
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: CONSTRUCTING VIOLENCE INDICATORS
3. Parameters (mandatory). For selecting program options.
Example: VSTART=1, WIDTH=2 OUTVARS=(V2-V5,R7)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric input data values and insucient eld width output values. See
The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MAXERR=0/n
The maximum number of insucient eld width errors allowed before execution stops. These
errors occur when the value of a variable is too big to t into the eld assigned, e.g. a value of
250 when WIDTH=2 has been specied. See Data in IDAMS chapter.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
OUTVARS=(variable list)
V- and R-variables which are to be output. The order of the variables in the list is signicant
only if the parameter VSTART is specied. If VSTART is not specied all V- and R-variable
numbers must be unique.
No default.
VSTART=n
The variables will be numbered sequentially, starting at n, in the output dataset.
Default: Input variable numbers are retained.
WIDTH=9/n
The default output variable eld width to be used for R-variables. This default may be overridden
for specic variables with the dictionary specication WIDTH. To change the eld width of a
numeric V-variable, create an equivalent R-variable (see Example 1).
DEC=0/n
Number of decimal places to be retained for R-variables.
166 Transforming Data (TRANS)
PRINT=(OUTDICT/OUTCDICT/NOOUTDICT, DATA)
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
DATA Print the values of the output variables.
4. Dictionary specications (optional). For any particular set of variables, the eld width and number
of decimals may be specied. These specications will override the values set by the main parameters
WIDTH and DEC. Note that missing data codes and variable names are assigned by the Recode state-
ments MDCODES and NAME respectively. Warning: MDCODES statement retains only 2 decimal
places for R-variables, rounding up the values accordingly.
The coding rules are the same as for parameters. Each dictionary specication must begin on a new
line.
Examples: VARS=R4, WIDTH=4, DEC=1
VARS=R8, WIDTH=2
VARS=(R100-R109), WIDTH=1
VARS=(variable list)
The R-variables to which the WIDTH and DEC parameters apply.
WIDTH=n
Field width for the output variables.
Default: Value given for WIDTH parameter.
DEC=n
Number of decimal places.
Default: Value given for DEC parameter.
21.8 Restrictions
1. The maximum number of R-variables that can be output is 250.
2. The maximum number of variables that can be used in the execution (including variables used only in
Recode statements) is 500.
3. The maximum number of dictionary specications is 200.
21.9 Examples
Example 1. Selected variables from the input dataset are transferred to the output le along with the 2
new variables; variable numbers are not changed; the eld width of input variable V20 is changed to 4.
$RUN TRANS
$FILES
PRINT = TRANS1.LST
DICTIN = OLD.DIC input Dictionary file
DATAIN = OLD.DAT input Data file
DICTOUT = NEW.DIC output Dictionary file
DATAOUT = NEW.DAT output Data file
$SETUP
CONSTRUCTING TWO NEW VARIABLES
PRINT=NOOUTDICT OUTVARS=(V1-V19,R20,V33,V45-V50,R105,R122)
VARS=R105,WIDTH=1
VARS=R122,WIDTH=3,DEC=1
VARS=R20,WIDTH=4
$RECODE
21.9 Examples 167
R20=V20
NAME R20VARIABLE 20
R105=BRAC(V5,15-25=1,<36=2,<46=3,<56=4,<66=5,<90=6,ELSE=9)
MDCODES R105(9)
NAME R105GROUPS OF AGE
IF MDATA(V22) THEN R122=99.9 ELSE R122=V22/3
MDCODES R122(99.9)
NAME R122NO ARTICLES PER YEAR
Example 2. This example shows the use of TRANS to check Recode statements; data values for the ID
variables (V1, V2), the variables being used in the recodes and the result variables are listed for the rst 30
cases; the output dataset is not required and is not dened.
$RUN TRANS
$FILES
PRINT = TRANS2.LST
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
$SETUP
CHECKING RECODES
WIDTH=2 PRINT=(DATA,NOOUTDICT) MAXCASES=30 -
OUTVARS=(V1-V2,V71-V74,V118,V12,V13,R901-R903)
$RECODE
R901=BRAC(V118,1-16=2,17=1,18-23=3,24=1,25-35=3,36=1,37=2,ELSE=9)
IF NOT MDATA(V12,V13) THEN R902=TRUNC(V12/V13) ELSE R902=99
R903=COUNT(1,V71-V74)
Example 3. Creating a test le of data with a random 1/20 sample of data le; there is no need to save
the output dictionary as it will be identical to the input.
$RUN TRANS
$FILES
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
DATAOUT = TESTDATA output Data file
$SETUP
CREATING TEST FILE WITH ALL VARIABLES AND 1/20 SAMPLE OF CASES
PRINT=NOOUTDICT OUTVARS=(V1-V505)
$RECODE
IF RAND(0,20) NE 1 THEN REJECT
Part IV
Data Analysis Facilities
Chapter 22
Cluster Analysis (CLUSFIND)
22.1 General Description
CLUSFIND performs cluster analysis by partitioning a set of objects (cases or variables) into a set of clusters
as determined by one of six algorithms: two algorithms based on partitioning around medoids, one based on
fuzzy clustering and three based on hierarchical clustering.
22.2 Standard IDAMS Features
Case and variable selection. If raw data are input, the standard lter is available to select a subset of
cases from the input data. The variables for analysis are specied in the parameter VARS.
Transforming data. If raw data are input, Recode statements may be used.
Weighting data. Use of weight variables is not applicable.
Treatment of missing data. If raw data are input, the MDVALUES parameter is available to indicate
which missing data values, if any, are to be used to check for missing data. The cases in which missing data
occur in all variables are deleted automatically. Otherwise, missing data are suppressed by pairs. If the
data are standardized, the average and the mean absolute deviation are calculated using only valid values.
When calculating the distances, only those variables are considered in the sum for which valid values are
present for both objects.
If a matrix is input, the MDMATRIX parameter is available to indicate which value should be used to check
for invalid matrix elements.
22.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Input data after standardization. (Optional: see the parameter PRINT). Standardized values for all
cases for each V- or R-variable used in analysis, preceded by the average and the mean absolute deviation
for those variables.
Dissimilarity matrix. (Optional: see the parameter PRINT). The lower-left triangle of the matrix, as
input or computed by the program.
PAM analysis results. For each number of clusters in turn (going from CMIN to CMAX) the following
is printed:
number of representative objects (clusters) and the nal average distance,
for each cluster: representative object ID, number of objects and the list of objects belonging to this
cluster,
172 Cluster Analysis (CLUSFIND)
coordinates of medoids (values of analysis variables for each representative object; for input dataset
only),
clustering vector (vector of numbers corresponding to the objects indicating to which cluster each
object belongs) and clustering characteristics,
graphical representation of results, i.e. a plot of silhouette for each cluster (optional - see the parameter
PRINT).
FANNY analysis results. For each number of clusters in turn (going from CMIN to CMAX) the following
is printed:
number of clusters,
objective function value at each iteration,
for each object, its ID and the membership coecient for each cluster,
partition coecient of Dunn and its normalized version,
closest hard clustering, i.e. number of objects and the list of objects belonging to each cluster,
clustering vector,
graphical representation of results, i.e. a plot of silhouette for each cluster (optional - see the parameter
PRINT).
CLARA analysis results. For the number of clusters tried the following is printed:
list of objects selected in the sample retained,
clustering vector,
for each cluster: representative object ID, number of objects and the list of objects belonging to this
cluster,
average and maximum distances to each medoid,
graphical representation of results, i.e. a plot of silhouette for each cluster belonging to the selected
sample (optional - see the parameter PRINT).
AGNES analysis results contain the following:
nal ordering of objects (identied by their ID) and dissimilarities between them,
graphical representation of results, i.e. a plot of dissimilarity banner (optional - see the parameter
PRINT).
DIANA analysis results contain the following:
nal ordering of objects (identied by their ID) and diameters of the clusters,
graphical representation of results, i.e. a plot of dissimilarity banner (optional - see the parameter
PRINT).
MONA analysis results contain the following:
trace of splits (optional - see the parameter PRINT) with, for each step, the cluster to be separated,
the list of objects (identied by their ID variable values) in each of the two subsets and the variable
used for the separation,
the nal ordering of objects,
graphical representation of results, i.e. a separation plot with the list of objects in each cluster and
the variable used for the separation (optional - see the parameter PRINT).
22.4 Input Dataset
The input dataset is a Data le described by an IDAMS dictionary. All variables used for analysis must be
numeric; they may be integer or decimal valued. The case ID variable can be alphabetic. Variables used
in PAM, CLARA, FANNY, AGNES or DIANA analysis should be interval scaled. Variables used in the
MONA analysis should be binary (with 0 or 1 values). Note that CLUSFIND uses at most 8 characters of
the variable name as provided in the dictionary.
22.5 Input Matrix
This is an IDAMS square matrix. See Data in IDAMS chapter. It can contain measures of similarities,
dissimilarities or correlation coecients. Note that CLUSFIND uses at most 8 characters of the object name
as provided on variable identication records.
22.6 Setup Structure 173
22.6 Setup Structure
$RUN CLUSFIND
$FILES
File specifications
$RECODE (optional with raw data input; unavailable with matrix input)
Recode statements
$SETUP
1. Filter (optional; for raw data input only)
2. Label
3. Parameters
$DICT (conditional)
Dictionary for raw data input
$DATA (conditional)
Data for raw data input
$MATRIX (conditional)
Matrix for matrix input
Files:
FT09 input matrix (if $MATRIX not used and a matrix input)
DICTxxxx input dictionary (if $DICT not used and INPUT=RAWDATA)
DATAxxxx input data (if $DATA not used and INPUT=RAWDATA)
PRINT results (default IDAMS.LST)
22.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution. Available only with raw data
input.
Example: INCLUDE V8=5-10
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: PARTITION AROUND MEDOIDS
3. Parameters (mandatory). For selecting program options.
Example: ANALYSIS=PAM VARS=(V7-V12) IDVAR=V1
INPUT=RAWDATA/SIMILARITIES/DISSIMILARITIES/CORRELATIONS
RAWD Input: Data le described by an IDAMS dictionary.
SIMI Input: measures of similarities in the form of an IDAMS sqaure matrix.
DISS Input: measures of dissimilarities in the form of an IDAMS square matrix.
CORR Input: correlation coecients in the form of an IDAMS square matrix.
174 Cluster Analysis (CLUSFIND)
Parameters only for raw data input
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=100/n
The maximum number of cases (after ltering) to be used from the input le.
Its value depends on the memory available.
n=0 No execution, only verication of parameters.
0<n<=100 Normal execution.
n>100 Only CLARA analysis allowed.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
STANDARDIZE
Standardize the variables before computing dissimilarities.
DTYPE=EUCLIDEAN/CITY
Type of distance to be used for computing dissimilarities.
EUCL Euclidean distance.
CITY City block distance.
IDVAR=variable number
Variable to be printed as case ID. Only 3 characters are used on the results. Thus, integer variables
must have values smaller than 1000. Only the rst three characters of an alphabetic variable are
printed.
No default.
PRINT=(CDICT/DICT, STAND)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
STAN Print the input data after standardization.
Parameters only for matrix input
DISSIMILARITIES=ABSOLUTE/SIGN
For INPUT=CORR, species how dissimilarity matrix should be computed.
ABSO Consider absolute values of correlation coecients as similarity measures.
SIGN Use correlation coecients with their signs.
MDMATRIX=n
Treat matrix elements equal to n as missing data.
Default: All values are valid.
PRINT=MATRIX
Print the input matrix.
Parameters for both types of input
VARS=(variable list)
The variables to be used in this analysis.
No default.
22.8 Restrictions 175
ANALYSIS=PAM/FANNY/CLARA/AGNES/DIANA/MONA
Species the type of analysis to be performed.
PAM Partition around medoids.
FANN Partition with fuzzy clustering.
CLAR Partition around medoids (same as PAM), but for datasets of at least 100 cases. CLUS-
FIND will sample the cases and choose the best representative sample. Five samples
of 40+2*CMAX cases are drawn (see CMAX parameter below).
Only for raw data input.
AGNE Agglomerative hierarchical clustering.
DIAN Divisive hierarchical clustering.
MONA Monothetic clustering of data consisting of binary variables. Requires at least 3 vari-
ables.
Only for raw data input.
No default.
CMIN=2/n
For PAM and FANNY. The minimum number of clusters to try.
CMAX=n
For PAM and FANNY, the maximum number of clusters to try.
For CLARA, the exact number of clusters to try.
Default: The larger of 20 and the value specied for CMIN.
PRINT=(DISSIMILARITIES, GRAPH, TRACE, VNAMES)
DISS Print the dissimilarity matrix.
GRAP Print the graphical representation of the results.
TRAC Print each step of the binary split when MONA is specied.
VNAM For matrix input, print the rst 3 or 8 characters of variable names instead of variable
numbers as object identication.
22.8 Restrictions
1. The maximum number of cases which can be used in an analysis (except CLARA) is 100.
2. The minimum number of cases requested for CLARA analysis is 100.
3. The maximum number of objects in an input matrix is 100.
4. Only 3 characters of the ID variable are used on the results.
22.9 Examples
Example 1. Clustering the rst 100 cases into 5 groups using 6 quantitative variables V11-V16; variable
values are standardized and Euclidean distance is used in calculations; clustering is done as partitioning
around medoids; printing of graphics is requested; cases are identied by variable V2.
$RUN CLUSFIND
$FILES
PRINT = CLUS1.LST
DICTIN = MY.DIC input Dictionary file
DATAIN = MY.DAT input Data file
$SETUP
PAM ANALYSIS USING RAW DATA AS INPUT
BADD=MD1 VARS=(V11-V16) STAND IDVAR=V2 CMIN=5 CMAX=5 PRINT=GRAP
176 Cluster Analysis (CLUSFIND)
Example 2. Agglomerative hierarchical clustering of 30 towns; the input matrix contains distances between
the towns and the towns are numbered from 1 to 30; printing of graphics is requested; town names are used
on the results.
$RUN CLUSFIND
$FILES
PRINT = CLUS2.LST
FT09 = TOWNS.MAT input Matrix file
$SETUP
AGNES ANALYSIS USING MATRIX OF DISTANCES AS INPUT
$COMMENT ACTUAL DISTANCES WERE DIVIDED BY 10,000 TO BE IN THE INTERVAL 0-1
INPUT=DISS VARS=(V1-V30) ANAL=AGNES PRINT=(GRAP,VNAMES)
Chapter 23
Conguration Analysis (CONFIG)
23.1 General Description
CONFIG performs analysis on a single spatial conguration input in the form of an IDAMS rectangular
matrix (as output for example by MDSCAL). It has the capability of centering, norming, rotating, translating
dimensions, computing interpoint distances and computing scalar products.
Each row of a conguration matrix provides the coordinates of one point of the conguration. Thus the
number of rows equals the number of points (variables), while the number of columns equals the number of
dimensions.
CONFIG can provide output which allows the user to compare more easily congurations which originally
had dissimilar orientations. It can also be used to perform further analysis on a conguration. Rotation,
for example, may make a conguration more easily interpreted.
23.2 Standard IDAMS Features
Case and variable selection. Selecting a subset of the cases is not applicable and a lter is not available.
Nor is there an option within CONFIG to subset the input conguration. An option for selection of one
matrix from a le containing multiple matrices is available within CONFIG (see the parameter DSEQ).
Transforming data. Use of Recode statements is not applicable in CONFIG.
Weighting data. Use of weight variables is not applicable.
Treatment of missing data. CONFIG does not recognize missing data in the input conguration. Ordi-
narily this presents no problem, as congurations are usually complete.
23.3 Results
Input matrix dictionary. (Conditional: only if the input matrix contained a dictionary. See the parameter
MATRIX). Input variable dictionary records with corresponding numbers used on plots (plot labels).
Input conguration. A printed copy of the input conguration.
Centered conguration. (Optional: see the parameter PRINT). If PRINT=ALL or PRINT=CENT is
specied and the input conguration is already centered, the message Input conguration is centered is
printed.
Normalized conguration. (Optional: see the parameter PRINT). If PRINT=ALL or PRINT=NORM
is specied and the input conguration is already normalized, the message Conguration is normalized is
printed.
178 Conguration Analysis (CONFIG)
Solution with principal axes. (Optional: see the parameter PRINT). The rows of the matrix are the
points and the columns are the principal axes. The elements in the matrix are the projections of the points
on the axes.
Scalar products. (Optional: see the parameter PRINT). The lower-left half of the symmetric matrix is
printed. Each element of the matrix is the scalar product for a pair of points (variables).
Inter-point distances. (Optional: see the parameter PRINT). The lower-left half of the symmetric matrix
is printed. Each element in the matrix is the distance between a pair of points (variables). The diagonal,
always all zeros, is printed.
Transformed conguration(s). (Optional: see the transformation specication parameter PRINT). The
transformed conguration is printed after the rotation/translation.
Plot of the transformed conguration(s). (Optional: see the transformation specication parameter
PRINT). The transformed conguration is plotted 2 axes at a time after the rotation/translation. The points
are numbered.
Varimax rotation history. (Optional: see the parameter PRINT). A vector is printed which contains
the variance of the conguration matrix before each iteration cycle. This is followed by the conguration
matrix after rotation to maximize the normal varimax criterion. It will have the same number of rows and
columns as the input conguration matrix.
Sorted conguration. (Optional: see the parameter PRINT). Each column of the conguration matrix,
after being ordered, is printed horizontally across the page.
Vector plots. (Optional: see the parameter PRINT). The nal conguration is plotted two axes at a time.
The points are numbered using the plot labels for the variables as printed with the input conguration
dictionary.
23.4 Output Conguration Matrix
The nal conguration may be written to a le (see the parameter WRITE). It is output as an IDAMS
rectangular matrix. See Data in IDAMS chapter for a description of IDAMS matrices. Variable identi-
cation records are output only if such records are included in the input conguration le (see the parameter
MATRIX). The format for the matrix elements is 10F7.3. The records containing the matrix elements are
identied by CFG in columns 73-75 and a sequence number in columns 76-80. The dimensions of the matrix
will be the same as the dimensions of the input matrix.
23.5 Output Distance Matrix
The inter-point distance matrix may be written to a le (see the parameter WRITE). This is output in
the form of an IDAMS square matrix with dummy records supplied for the means and standard deviations
expected in such a matrix. Variable identication records are output only if these are included in the input
conguration le (see the parameter MATRIX). The format of the matrix elements is 10F7.3. The records
containing the matrix elements are identied by CFG in columns 73-75 and a sequence number in columns
76-80.
23.6 Input Conguration Matrix
The input matrix must be in the form of an IDAMS rectangular matrix, either with or without variable
identication records (see the parameter MATRIX). See Data in IDAMS chapter for a description of the
format.
Conguration matrices obtained from the MDSCAL program can be input directly to CONFIG.
The n(rows) by m(columns) input matrix should contain the coordinates of n points for m dimensions. There
may be no missing data in the input matrix.
23.7 Setup Structure 179
More than one conguration can exist in a le being input to CONFIG. The one to be analyzed is selected
using the parameter DSEQ.
23.7 Setup Structure
$RUN CONFIG
$FILES
File specifications
$SETUP
1. Label
2. Parameters
3. Transformation specifications (conditional)
$MATRIX (conditional)
Matrix
Files:
FT02 output configuration and/or distance matrix
FT09 input configuration (omit if $MATRIX used)
PRINT results (default IDAMS.LST)
23.8 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Label (mandatory). One line containing up to 80 characters to label the results.
Example: CONFIG EXECUTED AFTER MDSCAL
2. Parameters (mandatory). For selecting program options.
Example: PRINT=(CENT,SORT,DIST) TRANS
MATRIX=STANDARD/NONSTANDARD
STAN Variable identication records are included in the input conguration matrix.
NONS Variable identication records are not included.
DSEQ=1/n
The sequence number on the input le of the conguration which is to be analyzed.
WRITE=(CONFIG,DISTANCES)
CONF Output the nal conguration to a le.
DIST Output the matrix of inter-point distances to a le.
TRANSFORM
Transformation specications will be provided.
180 Conguration Analysis (CONFIG)
PRINT=(CENTER, NORMALIZE, PRINAXIS, SCALARS, DISTANCES, VARIMAX, SORTED,
PLOT, ALL)
CENT Shift origin to centroid of space.
NORM Alter size of the space so sum of squared elements of the matrix equals the number of
variables.
PRIN Look for principal axes.
SCAL Matrix of scalar products.
DIST Matrix of inter-point distances.
VARI Orthogonal (varimax) rotation (after transformation if any).
SORT Sorted conguration (after transformation if any).
PLOT Plot the nal conguration.
ALL Print CENT, NORM, PRIN, SCAL, DIST, VARI, SORT, PLOT.
Default: Input conguration is printed.
Note. Analysis options are performed on the input conguration in the sequence specied above,
regardless of the order in which they are specied with the PRINT parameter. Transformations, if
any, are performed just before orthogonal rotation of the conguration. After each operation, the
results are printed. The eects of the analysis options are cumulative. If the nal conguration is
plotted and/or saved, this is done after all the analyses have been performed.
3. Transformation specications. (Conditional: if TRANSFORM was specied, use parameters as
specied below). As many transformations as desired may be specied; each one must start on a new
line.
If the user species the angle of rotation (DEGREES) and two dimensions (DIMENSION), rotation
is performed. If a constant (ADD) and one dimension (DIMENSION) are specied, translation is
performed.
Example: DEGR=45, DIME=(5,8) PRINT=PLOT
PRINT=(CONFIG, PLOT)
CONF Print the translated or rotated conguration (automatic for congurations with 2 di-
mensions and for the nal conguration).
PLOT Plot the translated or rotated conguration.
Note: There will be no printed output for the transformation if PRINT is not specied. It must
be specied for each transformation.
Rotation parameters
DIMENSION=(n, m)
The two dimensions to be rotated (only pairwise rotation).
DEGREES=n
Angle of rotation in degrees (only orthogonal rotation).
Translation parameters
DIMENSION=n
The one dimension to be translated.
ADD=n
Value to be added to each coordinate for the specied dimension (may be negative and have
decimal places).
23.9 Restrictions
The maximum size of the input conguration matrix is 60 rows x 10 columns.
23.10 Examples 181
23.10 Examples
Example 1. Rotation and transformation of a conguration matrix previously created by the MDSCAL
program; the nal conguration is written into a le and plotted; dimensions 1 and 2 are to be rotated by
60 degrees; dimension 1 is to be transformed by adding 6.
$RUN CONFIG
$FILES
PRINT = CONF1.LST
FT02 = CONFIG.MAT output file for configuration matrix
FT09 = MDS.MAT input configuration matrix
$SETUP
CONFIGURATION ANALYSIS
PRINT=(PLOT,VARI) TRAN WRITE=CONF
DEGR=60 DIME=(1,2) PRINT=PLOT
ADD=6 DIME=1 PRINT=PLOT
Example 2. Computation of the matrix of scalar products and the matrix of inter-point distances for the
4th conguration from the input le; no plots are requested.
$RUN CONFIG
$FILES
PRINT = CONF2.LST
FT02 = SCAL.MAT output file for scalar products and distances
FT09 = MDS.MAT input configuration matrix
$SETUP
CONFIGURATION ANALYSIS
PRINT=(SCAL,DIST) DSEQ=4
Chapter 24
Discriminant Analysis (DISCRAN)
24.1 General Description
The task of discriminant analysis is to nd the best linear discriminant function(s) of a set of variables which
reproduce(s), as far as it is possible, an a priori grouping of the cases considered.
A stepwise procedure is used in this program, i.e. in each step the most powerful variable is entered into
the discriminant function. The criterion function for selecting the next variable depends on the number of
groups specied (number of groups varies between 2 and 20). In the case of two groups the Mahalanobis
distance is used. When the number of groups is greater than 2 then the variable selection criterion is the
trace of a product of the covariance matrix for the variables involved and the inter-class covariance matrix
at a particular step. This is a generalization of Mahalanobis distance dened for two groups.
Besides executing the main discriminant analysis steps on a basic sample there are two optional possibilities:
checking the power of the discriminant function(s) with the help of a test sample, in which the group
assignment of the cases is known (as in the basic sample) but which cases were not used in the analysis, and
classifying the cases with the help of discriminant function(s) provided by the analysis in an anonymous
sample where the group assignment of the cases is unknown, or at least is not used.
24.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. A further subsetting is possible with the use of the sample and group variables. Analysis variables are
selected with the VARS parameter.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. Cases with missing data in the sample variable, the
group variable and/or the analysis variables can be optionally excluded from the analysis.
24.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Number of cases in samples. The number of cases in the basic, test and anonymous samples according
to the sample denition parameters.
184 Discriminant Analysis (DISCRAN)
Revised number of cases in samples. The number of cases in the basic, test and anonymous samples
revised according to the sample and group denition parameters. Note that the revised gures may be
smaller than the non-revised ones for the basic and the test samples if the groups dened do not cover
completely the samples.
Basic sample. (Optional: see the parameter PRINT). The identication and the analysis variables of the
cases in the basic sample are printed by groups, while the groups are separated from each other by a line of
asterisks.
Test sample. As for basic sample.
Anonymous sample. As for basic sample except that there are no groups.
Univariate statistics. For each variable used in the analysis the program prints the group means and
standard deviations as well as the total mean.
Stepwise procedure results (for each step)
Step number. The sequence number of the step.
Variables entered. The list of variables retained in this step.
Linear discriminant function. (Conditional: only if 2 groups specied). The constant term and the
coecients of the linear discriminant function corresponding to the variables already entered.
Classication table for basic sample. Bivariate frequency table showing the re-distribution of cases
between the original groups and the groups to which they are allocated on the basis of the discriminant
function, followed by the percentage of the correctly classied cases.
Classication table for test sample. As for basic sample.
Case assignment list. (Optional: see the parameter PRINT). The cases of the three samples are printed
here with case identication, case allocation, and discriminant function value (for 2 groups) or distances to
each group (for more than 2 groups).
Discriminant factor analysis results. (Conditional: only if more than 2 groups specied). Overall
discriminant power and the discriminant power of the rst three factors, followed by the values of discriminant
factors for group means. In addition, a graphical representation of cases and means in the space of the rst
two factors is also given.
24.4 Output Dataset
A dataset with the nal assignment of groups to cases can be requested. It is output in the form of a data
le described by an IDAMS dictionary (see parameter WRITE and Data in IDAMS chapter).
It contains in the following order:
- the transferred variables,
- the code of the original groups as renumbered by DISCRAN (Original group),
- the code of groups assigned to cases at the end (Assigned group),
- the Sample type (1=basic, 2=test, 3=anonymous) and,
- for analysis with more than 2 original groups, the values of the rst two discriminant factors
(Factor-1, Factor-2).
The variables are renumbered starting from one.
The code of the original groups is set to the rst missing data code (999.9999) for cases in anonymous sample;
factors are set to the rst missing data code (999.9999) for cases in the test and anonymous samples.
Note: variable specied in IDVAR is not output automatically and thus ID variables should better be
included in the transfer variable list.
24.5 Input Dataset 185
24.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. Three types of sample can be specied in the
input le, namely:
- basic sample,
- test sample, and
- anonymous sample.
The analysis is based on the basic sample. The test sample is used for testing the discriminant function(s)
while the cases of the anonymous sample are simply classied using the discriminant functions.
The samples are dened by a sample variable. The basic sample must not be empty. The groups to be
separated by the discriminant function(s) should be dened by a group variable. This variable denes an
a priori classication of the basic and test sample cases.
All variables used for analysis must be numeric; they may be integer or decimal valued. The case ID variable
and variables to be transferred can be alphabetic.
24.6 Setup Structure
$RUN DISCRAN
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary if WRITE=DATA specified
DATAyyyy output data if WRITE=DATA specified
PRINT results (default IDAMS.LST)
24.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
186 Discriminant Analysis (DISCRAN)
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V3=6 OR V11=99
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: DISCRIMINANT ANALYSIS ON AGRICULTURAL SURVEY
3. Parameters (mandatory). For selecting program options.
Example: MDHA=SAMPVAR IDVAR=V4 SAVAR=R5 BASA=(1,5) VARS=(V12-V15)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
VARS=(variable list)
List of V- and/or R-variables to be used in the analysis.
No default.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=(SAMPVAR, GROUPVAR, ANALVARS)
Choice of missing data treatment.
SAMP Cases with missing data in the sample variable are excluded from the analysis.
GROU Cases of basic and test samples with missing data in the group variable are excluded
from the analysis.
ANAL Cases with missing data in the analysis variables are excluded from the analysis.
Default: Cases with missing data are included.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
IDVAR=variable number
Case identication variable for the data and/or case assignment listing.
Default: DISC is used as identier for all cases.
STEPMAX=n
Maximum number of steps to be performed. It must be less than or equal to the number of
analysis variables.
Default: Number of analysis variables.
MEMORY=20000/n
Memory necessary for program execution.
24.7 Program Control Statements 187
WRITE=DATA
Create an IDAMS dataset containing transferred variables, case assignment variables, sample type
and values of the discriminant factors, if any.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
TRANSVARS=(variable list)
Variables (up to 99) to be transferred to the output dataset.
PRINT=(CDICT/DICT, OUTCDICT/OUTDICT, DATA, GROUP)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
OUTD Print the output dictionary without C-records.
DATA Print the data with original group assignments of cases.
GROU Print for each case the group assignment based on discriminant function.
Sample specication
These parameters are optional. If they are not specied, all cases from the input le are taken for
the basic sample. Test and anonymous samples, if they exist, must always be explicitly dened. The
pair-wise intersection of the samples must be empty. However, they need not cover the whole input
data le. A single value or a range of values can be used for selecting the cases which belong to the
corresponding sample.
m1 = value of sample variable
or
m1 <= value of sample variable < m2
where m1 and m2 may be integer or decimal values.
SAVAR=variable number
The variable used for sample denition. V- or R-variable can be used.
BASA=(m1, m2)
Conditional: denes the basic sample. Must be provided if SAVAR specied.
TESA=(m1, m2)
Conditional and optional: if SAVAR is specied. Denes the test sample.
ANSA=(m1, m2)
Conditional and optional: if SAVAR is specied. Denes the anonymous sample.
Basic sample classication
These parameters dene the a priori groups used in the discriminant analysis procedure. All the groups
must be dened explicitly and their pair-wise intersection must be empty. However, they need not
cover the whole basic sample.
GRVAR=variable number
The variable used for group denition. V- or R-variable can be used.
No default.
GR01=(m1, m2)
Denes the rst group in the basic sample.
188 Discriminant Analysis (DISCRAN)
GR02=(m1, m2)
Denes the second group in the basic sample.
GRnn=(m1, m2)
Denes the n-th group in the basic sample (nn <= 20).
Note. At least two groups have to be specied.
24.8 Restrictions
1. Maximum number of a priori groups is 20.
2. Same variable cannot be used twice.
3. Maximum eld width of case ID variable is 4.
4. Maximum number of variables to be transferred is 99.
5. R-variables cannot be transferred.
6. If a variable to be transferred is alphabetic with width > 4, only the rst four characters are used.
24.9 Examples
Example 1. Discriminant analysis on all cases together; cases are identied by the V1; 5 steps of analysis
are requested; a priori groups are dened by the variable V111 which includes categories 1-6.
$RUN DISCRAN
$FILES
PRINT = DISC1.LST
DICTIN = MY.DIC input Dictionary file
DATAIN = MY.DAT input Data file
$SETUP
CANONICAL LINEAR DISCRIMINANT ANALYSIS
PRINT=(DATA,GROUP) IDVAR=V1 STEP=5 VARS=(V101-V105) -
GVAR=V111 GR01=(1,3) GR02=(3,5) GR03=(5,7)
Example 2. Repeat analysis described in the Example 1 using the subset of respondents having the value
1 on V5 as the basic sample and test the results on the respondents having the value 2 on V5.
$RUN DISCRAN
$FILES
as for Example 1
$SETUP
CANONICAL LINEAR DISCRIMINANT ANALYSIS USING BASIC AND TEST SAMPLES
PRINT=(DATA,GROUP) IDVAR=V1 STEP=5 VARS=(V101-V105) -
SAVAR=V5 BASA=1 TESA=2 -
GVAR=V111 GR01=(1,3) GR02=(3,5) GR03=(5,7)
Chapter 25
Distribution and Lorenz Functions
(QUANTILE)
25.1 General Description
QUANTILE generates distribution functions, Lorenz functions, and Gini coecients for individual variables,
and performs the Kolmogorov-Smirnov test between two variables or between two samples.
25.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. In addition, each analysis may be performed on a further subset by use of a lter parameter. Variables
to be analysed are specied with VAR parameter.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer
values not grater than 32,767. Note that decimal valued weights are rounded to the nearest integer. When
the value of the weight variable for a case is zero, negative, missing, non-numeric or exceeding the maximum,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. Cases containing a missing data value on analysis
variable are eliminated from that analysis.
25.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Results for each analysis.
Distribution function: minimum, maximum, and subinterval break points.
Lorenz function (optional): minimum, maximum, subinterval break points, and Gini coecient.
Lorenz curve (optional): plotted in deciles.
Kolmogorov-Smirnov test statistics (optional).
190 Distribution and Lorenz Functions (QUANTILE)
25.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. All variables referenced (except main lter)
must be numeric; they may be integer or decimal valued.
25.5 Setup Structure
$RUN QUANTILE
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Subset specifications (optional)
5. QUANTILE
6. Analysis specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
25.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 and 6 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V5=1
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: MAKING DECILES
3. Parameters (mandatory). For selecting program options.
Example: MDVAL=MD1, PRINT=DICT
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
25.6 Program Control Statements 191
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter. Cases with missing data in an analysis are eliminated from that
analysis.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Subset specications (optional). These statements permit selection of a subset of cases for a par-
ticular analysis.
Example: FEMALE INCLUDE V6=2
Rules for coding
Prototype: name statement
name
Subset name. 1-8 alphanumeric characters beginning with a letter. This name must match
exactly the name used on subsequent analysis specications. Embedded blanks are not allowed.
It is recommended that all names be left-justied.
statement
Subset denition which follows the syntax of the standard IDAMS lter statement.
5. QUANTILE. The word QUANTILE on this line signals that analysis specications follow. It must
be included (in order to separate subset specications from analysis specications) and must appear
only once.
6. Analysis specications. The coding rules are the same as for parameters. Each analysis specication
must begin on a new line.
Examples: VAR=R10 N=5 PRINT=CLORENZ
VAR=V25 N=10 FILTER=MALE ANALID=M
VAR=V25 N=10 FILTER=FEMALE KS=M
VAR=variable number
Variable to be analysed.
No default.
WEIGHT=variable number
The weight variable number if the data are to be weighted. Data weighting is not allowed for the
Kolmogorov-Smirnov test.
N=20/n
Number of subintervals. If n<2 or n>100, a warning is printed and the default value of 20 is used.
192 Distribution and Lorenz Functions (QUANTILE)
FILTER=xxxxxxxx
Only cases which satisfy the condition dened on the subset specication named xxxxxxxx will
be used for this analysis. Enclose the name in primes if it contains non-alphanumeric characters.
Upper case letters should be used in order to match the name on the subset specication which
is automatically converted to upper case.
ANALID=label
A label for this analysis so that it can be referenced for doing a Kolmogorov-Smirnov test. Must
be enclosed in primes if it contains non-alphanumeric characters.
KS=label
Label is the label assigned to a previous analysis through the ANALID parameter and denes the
variable and/or sample with which this analysis is to be compared using the Kolmogorov-Smirnov
test. Must be enclosed in primes if it contains non-alphanumeric characters.
PRINT=(FLORENZ, CLORENZ)
FLOR Print the Lorenz function and Gini coecient.
CLOR Print the Lorenz curve plotted in deciles. (Lorenz function is also printed).
Note: If KS is specied, the PRINT parameter is ignored.
25.7 Restrictions
1. Maximum number of variables used (analysis+weight+local lter) is 50.
2. Maximum number of cases that can be analyzed is 5000.
3. Minimum number of subintervals is 2; maximum is 100.
4. Maximum number of subset specications is 25.
5. If using the Kolmogorov-Smirnov test, the maximum number of cases that can be analyzed is 2500.
6. The Lorenz function and the Kolmogorov-Smirnov test cannot be requested for the same analysis.
7. The break point values are always printed with three decimal places. Variables with more than three
decimals are truncated to three places when printed.
25.8 Example
Generation of distribution function, Lorenz function and Gini coecients for variable V67; separate analyses
are performed on all the data and then on two subsets; the Kolmogorov-Smirnov test is performed to test
the dierence of distributions of variable V67 in the two subsets of data.
$RUN QUANTILE
$FILES
PRINT = QUANT.LST
DICTIN = MY.DIC input Dictionary file
DATAIN = MY.DAT input Data file
$SETUP
COMPARISON OF AGE DISTRIBUTIONS FOR FEMALE AND MALE
* (default values taken for all parameters)
FEMALE INCLUDE V12=1
MALE INCLUDE V12=2
QUANTILE
VAR=V67 N=15 PRINT=(FLOR,CLOR)
VAR=V67 N=15 PRINT=(FLOR,CLOR) FILT=FEMALE ANALID=F
VAR=V67 N=15 PRINT=(FLOR,CLOR) FILT=MALE
VAR=V67 N=15 FILT=MALE KS=F
Chapter 26
Factor Analysis (FACTOR)
26.1 General Description
FACTOR covers a set of principal component factor analyses and analysis of correspondences having common
specications. It provides the possibility of performing, with only one read of the data factor analysis of
correspondences, scalar products, normed scalar products, covariances and correlations.
For each analysis the program constructs a matrix representing the relations among the variables and com-
putes its eigenvalues and eigenvectors. It then calculates the case and variable factors giving for each
case and variable its ordinate, its quality of representation and its contribution to the factors. A graphic
representation of the factors with ordinary or simplicio-factorial options can also be printed.
The principal variables/cases are the variables/cases on the basis of which the factorial decomposition
procedure is performed, i.e. they are used in computing the matrix of relations. One can also look for a
representation of other variables/cases in the factor space corresponding to the principal variables. Such
variables/cases (having no inuence on the factors) are called supplementary variables/cases.
One speaks about ordinary representation (of variables/cases) if the values (factor scores) coming directly
from the analysis are used in the graphic representation. However, for a better understanding of the relation
between variables and cases, another simultaneous representation, the simplicio-factorial representation,
is possible.
26.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. Variables are selected with the PVARS and SVARS parameters.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. There are two ways of handling missing data:
cases with missing data in principal variables are excluded from the analysis,
cases with missing data in principal and/or supplementary variables are excluded from the analysis.
194 Factor Analysis (FACTOR)
26.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Summary statistics. (Optional: see the parameter PRINT). Variable number, variable label, new variable
number (re-numbered from 1), minimum and maximum values, mean, standard deviation, coecient of
variability, total, variance, skewness, kurtosis and weighted number of valid cases for each variable. Note
that standard deviation and variance are estimates based upon weighted data.
Input data. (Optional: see the parameter PRINT). Groups of 16 variables with, on each row: the corre-
sponding number of cases, the total for principal variables and the values of all the variables, preceded by
the total for the columns (calculated for only the principal cases). Values are printed with explicit decimal
point and with one decimal place. If more than 7 characters are required for printing a value, it is replaced
by asterisks.
Matrix of relations (core matrix). (Optional: see the parameter PRINT). The matrix (after multipli-
cation by ten to the n-th power as indicated in the line printed before the matrix), the trace value and the
table of eigenvalues and eigenvectors.
Histogram of eigenvalues. The histogram with the percentages and cumulative percentages of each
eigenvalues contribution to the total inertia. The dashes in the histogram show the Kaiser criteria for the
correlation analysis.
Dictionaries of the output data les. (Optional: see the parameter PRINT). The dictionary pertaining
to the case factors followed by that of the variable factors.
Table(s) of factors. Depending upon the option(s) chosen, there will be: one table (either for case
factors or for variable factors), or two tables (for both case and variable factors, in that order).
According to the printing option chosen, these tables will contain only the principal cases (variables), only
the supplementary ones, or both.
Table of case factors. It gives, line by line:
case ID value,
information relevant to all factors taken together, i.e. the quality of representation of the case in the
space dened by the factors, the weight of the case and the inertia of the case,
information for each factor in turn, i.e. the ordinate of the case, the square cosine of the angle between
the case and the factor, and the contribution of the case to the factor.
Table of variable factors. It gives, line by line, similar information for the variables.
Scatter plots. (Optional: see the parameter PLOTS). The rst line gives the number of the factor repre-
sented along the horizontal axis with its eigenvalue and its min-max range. The second line gives the same
information concerning the vertical axis. Along with the label of the execution, the number of cases/variables
(i.e. points) that are represented is given. At the right side of each graph are printed:
number of points which cannot be printed for that ordinate (overlapping points),
number of points which it was not possible to represent,
page number.
Rotated factors. (Optional: see the parameter ROTATION). The variance calculated for each factor ma-
trix in each iteration of the rotation (using the VARIMAX method) is printed, followed by the communalities
of the variables before and after rotation, ending with the table of rotated factors.
Termination message. At the end of each analysis a termination message is printed with the type of
analysis performed.
26.4 Output Dataset(s)
Two Data les, each with an associated IDAMS dictionary can optionally be constructed. In the case
factors dataset, the records correspond to the cases (both principal and supplementary), the columns corre-
spond to variables (including the case identication and transferred variables) and factors. In the variable
factors dataset, the records correspond to the analysis variables, while the columns contain the variable
26.5 Input Dataset 195
identications (original variable numbers) and factors.
Output variables are numbered sequentially starting from 1 and they have the following characteristics:
Case identication (ID) and transferred variables: V-variables have the same characteristics as their
input equivalents, Recode variables are output with WIDTH=9 and DEC=2.
Computed factor variables:
Name specied by FNAME
Field width 7
No. of decimals 5
MD1 and MD2 9999999
26.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. All variables used for analysis must be numeric;
they may be integer or decimal valued. They should be dichotomous or measured on an interval scale.
The case ID variable and variables to be transferred can be alphabetic. There are two kinds of analysis
variables, namely, principal and supplementary. In addition one variable identifying the case must exist.
Other variables can be selected for transfer to the output data le of case factors. One or more cases at
the end of the input data le can be specied as supplementary cases.
For analysis of correspondence, two types of data are suitable: a) dichotomous variables from a raw data le
or b) a contingency table described by a dictionary and input as an IDAMS dataset.
26.6 Setup Structure
$RUN FACTOR
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. User-defined plot specifications (conditional)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary for case factors
DATAyyyy output data for case factors
DICTzzzz output dictionary for variable factors
DATAzzzz output data for variable factors
PRINT results (default IDAMS.LST)
196 Factor Analysis (FACTOR)
26.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: EXCLUDE V10=99 OR V11=99
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: AGRICULTURAL SURVEY 1984
3. Parameters (mandatory). For selecting program options.
Example: ANAL=(CRSP,SSPRO) TRANS=(V16,V20) IDVAR=V1 -
PVARS=(V31-V35)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=PRINCIPAL/ALL
PRIN Cases with missing data in the principal variables are excluded from the analysis while
cases with missing data in supplementary variables are included. Supplementary vari-
ables factors are based on valid data only.
ALL All cases with missing data are excluded.
ANALYSIS=(CRSP/NOCRSP, SSPRO, NSSPRO, COVA, CORR)
Choice of analyses.
CRSP Factor analysis of correspondences.
SSPR Factor analysis of scalar products.
NSSP Factor analysis of normed scalar products.
COVA Factor analysis of covariances.
CORR Factor analysis of correlations.
PVARS=(variable list)
List of V- and/or R-variables to be used as the principal variables.
No default.
SVARS=(variable list)
List of V- and/or R-variables to be used as supplementary variables.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
26.7 Program Control Statements 197
NSCASES=0/n
Number of supplementary cases. Note: These cases are not included in the computations of
statistics, matrix and factors; they are the last n ones in the data le.
IDVAR=variable number
Case identication variable for points on the plots and for cases in the output le.
No default.
KAISER/NFACT=n/VMIN=n
Criterion for determining the number of factors.
KAIS Kaisers criterion - number of roots greater than 1.
NFAC Number of factors desired.
VMIN The minimum percentage of variance to be explained by the factors taken all together.
Do not type the decimal, e.g. VMIN=95.
ROTATION=KAISER/UDEF/NOROTATION
Species VARIMAX rotation of variable factors. Only for correlation analysis.
KAIS Number of factors to be rotated is dened according to the KAISER criteria.
UDEF Number of factors to be rotated is specied by the user (see the parameter NROT).
NROT=1/n
Number of factors to be rotated (if ROTATION=UDEF specied).
WRITE=(OBSERV, VARS)
Controls output of les of case and variable factors. If more than one analysis is requested
on the ANALYSIS parameter, these les will only be for the rst specied.
OBSE Create a le containing case factors.
VARS Create a le containing variable factors.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the Dictionary and Data les for case factors.
Default ddnames: DICTOUT, DATAOUT.
OUTVFILE=OUTV/zzzz
A 1-4 character ddname sux for the Dictionary and Data les for variable factors.
Default ddnames: DICTOUTV, DATAOUTV.
TRANSVARS=(variable list)
Variables (up to 99) to be transferred to the output case factor le.
FNAME=uuuu
A 1-4 character string used as a prex for variable names of factors in output dictionaries. Must
be enclosed in primes if it contains any non-alphanumeric characters. Factors have names uuuu-
FACT0001, uuuuFACT0002, etc.
Default: Blank.
PLOTS=STANDARD/USER/NOPLOTS
Controls graphical representation of results.
STAN Standard plots will be printed for factor pairs 1-2, 1-3, 2-3 with options PAGES=1,
OVLP=LIST, NCHAR=4, REPR=COORD, VARPLOT=(PRINCIPAL,SUPPL).
USER User-dened plots are desired (see parameters for user-dened plots below).
198 Factor Analysis (FACTOR)
PRINT=(CDICT/DICT, OUTCDICTS/OUTDICTS, STATS, DATA, MATRIX, VFPRINC/NOVFPRINC,
VFSUPPL, OFPRINC, OFSUPPL)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
OUTC Print output dictionaries with C-records if any.
OUTD Print output dictionaries without C-records.
STAT Print statistics of principal and supplementary variables.
DATA Print input data.
MATR Print the matrix of relations (core matrix) and eigenvectors.
VFPR Print variable factors for the principal variables.
VFSU Print variable factors for supplementary variables.
OFPR Print case factors for the principal cases.
OFSU Print case factors for supplementary cases.
4. User-dened plot specications (conditional: if PLOT=USER specied as parameter). Repeat
for each two-dimensional plot to be printed. The coding rules are the same as for parameters. Each
plot specication must begin on a new line.
Example: X=3 Y=10
X=factor number
Number of the factor to be represented on the horizontal axis.
Y=factor number
Number of the factor to be represented on the vertical axis (see also the plot parameter FOR-
MAT=STANDARD).
ANSP=ALL/CRSP/SSPRO/NSSPRO/COVA/CORR
Species the analyses for which the plots are to be printed.
ALL Plots for all analyses specied in the ANALYSIS parameter.
For the rest, a plot for a single analysis (keywords have same meaning as for ANALYSIS param-
eter). These options imply one plot only.
OBSPLOT=(PRINCIPAL, SUPPL)
Choice of cases to be represented on the plot(s).
PRIN Represent principal cases.
SUPP Represent supplementary cases.
VARPLOT=(PRINCIPAL/NOPRINCIPAL, SUPPL)
Choice of variables to be represented on the plot(s).
PRIN Represent principal variables.
SUPP Represent supplementary variables.
REPRESENT=COORD/BASVEC/NORMBV
Choice of simultaneous representation of points (variables/cases).
COOR Coordinates as indicated in the table of factors.
BASV Represent basic vectors.
NORM Represent basic vectors using special norm for simplicio-factorial representation.
OVLP=FIRST/LIST/DEN
Option concerning the representation of overlapping points.
FIRS Print the variable number/case ID of the rst point only.
LIST Give a vertical list of the points having the same abscissa in the graph until another
point is met (the variable number/case IDs are then lost).
DEN Print the density (number of overlapping points). Print for one point ., for two
(overlapping) points :, for three points 3, etc, for 9 points 9, for more than 9
points *. NCHAR=2 must be specied if this option is selected.
26.8 Restrictions 199
NCHAR=4/n
Number of digits/characters used for the identication of the variables/cases on the plot(s) (1 to
4 characters).
PAGES=1/n
Number of pages per plot.
FORMAT=STANDARD/NONSTANDARD
Denes frame size of the plot.
STAN Use a 21 x 30 cm frame for the plot showing the factor with the wider range on the
horizontal axis and using dierent scales for the two axes.
NONS The frame will not be standardized in the sense above. Size of plot is dened by
PAGES=n, and meaning of axes by X and Y.
26.8 Restrictions
1. Maximum number of analysis variables is 80.
2. One (and only one) identication variable must be specied.
3. Maximum number of variables to be transferred is 99.
4. Maximum number of input variables including those used in lter and Recode statements is 100.
5. Maximum of 24 user-dened plots.
6. If the ID variable or a variable to be transferred is alphabetic with width > 4, only the rst four
characters are used.
7. For the parameters the following must hold:
max(D1,D2,D3) < 5000
where
D1 = NPV * NPV + 10 * NV
D2 = NV * (NF + 6) + NPV * NIF
D3 = NV + NF + NIF + 3 * NP
and NV, NPV, NF, NIF, NP denote the total number of analysis variables, number of principal
variables, number of factors to be computed, number of factors to be ignored, maximum number of
points to be represented in the plots respectively.
26.9 Examples
Example 1. Factor analysis of correlations; analyses are based upon 20 variables and 7 factors are requested;
number of factors to be rotated is dened according to the Kaiser criteria; statistics, correlation matrix and
eigenvectors will be printed, followed by variable factors and standard plots; factors will not be kept in a le.
$RUN FACTOR
$FILES
PRINT = FACT1.LST
DICTIN = A.DIC input Dictionary file
DATAIN = A.DAT input Data file
$SETUP
FACTOR ANALYSIS OF CORRELATIONS
ANAL=(NOCRSP,CORR) ROTA=KAISER NFACT=7 IDVAR=V1 PRINT=(STATS,MATRIX) -
PVARS=(V12-V16,V101-V115)
200 Factor Analysis (FACTOR)
Example 2. Factor analysis of scalar products based upon 10 variables; 2 supplementary variables, V5 and
V7, are to be represented on plots; plots are dened by user since only the 1st point of overlapping points
is required; Kaisers criteria are used to determine the number of factors; both variable and case factors will
be written into les.
$RUN FACTOR
$FILES
DICTIN = A.DIC input Dictionary file
DATAIN = A.DAT input Data file
DICTOUT = CASEF.DIC Dictionary file for case factors
DATAOUT = CASEF.DAT Data file for case factors
DICTOUTV = VARF.DIC Dictionary file for variable factors
DATAOUTV = VARF.DAT Data file for variable factors
$SETUP
FACTOR ANALYSIS OF SCALAR PRODUCTS
ANAL=(NOCRSP,SSPR) IDVAR=V1 WRITE=(OBSERV,VARS) PRINT=STATS PLOT=USER -
PVARS=(V112-V116,V201-V205) SVARS=(V5,V7)
X=1 Y=2 VARP=(PRINCIPAL,SUPPL)
X=1 Y=3 VARP=(PRINCIPAL,SUPPL)
X=2 Y=3 VARP=(PRINCIPAL,SUPPL)
Example 3. Correspondence analyses using a contingency table described by a dictionary and entered as
a dataset in the Setup le to be executed; number of factors is dened by the Kaisers criterion; matrix of
relations will be printed, followed by variable and case factors, and by user dened plots of variables and
cases.
$RUN FACTOR
$FILES
PRINT = FACT3.LST
$SETUP
CORRESPONDENCE ANALYSIS ON CONTINGENCY TABLE
BADD=MD1 IDVAR=V8 PLOTS=USER PRINT=(MATRIX,OFPRINC) PVARS=(V31-V33)
$DICT
$PRINT
3 8 33 1 1
T 8 Scientific degree 1 20
C 8 81 Professor
C 8 82 Ass.Prof.
C 8 83 Doctor
C 8 84 M.Sc
C 8 85 Licence
C 8 86 Other
T 31 Head 4 20
T 32 Scientifc 7 20
T 33 Technician 10 20
$DATA
$PRINT
81 5 0 0
82 1 3 0
83 0 17 01
84 0 28 04
85 0 0 01
86 0 0 17
Chapter 27
Linear Regression (REGRESSN)
27.1 General Description
REGRESSN provides a general multiple regression capability designed for either standard or stepwise linear
regression analysis. Several regression analyses, using dierent parameters and variables, may be performed
in one execution.
Constant term. If the input is raw data, the user may request that the equations have no constant term
(see the regression parameter CONSTANT=0). In such case, a matrix based on the cross-product matrix is
analyzed instead of a correlation matrix. This changes the slope of the tted line and can substantially aect
the results. In stepwise regression, variables may enter the equation in a dierent order than they would if
a constant term were estimated. If a correlation matrix is input, the regression equation always includes a
constant term.
Use of categorical variables as independent variables. An option is available to create a set of dummy
(dichotomous) variables from specied categorical variables (see the parameter CATE). These can be used
as independent variables in the regression analysis.
F-ratio for a variable to enter in the equation. In a stepwise regression, variables are added in turn to
the regression equation until the equation is satisfactory. At each step the variable with the highest partial
correlation with the dependent variable is selected. A partial F-test value is then computed for the variable
and this value is compared to a critical value supplied by the user. As soon as the partial F for the next to
be entered variable becomes less than the critical value, the analysis is terminated.
F-ratio for a variable to be removed from the equation. A variable which may have been the best
single variable to enter at an early stage of a stepwise regression may, at a later stage, not be the best because
of the relationship between it and other variables now in the regression. To detect this, the partial F-value
for each variable in the regression at each step of the calculation is computed and compared with a critical
value supplied by the user. Any variable whose partial F-value falls below the critical value is removed from
the model.
Stepwise regression. If stepwise regression is requested, the program determines which variables or which
sets of dummy variables among the specied set of independent variables will actually be used for the
regression, and in which order they will be introduced, beginning with the forced variables and continuing
with the other variables and sets of dummy variables, one by one. After each step the algorithm selects from
the remaining predictor variables the variable or set of dummy variables which yields the largest reduction
in the residual (unexplained) variance of the dependent variable, unless its contribution to the total F-ratio
for the regression remains below a specied threshold. Similarly, the algorithm evaluates after each step
whether the contribution of any variable or set of dummy variables already included falls below a specied
threshold, in which case it is dropped from the regression.
Descending stepwise regression. Like the stepwise regression, except that the algorithm starts with all
the independent variables and then drops variables and sets of dummy variables in a stepwise manner. At
each step the algorithm selects from the remaining included predictor variables the variable or set of dummy
variables which yields the smallest reduction in the explained variance of the dependent variable, unless this
exceeds a specied threshold. Similarly, the algorithm evaluates at each step whether the contribution of
202 Linear Regression (REGRESSN)
any variable or set of dummy variables previously dropped from the regression has risen above a specied
threshold, in which case it is added back into the regression.
Generating a residuals dataset. With raw data input, residuals may be computed and output as a
data le described by an IDAMS dictionary. See the Output Residuals Datasets section for details on the
content. Note that a separate residuals dataset is generated from each equation. Also, since REGRESSN
has no facility to transfer specic variables of interest in a residuals analysis from the input raw data to the
residuals dataset, it may be necessary to use the MERGE program to create the dataset containing all of
the desired variables. A case ID variable from the input dataset is output to the residuals dataset to make
matching possible.
Generating a correlation matrix. If raw data are input, the program computes correlation coecients
which may be output in the format of an IDAMS square matrix and used for further analysis. REGRESSN
correlations include all variables across all regression equations and are based on cases which have valid data
on all variables in the matrix. Thus, the correlations will usually dier from correlations obtained from
the PEARSON program execution with the MDHANDLING=PAIR option. When missing data elimination
in REGRESSN leaves the sample size acceptably large, REGRESSN is an alternative to PEARSON for
generating a correlation matrix (see the paragraph Treatment of missing data).
27.2 Standard IDAMS Features
Case and variable selection. If raw data are input, the standard lter is available to select a subset of
cases from the input data. If a matrix of correlations is used as input to the program, case selection is not
applicable. The variables for the regression equation are specied in the regression parameters DEPVAR
and VARS.
Transforming data. If raw data are input, Recode statements may be used.
Weighting data. If raw data are input, a variable can be used to weight the input data; this weight variable
may have integer or decimal values. The program will force the sum of the weights to equal the number of
input cases. When the value of the weight variable for a case is zero, negative, missing or non-numeric, then
the case is always skipped; the number of cases so treated is printed.
Treatment of missing data.
1. Input. If raw data are input, the MDVALUES parameter is available to indicate which missing
data values, if any, are to be used to check for missing data. Cases in which missing data occur in
any regression variable in any analysis are deleted (case-wise missing data deletion). An option
(see the parameter MDHANDLING) allows the user to specify the maximum number of missing data
cases which can be tolerated before the execution is terminated. Warning: If multiple analyses are
performed in one REGRESSN execution, a single correlation matrix is computed for all variables used
in the dierent analyses. Because of the case-wise method of deleting cases with missing data, the
number of cases used and thus the regression statistics produced may be dierent if the analyses are
then performed separately.
If a matrix is input, cases with missing data should have been accommodated when the matrix was
created. If a cell of the input matrix has a missing data code (i.e. 99.999) any analysis involving that
cell will be skipped.
2. Output residuals. If residuals are requested, predicted values and residuals are computed for all
cases which pass the (optional) lter. If a case has missing data on any of the variables required for
these computations, output missing data codes are generated.
3. Output correlation matrix. The REGRESSN algorithm for handling missing data on raw data
input cannot result in missing data entries in the correlation matrix.
27.3 Results 203
27.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Univariate statistics. (Raw data input only). The sum, mean, standard deviation, coecient of variation,
maximum, and minimum are printed for all dependent and independent variables used.
Matrix of total sums of squares and cross-products. (Raw data input only. Optional: see the
parameter PRINT).
Matrix of residual sums of squares and cross-products. (Raw data input only. Optional: see the
parameter PRINT).
Total correlation matrix. (Optional: see the parameter PRINT).
Partial correlation matrix. (Optional for each regression: see the regression parameter PARTIALS).
The ij-th element is the partial correlation between variable i and variable j, holding constant the variables
specied in the PARTIALS variable list.
Inverse matrix. (Optional for each regression: see the regression parameter PRINT).
Analysis summary statistics. The following statistics are printed for each regression or for each step of
a stepwise regression:
standard error of estimate,
F-ratio,
multiple correlation coecient (adjusted and unadjusted),
fraction of explained variance (adjusted and unadjusted),
determinant of the correlation matrix,
residual degrees of freedom,
constant term.
Analysis statistics for predictors. The following statistics are printed for each regression or for each
step of a stepwise regression:
coecient B (unstandardized partial regression coecient),
standard error (sigma) of B,
coecient beta (standardized partial regression coecient),
standard error (sigma) of beta,
partial and marginal R squared,
t-ratio,
covariance ratio,
marginal R squared values for all predictors and t-ratios for all sets of dummy variables (for stepwise
regression).
Residual output dictionary. (For raw data input only. Optional: see the regression parameter WRITE).
Residual output data. (For raw data input only. Optional: see the regression parameter PRINT). If
there are less than 1000 cases, calculated values, observed values and residuals (dierences) may be listed
in ascending order of residual value. Any number of cases may be listed in input case sequence order. The
Durbin-Watson statistic for association of residuals will be printed for residuals listed in case sequence order.
27.4 Output Correlation Matrix
The computed correlation matrix may be output (see the parameter WRITE). It is written in the form of
an IDAMS square matrix (see Data in IDAMS chapter). The format is 6F11.7 for the correlations and
4E15.7 for the means and standard deviations. In addition, labeling information is written in columns 73-80
of the records as follows:
204 Linear Regression (REGRESSN)
matrix-descriptor record N=nnnnn
correlation records REG xxx
means records MEAN xxx
standard deviation records SDEV xxx
(nnnnn is the REGRESSN sample size. The xxx is a sequence number beginning with 1 for the rst
correlation record and incremented by one for each successive record through the last standard deviation
record).
The elements of the matrix are Pearson rs. They, as well as the means and standard deviations, are based
on the cases that have valid data on all the variables specied in any of the regression variable lists. The
correlations are for all pairs of variables from all the analysis variable lists taken together.
27.5 Output Residuals Dataset(s)
For each analysis, a residuals dataset can be requested (see the regression parameter WRITE). This is output
in the form of a Data le described by an IDAMS dictionary. It contains either four or ve variables per
case, depending on whether or not the data were weighted: an ID variable, a dependent variable, a predicted
(calculated) dependent variable, a residual, and a weight, if any. Cases are output in the order of the input
cases. The characteristics of the dataset are as follows:
Variable Field No. of MD1
No. Name Width Decimals Code
(ID variable) 1 same as input * 0 same as input
(dependent variable) 2 same as input * ** same as input
(predicted variable) 3 Predicted value 7 *** 9999999
(residual) 4 Residual 7 *** 9999999
(weight-if weighted) 5 same as input * ** same as input
* transferred from input dictionary for V variables or 7 for R variables
** transferred from input dictionary for V variables or 2 for R variables
*** 6 plus no. of decimals for dependent variable minus width of dependent variable; if this is
negative, then 0.
If the calculated value or residual exceeds the allocated eld width, it is replaced by MD1 code.
27.6 Input Dataset
The input raw dataset is a Data le described by an IDAMS dictionary. All variables used for analysis must
be numeric; they may be integer or decimal valued. The case ID variable can be alphabetic.
27.7 Input Correlation Matrix
This is an IDAMS square matrix. A correlation matrix generated by PEARSON or by a previous RE-
GRESSN is an appropriate input matrix for REGRESSN.
The input matrix dictionary must contain variable numbers and names. The matrix must contain correla-
tions, means and standard deviations. Both the means and standard deviations are used.
27.8 Setup Structure 205
27.8 Setup Structure
$RUN REGRESSN
$FILES
File specifications
$RECODE (optional with raw data input; unavailable with matrix input)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Definition of dummy variables (conditional)
5. Regression specifications (repeated as required)
$DICT (conditional)
Dictionary for raw data input
$DATA (conditional)
Data for raw data input
$MATRIX (conditional)
Matrix for correlation matrix input
Files:
FT02 output correlation matrix
FT09 input correlation matrix
(if $MATRIX not used and INPUT=MATRIX)
DICTxxxx input dictionary (if $DICT not used and INPUT=RAWDATA)
DATAxxxx input data (if $DATA not used and INPUT=RAWDATA)
DICTyyyy output residuals distionary ) one set for each
DATAyyyy output residuals data ) residuals file requested
PRINT results (default IDAMS.LST)
27.9 Program Control Statements
Refer to The IDAMS setup le chapter for further descriptions of the program control statements, items
1-3 and 5 below.
1. Filter (optional). Selects a subset of cases to be used in the execution. Available only with raw data
input.
Example: INCLUDE V3=5
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: REGRESSION ANALYSIS
3. Parameters (mandatory). For selecting program options.
Example: IDVAR=V1 MDHANDLING=100
206 Linear Regression (REGRESSN)
INPUT=RAWDATA/MATRIX
RAWD The input data are in the form of a Data le described by an IDAMS dictionary.
MATR The input data are correlation coecients in the form of an IDAMS square matrix.
Parameters only for raw data input
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=0/n
The number of missing data cases to be allowed before termination. A case is counted missing if
it has missing data in any of the variables in the regression equations.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
CATE
Specify CATE if a denition of dummy variables is provided.
IDVAR=variable number
Variable to be output or printed as case ID if residuals dataset is requested. The ID variable
should not be included in any variable list.
WRITE=MATRIX
Write the correlation matrix computed from the raw data input to an output le.
PRINT=(CDICT/DICT, XMOM, XPRODUCTS, MATRIX)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
XMOM Print the matrix of residual sums of squares and cross-products.
XPRO Print the matrix of total sums of squares and cross-products.
MATR Print the correlation matrix.
Parameters for correlation matrix input
CASES=n
Set CASES equal to the number of cases used to create the input matrix. This number is used in
calculating the F-level.
No default; must be supplied when correlation matrix input.
PRINT=MATRIX
Print the correlation matrix.
4. Denition of dummy variables (conditional: if CATE was specied as a parameter). The RE-
GRESSN program can transform a categorical variable to a set of dummy variables. To have a variable
27.9 Program Control Statements 207
treated as categorical, the user must a) include the CATE parameter in the parameter list and b) spec-
ify the variables to be considered categorical and the codes to be used. Each categorical variable to be
transformed is followed by the codes to be used enclosed in brackets. For each variable, any codes not
listed will be excluded from the construction. Note: The list of codes should not be exhaustive, i.e. all
existing codes should not be listed or else a singular matrix will result.
Example: V100(5,6,1), V101 (1-6)
Codes 5, 6 and 1 of variable 100 will be represented in the regression as dummy variables, along
with codes 1 through 6 of variable 101.
A variable specied in the denition of dummy variables, when used in predictor (VARS), partials
(PARTIALS) or forced (FORCE) variables lists for stepwise regression, will refer to the set of dummy
variables created from that variable. In stepwise regressions, the codes of such a variable will be
entered or excluded together, and marginal R-squares and F-ratios will be calculated for all codes
of the variable together as well as for codes individually. A variable used in a denition of dummy
variables may not be used as a dependent variable.
5. Regression specications. The coding rules are the same as for parameters. Each set of regression
parameters must begin on a new line.
Example: DEPV=V5 METH=STEP FORCE=(V7) VARS=(V7,V16,V22,V37-V47,R14)
METHOD=STANDARD/STEPWISE/DESCENDING
STAN A standard regression will be done.
STEP A stepwise regression will be done.
DESC A descending stepwise regression will be done.
DEPVAR=variable number
Variable number of dependent variable.
No default.
VARS=(variable list)
The independent variables to be used in this analysis.
No default.
PARTIALS=(variable list)
Compute and print a partial correlation matrix with the specied variables removed from the
independent variable list.
Default: No partials.
FORCE=(variable list)
Force the variables listed to enter into the stepwise regression (METH=STEP) or to remain in
the descending stepwise regression (METH=DESC).
Default: No forcing.
FINRATIO=.001/n
The F-ratio value below which a variable will not be entered in a stepwise procedure; this is the
F-ratio to enter. The decimal point must be entered.
FOUTRATIO=0.0/n
The F-ratio value above which a variable must remain in order to continue in a stepwise procedure;
this is the F-ratio to remove. The decimal point must be entered.
CONSTANT=0
For raw data input only.
The constant term is required to equal zero and no constant term will be estimated.
Default: A constant term will be estimated.
208 Linear Regression (REGRESSN)
WRITE=RESIDUALS
Residuals are to be written out as an IDAMS dataset.
OUTFILE=OUT/yyyy
Applicable only if WRITE=RESI specied.
A 1-4 character ddname sux for the residuals output Dictionary and Data les. If outputting
residuals from more than 1 analysis, the default ddname, OUT, may be used only once.
PRINT=(STEP, RESIDUALS, ERESIDUALS, INVERSE)
STEP Applies to the stepwise regression only: print marginal R-squares for all predictors in
each step.
RESI Print residuals in input case sequence order and Durbin-Watson statistic.
ERES Print residuals, except for missing data, in error magnitude order, provided there are
fewer than 1000 cases.
INVE Print the inverse correlation matrix.
27.10 Restrictions
1. With raw data input, there may be as many as 99 or 100 (depending on whether a weight variable is
used) distinct variables used in any single regression equation; the total number of variables across all
analysis, including Recode variables, weight variable and ID variable, can be no more than 200.
2. With matrix input, the matrix can be 200 x 200, and up to 100 variables may be used in any single
regression equation.
3. FINRATIO must be greater than or equal to FOUTRATIO.
4. Residuals may be listed in ascending order of residual value only if there are fewer than 1000 cases.
5. A variable specied in a denition of dummy variables may not be used as a dependent variable.
6. Maximum 12 dummy variables can be dened from one categorical variable.
7. If the ID variable is alphabetic with width > 4, only the rst four characters are used.
27.11 Examples
Example 1. Standard regression with ve independent variables using an IDAMS correlation matrix as
input.
$RUN REGRESSN
$FILES
FT09 = A.MAT input Matrix file
SETUP
STANDARD REGRESSION - USING MATRIX AS INPUT
INPUT=MATR CASES=1460
DEPV=V116 VARS=(V18,V36,V55-V57)
Example 2. Standard regression with six independent variables and with two variables each with 3 cat-
egories transformed to 6 dummy variables; raw data are used as input; residuals are to be computed and
written into a dataset (cases are identied by variable V2).
$RUN REGRESSN
$FILES
PRINT = REGR2.LST
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
27.11 Examples 209
DICTOUT = RESID.DIC Dictionary file for residuals
DATAOUT = RESID.DAT Data file for residuals
$SETUP
STANDARD REGRESSION - USING RAW DATA AS INPUT AND WRITING RESIDUALS
MDHANDLING=50 IDVAR=V2 CATE
V5(1,5,6),V6(1-3)
DEPV=V116 WRITE=RESI VARS=(V5,V6,V8,V13,V75-V78)
Example 3. Two regressions: one standard and one stepwise using raw data as input.
$RUN REGRESSN
$FILES
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
$SETUP
TWO REGRESSIONS
PRINT=(XMOM,XPROD)
DEPV=V10 VARS=(V101-V104,V35) PRINT=INVERSE
DEPV=V11 METHOD=STEP PRINT=STEP VARS=(V1,V3,V15-V18,V23-V29)
Example 4. Two-stage regression; the rst stage uses variables V2-V6 to estimate values of the dependent
variable V122; in the 2nd stage, two additional variables V12, V23 are used to estimate the predicted values
of V122, i.e. V122 with the eects of V2-V6 removed.
In the rst regression, predicted values for the dependent variable (V122) are computed and written to the
residuals le (OUTB) as variable V3. MERGE is then used to merge this variable with the variables from
the original le that are required in the second stage. The output dataset from MERGE (a temporary le
so it need not be dened) will contain the 5 variables from the build list, numbered V1 to V5 where A12
and A23 (to be used as predictors in the second stage) become V2 and V3, A122, the original dependent
variable, becomes V4, and B3, the variable giving predicted values of V122 becomes V5. This output le is
then used as input to the second stage regression.
$RUN REGRESSN
$FILES
PRINT = REGR4.LST
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
DICTOUTB = RESID.DIC Dictionary file for residuals
DATAOUTB = RESID.DAT Data file for residuals
$SETUP
TWO STAGE REGRESSION - FIRST STAGE
MDHANDLING=100 IDVAR=V1
DEPV=V122 WRITE=RESI OUTF=OUTB VARS=(V2-V6)
$RUN MERGE
$SETUP
MERGING PREDICTED VALUE (V3 IN RES FILE) INTO DATA FILE
MATCH=INTE INAF=IN INBF=OUTB
A1=B1
A1,A12,A23,A122,B3
$RUN REGRESSN
$SETUP
TWO STAGE REGRESSION - SECOND STAGE
MDHANDLING=100 INFI=OUT
DEPV=V5 VARS=(V2,V3)
Chapter 28
Multidimensional Scaling (MDSCAL)
28.1 General Description
MDSCAL is a non-metric multidimensional scaling program for the analysis of similarities. The program,
which operates on a matrix of similarity or dissimilarity measures, is designed to nd, for each dimensionality
specied, the best geometric representation of the data in the space.
The uses of non-metric multidimensional scaling are similar to those of factor analysis, e.g. clusters of
variables can be spotted, the dimensionality of the data can be discovered, and dimensions can sometimes be
interpreted. The CONFIG program can be used to perform analysis on an MDSCAL output conguration.
Input conguration. Normally an internally created arbitrary starting conguration is used to begin the
computation. The user may, however, supply an initial conguration. There are several possible reasons for
providing a starting conguration. The user may have theoretical reasons for beginning with a certain con-
guration; one may wish to perform further iteration on a conguration which is not yet close enough to the
best conguration; or, to save computing time, one may wish to provide a higher dimensional conguration
as a starting point for a lower dimensional conguration.
Scaling algorithm. The program starts with an initial conguration, either generated arbitrarily or sup-
plied by the user, and iterates (using a procedure of the steepest descent type) over successive trial
congurations, each time comparing the rank order of inter-point dierences in the trial conguration with
the rank order of the corresponding measure in the data. A badness of t measure (stress coecient)
is computed after each iteration and the conguration is rearranged accordingly to improve the t to the
data, until, ideally, the rank order of distances in the conguration is perfectly monotonic with the rank
order of dissimilarities given by the data; in that case, the stress will be zero. In practice, the scaling
computation stops, in any given number of dimensions, because the stress reaches a suciently small value
(STRMIN), the scale factor (magnitude) of the gradient reaches a suciently small value (SRGFMN), the
stress has been improving too slowly (SRATIO), or the preset maximum number of iterations is reached
(ITERATIONS). The program stops on whichever condition comes rst. The same procedure is repeated
for the next lower dimensionality using the previous results as the initial conguration, until a specied
minimum number of dimensions is reached. During computation, the cosine of the angle between successive
gradients plays an important role in several ways; optionally, two internal weighting parameters may be
specied (see parameters COSAVW and ACSAVW).
Dimensionality and metric. Solutions may be obtained in 2 to 10 dimensions. The user controls the di-
mensionality of the congurations obtained by specifying the maximum and minimum number of dimensions
desired, and the dierence between the dimensionality of the successive solutions produced (see parameters
DMAX, DMIN, and DDIF). The user also species, using parameter R, whether the distance metric should
be Euclidean (R=2), the usual case, or some other Minkowski r-metric.
Stress. Stress is a measure of how well the conguration matches the data. The user may choose between
two alternate formulas for computing the stress coecient: either the stress is standardized by the sum of
the squared distances from the mean (SQDIST) or the stress is standardized by the sum of the squared
deviations from the mean (SQDEV). In many situations, the congurations reached by the two formulas will
not be substantially dierent. Larger values of stress result from formula 2 for the same degree of t.
212 Multidimensional Scaling (MDSCAL)
Ties in input coecients. There are two alternative methods for handling ties among the input data
values; the corresponding distances can be required to be equal (TIES=EQUAL) or they can be allowed to
dier (TIES=DIFFER). When there are few ties, it makes little dierence which approach is used. When
there are a great many ties it does make a dierence, and the context must be considered in making the
choice.
28.2 Standard IDAMS Features
Case and variable selection. Filtering of cases must be performed at the time the matrix is created, not
in MDSCAL. The parameter VARS allows the computation to be performed on subsets of the matrix rather
than on the entire matrix.
Transforming data. Use of Recode statements is not applicable in MDSCAL. Data transformations must
be performed at the time the input matrix is created.
Weighting data. Weighting in the usual sense (weighting cases to correct for dierent sampling rates
or dierent levels of aggregation) must be accomplished before using MDSCAL; such weighting must be
incorporated in the input data matrix. There is a weight option of a quite dierent sort available in MDSCAL
(see parameter INPUT=WEIGHTS). It may be used to assign weights to cells of the input matrix; the user
supplies a matrix of values which are to be used as weights for the corresponding elements in the input
matrix.
Treatment of missing data. Missing data for individual cases must be accounted for at the time the input
data matrix is created, not in MDSCAL. If, after the matrix has been created, an entry in the matrix is
missing, i.e. contains a missing data code, there is a possibility of processing it in MDSCAL: the MDSCAL
cuto option (see parameter CUTOFF) can be used to exclude from analysis missing data values if these
are less than valid data values. MDSCAL has no option for recognizing missing data values that are large
numbers (such as 99.99901, the missing data code output by PEARSON). If large missing data values do
exist, these should be edited to small numbers. If one particular variable has many missing entries, possibly
it should be dropped from the analysis.
28.3 Results
Input matrix. (Optional: see the parameter PRINT).
Input weights. (Optional: see the parameter PRINT).
Input conguration. If a starting conguration is supplied, it is always printed.
History of the computation. For each solution, the program prints a complete history of computations,
reporting the stress value and its ancillary parameters for each iteration:
Iteration the iteration number
Stress the current value of the stress
SRAT the current value of the stress ratio
SRATAV the current stress ratio average (it is an exponentially weighted average)
CAGRGL the cosine of the angle between the current gradient and the previous gradient
COSAV the current value of the average cosine of the angle between successive gradients
(a weighted average)
ACSAV the current value of the average absolute value of the cosine of the angle
between successive gradients (a weighted average)
SFGR the length (more properly, the scale factor) of the gradient
STEP the step size.
Reason for termination. When computation is terminated, the reason is indicated by one of the remarks:
Minimum was achieved, Maximum number of iterations were used, Satisfactory stress was reached,
or Zero stress was reached.
Final conguration. For each solution, the Cartesian coordinates of the nal conguration are printed.
28.4 Output Conguration Matrix 213
Sorted conguration. (Optional: see the parameter PRINT). For each solution, the projections of points
of the nal conguration are sorted separately on each dimension into ascending order and printed.
Summary. For each solution, the original data values are sorted and printed together with their correspond-
ing nal distances (DIST) and the hypothetical distances required for a perfect monotonic t (DHAT).
28.4 Output Conguration Matrix
As the nal conguration for each dimensionality is calculated, it may be output as an IDAMS rectangular
matrix. The conguration is centered and normalized. The rows represent variables and the columns
represent dimensions. The matrix elements are written in 10F7.3 format. Dictionary records are generated.
This matrix may be submitted as a conguration input for another execution of MDSCAL or it may be
input to another program such as CONFIG for additional analysis.
28.5 Input Data Matrix
The usual input to MDSCAL is an IDAMS square matrix (see Data in IDAMS chapter). This matrix
is the upper-right-half matrix with no diagonal and it is dened by the parameter INPUT=STANDARD.
TABLES and PEARSON generate matrices suitable for input to MDSCAL. Means and standard deviations
are not used but appropriate (dummy) records must be supplied. MDSCAL will accept matrices in other
formats than the upper-right triangle with no diagonal. However, such matrices must contain the dictionary
portion of an IDAMS square matrix and must have records containing pseudo means and standard deviations
at the end.
The following INPUT parameters indicate the exact format of matrix being input:
STAN upper-right triangle, no diagonal
STAN, DIAG upper-right triangle, with diagonal
LOWER, DIAG lower-left triangle, with diagonal
LOWER lower-left triangle, no diagonal
SQUARE full square matrix with diagonal.
The measures contained in the data matrix may either be measures of similarity (such as correlations) or
dissimilarities. Although the input to MDSCAL is usually a matrix of correlation coecients (e.g. a matrix
of gammas or a matrix of Pearson rs), the input matrix may contain any measure that makes sense as a
measure of proximity. Because non-metric scaling uses only ordinal properties of the data, nothing need
be assumed about the quantitative or numerical properties of the data. There should be, at the very least,
twice as many variables as dimensions.
28.6 Input Weight Matrix
If a weight matrix is supplied, it must be in exactly the same format as the input data matrix. The parameter
INPUT=(STAN/LOWE/SQUA, DIAG) applies to the weight matrix as well as to the data matrix. The
dictionary for the weight matrix should be the same as for the input data matrix. Means and standard
deviations are not used, but corresponding dummy lines should be supplied.
This matrix contains values, in one-to-one correspondence with elements of the data matrix, which are to
be used as weights for the data. These values are used in conjunction with the value for the parameter
CUTOFF when applied to the data. If a data value is greater than the cuto value, but the corresponding
weight value is less than or equal to zero, an error condition is signaled. Likewise, if the data value is less
than or equal to the cuto value, and the corresponding weight value is greater than zero, an error condition
is set. If either of these inconsistencies occurs, the execution terminates.
214 Multidimensional Scaling (MDSCAL)
28.7 Input Conguration Matrix
The input conguration must be in the format of an IDAMS rectangular matrix. See Data in IDAMS
chapter.
It provides a starting conguration to be used in the computations. The rows should represent variables
and the columns dimensions. It is usually produced by a previous execution of MDSCAL and is submitted
in order that a previous execution may start where it left o.
The matrix must contain at least as many dimensions as the value given for the parameter DMAX.
Note: If a variable list (VARS) is specied, MDSCAL uses the rst n rows of the input conguration where
n is the number of variables in the list, without checking the variable numbers.
28.8 Setup Structure
$RUN MDSCAL
$FILES
File specifications
$SETUP
1. Label
2. Parameters
$MATRIX (conditional)
Data matrix
Weight matrix
Starting configuration matrix
(Note: Not all of the matrices need be included here; however, if
more than one matrix is included, they must be in the above order).
Files:
FT02 output configuration matrix
FT03 input weight matrix if INPUT=WEIGHTS specified (omit if $MATRIX used)
FT05 input starting configuration if INPUT=CONFIG specified
(omit if $MATRIX used)
FT08 input data matrix (omit if $MATRIX used)
PRINT results (default IDAMS.LST)
28.9 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-2 below.
1. Label (mandatory). One line containing up to 80 characters to label the results.
Example: MDSCAL EXECUTION ON DATASET X4952
2. Parameters (mandatory). For selecting program options.
Example: DMAX=5 ITER=75 WRITE=CONFIG
28.9 Program Control Statements 215
INPUT=(STANDARD/LOWER/SQUARE, DIAGONAL, WEIGHTS, CONFIG)
STAN The input is an IDAMS square matrix, i.e. o-diagonal, upper-right-half matrix.
LOWE The input matrix is a lower-left-half matrix.
SQUA The input matrix is a full square matrix.
DIAG The input matrix has the diagonal elements.
WEIG A matrix of weight values is being supplied.
CONF The starting conguration matrix is being supplied.
VARS=(variable list)
List of variables in the matrix on which analysis is to be performed.
Default: The entire input matrix is used.
FILE=(DATA, WEIGHTS, CONFIG)
DATA The input data matrix is in a le.
WEIG The weight matrix is in a le.
CONF The input conguration matrix is in a le.
Default: All matrices are assumed to follow a $MATRIX command in the order data, weight,
conguration.
COEFF=SIMILARITIES/DISSIMILARITIES
SIMI Large coecients in the data matrix indicate that points are similar or close.
DISS Large coecients indicate that points are dissimilar or far.
DMAX=2/n
The dimension maximum: scaling starts with the space of maximum dimension.
DMIN=2/n
The dimension minimum: scaling proceeds until it reaches or would pass the minimum dimension.
DDIF=1/n
The dimension dierence: scaling proceeds from maximum dimension to minimum dimension by
steps of the dimension dierence.
R=2.0/n
Indicate which Minkowski r-metric is to be used. Any value >= 1.0 can be used.
R=1.0 City block metric.
R=2.0 Ordinary Euclidean distance.
CUTOFF=0.0/n
Data values less than or equal to n are discarded. If the legitimate values of the input coecients
range from -1.0 to 1.0, CUTOFF=-1.01 should be used.
TIES=DIFFER/EQUAL
DIFF Unequal distances corresponding to equal data values do not contribute to the stress
coecient and no attempt is made to equalize these distances.
EQUA Unequal distances corresponding to equal data values do contribute to the stress and
there is an attempt to equalize these distances.
ITERATIONS=50/n
The maximum number of iterations to be performed in any given number of dimensions. This
maximum is a safety precaution to control execution time.
STRMIN=.01/n
Stress minimum. The scaling procedure will stop if the stress reaches the minimum value.
216 Multidimensional Scaling (MDSCAL)
SFGRMN=0.0/n
Minimum value of the scale factor of the gradient. The scaling procedure will stop if the magnitude
of the gradient reaches the minimum value.
SRATIO=.999/n
The stress ratio. Scaling procedure stops if the stress ratio between successive steps reaches n.
ACSAVW=.66/n
The weighting factor for the average absolute value of the cosine of the angle between successive
gradients.
COSAVW=.66/n
The weighting factor for the average cosine of the angle between successive gradients.
STRESS=SQDIST/SQDEV
SQDI Compute the stress using the standardization by the sum of the squared distances.
SQDE Compute the stress using the standardization by the sum of the squared deviations
from the mean.
WRITE=CONFIG
Output the nal conguration of each solution into a le.
PRINT=(MATRIX, SORTCONF, LONG/SHORT)
MATR Print the input data matrix and the weight matrix if one is supplied.
SORT Sort each dimension of the nal conguration and print it.
LONG Print matrices on long lines.
SHOR Print matrices on short lines.
28.10 Restrictions
1. The capacity of the program is 1800 data points (e.g. 1800 elements of the similarity or dissimilarity
matrix). This is equivalent to a triangle of a 60 x 60 matrix or to a 42 x 42 square matrix.
2. Variables may be scaled in up to 10 dimensions.
3. The starting conguration matrix may have a maximum of 60 rows and 10 columns.
28.11 Example
Generation of an output conguration matrix; the input data matrix is in standard IDAMS form and in a
le; there is neither input weight matrix nor input conguration matrix; 20 iterations are requested; analysis
is to be performed on a subset of variables.
$RUN MDSCAL
$FILES
FT02 = MDS.MAT output configuration Matrix file
FT08 = ABC.COR input data Matrix file
$SETUP
MULTIDIMENSIONAL SCALING
ITER=20 WRITE=CONFIG FILE=DATA VARS=(V18-V36)
Chapter 29
Multiple Classication Analysis
(MCA)
29.1 General Description
MCA examines the relationships between several predictor variables and a single dependent variable and
determines the eects of each predictor before and after adjustment for its inter-correlations with other
predictors in the analysis. It also provides information about the bivariate and multivariate relationships
between the predictors and the dependent variable. The MCA technique can be considered the equivalent of
a multiple regression analysis using dummy variables. MCA, however, is often more convenient to use and
interpret. MCA also has an option for one-way analysis of variance.
MCA assumes that the eects of the predictors are additive i.e. that there are no interactions between
predictors. It is designed for use with predictor variables measured on nominal, ordinal, and interval scales.
It accepts an unequal number of cases in the cells formed by cross-classication of the predictors.
Alternatives to MCA are REGRESSN and ONEWAY. REGRESSN provides a general multiple regression
capability. ONEWAY performs a one-way analysis of variance. The advantage of MCA over REGRESSN is
that it accepts predictor variables in as weak a form as nominal scales, and it does not assume linearity of
the regression. The advantages over ONEWAY are that in MCA the maximum code for a control variable
in a one-way analysis is 2999 (instead of 99 in ONEWAY).
Generating a residuals dataset. Residuals may be computed and output as a Data le described by an
IDAMS dictionary. See the Output Residuals Dataset(s) section for details on the content. The option is
not available if only one predictor is specied.
Iterative procedures. MCA uses an iteration algorithm for approximating the coecients constituting
the solutions to the set of normal equations. The iteration algorithm stops when the coecients being
generated are suciently accurate. This involves setting a tolerance and specifying a test for determining
when that tolerance has been met (see analysis parameters CRITERION and TEST). Four convergence
tests are available. If the coecients do not converge within the limits set by the user, the program prints
out its results on the basis of the last iteration. The number of useful iterations depends somewhat on the
number of predictors used in the analysis and on the fraction specied for tolerance. If there are fewer than
10 predictors, it has usually been found satisfactory to specify 10 as the maximum number of iterations.
Detection and treatment of interactions. The program assumes that the phenomena being examined
can be understood in terms of an additive model.
If, on a priori grounds, particular variables are suspected to be interacting, MCA itself can be used to
determine the extent of the interaction as follows. If one predictor is specied, MCA performs a one-way
analysis of variance. Such an analysis can assist in detecting and eliminating predictor interactions. The
complete procedure is as follows (see also Example 3):
1. Determine a set of suspected interacting predictors.
2. Form a single combination variable using these predictors and the Recode statement COMBINE.
218 Multiple Classication Analysis (MCA)
3. Perform one MCA analysis using the suspect predictors to get adjusted R squared.
4. Perform one MCA analysis with the combination variable as the control in a one-way analysis of
variance to get adjusted eta squared, which will be greater than or equal to adjusted R squared.
5. Use the dierence, adjusted eta squared-adjusted R squared (the fraction of variance explained which
is lost due to the additivity assumption), as a guide to determine whether the use of a combination
variable in place of the original predictors is justied.
The test for interaction must be based on the same sample as the normal MCA execution. If interactions
are detected, then the combination variable should be used as predictor variable in place of the individual
interacting variables.
29.2 Standard IDAMS Features
Case and variable selection. Cases may be excluded from all analyses in the MCA execution by use of
a standard lter statement. In multiple classication analysis, cases may be excluded also by exceeding the
predictor maximum code. (Note: If a predictor variable from any analysis has a code outside the range 0-31,
the case containing the value is eliminated from all analyses). For any particular analysis, additional cases
may be excluded due to the following conditions:
A case (referred to as an outlier) has a dependent variable value that is more than a specied number
of standard deviations from the mean of the dependent variable. See analysis parameters OUTDIS-
TANCE and OUTLIERS.
A case has a dependent variable value that is greater than a specied maximum. See analysis parameter
DEPVAR.
A case has missing data for the dependent or weight variable. See the Treatment of missing data
and Weighting data paragraphs below.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed. When weighted data are used,
tests of statistical signicance must be interpreted with caution.
Treatment of missing data. The MDVALUES analysis parameter is available to indicate which missing
data values, if any, are to be used to check for missing data in the dependent variable. Cases with missing
data in the dependent variable are always excluded. Cases with missing data in predictor variables may be
excluded from all analyses using the lter. (Using the lter to exclude cases with missing data on predictor
variables in multiple classication is only needed if the missing data codes are in the range 0-31; if the value
for any predictor is outside this range, a case is automatically excluded from all analyses requested in the
execution).
29.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Weighted frequency table. (Optional: see the analysis parameter PRINT). An N x M matrix is printed
for each pair of predictors where N=maximum code of row predictor and M=maximum code of column
predictor. The total number of tables is P(P-1)/2 where P is the number of predictors.
Coecients for each iteration. (Optional: see the analysis parameter PRINT). The coecients for each
class for each predictor.
29.4 Output Residuals Dataset(s) 219
Dependent variable statistics. For the dependent variable (Y):
grand mean, standard deviation and coecient of variation,
sum of Y and sum of Y-squared,
total, explained and residual sums of squares,
number of cases used in the analysis and sum of weights.
Predictor statistics for multiple classication analysis.
For each category of each predictor:
the category (class) code, and label if it exists in the dictionary,
the number of cases with valid data (in raw, weighted and per cent form),
mean (unadjusted and adjusted), standard deviation and coecient of variation of the dependent
variable,
unadjusted deviation of the category mean from the grand mean and, coecient of adjustment.
For each predictor variable:
eta and eta squared (unadjusted and adjusted),
beta and beta squared,
unadjusted and adjusted sums of squares.
Analysis statistics for multiple classication analysis. For all predictors combined:
multiple R-squared (unadjusted and adjusted),
coecient of adjustment for degrees of freedom,
multiple R (adjusted),
listing of betas in descending order of their values.
One-way analysis of variance statistics.
For each category of the predictor:
the category (class) code, and label if it exists in the dictionary,
the number of cases with valid data (in raw, weighted and per cent form),
mean, standard deviation and coecient of variation of the dependent variable,
sum and percentage of dependent variable values,
sum of dependent variable values squared.
For the predictor variable:
eta and eta squared (unadjusted and adjusted),
coecient of adjustment for degrees of freedom,
total, between means and within groups sums of squares,
F value (degrees of freedom are printed).
Residuals. (Optional: see the analysis parameter PRINT). The identifying variable, observed value, pre-
dicted value, residual and weight variable, if any, are printed for cases in the order of the input le.
Summary statistics of residuals. If residuals are requested, the program prints the number of cases, sum
of weights, and mean, variance, skewness, and kurtosis of the residual variable.
29.4 Output Residuals Dataset(s)
For each analysis, residuals can optionally be output in a Data le described by an IDAMS dictionary. (See
analysis parameter WRITE=RESIDUALS). A record is output for each case passing the lter containing an
ID variable, an observed value, a calculated value, a residual value for the dependent variable and a weight
variable value, if any. The characteristics of the dataset are as follows:
Variable Field No. of MD
No. Name Width Decimals Codes
(ID variable) 1 same as input * 0 same as input
(dependent variable) 2 same as input * ** same as input
(predicted variable) 3 Predicted value 7 *** 9999999
(residual) 4 Residual 7 *** 9999999
(weight-if weighted) 5 same as input * ** same as input
220 Multiple Classication Analysis (MCA)
* transferred from input dictionary for V variables or 7 for R variables
** transferred from input dictionary for V variables or 2 for R variables
*** 6 plus no. of decimals for dependent variable minus width of dependent variable; if this is
negative, then 0.
If the observed value or weight variable value is missing or the case was excluded by maximum code checking
or by the outlier criteria, a residual record is output with all variables (except the identifying variable) set
to MD1.
29.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. All variables used for analysis must be numeric;
they may be integer or decimal valued, except for predictors which must have integer values, between 0 and
31 for multiple classication and up to 2999 for one-way analysis of variance. The case ID variable can be
alphabetic.
A large number of cases is necessary for an MCA analysis; a good rule of thumb is that the total number of
categories (i.e. the sum of categories over all predictors) should not exceed 10% of the sample size.
The dependent variable must be measured on an interval scale or be a dichotomy, and it should not be
badly skewed. Predictor variables for MCA must be categorized, preferably with not more than 6 categories.
Although MCA is designed to handle correlated predictors, no two predictors should be so strongly correlated
that there is perfect overlap between any of their categories. (If there is perfect overlap, recoding to combine
categories or ltering to remove oending cases is necessary).
29.6 Setup Structure
$RUN MCA
$FILES
File specificaitions
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Analysis specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output residuals distionary ) one set for each
DATAyyyy output residuals data ) residuals file requested
PRINT results (default IDAMS.LST)
29.7 Program Control Statements 221
29.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V6=2-6
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: TEST RUN FOR MCA
3. Parameters (mandatory). For selecting program options.
Example: *
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Analysis specications. The coding rules are the same as for parameters. Each analysis specication
must begin on a new line.
Example: PRINT=TABLES, DEPVAR=(V35,98), ITER=100, CONV=(V4-V8)
DEPVAR=(variable number, maxcode)
Variable number and maximum code for the dependent variable.
No default; the variable number must always be specied.
Default for maxcode is 9999999.
CONVARS=(variable list)
Variables to be used as predictors. If only one variable is given, a one-way analysis of variance
will be performed.
No default.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values for the dependent variable are to be used. See The IDAMS Setup
File chapter.
Note: Missing data values are never checked for predictor variables.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
222 Multiple Classication Analysis (MCA)
ITERATIONS=25/n
The maximum number of iterations. Range: 1-99999.
TEST=PCTMEAN/CUTOFF/PCTRATIO/NONE
The convergence test desired.
PCTM Test whether the change in all coecients from one iteration to the next is below a
specied fraction of the grand mean.
CUTO Test whether the change in all coecients from one iteration to the next is less than a
specied value.
PCTR Test whether the change in all coecients from one iteration to the next is less than
a specied fraction of the ratio of the standard deviation of the dependent variable to
its mean.
NONE The program will iterate until the maximum number of iterations has been exceeded.
CRITERION=.005/n
Supply a numeric value which is the tolerance of the convergence test selected. It ranges from 0.0
to 1.0. (Enter the decimal point).
OUTLIERS=INCLUDE/EXCLUDE
INCL Cases with outlying values of the dependent variable will be counted and included in
the analysis.
EXCL Outliers will be excluded from the analysis.
OUTDISTANCE=5/n
Number of standard deviations from its grand mean used to dene an outlier for the dependent
variable.
WRITE=RESIDUALS
Write residuals to an IDAMS dataset; apply the MCA model only to the subset of cases passing
missing data, maximum-code, and outlier criteria. Cases to which the MCA model does not apply
are included in the residuals dataset with all values (except the identifying variable value) set to
MD1.
Residuals cannot be obtained if only one predictor variable is specied.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the residuals output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
Note: If more than one analysis requests residual output, the default ddnames DICTOUT and
DATAOUT can only be used for one.
IDVAR=variable number
Number of an identication variable to be included in the residuals dataset.
Default: A variable is created whose values are numbers indicating the sequential position of the
case in the residuals le.
PRINT=(TABLES, HISTORY, RESIDUALS)
TABL Print the pair-wise cross-tabulations of the predictors.
HIST Print the coecients from all iterations. If the HIST option is not selected and if
the iterations converge, only the nal coecients are printed; if the iterations do not
converge, the coecients from only the last 2 iterations are printed.
RESI Print residuals in input case sequence order.
29.8 Restrictions
1. The maximum number of input variables, including variables used in Recode statements is 200.
29.9 Examples 223
2. Maximum number of predictor (control) variables per analysis is 50.
3. It is not possible to use the maximum number of predictors, each with the maximum number of
categories, in an analysis. If a problem exceeds the available memory, an error message is printed, and
the program skips to the next analysis.
4. Maximum number of analyses per execution is 50.
5. Predictor variables for multiple classication analysis must be categorized, preferably with 6 or fewer
categories. The categories must have integer codes in the range 0-31. Cases with any other value will
be dropped from the analysis.
6. Predictor variable for one-way analysis of variance must be coded in the range 0-2999. Cases with any
other value are dropped from the analysis.
7. If a predictor variable has decimal places, only the integer part is used.
8. If the ID variable is alphabetic with width > 4, only the rst four characters are used.
29.9 Examples
Example 1. Multiple classication analysis using four control variables (predictors): V7, V9, V12, V13,
and dependent variable V100; separate analyses will be performed on the whole dataset and on two subsets
of cases.
$RUN MCA
$FILES
PRINT = MCA1.LST
DICTIN = LAB.DIC input Dictionary file
DATAIN = LAB.DAT input Data file
$SETUP
ALL RESPONDENTS TOGETHER
* (default values taken for all parameters)
DEPV=V100 CONV=(V7,V9,V12-V13)
$RUN MCA
$SETUP
INCLUDE V4=21,31-39
ONLY SCIENTISTS
* (default values taken for all parameters)
DEPV=V100 CONV=(V7,V9,V12-V13)
$RUN MCA
$SETUP
INCLUDE V4=41-49
ONLY TECHNICIANS
* (default values taken for all parameters)
DEPV=V100 CONV=(V7,V9,V12-V13)
Example 2. Multiple classication analysis with dependent variable V201 and three predictor variables
V101, V102, V107; data are to be weighted by variable V6; producing residuals dataset where cases are
identied by variable V2; cases with extreme values (outliers of more than 4 standard deviations from THE
GRAND mean) on dependent variable are to be excluded from analysis. Residuals for the 1st 20 cases are
listed afterwards using the LIST program.
224 Multiple Classication Analysis (MCA)
$RUN MCA
$FILES
PRINT = MCA2.LST
DICTIN = LAB.DIC input Dictionary file
DATAIN = LAB.DAT input Data file
DICTOUT = LABRES.DIC Dictionary file for residuals
DATAOUT = LABRES.DAT Data file for residuals
$SETUP
MULTIPLE CLASSIFICATION ANALYSIS - RESIDUALS WRITTEN INTO A FILE
* (default values taken for all parameters)
DEPV=V201 OUTL=EXCL OUTD=4 IDVA=V2 WRITE=RESI -
CONV=(V101,V102,V107) WEIGHT=V6
$RUN LIST
$SETUP
LISTING START OF RESIDUAL FILE
MAXCASES=20 INFILE=OUT
Example 3. For a dependent variable V52, interactions between three variables (V7, V9, V12) will be
checked. V7 is coded 1,2,9, V9 is coded 1,3,5,9 and V12 is coded 0,1,9 where 9s are missing values. A
single combination variable is constructed using Recode. This involves recoding each variable to a set of
contiguous codes starting from zero and then using the COMBINE function to produce a unique code for
each possible combination of codes for the three separate variables. MCA is performed using the 3 separate
variables as predictors and a one-way analysis of variance is performed using the combination variable as
control. Cases with missing data on the predictors will be excluded. Cases with values greater than 90000
on the dependent variable will also be excluded.
$RUN MCA
$FILES
DICTIN = CON.DIC input Dictionary file
DATAIN = CON.DAT input Data file
$SETUP
EXCLUDE V7=9 OR V9=9 OR V12=9
CHECKING INTERACTIONS
BADD=SKIP
DEPV=(V52,90000) CONVARS=(V7,V9,V12)
DEPV=(V52,90000) CONVARS=R1
$RECODE
R7=V7-1
R9=BRAC(V9,1=0,3=1,5=2)
R1=COMBINE R7(2),R9(3),V12(2)
Chapter 30
Multivariate Analysis of Variance
(MANOVA)
30.1 General Description
MANOVA performs univariate and multivariate analysis of variance and of covariance, using a general linear
model. Up to eight factors (independent variables) can be used. If more than one dependent variable is
specied, both univariate and multivariate analyses are performed. The program accepts both equal and
unequal numbers of cases in the cells.
MANOVA is the only IDAMS program for multivariate analysis of variance. ONEWAY is recommended for
one-way univariate analysis of variance. MCA handles multifactor univariate problems. It has no limitations
with respect to empty cells, accepts more than 8 predictors, and allows for more than 80 cells. However, the
basic analytic model of MCA is dierent from that of MANOVA. One important dierence is that MCA is
insensitive to interaction eects.
Hierarchical regression model. MANOVA uses a regression approach to analysis of variance. More
particularly, the program employs a hierarchical model. There is an important consequence for the user:
if a MANOVA execution involves more than 1 factor variable, and if there are disproportionate number of
cases in the cells formed by the cross-classication of the factors, then consideration must be given to the
order in which factor variables are specied. Disproportionality of subclass numbers confounds the main
eects and the researcher must choose the order in which the confounded eects should be eliminated. When
using MANOVA, this choice is accomplished by the order in which factor variables are specied. When using
standard ordering, variables early in the specication have the eects of later variables removed, e.g. the rst
listed eect will be tested with all other main eects eliminated. The general rule is that each test eliminates
eects listed before it on the test name specications and ignores eects listed afterward. For a standard
two-way analysis, the interaction term is not aected by the order of factor variables; more generally, for
a standard n-way analysis, the n-th order interaction term and that term only, is unaected. The problem
exists for both univariate and multivariate analysis.
Contrast option. Two options are available for setting up contrasts (see the factor parameter CON-
TRAST). Nominal contrasts are generated by default; they are the customary deviations of row and column
means from the grand mean and the generalization of these for the interaction contrasts. The program can
also generate Helmert contrasts.
Augmentation of within cells sum of squares. It is possible to augment the within cells sum of squares
(error term) using the orthogonal estimates (see the parameter AUGMENT). This allows the program to be
used for Latin squares and for pooling of interaction terms with error.
Reordering and/or pooling orthogonal estimates. A conventional ordering of orthogonal estimates of
eects (e.g. mean, C, B, A, BxC, AxC, AxB, AxBxC for three-factor design) is build into the program for
standard usage. However, orthogonal estimates may be rearranged into some other order (see the parameter
REORDER). Further, it is possible to pool several orthogonal estimates, such as several interaction terms,
for simultaneous testing or to partition the cluster of orthogonal estimates for a given eect into smaller
clusters for separate testing (see the test name parameter DEGFR).
226 Multivariate Analysis of Variance (MANOVA)
30.2 Standard IDAMS Features
Case and variable selection. The standard lter is available for selecting cases for the execution. Depen-
dent variables are selected by the parameter DEPVARS and covariates by the parameter COVARS. Factor
variables are specied on special factor statements.
Transforming data. Recode statements may be used. Note that only integer values (positive or negative)
are accepted for variables used as factors.
Weighting data. Use of weight variables is not applicable.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. Cases with missing data codes on any of the input
variables (dependent, covariate or factor variables) are excluded. This may result in many excluded cases
and constitutes a potential problem which should be considered when planning an analysis.
30.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Cell means and Ns. For each cell, N is printed and the mean for each dependent variable and covariate.
The means are not adjusted for any covariates. Cells are labelled consecutively starting with 1 1 (for a
2 factor design) regardless of actual codes of factor variables. In indexing the cells, the indices of the last
factor are the minor indices (fastest moving).
Basis of design. This is the design matrix generated by the program. The eects equations are in
columns beginning with the mean eect in column 1. If REORDER was specied, the matrix is printed
after reordering.
Intercorrelations among the coecients of the normal equations.
Error correlation matrix. In a multivariate analysis of variance, the error term is a variance-covariance
matrix. This is that error term (before adjustment for covariates, if any) reduced to a correlation matrix.
Principal components of the error correlation matrix. The components are in columns. These are
the components of the error term (before adjustment for covariates, if any) of the analysis.
Error dispersion matrix and the standard errors of estimation. This is the error term, a variance-
covariance matrix, for the analysis. The matrix is adjusted for covariates, if any. Each diagonal element
of the matrix is exactly what would appear in a conventional analysis of variance table as the within mean
square error for the variable. Degrees of freedom are adjusted for augmentation if that was requested.
Standard errors of estimation correspond to the square roots of the diagonal elements of the matrix.
For analysis with covariate(s)
Adjusted error dispersion matrix reduced to correlations. This is the error term, a variance-
covariance matrix, after adjustments for covariates, reduced to a correlation matrix.
Summary of regression analysis.
Principal components of the error correlation matrix after covariate adjustments. The com-
ponents are in columns. These are the components of the error term of the analysis after adjustment for
covariates.
For univariate analysis
An anova table. Degrees of freedom, sum of squares, mean squares and F-ratios.
For multivariate analysis
The following items are printed for each eect. Adjustments are made for covariates, if any. The order of
eects is exactly opposite to the order of the test name specications.
F-ratio for the likelihood ratio criterion. Raos approximation is used. This is a multivariate test of
30.4 Input Dataset 227
signicance of the overall eect for all the dependent variables simultaneously.
Canonical variances of the principal components of the hypothesis. These are the roots, or eigen-
values, of the hypothesis matrix.
Coecients of the principal components of the hypothesis. These are the correlations between the
variables and the components of the hypothesis matrix. The number of nonzero components for any eect
will be the minimum of the degrees of freedom and the number of dependent variables.
Contrast component scores for estimated eects. These are the scores of the hypothesis for the
contrasts used in the design. They are analogous to the column means in a univariate analysis of variance
and can be used in the same manner to locate variables and contrasts which give unusual departures from
the null hypothesis.
Cumulative Bartletts tests on the roots. This is an approximate test for the remaining roots after
eliminating the rst, second, third, etc.
F-ratios for univariate tests. These are exactly the F-ratios which would be obtained in a conventional
univariate analysis.
30.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. All variables must be numeric. The dependent
variable(s) and covariate(s) should be measured on an interval scale or be a dichotomy. The factor variables
may be nominal, ordinal or interval but must have integer values; they are used to designate the proper cell
for the case.
30.5 Setup Structure
$RUN MANOVA
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Factor specifications
(repeated as required; at least one must be provided)
5. Test name specifications
(repeated as required; at least one must be provided)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
228 Multivariate Analysis of Variance (MANOVA)
30.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further description of the program control statements, items
1-5 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V2=1-4 AND V15=2
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: ANALYSIS OF AGE AND SALARY WITH SEX AND PROFESSION AS FACTORS
3. Parameters (mandatory). For selecting program options.
Example: DEPVARS=(V5,V8) COVA=(V101,V102)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
DEPVARS=(variable list)
A list of variables to be used as dependent variables.
No default.
COVARS=(variable list)
A list of variables to be used as covariates.
AUGMENT=(m,n)
To form error term, within sum of squares will be augmented by the columns m,m+1,m+2,...,n
of the orthogonal estimates matrix.
Default: Within sum of squares will be used as the error term.
REORDER=(list of values)
Reorder the orthogonal estimates according to the list (see the paragraph Reordering and/or
pooling orthogonal estimates above). Note that if reordering of estimates is requested, the order
of the test name specications should correspond to the new order.
Example: the conventional ordering for a three-factor design can be changed to the order: mean,
A, B, C, AxB, AxC, BxC, AxBxC using REORDER=(1,4,3,2,7,6,5,8).
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Factor specications (at least one must be provided). Up to 8 factor specications may be supplied.
The coding rules are the same as for parameters. Each factor specication must begin on a new line.
30.7 Restrictions 229
Example: FACTOR=(V3,1,2)
FACTOR=(variable number, list of code values)
Variable to be used as factor, followed by the code values which should be used to designate
proper cell to the case.
CONTRAST=NOMINAL/HELMERT
Species the type of contrast to be used in computation.
NOMI Nominal contrasts. Eect means deviated from the grand mean, i.e. M(1)-GM, M(2)-
GM, etc.
HELM Helmert contrasts. Mean of eect 1 deviated from the sum of means 1 through r, where
r levels are involved.
5. Test name specications (at least one must be provided). These specications identify the tests that
should be performed. They must be in the correct order. Ordinarily, there will be a specication for
the grand mean, followed by a name specication for each main eect, and nally, a name specication
for each possible interaction. If the design parameters are reordered or the degrees of freedom are
regrouped (see the parameters REORDER and DEGFR), the test name statements must be made
to conform to the modications. The coding rules are the same as for parameters. Each test name
specication must begin on a new line.
Example: TESTNAME=grand mean
TESTNAME=test name
Up to 12 character name for each test to be performed. Primes are mandatory if the name contains
non-alphanumeric characters.
DEGFR=n
The natural grouping of degrees of freedom (or hypothesis parameter equations) occures when
the conventional ordering of statistical tests is used. DEGFR is used only to change the grouping,
e.g. when you want to pool several interaction terms and test them simultaneously or to partition
the degrees of freedom of some eect into two or more parts. When using the DEGFR parameter,
be sure to use it on all test name statements, including a degree of freedom for the grand mean.
Default: Use the natural grouping of degrees of freedom.
30.7 Restrictions
1. The maximum number of dependent variables is 19.
2. The maximum number of covariates is 20.
3. The maximum number of factor specications is 8.
4. The maximum number of code values on a factor specication is 10.
5. The maximum number of cells is 80.
6. Cells with zero frequencies, with only one case, or with multiple identical cases, sometimes cause
problems; the execution may end prematurely, or it may go to the end but produce invalid F-ratios
and other statistics.
30.8 Examples
Example 1. Univariate analysis of variance (V10 is the dependent variable) with two factors represented
by A with codes 1,2,3 and B with codes 21 and 31; nominal contrasts will be used in calculations, and tests
will be performed in a conventional order.
230 Multivariate Analysis of Variance (MANOVA)
$RUN MANOVA
$FILES
PRINT = MANOVA1.LST
DICTIN = CM-NEW.DIC input Dictionary file
DATAIN = CM-NEW.DAT input Data file
$SETUP
UNIVARIATE ANALYSIS OF VARIANCE
DEPVARS=v10
FACTOR=(V3,1,2,3)
FACTOR=(V8,21,31)
TESTNAME=grand mean
TESTNAME=B
TESTNAME=A
TESTNAME=AB
Example 2. Multivariate analysis of variance (V11-V14 are dependent variables) with two factors (sex
coded 1,2 and age coded 1,2,3); nominal contrasts will be used in calculations, and tests will be performed
in a conventional order.
$RUN MANOVA
$FILES
as for Example 1
$SETUP
MULTIVARIATE ANALYSIS OF VARIANCE
DEPVARS=(v11-v14)
FACTOR=(V2,1,2)
FACTOR=(V5,1,2,3)
TESTNAME=grand mean
TESTNAME=age
TESTNAME=sex
TESTNAME=sex & age
Example 3. Multivariate analysis of variance (V11-V14 are dependent variables) with three factors (A
coded 1,2, B coded 1,2,3, C coded 1,2,3,4); nominal contrasts will be used in calculations, and tests will be
performed in a modied order (mean, A, B, AxB, C, AxC, BxC, AxBxC).
$RUN MANOVA
$FILES
as for Example 1
$SETUP
MULTIVARIATE ANALYSIS OF VARIANCE - TESTS IN MODIFIED ORDER
DEPVARS=(v11-v14) REORDER=(1,4,3,7,2,6,5,8)
FACTOR=(V2,1,2)
FACTOR=(V5,1,2,3)
FACTOR=(V8,1,2,3,4)
TESTNAME=mean
TESTNAME=A
TESTNAME=B
TESTNAME=AxB
TESTNAME=C
TESTNAME=AxC
TESTNAME=BxC
TESTNAME=AxBxC
Chapter 31
One-Way Analysis of Variance
(ONEWAY)
31.1 General Description
ONEWAY is a one-way analysis of variance program. An unlimited number of tables, using various in-
dependent and dependent variable pairs, may be produced in a single execution. Each analysis may be
performed on all the cases or on a subset of cases of the data le; the selection of cases for one analysis is
independent of the selection for other analyses. The term control variable used in ONEWAY is equivalent
to independent variable, predictor or, in analysis of variance terminology, treatment variable.
An alternative to ONEWAY is the MCA program when only one predictor is specied. It permits a maximum
code of 2999 for a control variable, whereas ONEWAY is limited to a maximum code of 99.
31.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. This lter aects all analyses in an execution. In addition, up to two local lters are available for
independently selecting a subset of the data cases for each analysis. If two local lters are used, a case
must satisfy both of them in order to be included in the analysis. Variables are selected for each analysis by
the table parameters DEPVARS and CONVARS. A separate table is produced for each variable from the
DEPVARS list with each variable from the CONVARS list.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES table parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. Cases with missing data on the dependent variable
are always excluded. Cases with missing data on the control variable may be optionally excluded (see the
table parameter MDHANDLING).
31.3 Results
Table specications. A list of table specications providing a table of contents for the results.
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
232 One-Way Analysis of Variance (ONEWAY)
Descriptive statistics within categories of the control variable. Intermediate statistics are printed
in table form for each code value of the control variable showing:
the number of valid cases (N) and sum of weights (rounded to nearest integer),
sum of weights as percent of the total sum,
mean, standard deviation, coecient of variation, sum and sum of squares of dependent variable,
sum of dependent variable as percent of the total sum.
A totals row is printed for the table giving sums over all categories of the control variable (except categories
with zero degrees of freedom, which are excluded from totals).
Analysis of variance statistics. Categories of the control variable which have zero degrees of freedom are
not included in the computation of these statistics. The following statistics are printed for each table:
total sum of squares of the dependent variable,
eta and eta squared (unadjusted and adjusted),
the sum of squares between groups (between means sum of squares) and sum of squares within groups,
the F-ratio (printed only if the data are unweighted).
31.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. All analysis variables must be numeric; they
may be integer or decimal valued.
A dependent variable should be measured on an interval scale or be a dichotomy. A control variable may be
nominal, ordinal or interval but must have values in the range 0-99. If, for any case, the control variable for
an analysis has a value exceeding this range, the case is eliminated from that analysis; no message is given.
If the value of the control variable has decimal places, only the integer part is used (e.g. 1.1 and 1.6 are both
placed in group 1); no message is given.
31.5 Setup Structure
$RUN ONEWAY
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Table specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
31.6 Program Control Statements 233
31.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of the cases to be used in the execution.
Example: EXCLUDE V3=9
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: DATA ON TRAINING EFFECTS FOR FOOTBALL PLAYERS
3. Parameters (mandatory). For selecting program options.
Example: *
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Table specications. The coding rules are the same as for parameters. Each table specication must
begin on a new line.
Examples: CONV=V6 DEPV=V26 WEIG=V3 F1=(V14,2,7) F2=(V13,1,1)
CONV=V5 DEPV=(V27-V29,V80)
DEPVARS=(variable list)
A list of variables to be used as dependent variables
CONVARS=(variable list)
A list of variables to be used as control variables.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this set of tables. See The
IDAMS Setup File chapter.
MDHANDLING=DELETE/KEEP
DELE Delete cases with missing data on the control variable.
KEEP Include cases with missing data on the control variable.
Note: Cases with missing data on the dependent variable are always deleted.
234 One-Way Analysis of Variance (ONEWAY)
F1=(variable number, minimum valid code, maximum valid code)
F1 refers to the rst lter variable which is used to create a subset of the data. The variable
number should be the number of the lter variable; cases whose values for this variable fall
in the minimum-maximum range will be entered in the table. The minimum value may be a
negative integer. The maximum must be less than 99,999. Decimal places must be entered where
appropriate.
F2=(variable number, minimum valid code, maximum valid code)
F2 refers to the second lter variable. If this second lter is specied, a case must satisfy the
requirements of both lters to enter the table.
31.7 Restrictions
1. The maximum number of control variables is 99. The maximum number of dependent variables is
99. The total number of variables which may be accessed is 204, including variables used in Recode
statements.
2. ONEWAY uses control variable values in the range 0 to 99. If, for any case, the control variable for a
certain analysis has a value exceeding this range, the case is eliminated from that table.
3. The maximum sum of weights is about 2,000,000,000.
4. The F-ratio is printed for unweighted data only.
31.8 Examples
Example 1. Three one-way analyses of variance using V201 as control and V204 as dependent variable:
rst for the whole dataset, second for a subset of cases having values 1-3 for variable V5, and the third for
a subset of cases having values 4-7 for variable V5.
$RUN ONEWAY
$FILES
PRINT = ONEW1.LST
DICTIN = STUDY.DIC input Dictionary file
DATAIN = STUDY.DAT input Data file
$SETUP
ONE-WAY ANALYSES OF VARIANCE DESCRIBED SEPARATELY
* (default values taken for all parameters)
CONV=V201 DEPV=V204
CONV=V201 DEPV=V204 F1=(V5,1,3)
CONV=V201 DEPV=V204 F1=(V5,4,7)
Example 2. Generation of a one-way analysis of variance for all combinations of control variables V101,
V102, V105 and V110, and dependent variables V17 through V21; data are weighted by variable V3.
$RUN ONEWAY
$FILES
as for Example 1
$SETUP
MASS-GENERATION OF ONE-WAY ANALYSES OF VARIANCE
* (default values taken for all parameters)
CONV=(V101,V102,V105,V110) DEPV=(V17-V21) WEIGHT=V3
Chapter 32
Partial Order Scoring (POSCOR)
32.1 General Description
POSCOR calculates (ordinal scale) scores using a procedure based on the hierarchical position of the elements
in a partially ordered set according to a number of properties (or characteristics, etc.). The scores, calculated
separately for each element of the set, are output to a Data le described by an IDAMS dictionary. This le
can then be used as input to other analysis programs.
Using the ORDER parameter, dierent types of scores can be obtained, namely: (1) four types of scores
where calculations are based on the proportion of cases dominated by the case; (2) four other scores where
calculations are based on the proportion of cases which dominate the case examined. The range of the scores
is determined by the SCALE parameter. Meaningful score values can be expected only when the number of
cases involved is much greater than the number of variables (or components of the score) specied.
In applications with variables of not uniform importance, a priority list can be dened using the analysis
parameter LEVEL in the partial ordering. If the variables of higher priority unambiguously determine the
relation of two cases, the variables of lower priority are not considered.
In the special case when only one variable is used in an analysis, the transformed values correspond to their
probabilities (see ORDER=ASEA/DEEA/ASCA/DESA options).
In one analysis, a series of mutually exclusive subsets can be examined using the subset facility. In this
event, the score variable(s) are computed within each subset of cases.
32.2 Standard IDAMS Features
Case and variable selection. The standard lter is available for selecting cases for the execution. A case
subsetting option is also available for each analysis. Variables to be transferred to the output le are selected
using the TRANSVARS parameter. Variables for each analysis are selected in the analysis specications.
Transforming data. Recode statements may be used. Note that only integer part of recoded variables is
used by the program, i.e. recoded variables are rounded to the nearest integer.
Weighting data. Use of weight variables is not applicable.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. The MDHANDLING parameter indicates whether
variables or cases with missing data are to be excluded from an analysis.
32.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
236 Partial Order Scoring (POSCOR)
Output dictionary. (Optional: see the parameter PRINT).
32.4 Output Dataset
The output le contains the computed scores along with transferred variables and, optionally, analysis
variables, for each case used in the analysis (i.e. all cases passing the lter and not excluded through the
use of the missing data handling option). An associated IDAMS dictionary is also output.
Output variables are numbered sequentially starting from 1 and have the following characteristics:
Analysis and subset variables (optional: only if AUTR=YES). V-variables have the same characteristics
as their input equivalents. Recode variables are output with WIDTH=7 and DEC=0.
Case identication (ID) and transferred variables. V-variables have the same characteristics as their
input equivalents. Recode variables are output with WIDTH=7 and DEC=0.
Computed score variables.
For ORDER=ASEA/DEEA/ASCA/DESA, one variable for each analysis with:
Name specied by ANAME (default: blank)
Field width specied by FSIZE (default: 5)
No. of decimals 0
MD1 specied by OMD1 (default: 99999)
MD2 specied by OMD2 (default: 99999)
For ORDER=ASER/DESR/ASCR/DEER, two variables for each analysis with names specied by
ANAME and DNAME parameters respectively and other characteristics as outlined above.
Note. If an analysis is repeated for several mutually exclusive subsets of cases, the score variable is computed
for the cases in each subset in turn. If a case does not fall into any of the dened subsets for the analysis,
then its score variable(s) values will be set to the MD1 code.
32.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. For analysis variables, only integer values are
used. Decimal values, if any, are rounded to the nearest integer. The case ID variable and variables to be
transferred can be alphabetic.
32.6 Setup Structure 237
32.6 Setup Structure
$RUN POSCOR
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Subset specifications (optional)
5. POSCOR
6. Analysis specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
32.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further description of the program control statements, items
1-3 and 6 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V2=1-4 AND V15=2
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: SCALING THE RU INPUT VARIABLES
3. Parameters (mandatory). For selecting program options.
Example: MDHAND=CASES TRAN=V5 IDVAR=R6
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
238 Partial Order Scoring (POSCOR)
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=VARS/CASES
Treatment of missing data.
VARS A variable containing a missing data value is excluded from the comparison.
CASE A case containing a missing data value is excluded from the analysis.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
IDVAR=variable number
Variable to be transferred to the output dataset to identify the cases.
No default.
TRANSVARS=(variable list)
Additional variables (up to 99) to be transferred to the output dataset. This list should not include
analysis variables or variables used in subset specications. These are transferred automatically
using the AUTR parameter.
AUTR=YES/NO
YES Analysis variables and variables used in subset specications will be automatically
transferred to the output dataset.
NO No transfer of analysis and subset variables.
FSIZE=5/n
Field width of the variables (scores) computed.
SCALE=100/n
The value (scale factor) specifying the range (0 - n) of the scores computed.
OMD1=99999/n
Value of the rst missing data code for the computed variables (scores).
OMD2=99999/n
Value of the second missing data code for the computed variables (scores).
PRINT=(CDICT/DICT, OUTDICT/OUTCDICT/NOOUTDICT)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
OUTD Print the output dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
NOOU Do not print the output dictionary.
32.7 Program Control Statements 239
4. Subset specications (optional). These specify mutually exclusive subsets of cases for a particular
analysis.
Example: AGE INCLUDE V5=15-20,21-45,46-64
Rules for coding
Prototype: name statement
name
Subset name. 1-8 alphanumeric characters beginning with a letter. This name must match
exactly the name used on subsequent analysis specications. Embedded blanks are not allowed.
It is recommended that all names be left-justied.
statement
Subset denition.
Start with word INCLUDE.
Specify variable number (V- or R-variable) on which subsets are to be based (alphabetic
variables are not allowed).
Specify values and/or ranges of values separated by commas. Each value or range denes
one subset. Commas separate the subsets. Negative ranges must be expressed in numeric
sequence, e.g. -4 - -2 (for -4 to -2); -2 - 5 (for -2 to +5). The subsets must be mutually
exclusive (i.e. same values cannot appear in two ranges). In the example above, 3 subsets
based on the value of V5 are dened for the AGE subset specication.
Enter a dash at the end of one line to continue to another.
5. POSCOR. The word POSCOR on this line signals that analysis specications follow. It must be
included (in order to separate subset specications from analysis specications) and must appear only
once.
6. Analysis specications. The coding rules are the same as for parameters. Each analysis specication
must begin on a new line.
Example: ORDER=ASER ANAME=MSDCORE DNAME=DOWNSCORE -
VARS=(V3-V6) LEVELS=(1,1,2,2)
VARS=(variable list)
The V- and/or R-variables to be used in the analysis.
No default.
ORDER=ASEA/DEEA/ASCA/DESA/ASER/DESR/ASCR/DEER
Species the type of score to be computed.
The score is based upon:
ASEA cases better or equal/dominating
DEEA cases worse or equal/dominated
ASCA cases strictly better/strictly dominating
DESA cases strictly worse/strictly dominated
relatively to the total number of cases
ASER/DESR
ASER cases better or equal/dominating
DESR cases strictly worse/strictly dominated
relatively to the number of comparable cases
ASCR/DEER
ASCR cases strictly better/strictly dominating
DEER cases worse or equal/dominated
relatively to the number of comparable cases
Note. In both latter cases the two scores are computed whatever is selected. The sum of them equals
the value specied in the SCALE parameter.
240 Partial Order Scoring (POSCOR)
SUBSET=xxxxxxxx
Species the name of the subset specication to be used, if any. Enclose the name in primes if it
contains non-alphanumeric characters. Upper case letters should be used in order to match the
name on the subset specication which is automatically converted to upper case.
LEVELS=(1, 1,..., 1) / (N1, N2, N3,...,Nk)
k is the number of variables used in the analysis variable list. Ni denes the priority order of
the i-th variable in the list of variables involved in the partial ordering. A higher value implies a
lower priority. The priority values must be specied in the same sequence as the corresponding
variables in the analysis variable list. The default of all 1s implies that all variables have the
same priority.
ANAME=name
Up to 24 character name for the increasing score. Primes are mandatory if the name contains
non-alphanumeric characters.
Default: Blanks.
DNAME=name
Up to 24 character name for the decreasing score. Primes are mandatory if the name contains
non-alphanumeric characters.
Default: Blanks.
32.8 Restrictions
1. The values of the analysis variables must be between -32,767 and +32,767.
2. Components of the priority list in the LEVEL parameter must be positive integers between 1 and
32,767.
3. Maximum number of analyses is 10.
4. Maximum number of variables to be transferred is 99.
5. A variable can only be used once whether it be an ID variable, in an analysis list or in a transfer list.
If it is required to use the same variable twice, then use recoding to obtain a copy with a dierent
variable (result) number.
6. Maximum number of variables used for analysis, in subset specications and in a transfer list is 100
(including both V- and R-variables).
7. Maximum number of subset specications is 10.
8. If the ID variable or a variable to be transferred is alphabetic with width > 4, only the rst four
characters are used.
9. Although the number of cases processed is not limited, it should be noted that the execution time
increases as a quadratic function of the number of cases being analysed.
32.9 Examples
Example 1. Computation of two scores using the same variables V10, V12, V35 through V40; the rst
score will be calculated on the whole dataset, while the second one will be calculated separately on three
subsets (for values 1, 2 and 3 of the variable V7); cases with missing data are to be excluded from analyses;
both scores are based upon the cases strictly dominated relative to the number of comparable cases; cases
are identied by variables V2 and V4 which are transferred to the output le. Note that Recode is used to
make a copy of the variables since a restriction of the program means that a variable may only be used once
in an execution.
32.9 Examples 241
$RUN POSCOR
$FILES
PRINT = POSCOR1.LST
DICTIN = PREF.DIC input Dictionary file
DATAIN = PREF.DAT input Data file
DICTOUT = SCORES.DIC output Dictionary file
DATAOUT = SCORES.DAT output Data file
$SETUP
COMPUTATION OF TWO SCORES
MDHAND=CASES IDVAR=V2 TRANSVARS=V4
TYPE INCLUDE V7=1,2,3
POSCOR
ORDER=DESR ANAME=GLOBAL SCORE INCR DNAME=GLOBAL SCORE DECR -
VARS=(V10,V12,V35-V40)
ORDER=DESR ANAME=ADJUSTED SCORE INCR -
DNAME=ADJUSTED SCORE DECR SUBS=TYPE -
VARS=(R10,R12,R35-R40)
$RECODE
R10=V10
R12=V12
R35=V35
R36=V36
R37=V37
R38=V38
R39=V39
R40=V40
Example 2. Computation of three scores based upon cases dominating relative to the total number of
cases; analysis variables are not to be transferred to the output le; variables containing missing data values
are to be excluded from the comparison; case identication variables V1 and V5 are transferred.
$RUN POSCOR
$FILES
as for Example 1
$SETUP
COMPUTATION OF THREE SCORES
AUTR=NO IDVAR=V1 TRANSVARS=V5
POSCOR
ORDER=ASEA ANAME=SCORE 1 INCR VARS=(V11,V17,V55-V60)
ORDER=ASEA ANAME=SCORE 2 INCR VARS=(V108-V110,V114,V116,V118,V120)
ORDER=ASEA ANAME=SCORE 3 INCR VARS=(V22,V33,V101-V105)
Chapter 33
Pearsonian Correlation (PEARSON)
33.1 General Description
PEARSON computes and prints matrices of Pearson r correlation coecients and covariances for all pairs
of variables in a list (square matrix option) or for every pair of variables formed by taking one variable from
each of two variable lists (rectangular matrix option).
Either pair-wise or case-wise deletion of missing data may be specied.
PEARSON can also be used to output a correlation matrix which can subsequently be input to the RE-
GRESSN or MDSCAL programs. Although REGRESSN is capable of computing its own correlation matrix,
its missing data handling is limited to case-wise deletion. In contrast, a matrix can be generated by PEAR-
SON using a pair-wise deletion algorithm for missing data.
33.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. The variables for which correlations are desired are specied with the ROWVARS and COLVARS
parameters.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. The univariate statistics for each variable are
computed from the cases which have valid (non-missing) data for the variable.
Missing data: pair-wise deletion. Paired statistics and each correlation coecient can be computed from
the cases which have valid data for both variables (MDHANDLING=PAIR). Thus, a case may be used in the
computations for some pairs of variables and not used for other pairs. This method of handling missing data
is referred to as the pair-wise deletion algorithm. Note: If there are missing data, individual correlation
coecients may be computed on dierent subsets of the data. If there is a great deal of missing data,
this can lead to internal inconsistencies in the correlation matrix which can cause diculties in subsequent
multivariate analysis.
Missing data: case-wise deletion. The program can also be instructed (MDHANDLING=CASE) to
compute the paired statistics and correlations from the cases which have valid data on all variables in the
variable list. Thus, a case is either used in computations for all pairs of variables or not used at all. This
method of handling missing data is referred to as the case-wise deletion algorithm (also available in the
REGRESSN program), and applies only to the square matrix option.
244 Pearsonian Correlation (PEARSON)
33.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Square matrix option
Paired statistics. (Optional: see the parameter PRINT). For each pair of variables in the variable list the
following are printed:
number of valid cases (or weighted sum of cases),
mean and standard deviation of the X variable,
mean and standard deviation of the Y variable,
t-test for correlation coecient,
correlation coecient.
Univariate statistics. For each variable in the variable list the following are printed:
number of valid cases and sum of weights,
sum of scores and sum of scores squared,
mean and standard deviation.
Regression coecients for raw scores. (Optional: see the parameter PRINT). For each pair of variables
x and y, the regression coecients a and c and the constant terms b and d in the regression equations x=ay+b
and y=cx+d are printed.
Correlation matrix. (Optional: see the parameter PRINT). The lower-left triangle of the matrix.
Cross-products matrix. (Optional: see the parameter PRINT). The lower-left triangle of the matrix.
Covariance matrix. (Optional: see the parameter PRINT). The lower-left triangle of the matrix with
diagonal.
In each of the above matrices, a maximum of 11 columns and 27 rows are printed per page.
Rectangular matrix option
Table of variable frequencies. Number of valid cases for each pair of variables.
Table of mean values for column variables. Means are calculated and printed for each column variable
over the cases which are valid for each row variable in turn.
Table of standard deviations for column variables. As for means.
Correlation matrix. (Optional: see the parameter PRINT). Correlation coecients for all pairs of vari-
ables.
Covariance matrix. (Optional: see the parameter PRINT). Covariances for all pairs of variables.
In each of the above tables, a maximum of 8 columns and 50 rows are printed per page.
Note: If a variable pair has no valid cases, 0.0 is printed for the mean, standard deviation, correlation and
covariance.
33.4 Output Matrices
Correlation matrix
The correlation matrix in the form of an IDAMS square matrix is output when the parameter WRITE=CORR
is specied. The format used to write the correlations is 8F9.6; the format for both the means and standard
deviations is 5E14.7. Columns 73-80 are used to identify the records.
The matrix contains correlations, means, and standard deviations. The means and standard deviations are
unpaired. The dictionary records which are output by PEARSON contain variable numbers and names from
the input dictionary and/or Recode statements. The order of the variables is determined by the order of
variables in the variable list.
33.5 Input Dataset 245
PEARSON may generate correlations equal to 99.99901, and means and standard deviations equal to 0.0
when it is unable to compute a meaningful value. Typical reasons are that all cases were eliminated due
to missing data or one of the variables was constant in value. Note that MDSCAL does not accept these
missing values although REGRESSN does.
Covariance matrix
The covariance matrix without the diagonal in the form of an IDAMS square matrix is output when the
parameter WRITE=COVA is specied.
33.5 Input Dataset
The input is a Data le described by an IDAMS dictionary. All analysis variables must be numeric; they
may be integer or decimal valued.
33.6 Setup Structure
$RUN PEARSON
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
FT02 output matrices if WRITE parameter specified
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
33.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V2=11-15,60 OR V3=9
246 Pearsonian Correlation (PEARSON)
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: FIRST EXECUTION OF PEARSON - APRIL 27
3. Parameters (mandatory). For selecting program options.
Example: WRITE=CORR, PRINT=(CORR,COVA) ROWV=(V1,V3-V6,R47,V25)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MATRIX=SQUARE/RECTANGULAR
SQUA Compute Pearson correlation coecients for all pairs of variables from the ROWV list.
RECT Compute Pearson correlation coecients for every pair of variables formed by taking
one variable from each of the ROWV and COLV lists.
ROWVARS=(variable list)
A list of V- and/or R-variables to be correlated (MATRIX=SQUARE) or the list of row variables
(MATRIX=RECTANGULAR).
No default.
COLVARS=(variable list)
(MATRIX=RECTANGULAR only).
A list of V- and/or R-variables to be used as the column variables. Eight columns are printed per
page; if either the row variable list or the column variable list contains less than eight variables,
it is preferable (for ease of reading results) to have the short list as the column variable list.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=PAIR/CASE
Method of handling missing data.
PAIR Pair-wise deletion.
CASE Case-wise deletion (not available with MATRIX=RECTANGULAR).
WEIGHT=variable number
The weight variable number if the data are to be weighted.
WRITE=(CORR, COVA)
(MATRIX=SQUARE only).
CORR Output the correlation matrix with means and standard deviations.
COVA Output the covariance matrix with means and standard deviations.
33.8 Restrictions 247
PRINT=(CDICT/DICT, CORR/NOCORR, COVA, PAIR, REGR, XPRODUCTS)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
CORR Print the correlation matrix.
COVA Print the covariance matrix.
PAIR Print the paired statistics (MATRIX=SQUARE only).
REGR Print the regression coecients (MATRIX=SQUARE only).
XPRO Print the matrix of cross-products (MATRIX=SQUARE only).
33.8 Restrictions
When MATRIX=SQUARE is specied
1. The maximum number of variables permitted in an execution is 200. This limit includes all analysis
variables, and variables used in Recode statements.
2. Recode variable numbers must not exceed 999 if the parameter WRITE is specied. (They are output
as negative numbers in the descriptive part of the matrix which has only 4 columns reserved for the
variable number e.g. R862 becomes -862).
When MATRIX=RECTANGULAR is specied
1. The maximum number of variables in either row or the column variable list is 100.
2. Maximum total number of row variables, column variables, variables used in Recode statements, and
the weight variable is 136.
33.9 Examples
Example 1. Calculation of a square matrix of Pearsons r correlation coecients with pair-wise deletion of
cases having missing data; the matrix will be written into a le and printed.
$RUN PEARSON
$FILES
PRINT = PEARS1.LST
FT02 = BIRDCOR.MAT output Matrix file
DICTIN = BIRD.DIC input Dictionary file
DATAIN = BIRD.DAT input Data file
$SETUP
MATRIX OF CORRELATION COEFFICIENTS
PRINT=(PAIR,REGR,CORR) WRITE=CORR ROWV=(V18-V21,V36,V55-V61)
Example 2. Calculation of Pearsons r correlation coecients for variables V10-V20 with variables V5-V6.
$RUN PEARSON
$FILES
DICTIN = BIRD.DIC input Dictionary file
DATAIN = BIRD.DAT input Data file
$SETUP
CORRELATION COEFFICIENTS
MATRIX=RECT ROWV=(V10-V20) COLV=(V5-V6)
Chapter 34
Rank-Ordering of Alternatives
(RANK)
34.1 General Description
RANK determines a reasonable rank-order of alternatives, using preference data as input and three dierent
ranking procedures, one based on classical logic (the method ELECTRE) and two others based on fuzzy
logic. The two approaches essentially dier in the way the relational matrices are constructed. With fuzzy
ranking, the data completely determine the result whereas with classical ranking the user, relying on concepts
of classical logic, has the possibility of controlling the calculation of the overall relations among alternatives.
The ELECTRE method (classical logic) implemented in RANK, in a rst step, uses the input preference
data to calculate a nal matrix expressing the overall collective opinion about the dominance among
alternatives, the structure of the relation not necessarily corresponding to a linear or partial order. The
dominance relation for each pair of alternatives is controlled by the conditions for concordance and for
discordance xed by the user. Dierent relational structures may be obtained from the same data by
varying the analysis parameters. In the second step, the procedure looks for a sequence of non-dominated
layers (cores) of alternatives. The rst core consists of the alternatives of highest rank in the whole set
considered. It should be noted that in certain cases further cores may not exist due to loops in the relation.
This may be true even at the highest level.
The rst fuzzy method (non-dominated layers) was originally developed for solving decision-making
problems with fuzzy information. This method makes it possible to nd a sequence of non-dominated
layers (cores) of alternatives in a fuzzy preference structure, which does not necessarily represent a (total)
linear order. The subsequent cores are such groups of alternatives which have the highest rank among the
alternatives which do not belong to previous, higher level cores. The rst core stands for the alternatives of
highest rank in the whole set considered.
The second fuzzy method (ranks) tries to nd the credibility of the statements the j-th alternative is
exactly at the p-th position in the rank-order. The results are straight-forward in the case of a (total) linear
order relation behind the data; otherwise special care should be given to the interpretation of the results.
The optimization procedure, developed to handle the general (normalized or non-normalized) case, allows
the user to decide whether to normalize the fuzzy relational matrix before the actual ranking procedure (see
option NORM). A careful interpretation of the results is needed after normalization. Usually incomplete
data result in a non-normalized relational matrix especially when DATA=RAWC is used and the number
of selected alternatives in individual answers is smaller than the number of possible alternatives. Although
a non-normalized matrix gives results in which the level of uncertainty is higher, it may provide a more
realistic picture about the latent relation determining the data; indeed the normalization can be interpreted
as a kind of extrapolation.
Two types of individual preference relations (strict or weak) can be specied, both in the case of data
representing a selection of alternatives, and in the case of data representing a ranking of alternatives.
250 Rank-Ordering of Alternatives (RANK)
1. Data representing a selection of alternatives.
Strict preference: each selected alternative is considered to have a unique (dierent) rank,
while the non-selected ones are given the same lowest rank.
Weak preference: all selected alternatives are considered to have same common rank, which
is higher than the rank of the non-selected ones.
2. Data representing a ranking of alternatives.
Strict preference: all ranked alternatives are supposed to have dierent values, and rela-
tions between alternatives having the same rank are disregarded in the calculation of the overall
preference relation across the alternatives.
Weak preference: alternatives with the same rank are taken into account in the calculation.
34.2 Standard IDAMS features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data, and the parameter VARS is used to select variables.
Transforming data. Recode statements may be used. Note that only integer part of recoded variables is
used by the program, i.e. recoded variables are rounded to the nearest integer.
Weighting data. Data may be weighted by integer values. Note that decimal valued weights are rounded to
the nearest integer. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. For DATA=RAWC, the variables with missing data
are skipped; for DATA=RANKS, the missing data values are substituted by the lowest rank.
34.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Invalid data. Messages about incorrect (rejected) data.
Methods based on fuzzy logic (METHOD=NOND/RANKS)
Matrix of relations. A square matrix representing the fuzzy relation is printed by rows. If the rows have
more than ten elements they are continued on subsequent line(s).
Description of the relations. After printing the type of relation, three measures are given which charac-
terize concisely the relation, namely: absolute coherence, intensity and absolute dominance indices.
Analysis results. The results are presented in a dierent form for each method.
For METHOD=NOND the cores are printed sequentially from the highest rank and for each of them the
following information is given:
its sequential number, with the certainty level,
the codes and code labels of the alternatives, or the variable numbers and names (up to 8 characters),
the membership function values of the alternatives indicating how strongly they are connected to the
core; membership values of alternatives belonging to previous cores are substituted by asterisks,
list of alternatives belonging to the core with the highest membership value (most credible alternatives).
For METHOD=RANKS the normalized relational matrix is printed rst if normalization was requested.
The results are then printed, in two forms for easier interpretation.
34.4 Input Dataset 251
1. All alternatives are listed sequentially with, for each:
the code and code label of the alternative, or the variable number and name,
the membership function values of the alternative indicating how strongly it is connected to each
rank,
the list of most credible rank(s) for that alternative.
2. All ranks are listed sequentially with, for each:
the ranks number,
the codes and code labels of the alternatives, or the variable numbers and names,
the membership function values of the alternatives indicating how strongly they are connected to
that rank,
the list of most credible alternative(s) for that rank.
Method based on classical logic (METHOD=CLAS)
Analysis results. For each nal dominance relational structure resulting from one analysis, the rank
dierences and the minimum/maximum population proportions specied by the user are printed, followed
by the list of successive non-dominated cores (identied by their sequential number) with the alternatives
belonging to them.
Note. Alternatives are labelled either with the rst 8 characters of the variable label for DATA=RANKS
or with the 8-character code label (if C-records are present in the dictionary) for DATA=RAWC.
34.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. All analysis variables must have positive integer
values. Note that decimal valued variables are rounded to the nearest integer.
Preferences can be represented in 2 ways in the data. The following illustration shows these.
Suppose that data are to be collected about employee preferences for various factors relating to their job:
Own oce
High salary
Long holidays
Minimum supervision
Compatible colleagues
The 2 ways of representing this in a questionnaire are:
1. DATA=RAWC
In this case, the factors are coded (e.g. 1 to 5) and the respondent is asked to pick them in order of
preference. The variables in the data would represent the rank, e.g.
V6 Most important factor
V7 2nd most important factor
.
.
V10 Least important factor
and the codes assigned to each of these variables by a respondent would represent the factors (e.g.
1=own oce, 2=high salary, etc.).
Not all possible factors need be selected, one could ask say for the 3 most important, by specifying
only these variables on the variable list e.g. V6, V7, V8. The number of dierent factors being used is
specied with the NALT parameter.
2. DATA=RANKS
Here, each factor is listed in the questionnaire as a variable, e.g.
252 Rank-Ordering of Alternatives (RANK)
V13 Own office
V14 High salary
.
.
V17 Compatible colleagues
and the respondent is invited to assign a rank to each, where 1 is given to the most important factor,
2 to the next most important, etc. Here the variables represent the factors and their values represent
the rank. Each variable must be assigned a rank and all factors will always enter into the analysis.
The ranks must be coded from 1 to n where n is the number of variables being considered.
Notes.
1. If DATA=RANKS, the code 0 and all codes greater than n where n is the number of variables (i.e.
number of alternatives) are treated as missing values and are assigned to the lowest rank.
2. If DATA=RAWC, the rst NALT dierent codes encountered while reading the data (excluding 0)
are used as valid codes. Other codes encountered later in the data are taken as illegal codes. Zero is
always treated as an illegal code. If the number of alternatives selected by the respondents is less than
NALT, then the not selected alternatives appear on the results with zero code value and empty code
label.
34.5 Setup Structure
$RUN RANK
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Analysis specifications (repeated as required)
(for classical logic only)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
34.6 Program Control Statements 253
34.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further description of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V2=11
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: FIRST RUN OF RANK
3. Parameters (mandatory). For selecting program options.
Example: DATA=RANKS PREF=STRICT MDVALUES=NONE VARS=(V11-V13)
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
For DATA=RAWC, variables with missing data are not included in the ranking.
For DATA=RANKS, missing data values are recoded to the lowest rank.
VARS=(variable list)
A list of V- and/or R-variables to be used in the ranking procedure.
No default.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
METHOD=(CLASSICAL/NOCLASSICAL, NONDOMINATED, RANKS)
Species the method to be used in the analysis.
CLAS Method of classical logic (ELECTRE).
NOND Fuzzy method-1, called non-dominated layers.
RANK Fuzzy method-2, called ranks.
DATA=RAWC/RANKS
Type of data.
RAWC The variables correspond to ranks (the rst variable in the list has the rst rank,
the second one the second rank, etc.), while their value is the code number of the
alternative selected.
RANK Variables represent alternatives, their values being ranks of the corresponding alterna-
tives.
254 Rank-Ordering of Alternatives (RANK)
PREF=STRICT/WEAK
Determines the type of the preference relation to be used in the analysis.
STRI A strict preference relation is used.
WEAK A weak preference relation is used.
NALT=5/n
(DATA=RAWC only). Total number of alternatives to be ranked.
Note: If DATA=RANKS, the number of alternatives is automatically set to the number of analysis
variables.
NORMALIZE=NO/YES
(METHOD=RANKS only).
NO No normalization.
YES Normalization of the relational matrix is performed before calculating the value of
membership function of alternatives.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Analysis specications (conditional: only in case of classical logic method). The coding rules are
the same as for parameters. Each analysis specication must begin on a new line.
Example: PCON=66 DDIS=4 PDIS=20
DCON=1/n
Rank dierence controlling the concordance in individual opinions (cases). It must be an integer
in the range 0 to NALT-1.
PCON=51/n
Minimum proportion of individual concordance, expressed as a percentage, required in the col-
lective opinion. It must be an integer in the range 0 to 99. The default value means that at least
51% agreement is requested for a collective concordance.
DDIS=2/n
Rank dierence controlling the discordance in individual opinions (cases). It must be an integer
in the range 0 to NALT-1.
PDIS=10/n
Maximum proportion of individual discordance, expressed as a percentage, tolerated in the col-
lective opinion. It must be an integer in the range 0 to 100. The default value means that no
more than 10% individual discordance is tolerated.
34.7 Restrictions
1. The maximum number of variables permitted in any execution is 200, including those used in Recode
statements and the weight variable.
2. The maximum number of analysis variables is 60.
34.8 Examples
Example 1. Determination of a rank-order of alternatives using data collected in the form of ranking of
alternatives; there are 10 alternatives, weak preference relation is assumed, and analysis is to be done using
the Ranks method.
34.8 Examples 255
$RUN RANK
$FILES
PRINT = RANK1.LST
DICTIN = PREF.DIC input Dictionary file
DATAIN = PREF.DAT input Data file
$SETUP
RANK - ORDERING OF ALTERNATIVES : RANKS METHOD
DATA=RANKS PREF=WEAK METH=(NOCL,RANKS) VARS=(V21-V30)
Example 2. Determination of a rank-order of alternatives using data collected in the form of a selection
of priorities; three alternatives are selected out of 20 and the order of variables determines the priority of
selection; strict preference relation is assumed; both fuzzy methods are requested in analysis.
$RUN RANK
$FILES
as for Example 1
$SETUP
RANK - ORDERING OF ALTERNATIVES : TWO FUZZY METHODS
NALT=20 METH=(NOCL,NOND,RANKS) VARS=(V101-V103)
Example 3. Determination of a rank-order of alternatives using data collected in the form of a selection of
priorities; 4 alternatives are selected out of 15 and the order of variables does not determine the priority of
selection (weak preference); four classical logic analyses are to be performed keeping rank dierences always
equal to 1, but increasing proportion of discordance and decreasing proportion of concordance.
$RUN RANK
$FILES
as for Example 1
$SETUP
RANK - ORDERING OF ALTERNATIVES : CLASSICAL LOGIC
PREF=WEAK NALT=15 METH=CLAS VARS=(V21,V23,V25,V27)
PCON=75 DDIS=1 PDIS=5
PCON=66 DDIS=1 PDIS=10
PCON=51 DDIS=1 PDIS=15
PCON=40 DDIS=1 PDIS=20
Chapter 35
Scatter Diagrams (SCAT)
35.1 General Description
SCAT is a bivariate analysis program which produces scatter diagrams, univariate statistics, and bivariate
statistics. The scatter diagrams are plotted on a rectangular coordinate system; for each combination of
coordinate values that appears in the data, the frequency of its occurrence is displayed.
SCAT is useful for displaying bivariate relationships if the numbers of dierent values for each variable
is large and the number of data cases containing any one value is small. If, however, a variable assumes
relatively few dierent values in a large number of data cases, the TABLES program is more appropriate.
Plot format. Each plot desired is dened separately by specifying the two variables to be used (called
the X and Y variables). The scales of the axes are adjusted separately for each plot to allow variables
with radically dierent scales to be plotted against each other without loss of discrimination. Normally, the
program plots the variable with the greater range (before rescaling) along the horizontal axis. However, the
user may request that the X variable always be plotted along the horizontal axis. The actual frequencies
are entered into the diagram if they are less than 10. For frequencies from 10-65, the letters of the alphabet
are used. If the frequency of a point is greater than 65, an asterisk is placed in the diagram. This coding
scheme is part of the results for easy reference.
Statistics. The mean, standard deviation, minimum and maximum values are printed for each variable
accessed, including the plot lter and weight variable, if any. For each plot the program also prints the
mean, standard deviation, case count and range for the two variables, Pearsons correlation coecient r, the
regression constant, and the unstandardized regression coecient for predicting Y from X.
35.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. In addition, a plot lter variable and range of values may be specied to restrict the data cases included
in a particular plot. The variables to be plotted are specied in pairs with plot parameters.
Transforming data. Recode statements may be used. Note that for R-variables, the number of decimals
to be retained is specyed by the NDEC parameter.
Weighting data. A weight variable may be specied for each plot. Both V- and R-variables with decimal
places are multiplied by a scale factor in order to obtain integer values. See Input Dataset section below.
When the value of the weight variable for a case is zero, negative, missing or non-numeric, then the case is
always skipped; the number of cases so treated is printed.
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. The univariate statistics which appear at the
beginning of the results, immediately following the dictionary, are based on all cases which have valid data
on each variable considered singly. For the plots themselves, the program eliminates cases which have missing
258 Scatter Diagrams (SCAT)
data on either or both of the variables in a particular plot. This pair-wise deletion also aects univariate
and bivariate statistics which are printed at the top of each plot.
35.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Univariate statistics. The following are printed for each variable referenced, including plot lter and
weight variables: minimum and maximum values, mean and standard deviation, and the number of cases
with valid data values.
Key to plot coding scheme. A table showing the correspondence between the actual frequencies and the
codes used in the plots.
Plot and statistics. For each plot requested, a 8 1/2 inch by 12 inch scatter diagram is printed. Univariate
statistics (means, standard deviations) and bivariate statistics (Pearsons r , the regression constant A, and
the regression unstandardized coecient B ) are printed at the top of the plot.
35.4 Input Dataset
The input is a Data le described by an IDAMS dictionary. All analysis and plot lter variables must be
numeric; integer or decimal valued. Variables with decimals are multiplied by a scale factor in order to
obtain integer values. This factor is calculated as 10
n
where n is the number of decimals taken from the
dictionary for V-variables and from the NDEC parameter for R-variables; it is printed for each variable.
35.5 Setup Structure
$RUN SCAT
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Plot specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
35.6 Program Control Statements 259
35.6 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-4 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V21=6 AND V37=5
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: STUDY 600. JULY 16, 1999. AGE BY HEIGHT FOR SUBSAMPLE 3
3. Parameters (mandatory). For selecting program options. New parameters are preceded by an aster-
isk.
Example: BADD=MD2
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
* NDEC=0/n
Number of decimals (maximum 4) to be retained for R-variables.
PRINT=CDICT/DICT
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
4. Plot specications. One set for each plot. The coding rules are the same as for parameters. Each
plot specication must begin on a new line.
Example: X=V3 Y=R17 FILTER=(V3,1,1)
X=variable number
Variable number of the X variable.
Y=variable number
Variable number of the Y variable.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
260 Scatter Diagrams (SCAT)
FILTER=(variable number, minimum valid code, maximum valid code)
Plot lter. Only those cases where the value of the lter variable is greater than or equal to the
minimum code, and less than or equal to the maximum code, will be entered into the plot. For
example, to specify that only cases with codes 0-40 on variable 6 are to be included, specify:
FILTER=(V6,0,40).
HORIZAXIS=MAXRANGE/X
MAXR Plot the variable with the greatest range along the horizontal axis.
X Plot always the X variable along the horizontal axis.
35.7 Restrictions
1. Not more than 50 variables can be used in one execution of the program. This maximum includes
everything: X and Y variables, plot lter variables, weight and variables used in Recode statements.
2. No limit to the number of plots but SCAT produces only 5 plots for each pass of the input data.
35.8 Example
Generation of two plots (weighted by variable V100 and unweighted) repeated for three dierent subsets of
data.
$RUN SCAT
$FILES
PRINT = SCAT1.LST
DICTIN = MY.DIC input dictionary file
DATAIN = MY.DAT input data file
$SETUP
GENERATION OF TWO PLOTS REPEATED FOR EACH SUBSET OF DATA
* (default values taken for all parameters)
X=V21 Y=V3 FILTER=(V5,1,2)
X=V21 Y=V3 FILTER=(V5,1,2) WEIGHT=V100
X=V21 Y=V3 FILTER=(V5,3,3)
X=V21 Y=V3 FILTER=(V5,3,3) WEIGHT=V100
X=V21 Y=V3 FILTER=(V5,4,7)
X=V21 Y=V3 FILTER=(V5,4,7) WEIGHT=V100
Chapter 36
Searching for Structure (SEARCH)
36.1 General Description
SEARCH is a binary segmentation procedure used to develop a predictive model for dependent variable(s).
It searches among a set of predictor variables for those predictors which most increase the researchers ability
to account for the variance or for the distribution of a dependent variable. The question what dichotomous
split on which single predictor variable will give us a maximum improvement in our ability to predict values
of the dependent variable?, embedded in an iterative scheme, is the basis for the algorithm used in this
program.
SEARCH divides the sample, through a series of binary splits, into mutually exclusive series of subgroups.
The subgroups are chosen so that, at each step in the procedure, the split into the two new subgroups
accounts for more of the variance or the distribution (reduces the predictive error more) than a split into
any other pair of subgroups.
SEARCH can perform the following functions:
* Maximize dierences in group means, group regression lines, or distributions (maximum likeli-
hood chi-square criterion).
* Rank the predictors to give them preference in the partitioning.
* Sacrice explanatory power for symmetry.
* Start after a specied partial tree structure has been generated.
Generating a residuals dataset. Residuals may be computed and output as a data le described by an
IDAMS dictionary. See the Output Residuals Dataset section for details on the content.
36.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. The dependent variable(s) are specied in the parameter DEPVAR, and the predictors are specied
in the parameter VARS on predictor statements.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data. Cases with missing data in a continuous dependent variable or a covariate
are deleted automatically. Cases with missing data in a categorical dependent variable can be excluded by
using a lter statement or by specifying valid codes with the DEPVAR parameter. Cases with missing data
in the predictor variables are not automatically excluded. However, the lter statement and/or the CODES
parameter may be used for this purpose.
262 Searching for Structure (SEARCH)
36.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Outliers. (Optional: see the parameter PRINT). Outliers with the ID variable values and the dependent
variable values.
Trace. (Optional: see the parameter PRINT, TRACE and FULLTRACE options). The trace of splits for
each predictor for each split containing: the candidate groups for splitting, the group selected for splitting,
all eligible splits for each predictor, the best split for each predictor and the split-on group.
Analysis summary containing the analysis of variance or distribution, the split summary and the summary
of nal groups.
Predictor summary tables. (Optional: see the parameter PRINT, TABLE, FIRST and FINAL options).
The rst group tables (PRINT=FIRST), the nal group tables (PRINT=FINAL) or all groups tables
(PRINT=TABLE) containing summary of best splits for each predictor for each group. The tables are
printed in reverse group order, i.e. last group rst.
Tree diagram. (Optional: see the parameter PRINT). Hierarchical tree diagram. Each node (box) of
the tree contains: group number, number of cases (N), split number, predictor variable number, mean
of dependent variable (for means analysis), and mean of dependent variable and covariate, and slope (for
regression analysis).
36.4 Output Residuals Dataset
Residuals can optionally be output in the form of a data le described by an IDAMS dictionary. (See the
parameter WRITE). For means and regression analysis, and chi-square analysis with multiple dependent
variables, each output record contains: an ID variable, the group variable, dependent variable(s), predicted
(calculated) dependent variable(s), residual(s), and a weight, if any.
For chi-square analysis with one categorical dependent variable, it contains: an ID variable, the group vari-
able, the rst category of the dependent variable, the predicted (calculated) rst category of the dependent
variable, the residual for the rst category of the dependent variable, the second category of the dependent
variable, the predicted (calculated) second category of the dependent variable, the residual for the second
category of the dependent variable, etc., and a weight, if any.
The characteristics of the output variables are as follows:
Variable Field No. of MD1
No. Name Width Decimals Code
(ID variable) 1 same as input * 0 same as input
(group variable) 2 Group variable 3 0 999
(dependent var 1) 3 same as input * ** same as input
(predicted var 1) 4 same as input cal 7 *** 9999999
(residual for var 1) 5 same as input res 7 *** 9999999
(dependent var 2) 6 same as input * ** same as input
(predicted var 2) 7 same as input cal 7 *** 9999999
(residual for var 2) 8 same as input res 7 *** 9999999
... . ... . ... ...
(weight-if weighted) n same as input * ** same as input
* transferred from input dictionary for V variables or 7 for R variables
** transferred from input dictionary for V variables or 2 for R variables
*** 6 plus no. of decimals for dependent variable minus width of dependent variable; if this is
negative, then 0.
If the calculated value or residual exceeds the allocated eld width, it is replaced by MD1 code.
36.5 Input Dataset 263
36.5 Input Dataset
The input is a data le described by an IDAMS dictionary. All variables used for analysis must be numeric;
they can be integer or decimal valued. The dependent variable may be continuous or categorical. Predictor
variables may be ordinal or categorical. The case ID variable can be alphabetic.
36.6 Setup Structure
$RUN SEARCH
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Predictor specifications
5. Predefined split specifications (optional)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output residuals dictionary
DATAyyyy output residuals data
PRINT results (default IDAMS.LST)
36.7 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-5 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V3=5
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: SEARCHING FOR STRUCTURE
264 Searching for Structure (SEARCH)
3. Parameters (mandatory). For selecting program options.
Example: DEPV=V5
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
ANALYSIS=MEAN/REGRESSION/CHI
MEAN Means analysis.
REGR Regression analysis.
CHI Chi-square analysis. With a single dependent variable, the default list of codes 0-9 will
be used and no missing data verication will be made.
DEPVAR=variable number/(variable list)
The dependent variable or variables. Note that a list of variables can be provided only when
ANALYSIS=CHI is specied.
No default.
CODES=(list of codes)
A list of codes may only be supplied for ANALYSIS=CHI and one dependent variable. Note that
in this case no missing data verication is made for the dependent variable and only cases with
the codes listed are used in analysis.
COVAR=variable number
The covariate variable number. Must be supplied for ANALYSIS=REGR.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
MINCASES=25/n
Minimum number of cases in one group.
MAXPARTITIONS=25/n
Maximum number of partitions.
SYMMETRY=0/n
The amount of explanatory power one is willing to lose in order to have symmetry, expressed as
a percentage.
EXPL=0.8/n
Minimum increase in explanatory power required for a split, expressed as a percentage.
OUTDISTANCE=5/n
Number of standard deviations from the parent-group mean dening an outlier. Note that outliers
are reported if PRINT=OUTL is specied, but they are not excluded from analysis.
36.7 Program Control Statements 265
IDVAR=variable number
Variable to be output with residuals and/or printed with each case classied as an outlier.
WRITE=RESIDUALS/CALCULATED/BOTH
Residuals and/or calculated values are to be written out as an IDAMS dataset.
RESI Output residual values only.
CALC Output calculated values only.
BOTH Output both calculated values and residuals.
OUTFILE=OUT/yyyy
Applicable only if WRITE specied.
A 1-4 character ddname sux for the residuals output dictionary and data les.
Default ddnames: DICTOUT, DATAOUT.
PRINT=(CDICT/DICT, TRACE, FULLTRACE, TABLE, FIRST, FINAL, TREE, OUTLIERS)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
TRAC Print the trace of splits for each predictor for each split.
FULL Print the full trace of splits for each predictor, including eligible but suboptimal splits.
TABL Print the predictor summary tables for all the groups.
FIRS Print the predictor summary tables for the rst group.
FINA Print the predictor summary tables for the nal groups.
TREE Print the hierarchical tree diagram.
OUTL Print the outliers with ID variable and dependent variable values.
4. Predictor specications (mandatory). Supply one set of parameters for each group of predictors
which may be described with the same parameter values. The coding rules are the same as for
parameters. Each predictor specication must begin on a new line.
Example: VARS=(V8,V9) TYPE=F
VARS=(variable list)
Predictor variables to which the other parameters apply.
No default.
TYPE=M/F/S
The predictor constraint.
M Predictors are considered to be monotonic, i.e. the codes of the predictors are to be
kept adjacent during the partition scan.
F Predictor codes are considered to be free.
S Predictor codes will be selected and separated from the remaining codes in forming
trial partitions.
CODES=(0-9)/maxcode/(list of codes)
Either the value of the largest acceptable code or a list of acceptable codes. The codes may range
from 0 to 31. Cases with codes outside the range 0 to 31 are always discarded.
RANK=n
Assigned rank. If ranking is desired, assign a predictor rank of 0 to 9. A zero rank indicates that
statistics are to be computed for the predictors, but they are not to be used in the partitioning.
266 Searching for Structure (SEARCH)
5. Predened split specications (optional). If predened splits are desired, supply one set of param-
eters for each predened split. The coding rules are the same as for parameters. Each predened split
specication must begin on a new line.
Example: GNUM=1 VAR=V18 CODES=(1-3)
GNUM=n
Number of the group to be split. Groups are specied in ascending order, where the entire original
sample is group 1. Each set of parameters forms two new groups.
No default.
VAR=variable number
Predictor variable used to make the split.
No default.
CODES=(list of codes)
List of the predictor codes dening the rst subgroup. All other codes will belong to the second
subgroup.
No default.
36.8 Restrictions
1. Minimum number of cases required is 2 * MINCASES.
2. Maximum number of predictors is 100.
3. Maximum predictor value is 31.
4. Maximum number of categorical variable codes is 400.
5. Maximum number of predened splits is 49.
6. If the ID variable is alphabetic with width > 4, only the rst four characters are used.
36.9 Examples
Example 1. Means analysis with ve predictor variables; minimum of 10 cases per group are requested;
outliers of more than 3 standard deviations from the parent group mean are reported; cases are identied
by the variable V1.
$RUN SEARCH
$FILES
PRINT = SEARCH1.LST
DICTIN = STUDY.DIC input dictionary file
DATAIN = STUDY.DAT input data file
$SETUP
MEANS ANALYSIS - FIVE PREDICTOR VARIABLES
DEPV=V4 MINC=10 OUTD=3 IDVAR=V1 PRINT=(TRACE,TREE,OUTL)
VARS=(V3-V5,V12)
VARS=V21 TYPE=F CODES=(1-4)
Example 2. Regression analysis with six predictor variables; residuals and calculated values are to be
computed and written into a dataset (cases are identied by variable V2).
36.9 Examples 267
$RUN SEARCH
$FILES
PRINT = SEARCH2.LST
DICTIN = STUDY.DIC input dictionary file
DATAIN = STUDY.DAT input data file
DICTOUT = RESID.DIC dictionary file for residuals
DATAOUT = RESID.DAT data file for residuals
$SETUP
REGRESSION ANALYSIS - SIX PREDICTOR VARIABLES
ANAL=REGR DEPV=V12 COVAR=V7 MINC=10 IDVAR=V2 -
WRITE=BOTH PRINT=(TRACE,TABLE,TREE)
VARS=(V3-V5,V18)
VARS=V22 TYPE=F
Example 3. Chi analysis with one dependent categorical variable and selected codes; the rst two splits
are predened.
$RUN SEARCH
$FILES
DICTIN = STUDY.DIC input dictionary file
DATAIN = STUDY.DAT input data file
$SETUP
CHI ANALYSIS - ONE DEPENDENT CATEGORICAL VARIABLE, PREDEFINED SPLITS
ANAL=CHI DEPV=V101 CODES=(1-5) MINC=5 PRINT=(FINAL,TREE)
VARS=(V3,V8) TYPE=S
GNUM=1 VAR=V8 CODES=3
GNUM=2 VAR=V3 CODES=(1,2)
Chapter 37
Univariate and Bivariate Tables
(TABLES)
37.1 General Description
The main use of TABLES is to obtain univariate or bivariate frequency tables with optional row, column
and corner percentages and optional univariate and bivariate statistics. Tables of mean values of a variable
can also be obtained.
Both univariate/bivariate tables and bivariate statistics can be output to a le so that can be used with
a report generating program, or can be input to GraphID or other packages such as EXCEL for graphical
display.
Univariate tables. Both univariate frequencies and cumulative univariate frequencies may be generated
for any number of input variables and may also be expressed as percentages of the weighted or unweighted
total frequency. In addition, the mean of a cell variable can be obtained.
Bivariate tables. Any number of bivariate tables may be generated. In addition to the weighted and/or
unweighted frequencies, a table may contain frequencies expressed as percentages based on the row marginals,
column marginals or table total, and the mean of a cell variable. These various items may be placed in a
single table with a possible six items per cell, or each may be obtained as a distinct table.
Univariate statistics. For univariate analyses, the following statistics are available: mean, mode, median,
variance (unbiased), standard deviation, coecient of variation, skewness and kurtosis. A quantile option
(NTILE) is also available. Division into as few as three parts or as many as ten parts may be requested.
Bivariate statistics. For bivariate analyses, the following statistics can be requested:
- t-tests of means (assumes independent populations) between pairs of rows,
- chi-square, contingency coecient and Cramers V,
- Kendalls Taus, Gamma, Lambdas,
- S (numerator of the tau statistics and of gamma), its standard and normal deviations, and its variance,
- Spearman rho,
- Evidence Based Medicine (EBM) statistics,
- non-parametric tests: Wilcoxon, Mann-Whitney and Fisher.
Matrices of statistics. Matrices of any of the above bivariate statistics except tests, EBM statistics or
statistics of S can be printed or written to a le. Corresponding matrices of weighted and/or unweighted ns
can be produced.
3- and 4-way tables. These can be constructed by making use of the repetition and subsetting features.
The repetition variable can be thought of as a control or panel variable. The subsetting feature can be used
to further select cases for a particular group of tables.
Tables of sums. Tables in which the cells contain the sum of a dependent variable can be produced by
specifying the dependent variable as the weight. E.g. specify WEIGHT=V208, where V208 represents a
270 Univariate and Bivariate Tables (TABLES)
respondents income, in order to get the total income of all respondents falling into a cell.
Note. The following options are available to control the appearance of the results:
A title may be specied for each set of tables.
Percentages and mean values, if requested, may be printed in separate tables.
The grid can be suppressed.
Rows which have no entries in a particular section of a large frequency table can be printed;
tables with more than ten columns are printed in sections and the use of this zero rows option
ensures that the various sections have the same number of rows (which is important if they are
to be cut and pasted together).
37.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. In addition, local lters and repetition factors (called subset specications) may be used to select a
subset of cases for a particular table. For tables which are individually specied, the variable(s) to be used
for the table are selected with the table specication parameters R and C. For sets of tables, variables are
selected with the table specication parameters ROWVARS and COLVARS.
Transforming data. Recode statements may be used. Note that for R-variables, the number of decimals
to be retained is specyed by the NDEC parameter.
Weighting data. A weight variable may optionally be specied for each set of tables. Both V- and R-
variables with decimal places are multiplied by a scale factor in order to obtain integer values. See Input
Dataset section below. When the value of the weight variable for a case is zero, negative, missing or
non-numeric, then the case is always skipped; the number of cases so treated is printed.
Treatment of missing data.
1. The MDVALUES parameter is available to indicate which missing data values, if any, are to be used
to check for missing data.
2. Univariate and bivariate frequencies are always printed for all codes in the data whether or not they
represent missing data. To remove missing data from tables completely, a lter or a subset can be
specied. Alternatively appropriate minimum and/or maximum values of row and column variable can
be dened.
3. Cases with missing data may optionally be included in the computation of percentages and bivariate
statistics. This can be done using the MDHANDLING table parameter.
4. Cases with missing data on a cell variable are always excluded from univariate and bivariate tables.
5. Cases with missing data are always excluded from the computation of univariate statistics.
37.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
A table of contents for the results. The contents shows each table produced and gives the page number
where it is located. The following information is provided:
- row and column variable numbers (0 if none)
- variable number for the mean value - cell variable (0 if none)
- weight variable number (0 if none)
- row minimum and maximum values (0 if none)
- column minimum and maximum values (0 if none)
37.3 Results 271
- lter name and repetition factor name
- percentages: row, column and total (T=requested, F=not requested)
- RMD: row-variable missing data (T=delete, F=do not delete)
- CMD: column-variable missing data (T=delete, F=do not delete)
- CHI: chi-square (T=requested, F=not requested)
- TAU: tau a, b or c (T=requested, F=not requested)
- GAM: gamma (T=requested, F=not requested)
- TEE: t-tests (T=requested, F=not requested)
- EXA: Fisher non-parametric test (T=requested, F=not requested)
- WIL: Wilcoxon non-parametric test (T=requested, F=not requested)
- MW: Mann-Whitney non-parametric test (T=requested, F=not requested)
- SPM: Spearman rho (T=requested, F=not requested)
- EBM: Evidence Based Medicine statistics (T=requested, F=not requested).
Tables which were requested using the PRINT=MATRIX or WRITE=MATRIX table parameters are not
listed in the contents and are always printed rst with negative page and table numbers.
Other tables are printed in the order of the table specications except for tables for which only univariate
statistics are requested; these are always grouped together and printed last.
Bivariate tables. Each bivariate table starts on a new page; a large table may take more than one page.
Tables are printed with up to 10 columns and up to 16 rows per page depending on the number of items in
each cell. Columns and rows are printed only for codes which actually appear in the data. Row and column
totals, and cumulative marginal frequencies and percentages if requested, are printed around the edges of
the table.
A large table is printed in vertical strips. For example, a table with 40 row codes and 40 column codes would
normally be printed on 12 pages as indicated in the following diagram, where the numbers in the cells show
the order in which the pages are printed:
1st 2nd 3rd 4th
10 10 10 10 codes
1st 16 codes 1 4 7 10
2nd 16 codes 2 5 8 11
last 8 codes 3 6 9 12
Bivariate statistics. (Optional: see the table parameter STATS).
t-tests. (Optional: see the table parameter STATS). If t-tests were requested, they and the means and
standard deviations of the column variable for each row are printed on a separate page.
Matrices of bivariate statistics. (Optional: see the table parameter PRINT). The lower-left corner of
the matrix is printed. Eight columns and 25 rows are printed per page.
Matrix of Ns. (Optional: see the table parameter PRINT). This is printed in the same format as the
corresponding statistical matrix.
Univariate tables. (Optional: see the table parameter CELLS). Normally each univariate table is printed
beginning on a new page. Frequencies, percents and mean values of a variable, if requested, for ten codes
are printed across the page.
Univariate statistics. (Optional: see the table parameter USTATS).
Quantiles. (Optional: see the table parameter NTILE). N-1 points are printed; e.g. if quartiles are
requested, the parameter NTILE is set to 4 and 3 breakpoints will be printed.
Page numbers. These are of the form: ttt.rr.ppp where
ttt = table number
rr = repetition number (00 if no repetition used)
ppp = page number within the table.
272 Univariate and Bivariate Tables (TABLES)
37.4 Output Univariate/Bivariate Tables
Univariate and/or bivariate tables with statistics requested in the table parameter CELLS may be output
to a le by specifying WRITE=TABLES. The tables are in the format of IDAMS rectangular matrix (see
Data in IDAMS chapter). One matrix is output for each statistic requested. If a repetition factor is used,
one matrix is output for each repetition.
Columns 21-80 on the matrix-descriptor record contain additional description of the matrix as follows:
21-40 Row variable name (for bivariate tables).
41-60 Column variable name.
61-80 Description of the values in the matrix.
Variable identication records (#R and #C) contain code values and code labels for the row and the column
variable respectively.
The statistics are written as 80 character records according to a 7F10.2 Fortran format. Columns 73-80
contain an ID as follows:
73-76 Identication of the statistic: FREQ, UNFR, ROWP, COLP, TOTP or MEAN.
77-80 Table number.
Note that the missing data codes are not included in the matrix.
37.5 Output Bivariate Statistics Matrices
Selected statistics may be output to a le. If, for example, gammas and tau bs were selected, a matrix
of gammas and a separate matrix of tau bs would be generated. Output matrices of bivariate statistics
are requested by specifying WRITE=MATRIX and either ROWVARS or ROWVARS and COLVARS table
parameters. If a repetition factor is used, one matrix is output for each repetition. The matrices are in the
format of IDAMS square or rectangular matrices (see Data in IDAMS chapter). The values in the matrix
are written with Fortran format 6F11.5. Columns 73-80 contain an ID as follows:
73-76 Identication of the statistic: TAUA, TAUB, TAUC, GAMM, LSYM, LRD, LCD, CHI, CRMV
or RHO.
77-80 Table number.
Note. If only ROWVARS is provided, dummy means and standard deviations records are written, 2 records
per 60 variables. The second format (#F) record in the dictionary species a format of 60I1 for these dummy
records. This is so that the matrix conforms to the format of an IDAMS square matrix.
37.6 Input Dataset
The input is a data le described by an IDAMS dictionary. With the exception of variables used in the main
lter, all the other variables used must be numeric.
In distributions and weights, variables (both V and R) with decimal places are multiplied by a scale factor
in order to obtain integer values. The scale factor is calculated as 10
n
where n is the number of decimals
taken from the dictionary for V-variables and from the NDEC parameter for R-variables; it is printed for
each variable.
Univariate statistics without distributions are calculated using the number of decimals specied in the
dictionary for V-variables and taken from NDEC parameter for R-variables.
Fields containing non-numeric characters (including elds of blanks) can be tabulated by setting the param-
eter BADDATA to MD1 or MD2. See The IDAMS Setup File chapter.
37.7 Setup Structure 273
37.7 Setup Structure
$RUN TABLES
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Subset specifications (optional)
5. TABLES
6. Table specifications (repeated as required)
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
FT02 output tables/matrices
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
PRINT results (default IDAMS.LST)
37.8 Program Control Statements
Refer to The IDAMS Setup File chapter for further descriptions of the program control statements, items
1-3 and 6 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V3=6
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: FREQUENCY TABLES
3. Parameters (mandatory). For selecting program options. New parameters are preceded by an aster-
isk.
Example: BADDATA=SKIP
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
274 Univariate and Bivariate Tables (TABLES)
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
* NDEC=0/n
Number of decimals (maximum 4) to be retained for R-variables.
PRINT=(CDICT/DICT, TIME)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
TIME Print the time after each table.
4. Subset specications (optional). These statements permit selection of a subset of cases for a table
or set of tables.
Example: CLASS INCLUDE V8=1,2,3,-7,9
There are two types of subset specications: local lters and repetition factors. Each has a dierent
function, but their formats are very similar. One specication may be used as a local lter for one or
more tables and as a repetition factor for other tables.
Rules for coding
Prototype: name statement
name
Subset name. 1-8 alphanumeric characters beginning with a letter. This name must match
exactly the name used on subsequent analysis specications. Embedded blanks are not allowed.
It is recommended that all names be left-justied.
statement
Subset denition which follows the syntax of the standard IDAMS lter statement.
For repetition factors, only one variable may be specied in the expression.
The way local lters and repetition factors work is described below.
Local lters. A subset specication is identied as a local lter for a table or set of tables by
specifying the subset name with the FILTER parameter. The local lter operates in the same manner
as the standard lter except that it applies only to the table specication(s) in which it is referenced.
Example: EDUCATN INCLUDE V4=0-4,9 AND V5=1
(subset name) (expression)
In the example above, if EDUCATN is designated as a local lter on the table specication, the table
would be produced including only cases coded 0, 1, 2, 3, 4 or 9 for V4 and 1 for V5.
Repetition factors. A subset specication is identied as a repetition factor for a table or set of
tables by specifying the subset name with the REPE parameter. Only one variable may be given on
a subset specication to be used as a repetition factor. Repetition factors permit the generation of
3-way tables where the variable used in the repetition factor can be considered as the control or panel
variable. Using a repetition factor and a lter, 4-way tables may be produced.
INCLUDE expressions cause tables to be produced including cases for each value or range of values of
the control variable used in the expression. Commas separate the values or ranges. Thus if there are
n commas in the expression, n+1 tables will be produced.
37.8 Program Control Statements 275
Example: EDUCATN INCLUDE V4=0-4,9
(subset name) (expression)
In the above example, if EDUCATN is designated as a repetition factor, two tables will result: one
including cases coded 0-4 for variable 4, and another including cases coded 9 for variable 4.
EXCLUDE may be used to produce tables with all values except those specied.
Example: EDUCATN EXCLUDE V1=1,4
(subset name) (expression)
In the above example, if EDUCATN is designated as a repetition factor, two tables will result: one
including all values except 1 and another including all values except 4.
5. TABLES. The word TABLES on this line signals that table specications follow. It must be included
(in order to separate subset specications from table specications) and must appear only once.
6. Table specications. Table specications are used to describe the characteristics of the tables to be
produced. The coding rules are the same as for parameters. Each set of table specications must start
on a new line.
Examples:
R=(V6,1,8) CELLS=FREQS (One univariate table).
R=(V6,1,8) C=(V9,0,4) - (One bivariate table with repetition
REPE=SEX CELLS=(ROWP,FREQS) factor, i.e. 3-way table).
ROWV=(V5-V9) CELLS=FREQS USTA=MEAN (Set of univariate tables).
ROWV=(V3,V5) COLV=(V21-V31) - (Set of bivariate tables).
R=(0,1,8) C=(0,1,99)
ROWVARS=(variable list)
List of variables for which univariate tables are required or to be used as the rows in bivariate
tables.
COLVARS=(variable list)
List of variables to be used as columns for bivariate tables.
R=(var, rmin, rmax)
var Row or univariate variable number for a single table. To supply minimum and max-
imum values for a set of tables, set the variable number to zero, e.g. R=(0,1,5); in
this case the minimum and maximum codes apply to all variables in the ROWVARS
parameter.
rmin Minimum code of the row variable(s) for statistical and percent calculations.
rmax Maximum code of the row variable(s) for statistical and percent calculations.
If either rmin or rmax is specied, both must be specied. If only the variable number is specied,
minimum and maximum values are not applied.
C=(var, cmin, cmax)
var Column variable number for a single bivariate table. To supply minimum and max-
imum values for a set of tables, set the variable number to zero, e.g. C=(0,2,5); in
this case, the minimum and maximum codes apply to all variables in the COLVARS
parameter.
cmin Minimum code of the column variable(s) for statistical and percent calculations.
cmax Maximum code of the column variable(s) for statistical and percent calculations.
If either cmin or cmax is specied, both must be specied. If only the variable number is specied,
minimum or maximum values are not applied.
276 Univariate and Bivariate Tables (TABLES)
TITLE=table title
Title to be printed at the top of each table in this set.
Default: No table title.
CELLS=(ROWPCT, COLPCT, TOTPCT, FREQS/NOFREQS, UNWFREQS, MEAN)
Contents of cells for tables when PRINT=TABLES or WRITE=TABLES specied.
ROWP Percentages for univariate tables or percentages based on row totals for bivariate tables.
COLP Percentages based on column totals in bivariate tables.
TOTP Percentages based on grand total in bivariate tables.
FREQ Weighted frequency counts (same as unweighted if WEIGHT not specied).
UNWF Unweighted frequency counts.
MEAN Mean of variable specied by VARCELL.
VARCELL=variable number
Variable number of the variable for which mean value is to be computed for each cell in the table.
MDHANDLING=ALL/R/C/NONE
Indicates which missing data values should be excluded from statistics and percent calculations.
ALL Delete all missing data values.
R Delete missing data values of row-variables.
C Delete missing data values of column-variables.
NONE Do not delete missing data. Note: missing data cases are always excluded from uni-
variate statistics.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
FILTER=xxxxxxxx
The 1-8 character name of the subset specication to be used as a local lter. Enclose the name
in primes if it contains any non-alphanumeric characters. If the name does not match with any
subset specication, the table will be skipped. Upper case letters should be used in order to match
the name on the subset specication which is automatically converted to upper case.
REPE=xxxxxxxx
The 1-8 character name of the subset specication to be used as a repetition factor. Enclose
the name in primes if it contains any non-alphanumeric characters. If the name does not match
with any subset specication, the table will be skipped. Tables will be repeated for each group
of cases specied. Upper case letters should be used in order to match the name on the subset
specication which is automatically converted to upper case.
USTATS=(MEANSD, MEDMOD)
(Univariate tables only).
MEAN Print mean, minimum, maximum, variance (unbiased), standard deviation, coecient
of variation, skewness, kurtosis, weighted and unweighted total number of cases.
MEDM Print median and mode (if there are ties, numerically smallest value is selected).
NTILE=n
(Univariate tables only).
The n is the number of quantiles to be calculated; it must be in the range 3-10.
STATS=(CHI, CV, CC, LRD, LCD, LSYM, SPMR, GAMMA, TAUA, TAUB, TAUC, EBMSTAT,
WILC, MW, FISHER, T)
If any bivariate statistics are to be printed or output supply the STAT parameter with each of
the statistics desired.
37.8 Program Control Statements 277
Bivariate tables and matrix output
CHI Chi-square. (If MATRIX is not requested, the selection of CHI, CV or CC will cause
all three to be computed).
CV Cramers V.
CC Contingency coecient.
LRD Lambda, row variable is the dependent variable. (If MATRIX is not requested, the
selection of any of the lambdas will cause all three to be computed).
LCD Lambda, column variable is the dependent variable.
LSYM Lambda, symmetric.
SPMR Spearman rho statistic.
GAMM Gamma statistic.
TAUA Tau a statistic. (If MATRIX is not requested, the selection of any of the three taus
will cause all three to be computed).
TAUB Tau b statistic.
TAUC Tau c statistic.
Bivariate tables only
EBMS Evidence Based Medicine statistics.
WILC Wilcoxon signed ranks test.
MW Mann-Whitney test.
FISH Fisher exact test.
T t-tests between all combinations of rows, up to a limit of 50 rows.
DECPCT=2/n
Number of decimals, maximum 4, printed for percentages.
DECSTATS=2/n
Number of decimals printed for mean, median, taus, gamma, lambdas, and chi-square statistics.
All other statistics will be printed with 2+n decimals (i.e. default of 4).
WRITE=MATRIX/TABLES
If an output le is to be generated, supply the WRITE parameter and the type of output.
MATR Output the matrices of selected statistics.
If the ROWVARS parameter is specied produce a square matrix for each statistic
requested by the STATS parameter using all pairings of the variables appearing in the
list.
If the ROWVARS and COLVARS parameters are specied produce a rectangular ma-
trix for each statistic requested by the STATS parameter using each variable appearing
in the ROWVARS list paired with each variable appearing in the COLVARS list.
TABL Output the tables of statistics requested with the CELLS parameter.
PRINT=(TABLES/NOTABLES, SEPARATE, ZEROS, CUM, GRID/NOGRID,
N, WTDN, MATRIX)
Options relevant to univariate/bivariate tables only.
TABL Print tables with items specied by CELLS.
SEPA Print each item specied in CELLS as a separate table.
ZERO Keep rows with zero marginals in results. (Applicable only if table has more than 10
columns and hence must be printed in strips).
CUM Print cumulative row and column marginal frequencies and percentages. If data are
weighted, gures are computed on weighted frequencies only.
GRID Print grid around cells of bivariate tables.
NOGR Suppress grid around cells of bivariate tables.
Options relevant with WRITE=MATRIX only.
N Print matrix of ns for matrices of statistics requested.
WTDN Print matrix of weighted ns for matrices of statistics requested.
MATR Print matrices of statistics specied under STATS.
278 Univariate and Bivariate Tables (TABLES)
37.9 Restrictions
1. The maximum number of variables for univariate frequencies is 400.
2. The combination of variables and subset specications is subject to the restriction:
5NV + 107NF < 8499
where NF is the number of subset specications and NV is the number of variables.
3. Code values for univariate tables must be in the range -2,147,483,648 to 2,147,483,647.
4. Code values for bivariate tables must be in the range -32,768 to 32,767. Any code values outside
this range are automatically recoded to the end points of the range, e.g. -40,000 will become -32,768
and 40,000 will become 32,767. Thus, on the bivariate table specication, 32,767 is the maximum
maximum value. (Note that a 5-digit variable with a missing data code of 99999 will have the
missing data row labeled 32,767 on the results).
5. The maximum cumulative weighted or unweighted frequency for a table (and for any cell, row or
column) is 2,147,483,647.
6. Table dimension maximums.
Bivariate: 500 row codes, 500 column codes, 3000 cells with non-zero entities.
Univariate: 3000 categories if frequencies, median/mode requested; otherwise, unlimited.
Note: For a variable such as income, if there are more than 3000 unique income values, one
cannot get a median or mode without rst bracketing the variable.
7. Non-integer V-variable values in distributions and in weights are treated as if the decimal point were
absent; a scale factor is printed for each variable.
8. t-tests of means between rows are performed only on the rst 50 rows of a table.
9. For bivariate statistical matrix output, the maximum number of variables that may be requested for a
row or column is 95.
10. If output les for tables and matrices are both requested, these are output to the same physical le.
11. There is no way of labelling rows and columns of tables when recoded variables are used.
37.10 Example
In the example below, the following tables are requested:
1. Frequency counts for variables V201-V220.
2. Univariate statistics with no frequency tables for variables V54-V62 and V64. Means will have 1
decimal and other statistics 3 decimals.
3. Weighted and unweighted frequency counts and percentages with cumulative frequencies and percent-
ages for variables V25-V30 and a grouped version of variable V7. Missing data cases are not to be
excluded from the percentages or statistics. Median and mode statistics requested.
4. For the categories of the single variable V201, frequency counts and the mean of variable V54.
5. 8 bivariate tables (with row variables V25-V28 and column variables V29, V30) repeated by values 1
and 2 of variable V10 (sex), i.e. with sex as a panel (control) variable. Counts, row, column and total
percentages will be in each cell. Chi-square and Taus statistics requested.
6. 3-way tables, using region (V3) grouped into 3 categories as the panel variable. Tables are restricted
to male cases only (V10=1). Frequency counts and mean of variable V54 will appear in each cell.
7. A single weighted frequency count table, excluding cases where either the row variable and/or the
column variable take the value 9.
8. Matrices of Tau A and Gamma statistics to be printed and written to a le for all pairs of variables
V54-V62. A matrix of counts of valid cases for each pair of variables will also be printed.
37.10 Example 279
$RUN TABLES
$FILES
PRINT = TABLES.LST
FT02 = TREE.MAT matrices of statistics
DICTIN = TREE.DIC input Dictionary file
DATAIN = TREE.DAT input Data file
$RECODE
R7=BRAC(V7,0-15=1,16-25=2,26-35=3,36-45=4,46-98=5,99=9)
NAME R7GROUPED V7
$SETUP
TABLE EXAMPLES
BADDATA=MD1
MALE INCLUDE V10=1
SEX INCLUDE V10=1,2
REGION INCLUDE V3=1-2,3-4,5
MD EXCLUDE V19=9 OR V52=9
TABLES
1. ROWV=(V201-V220) TITLE=Frequency counts
2. ROWV=(V54-V62,V64) USTATS=MEANSD PRINT=NOTABLES DECSTAT=1
3. ROWV=(V25-V30,R7) USTATS=MEDMOD CELLS=(FREQS,UNWFREQS,ROWP) -
WEIGHT=V9 PRINT=CUM MDHAND=NONE
4. R=(V201,1,3) CELLS=(FREQS,MEAN) VARCELL=V54
5. ROWV=(V25-V28) COLV=(V29-V30) -
CELLS=(FREQS,ROWP,COLP,TOTP) STATS=(CHI,TAUA) REPE=SEX
6. ROWV=(V201-V203) COLV=V206 -
CELLS=(FREQS,MEAN) VARCELL=V54 REPE=REGION FILT=MALE
7. R=V19 C=V52 WEIGHT=V9 FILT=MD
8. ROWV=(V54-V62) STATS=(TAUA,GAMMA) PRINT=(MATRIX,N) WRITE=MATRIX
Chapter 38
Typology and Ascending
Classication (TYPOL)
38.1 General Description
TYPOL creates a classication variable summarizing a large number of variables. The use of an initial
classication variable, dened a priori (key variable), or a random sample of cases, or a step-wise sample
are allowed to constitute the initial core of groups. An iterative procedure renes the results by stabilizing
the cores. The nal groups constitute the categories of the classication variable looked for. The number of
groups of the typology may be reduced using an algorithm of hierarchical ascending classication.
The active variables are the variables on the basis of which the grouping and regrouping of cases is
performed. One can also look for the main statistics of other variables within the groups constructed
according to the active variables. Such variables (having no inuence on the construction of the groups) are
called passive variables.
TYPOL accepts both quantitative and qualitative variables, the latter being treated as quantitative after
full dichotomization of their respective categories, which results in the construction of as many dichotomized
(1/0) variables as the number of categories of the qualitative variable. It is also possible to standardize the
active variables (the quantitative variables, and the qualitative after dichotomization).
TYPOL operates in two steps:
1. Building of an initial typology. The program builds a typology of n groups, as requested by the
user, from the cases characterized by a given number of variables (considered as being quantitative).
The user may select the way an initial conguration is established (see INITIAL parameter), and also
the type of distance (see DTYPE parameter) used by the program for calculating the distance between
cases and groups.
2. Further ascending classication (optional). If the user wants a typology in fewer groups, the
program- using an algorithm of hierarchical ascending classication- reduces one by one the number of
groups up to the number specied by the user.
38.2 Standard IDAMS Features
Case and variable selection. The standard lter is available to select a subset of cases from the input
data. The variables are specied with parameters.
Transforming data. Recode statements may be used.
Weighting data. A variable can be used to weight the input data; this weight variable may have integer or
decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric,
then the case is always skipped; the number of cases so treated is printed.
282 Typology and Ascending Classication (TYPOL)
Treatment of missing data. The MDVALUES parameter is available to indicate which missing data
values, if any, are to be used to check for missing data. Cases with missing data in the quantitative variables
can be excluded from the analysis (see MDHANDLING parameter).
38.3 Results
Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if
any, only for variables used in the execution.
Initial typology
Construction of an initial typology. (Optional: see the parameter PRINT).
The regrouping of initial groups, followed by a table of cross-reference numbers attributed to the
groups before and after the constitution of the initial groups.
Table(s) showing the re-distribution of cases between one iteration and the following one, and
giving the percentage of the total number of cases properly grouped.
Evolution of the percentage of explained variance from one iteration to the other.
Characteristics of distances by groups. The number of cases in each initial group of the typology,
together with the mean value and the standard deviation of distances.
Classication of distances. (Optional: see the parameter PRINT). Table showing, within each group,
the distribution of cases across fteen continuous intervals, these intervals being:
dierent for each group (rst table),
identical for all groups (second table).
Global characteristics of distances. The total number of cases, with the overall mean and standard
deviation of distances.
Summary statistics. The mean, standard deviation and the variable weight for the quantitative variables
and for categories of qualitative active variables.
Description of resulting typology. For each typology group, its number and the percentage of cases
belonging to it are printed rst. Then the statistics are provided, variable by variable, in the following
order: (1) quantitative active variables; (2) quantitative passive variables; (3) qualitative active variables;
(4) qualitative passive variables.
For each quantitative variable is given its amount of explained variance, its overall mean value
and, within each group of the typology, its mean value and standard deviation.
For each category of the qualitative variable is given rst its amount of variance explained and the
percentage of cases belonging to it; then within each group of the typology are printed: vertically,
the percentage of cases across the categories of the variable in the 1st line and horizontally, the
percentage of cases across the groups of the typology (row percentages) in the 2nd line (optional:
see the parameter PRINT).
Summary of the amount of variance explained by the typology. The following percentages of
explained variance are given:
the variance explained by the most discriminant variables, i.e. those which taken altogether are re-
sponsible for eighty per cent of the explained variance,
the mean amount of variance explained by the active variables,
the mean amount of variance explained by all the variables together,
the mean amount of variance explained by the most discriminant variables together with the proportion
of these variables.
38.4 Output Dataset 283
Note: When qualitative variables appear in tables, the rst 12 characters of the variable name are printed
together with the code value identifying the category. When quantitative variables appear in tables, all 24
characters of the variable name are printed.
Ascending hierarchical classication
Table of square roots of displacements and distances calculated for each pair of groups. (Optional: see
the parameter PRINT).
Table of regrouping No. 1. Summary statistics for the quantitative active variables and categories of
qualitative active variables for groups involved in regroupment.
Description of new resulting typology. (Optional: see the parameter LEVELS). The same information
as above.
Summary of the amount of variance explained by the new typology. The same information as above.
Note here the mean amount of variance explained by the most discriminant variables before regrouping.
The summary of the ascending hierarchical classication is printed after each regroupment up to the number
of groups specied by the user.
Three diagrams showing the percentage of explained variance as a function of the number of groups of the
successive typologies, in turn for:
all the variables,
the active variables,
the variables explaining 80% of the variance before the regroupings took place.
Proles of each group of the typology. (Optional: see the parameter PRINT). These proles are
printed and plotted for all the groups of the rst resulting typology and then for the groups obtained at each
regrouping.
Hierarchical tree is produced at the end.
38.4 Output Dataset
A classication variable dataset for the rst resulting typology can be requested and is output in the form
of a data le described by an IDAMS dictionary (see parameter WRITE and Data in IDAMS chapter).
It contains the case ID variable, the transferred variables, the classication variable (GROUP NUMBER)
and, for each case, its distance multiplied by 1000 from each category of the classication variable, called
n GROUP DISTANCE. The variables are numbered starting from one and incrementing by one in the
following order: case ID variable, transferred variables, classication variable and distance variables.
38.5 Output Conguration Matrix
An output conguration matrix may optionally be written in the form of an IDAMS rectangular matrix (see
parameter WRITE). See Data in IDAMS chapter for a description of the format. This matrix provides,
line by line, for each quantitative variable and for each category of qualitative active variables, its mean
value across the groups and its overall standard deviation for the initial typology, i.e. before the regroupings
take place. The elements of the matrix are written in 8F9.3 format. Dictionary records are written.
38.6 Input Dataset
The input is a Data le described by an IDAMS dictionary. All analysis variables must be numeric; they
may be integer or decimal valued. The case ID variable and variables to be transferred can be alphabetic.
284 Typology and Ascending Classication (TYPOL)
38.7 Input Conguration Matrix
The input conguration matrix must be in the form of an IDAMS rectangular matrix. See Data in IDAMS
chapter for a description of the format. This matrix is optional and provides a starting conguration to be
used in the computations. The statistics included should be mean values for the quantitative variables and
proportions (not percentages) for the categories of qualitative variables (e.g. .180 instead of 18.0 per cent).
A conguration matrix output by the program in a previous execution may serve as input conguration.
38.8 Setup Structure
$RUN TYPOL
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
$MATRIX (conditional)
Input configuration matrix
Files:
FT02 output configuration matrix if WRITE=CONF specified
FT09 input configuration matrix if INIT=INCONF specified
(omit if $MATRIX used)
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary if WRITE=DATA specified
DATAyyyy output data if WRITE=DATA specified
PRINT results (default IDAMS.LST)
38.9 Program Control Statements
Refer to The IDAMS Setup File chapter for further description of the program control statements, items
1-3 below.
1. Filter (optional). Selects a subset of cases to be used in the execution.
Example: INCLUDE V1=10-40,50
38.9 Program Control Statements 285
2. Label (mandatory). One line containing up to 80 characters to label the results.
Example: FIRST CONSTRUCTION OF CLASSIFICATION VARIABLE
3. Parameters (mandatory). For selecting program options.
Example: MDHAND=ALL AQNTV=(V12-V18) DTYP=EUCL -
PRINT=(GRAP,ROWP,DIST) INIG=5 FING=3
INFILE=IN/xxxx
A 1-4 character ddname sux for the input Dictionary and Data les.
Default ddnames: DICTIN, DATAIN.
BADDATA=STOP/SKIP/MD1/MD2
Treatment of non-numeric data values. See The IDAMS Setup File chapter.
MAXCASES=n
The maximum number of cases (after ltering) to be used from the input le.
Default: All cases will be used.
AQNTVARS=(variable list)
A variable list specifying quantitative active variables.
PQNTVARS=(variable list)
A variable list specifying quantitative passive variables.
AQLTVARS=(variable list)
A variable list specifying qualitative active variables.
PQLTVARS=(variable list)
A variable list specifying qualitative passive variables.
MDVALUES=BOTH/MD1/MD2/NONE
Which missing data values are to be used for the variables accessed in this execution. See The
IDAMS Setup File chapter.
MDHANDLING=ALL/QUALITATIVE/QUANTITATIVE
ALL Cases with missing data values in quantitative variables will be skipped and missing
data codes in qualitative variables will be excluded from analysis.
QUAL Missing data values in qualitative variables will be excluded from analysis.
QUAN Cases with missing data values in quantitative variables will be skipped.
REDUCE
Standardization of active variables, both quantitative and qualitative.
WEIGHT=variable number
The weight variable number if the data are to be weighted.
DTYPE=CITY/EUCLIDEAN/CHI
CITY City block distance.
EUCL Euclidean distance.
CHI Chi-square distance.
Note: Concerning the choice of type of distance it is advisable to use:
the City block distance when some active variables are qualitative and others are quantitative,
286 Typology and Ascending Classication (TYPOL)
the Euclidean distance when active variables are all quantitative (with standardization if they
are not measured on the same scale),
the Chi-square distance when active variables are all qualitative.
INIGROUP=n
Number of initial groups. If a key variable is to serve as a basis for the typology, and if the
number of initial groups specied here is greater than the maximum value of the key variable,
the program corrects this automatically. Also, if there are certain categories with zero cases, the
number of initial groups will be the number of non-empty categories.
No default.
FINGROUP=1/n
Number of nal groups.
INITIAL=STEPWISE/RANDOM/KEY/INCONF
The way the initial conguration is established.
STEP Stepwise sample.
RAND Random sample.
KEY Prole of initial groups is created according to a key variable.
INCO An a priori prole of initial groups is given in an input conguration le.
Note: Variables included in the input conguration must correspond exactly to the
variables provided with the AQNTV and/or AQLTV parameters.
STEP=5/n
If stepwise sample of cases is requested (INIT=STEP), n is the length of the step.
NCASES=n
If the random sample of cases is requested (INIT=RAND), n is the number of cases (unweighted)
in the input le, or a good underestimation of it.
No default; must be specied if INIT=RAND.
KEY=variable number
If a key variable is used to construct initial groups (INIT=KEY), this is the number of the key
variable.
No default; must be specied if INIT=KEY.
ITERATIONS=5/n
Maximum number of iterations for convergence of the group prole.
REGROUP=DISPLACEMENT/DISTANCE
DISP Regrouping is based on minimum displacement.
DIST Regrouping is based on minimum distance.
WRITE=(DATA, CONFIG)
DATA Create an IDAMS dataset containing the case ID variable, transferred variables, clas-
sication variable and distance variables.
CONF Output the conguration matrix into a le.
OUTFILE=OUT/yyyy
A 1-4 character ddname sux for the output Dictionary and Data les.
Default ddnames: DICTOUT, DATAOUT.
IDVAR=variable number
Variable to be transferred to the output dataset to identify the cases.
Obligatory if WRITE=DATA specied.
38.10 Restrictions 287
TRANSVARS=(variable list)
Additional variables (up to 99) to be transferred to the output dataset.
LEVELS=(n1, n2, ...)
Print description of resulting typology for the number of groups specied.
Default: Description is printed after each regrouping.
PRINT=(CDICT/DICT, OUTCDICT/OUTDICT, INITIAL, TABLES, GRAPHIC, ROWPCT,
DISTANCES)
CDIC Print the input dictionary for the variables accessed with C-records if any.
DICT Print the input dictionary without C-records.
OUTC Print the output dictionary with C-records if any.
OUTD Print the output dictionary without C-records.
INIT Print history of initial typology construction.
TABL Print two tables with classication of distances.
GRAP Print the graphic of proles.
ROWP Print row percentages for categories of qualitative variables.
DIST Print table of distances and displacements for each regrouping.
38.10 Restrictions
1. Maximum number of initial groups is 30.
2. Maximum total number of variables is 500, including weight variable, key variable, variables to be
transferred, analysis variables (quantitative variables + number of categories for qualitative variables)
and variables used temporarily in Recode statements.
3. If the ID variable or a variable to be transferred is alphabetic with width > 4, only the rst four
characters are used.
4. R-variables cannot be used as ID or as variables to be transferred.
38.11 Examples
Example 1. Creation of a classication variable summarizing 5 quantitative and 4 qualitative variables using
the City block distance; initial conguration will be established by random selection of cases; classication
starts with 6 groups and will terminate with 3 groups; regrouping will be based on minimum distance;
missing data will be excluded from analysis.
$RUN TYPOL
$FILES
PRINT = TYPOL1.LST
DICTIN = A.DIC input Dictionary file
DATAIN = A.DAT input Data file
$SETUP
SEARCHING FOR NUMBER OF CATEGORIES IN A CLASSIFICATION VARIABLE
AQNTV=(V114,V116,V118,V120,V122) AQLTV=(V5-V7,V36) REDU -
INIG=6 FING=3 INIT=RAND NCAS=1200 -
REGR=DIST PRINT=(GRAP,ROWP,DIST)
Example 2. Generating a classication variable from Example 1 with 4 categories; the variable is to be
written into a le; variables V18 and V34 are used as quantitative passive and variables V12 and V14 as
qualitative passive.
288 Typology and Ascending Classication (TYPOL)
$RUN TYPOL
$FILES
PRINT = TYPOL2.LST
DICTIN = A.DIC input Dictionary file
DATAIN = A.DAT input Data file
DICTOUT = CLAS.DIC output Dictionary file
DATAOUT = CLAS.DAT output Data file
$SETUP
GENERATING A CLASSIFICATION VARIABLE
AQNTV=(V114,V116,V118,V120,V122) AQLTV=(V5-V7,V36) REDU -
PQNTV=(V18,V34) PQLTV=(V12,V14) -
INIG=6 FING=4 INIT=RAND NCAS=1200 -
REGR=DIST PRINT=(GRAP,ROWP) WRITE=DATA IDVAR=V1
Part V
Interactive Data Analysis
Chapter 39
Multidimensional Tables and their
Graphical Presentation
39.1 Overview
The interactive Multidimensional Tables component of WinIDAMS allows you to visualize and customize
multidimensional tables with frequencies, row, column and total percentages, univariate statistics (sum,
count, mean, maximum, minimum, variance, standard deviation) of additional variables, and bivariate
statistics. Variables in rows and/or columns can either be nested (maximum 7 variables) or they can be put
at the same level. Construction of a table can be repeated for each value of up to three page variables.
Each page of the table can also be printed, or exported in free format (comma or tabulation character
delimited) or in HTML format.
IDAMS datasets used as input must have the same name for the Dictionary and Data les with extensions
.dic and .dat respectively.
Only one dataset can be used at a time, i.e. opening another dataset automatically closes the one being
used.
39.2 Preparation of Analysis
Selection of data. A dataset selected for constructing multidimensional tables is available until it is
changed when activating again the Multidimensional Tables component. The dialogue box lets you choose
a Data le either from a list of recently used Data les (Recent) or from any folder (Existing). The Data
folder of the current application is the default. Setting Files of type: to IDAMS Data Files (*.dat)
displays only IDAMS Data les.
Selection of variables. Selection of a dataset for analysis calls the dialogue box for table denition.
You are presented with a list of available variables and with four windows to specify variables for dierent
purposes. Use Drag and Drop technique to move variables between and/or within required windows.
Page variables are used to construct separate pages of the table for each distinct value of each variable in
turn, and for all cases taken together (Total page). Cases included on a particular page have all the
same value on the page variable. Page variables are never nested. The order in which variables are
specied determines the order in which pages are placed in the Table window.
Row variables are the variables whose values are used to dene table rows. Their order determines the
sequence of nesting use.
Column variables are the variables whose values are used to dene table columns. Their order determines
the sequence of nesting use.
Cell variables are variables whose values are used to calculate univariate statistics (e.g. mean) in the table
292 Multidimensional Tables and their Graphical Presentation
cells. The order in which they are specied determines the order of their appearance in the table.
There may be up to 10 cell variables.
Nesting. If more than one row and/or column variable is specied, by default they are nested. To use them
sequentially, at the same level, double-click on the variable in the row or column variable list and mark the
option for treating at the same level. Note: This option is not available for the rst variable in a list.
Percentages. Percentages in each cell (row, column or total) can be obtained by double-clicking on the
last nested row variable in the table denition window and selecting the type of percentages required.
Univariate statistics. Dierent statistics (sum, count, mean, maximum, minimum, variance, standard
deviation) for each of the cell variables can be obtained by double-clicking on the variable in the table
denition window and marking the required statistic(s). Formulas for calculating mean, variance and stan-
dard deviation can be found in section Univariate Statistics of Univariate and Bivariate Tables chapter.
However, they need to be adjusted since cases are not weighted.
Missing data treatment. The default missing data treatment is applied to the rst construction of the
table. Then, it can be changed using the menu Change.
Missing Data Values option is used to indicate which missing data values, if any, are to be used to check
for missing data in row and column variables.
Both Variable values will be checked against the MD1 codes and against the ranges of codes
dened by MD2.
MD1 Variable values will be checked only against the MD1 codes.
MD2 Variable values will be checked only against the ranges of codes dened by MD2.
None MD codes will not be used. All data values will be considered valid.
By default, both MD codes are used.
Missing Data Handling option is used to indicate which missing data values should be excluded from
computation of percentages and bivariate statistics.
All Delete all missing data values.
Row Delete missing data values of row variables.
Column Delete missing data values of column variables.
None Do not delete missing data values.
By default, all missing data values are deleted.
39.3 Multidimensional Tables Window 293
Note: Cases with missing data on cell variables are always excluded from calculation of univariate statistics.
The exclusion is done cell by cell, separately for each cell variable. Thus, the number of valid cases may not
be equal to the cell frequency. The statistic Count shows the number of valid cases.
Changing table denition. The menu command Change/Specication calls the dialogue box with the
active table denition. You can change variables for analysis, their nesting as well as requests for percentages
and univariate statistics. Clicking on OK replaces the active table by a new one.
39.3 Multidimensional Tables Window
After selection of variables and a click on OK, the Multidimensional Tables window appears in the WinIDAMS
document window. By default, frequencies and mean values for all cell variables are displayed. If page vari-
ables are specied, code labels (or codes) of these variables are displayed on tabs at the bottom of the table.
A particular page can be accessed by a click on the required label (code).
Changing the page appearance. The appearance of each page can be changed separately, the changes
applying exclusively to the active page.
The following modications are possible:
Increasing the font size - use the menu command View/Zoom In or the toolbar button Zoom In.
Decreasing the font size - use the menu command View/Zoom Out or the toolbar button Zoom Out.
Resetting default font size - use the menu command View/100% or the toolbar button 100%.
Increasing/Decreasing the width of a column - place the mouse cursor on the line which separates two
columns in the column heading until it becomes a vertical bar with two arrows and move it to the
right/left holding the left mouse button.
Minimizing the width of columns - mark the required column(s) and use the menu command For-
mat/Resize Columns.
Increasing/Decreasing the height of rows - place the mouse cursor on the line which separates two rows
in the row heading until it becomes a horizontal bar with two arrows and move it down/up holding
the left mouse button.
294 Multidimensional Tables and their Graphical Presentation
Minimizing the height of rows - mark the required row(s) and use the menu command Format/Resize
Rows.
Hiding columns/rows - decrease the width/height of a column/row to zero. To display back a hidden
column/row, place the mouse cursor on the line where it is hidden in the column/row heading until it
becomes a vertical/horizontal bar with two arrows and double-click the left mouse button.
In addition, the command Format/Style gives access to a number of table formatting possibilities such
as: selection of fonts, size of fonts, colours, etc. for the active cell or for all cells in the active line.
Bivariate statistics. Bivariate statistics (Chi-square, Phi coecient, contingency coecient, Cramers V,
Taus, Gamma, Lambdas and Sormers D) are computed for each table (each page). Use the menu command
Show/Statistics to display them at the end of table. If needed, this operation should be repeated for each
page separately. Formulas for calculating bivariate statistics can be found in section Bivariate Statistics
of Univariate and Bivariate Tables chapter.
Note that statistics are calculated only when there is one row and one column variable.
Printing a table page. The whole contents of the active page or desired parts only can be printed using
the File/Print command. If you want to print only some columns and/or rows, hide the other columns/rows
rst. The displayed columns/rows will be printed.
Exporting a table page. The whole contents of the active page or desired parts only can be exported in
free format (comma or tabulation character delimited) or in HTML format. Use the File/Export command
and select the required format. If you want to export only some columns and/or rows, hide the other
columns/rows rst. The displayed columns/rows will be exported.
39.4 Graphical Presentation of Univariate/Bivariate Tables
Frequencies displayed in a page of univariate/bivariate tables can be presented graphically using one of 24
graph styles at your disposal. Graph construction is initiated by the menu command Graph/Make. This
command calls the dialogue box to select the graph style for the active page. In addition, you may ask to
use logarithmic transformation of frequencies, and to provide a legend for colours and symbols used in the
graph.
Projected graphics cannot be manipulated. However, they can be saved in one of the two formats, namely:
JPEG le interchange format (.jpg) or Windows Bitmap format (.bmp), using the relevant commands in
the File menu. They can also be copied to the Clipboard (the command Edit/Copy, toolbar button Copy
or shortcut keys Ctrl/C) and passed to any text editor.
It should be noted here again that only frequencies from displayed rows and columns, i.e. not from rows
and/or columns which have been hidden, are used for this presentation.
39.5 How to Make a Multidimensional Table
We will use the rucm dataset (rucm.dic is the Dictionary le and rucm.dat is the Data le) which is
in the default Data folder and which is installed with WinIDAMS.
We will build a three-way table with two nested row variables (SCIENTIFIC DEGREE and SEX), one
column variable (CM POSITION IN UNIT) and one cell variable (AGE) for which we will ask the mean,
maximum and minimum.
Click on Interactive/Multidimensional Tables. This command opens a dialogue for selecting an IDAMS
Data le.
39.5 How to Make a Multidimensional Table 295
Click on rucm.dic and Open. You now see a dialogue for specifying the variables that you want to use
in the multidimensional table.
Select variables SCIENTIFIC DEGREE and SEX as ROW VARIABLES, CM POSITION IN
UNIT as COLUMN VARIABLE and AGE as CELL VARIABLE.
Use the mouse Drag and Drop technique to move the variables (press the left mouse button on the
variable you want to move, hold down the mouse button while you move the variable and release on
the variable list where you want to move the variable). Several variables can be selected and moved
simultaneously from one list to the other (hold down the Ctrl key when selecting).
The order of the variables in the ROW VARIABLES and COLUMN VARIABLES lists species,
implicitly, the nesting order. The rst variable in the list will be the outermost one. The variable order
in a list can be modied using the Drag and Drop mouse technique inside the same list.
296 Multidimensional Tables and their Graphical Presentation
After selecting the variables, the default options assigned to a variable can be changed by double-
clicking on the variable. A double-click on the variable AGE in the CELL VARIABLES list opens
the following dialogue:
Mean is marked by default. Mark Max and Min. Then click on OK here and on OK in the Multidi-
mensional Table Denition dialogue. You now see the multidimensional table.
39.6 How to Change a Multidimensional Table 297
39.6 How to Change a Multidimensional Table
Asking for separate tables. Suppose that now you wish to see a separate table for the men and the
women.
Click on Change/Specication and you get back the dialogue with your previous selection of variables.
Use the Drag and Drop technique to move the SEX variable from the ROW VARIABLES list to the
PAGE VARIABLES list and click on OK.
You see the rst view which is the total for all values taken together (men and women). At the bottom
of the view you can see three tabs: Total, MALE and FEMALE. Total is the tab of the
current view.
298 Multidimensional Tables and their Graphical Presentation
To see the page for the men, click on tab MALE.
To see the page for women, click on tab FEMALE.
39.6 How to Change a Multidimensional Table 299
Asking for the percentages. While frequencies are displayed by default, any type of percentages must
be requested explicitly.
Click on Change/Specication and you get back the dialogue with your previous selection of variables.
Double-click on the row variable SCIENTIFIC DEGREE and you see a dialogue with boxes for
Frequency (marked by default), Row %, Column % and Total %. Mark all the percentage boxes as
follow:
Click on OK for accepting this change and click on OK in the Multidimensional Table Denition
dialogue. You see the previous multidimensional table with all percentages.
300 Multidimensional Tables and their Graphical Presentation
Chapter 40
Graphical Exploration of Data
(GraphID)
40.1 Overview
GraphID is a component of WinIDAMS for interactive exploration of data through graphical visualization.
It accepts two kinds of input:
IDAMS datasets where the Dictionary and Data les must have the same name with extensions .dic
and .dat respectively,
IDAMS Matrix les where the extension must be .mat.
Only one dataset or one matrix le can be used at a time, i.e. opening of another le automatically closes
the one being used.
40.2 Preparation of Analysis
Selection of data. Use the menu command File/Open or click the toolbar button Open. Then, in the
Open dialogue box, choose your le. Setting Files of type: to IDAMS Data File (*.dat) or to IDAMS
Matrix File (*.mat) allows for ltering of les displayed.
Selection of case identication. If you have selected a dataset, you are asked to specify a case identica-
tion which can be a variable or the case sequence number. A numeric or alphabetic variable can be selected
from a drop-down list.
Selection of variables. If you have selected a dataset, you are asked to specify the variables which you
want to analyse. Numeric variables can be selected from the Source list and moved to the Selected items
area. Moving variables between the lists can be done by clicking the buttons >, < (move only highlighted
variables), >>, << (move all variables). Note that alphabetic variables are not available here and that the
case identication variable is not allowed for analysis.
Missing data treatment. Two possibilities are proposed: (1) case-wise deletion, when a case is used in
analysis only if it has valid data on all selected variables; (2) pair-wise deletion, when a case is used if it has
valid data on both variables for each pair of variables separately.
40.3 GraphID Main Window for Analysis of a Dataset
After selection of variables and a click on OK, the GraphID Main window displays the initial matrix of
scatter plots with 3 variables and the default properties of the matrix. This display can be manipulated
using various options and commands in the menus and/or equivalent toolbar icons.
302 Graphical Exploration of Data (GraphID)
40.3.1 Menu bar and Toolbar
File
Open Calls the dialogue box to select a new dataset/matrix le for analysis.
Close Closes all windows for the current analysis.
Save As Calls the dialogue box to save the graphical image of the active window in
Windows Bitmap format (*.bmp).
Save masked cases Saves for subsequent use, the sequential number of the cases masked during
the session, this following their sequence in the Data le analysed.
Print Calls the dialogue box to print the contents of the active window.
Print Preview Displays a print preview of the graphical image in the active window.
Print Setup Calls the dialogue box for modifying printing and printer options.
Exit Terminates the GraphID session.
The menu can also contain the list of recently opened les, i.e. les used in previous GraphID sessions.
Edit
The menu has only one command, Copy, to copy the graphic displayed in the active window to the Clipboard.
View
Conguration Calls the dialogue box for selecting symbols, colours, variables and the num-
ber of visible columns and rows in the matrix.
Scales Displays/hides graph scales for the active zoom window.
Toolbar Displays/hides toolbar.
Status Bar Displays/hides status bar.
Info Displays a window with relevant information about the dataset: number of
cases, number of variables, Data le name, etc.
Cell Info Displays a window with relevant information about the active plot: variable
names, their mean values, standard deviations, correlation and regression
coecients.
40.3 GraphID Main Window for Analysis of a Dataset 303
Brush appearance Calls the dialogue box to select the symbol and colour for brushed cases.
Font for Scales Calls the dialogue box to select the font for scales for the active zoomwindow.
Font for Labels Calls the dialogue box to select the font for variable names.
Basic Colors Calls the dialogue box to select colours for the active window: margin colour,
grid colour and diagonal cell background.
Save Colors Saves modication of colours.
Save Fonts Saves modication of fonts.
Tools
In this menu you can nd tools for manipulating the matrix of scatter plots and for calling other graphics
provided by GraphID.
Brush Sets/cancels brush mode.
Zoom Magnies the active plot or the brush contents to full window.
Grouping Calls the dialogue box to specify creation of groups.
Cancel grouping Cancels grouping.
Histograms Calls the dialogue box to specify graphics to be shown in the diagonal cells
and their properties.
Smoothing Calls the dialogue box to specify types of regression lines (smoothing lines)
and their properties.
3D Scatter Plots Calls the dialogue box to select variables to be used as axes for 3D-scattering
and rotating.
Directed Mode Sets/cancels directed mode.
Box-Whisker Plots Calls the dialogue box to select variables and colours for displaying Box-
Whiskers plots.
Jittering Performs jittering of projected cases.
Masking Mask the cases inside the brush.
Unmasking Restore step by step masked cases.
Apply saved masking Mask the cases which were masked and saved in the previous session.
Grouped plot Calls the dialogue box to select row and column variables for constructing
two-dimensional table, and X and Y variables for projecting their scatter
plots within the cells of the table.
Window
The menu contains the list of opened windows and Windows commands for arranging them.
Help
WinIDAMS Manual Provides access to the WinIDAMS Reference Manual.
About GraphID Displays information about the version and copyright of GraphID and a link
for accessing the IDAMS Web page at UNESCO Headquarters.
Toolbar icons
There are 21 buttons in the toolbar providing direct access to the same commands/options as the corre-
sponding menus. They are listed here as they appear from the left to the right.
304 Graphical Exploration of Data (GraphID)
Open Brush Box-Whisker plots
Save Zoom Cancel jittering
Copy Grouping Decrease jittering level
Print Histograms Increase jittering level
Basic colors Smoothed lines Mask the cases inside brush
Font for labels 3D scatter plots Restore step by step masked cases
Font for scales Directed mode Information about version of GraphID
40.3.2 Manipulation of the Matrix of Scatter Plots
Conguring the matrix of scatter plots. The current matrix of scatter plots can be changed using the
menu command View/Conguration.
Visible: Here you can set the number of columns and rows to be displayed on the screen (they do not need
to be equal). Other cells are made visible by scrolling.
Variables: The dialogue box carries two lists of variables: Source list and Selected items. Moving
variables between the lists can be done by clicking the buttons >, < (move only highlighted variables),
>>, << (move all variables).
Symbols: In this dialogue box, you can select the shape and colour of the symbols that are to be used to
represent each group of cases in the plots. If no groups are specied, then all the cases fall in a single
group by default and all will be represented by the same symbol (default is a small black rectangle).
One can either assign one symbol to one group or collapse groups by assigning the same symbol to two
or more groups.
The list of groups is given in the left-hand box. Two other boxes are for selecting colours and symbols.
To select a colour or symbol, just click on it. Its image will appear immediately in the button next to
the name of the highlighted group.
Directed mode. This option is useful when the order of cases on some column variables is meaningful, e.g.
when values of a column variable indicate time intervals. Linking the images sequentially by straight lines
can then, for example, help search for cyclical patterns.
To switch to directed plots or come back to scatter plots, press the toolbar button Directed mode or use the
menu command Tools/Directed mode.
Masking and Unmasking cases. You can mask cases projected in scatter plots. This feature can be
useful, for example, to remove outliers from the graphics.
Masking is available when the brush is active.
To mask cases included in the brush, click the toolbar button Mask. Masked cases are hidden in all the
scatter plots. Masking can be repeated several times.
All or part of the masked cases can be unmasked by clicking the toolbar button Restore.
Saving and re-using masked cases. The sequential number of currently masked cases can be saved in
a le corresponding to the analysed dataset using the command File/Save masked cases. This masking can
be recuperated in subsequent session(s) using the command Tools/Apply saved masking.
Grouping cases. This feature allows you to see how a variable partitions cases into groups in all plots.
The variable can be either qualitative or quantitative. In addition to selecting the grouping variable, the
user controls the way of grouping (by values, or by intervals and the number of groups).
The dialogue box for creation of groups is activated by clicking the toolbar button Grouping or by using the
menu command Tools/Grouping.
Exploration with the brush. The brush is a rectangle which can be (re)sized, moved and zoomed. As it
is moved over one scatter plot, the cases inside the brush are highlighted in brush colour and shape on all
the other scatter plots.
40.3 GraphID Main Window for Analysis of a Dataset 305
One of the applications is to determine if a crowding of cases in a scatter plot really represents a cluster in
the multidimensional space or whether the crowding is simply a property of the projection. For this purpose,
place the brush on a crowding in one scatter plot and observe how these cases are located on other scatter
plots. If the same crowding appears on other plots then the crowding may indeed indicate a real cluster.
Of course the scatter plots must be chosen so that the distance between cases are of the same order in the
dierent plots.
Another application of the brush is to study the conditional distributions. If the 4 corners of the brush are
given by x
min
, x
max
, y
min
, y
max
, then the cases inside the brush are those that satisfy the conditions:
x
min
< x < x
max
and y
min
< y < y
max
and the cases satisfying these conditions can be studied in the other scatter plots.
Brush can also be used to mask and search for cases.
To enter brush mode or cancel it, click the toolbar button Brush or use the menu command Tools/Brush.
To place the brush in the desired area, set the cursor at the edge, press the left mouse button, drag and
release at the other edge.
To move or resize the brush, set the cursor inside the brush rectangle or on its side, press the left button
and drag. Note: To move it quickly to another cell, place the cursor in the desired cell and press the left
mouse button.
Zooming. Zooming creates a new window to magnify the selected cell or, in brush mode, to magnify the
brush. Such a new zoom window has most of the properties of a matrix of scatter plots with one cell; for
example you can use brushing to identify a new set of cases and then zoom again.
If the parent matrix of scatter plots is in brush mode, modication of the brush is reected immediately in
the zoom window; otherwise the zoom window reects modications introduced in the selected cell of the
parent matrix.
The menu command View/Scales allows you to display scales of variable values for the active zoom window.
Jittering. The function is useful when there are discrete or qualitative variables in the analysed data. In
this case, usual matrices of scatter plots may be not very informative since a part or all 2D and 3D projections
present 2D or 3D grids and therefore it is impossible to determine visually how many cases coincide in the
same grid position and to which groups they belong.
The jittering is a random transformation of data. Data values (x) are modied by adding a noise (a*U)
where U is a uniformly distributed random value from the interval (-0.5, 0.5) and a is a factor to control
the jittering level.
To set the desired jittering level, use the toolbar buttons Decrease jittering level, Increase jittering level and
Cancel jittering.
Note that jittering can be performed only in the window of the matrix of scatter plots.
40.3.3 Histograms and Densities
Histograms, normal densities and dot graphics, and three univariate statistics can be displayed in the diagonal
cells of the matrix of scatter plots.
To obtain these, click the toolbar button Histograms or use the menu command Tools/Histograms. In the
dialogue box presented you can select the desired graphics, the colour and the number of histogram bars.
With the option Statistics, the following statistics are provided: Skewness (Skew), Kurtosis (Kurt) and
Standard deviation (Std).
306 Graphical Exploration of Data (GraphID)
40.3.4 Regression Lines (Smoothed lines)
Up to 4 dierent regression lines can be displayed on each scatter plot:
MLE (Maximum Likelihood Estimation) linear regression (usual linear regression)
Local linear regression
Local mean
Local median.
Note that these are regression lines of Y versus X, where the X and Y variables are projected respectively
on the horizontal and vertical axis.
To get the lines, click the toolbar button Smoothed lines or use the menu command Tools/Smoothing. Then,
in the dialogue box select the desired lines, their colour and the smoothing parameter value.
The smoothing parameter is the number of neighbours. It defaults to 7. The value cannot be greater than
n/2 where n is the number of cases.
40.3 GraphID Main Window for Analysis of a Dataset 307
40.3.5 Box and Whisker Plots
This feature is especially useful if the cases have been partitioned into groups (see Grouping cases above).
Use the menu command Tools/Box-Whisker plots or click the toolbar button Box-Whisker plots to get
a dialogue box for specifying the number of visible columns and rows as well as colours for the Box and
Whisker plots window.
For each selected variable, a graphic image is displayed in the form of a set of boxes, each box corresponding
to one group of cases. The base of the box can be set to be proportional to the number of cases in the group,
and the upper and lower boundaries show the upper and lower quartiles respectively. The upper and lower
ends of vertical lines (whiskers) emerging from the box correspond to the maximum and minimum values
of the variable for the group. The lines inside a box are the mean (green line) of the variable in the group
and its median (dotted blue line). The left side of a rectangle shows the scale of the variable and its lower
margin shows the group numbers.
You may change colours and fonts of the graphics using appropriate buttons in the toolbar. These changes
can be saved as new defaults for subsequent windows and sessions.
The Colors button allows you to change colours of:
Boxes
Background
Whiskers
Median line
Mean line
Margins.
The Font buttons allow you to change fonts for scales and variable names.
Any cell of a Box-Whisker plot can be zoomed. Select the desired cell and click the toolbar button Zoom.
40.3.6 Grouped Plot
This feature allows projection of a two-dimensional scatter plot within cells of a two-dimensional table, and
thus visual analysis in 4 dimensions.
Use the menu command Tools/Grouped plot to get a dialogue box for specifying row and column variables
for table construction, and X and Y variables for scatter plots.
308 Graphical Exploration of Data (GraphID)
You are also requested to select the way of calculating the number of rows and columns. There are two
possibilities: they can be equal to the number of distinct variable values or to the user specied number of
intervals. Calculated intervals are of the same length.
40.3.7 Three-dimensional Scatter Diagrams and their Rotation
To get a three-dimensional scatter diagram, click the toolbar button 3D scatter plots or use the menu
command Tools/3D Scatter Plots. The dialogue box lets you select three variables to be projected along
OX, OY and OZ axes. After OK, you get a new window with a three-dimensional scatter diagram for the
selected variables. If the parent matrix plot window is in brush mode, the cases included in the brush will
be dispayed the same way in this diagram.
You can use the control elements of the dialogue box in the left pane of the window to change the graphical
image and to rotate it.
The button in the top left corner can be used to reset the graphics to the start position.
The button in the top right corner can be used to set the center for the cloud of points: either in the gravity
center or in the zero point.
The buttons in the group Rotate are used for rotating the scatter diagram around the corresponding axes
and the ones in the group Spread are used to move points from and towards the center.
The group Labels allows you to display or to hide variable names on the corresponding axes.
Finally, the 3D scatter diagram can be projected as three 2D scatter plots by requesting the 2D-view.
40.4 GraphID Window for Analysis of a Matrix
Once the le with matrices has been selected, you can click on Open or double-click on the le name to display
a 3D histogram with one bar for each cell of the rst matrix in the le. The height of the bar represents the
value of the statistic from the matrix transformed using its range, i.e. h = (s
val
s
min
)/(s
max
s
min
). By
default, negative values are shown in blue and positive values in red.
40.4 GraphID Window for Analysis of a Matrix 309
You can select colours for labels (names) and scales, negative and positive values, walls, oor and background.
Use the same technique as for Box and Whisker plots.
In the right part of the window you are presented with a list of matrices included in the le. Note that only
the rst 16 characters of the matrix contents description are displayed. If there is no description, GraphID
displays Untitled n. You can display the required matrix by clicking its contents description.
The display of the matrix can be manipulated using options and commands in the menu bar items and/or
equivalent toolbar icons.
40.4.1 Menu bar and Toolbar
File and Edit
The same commands as the corresponding menus in dataset analysis, except Close, are provided.
View
Toolbar Displays/hides toolbar.
Status Bar Displays/hides status bar.
Colors Calls the dialogue box to select colours for the active window: row/column
labels and scales, negative and positive values, walls, oor and background.
Font for Scales Calls the dialogue box to select the font for scales.
Font for Labels Calls the dialogue box to select the font for labels.
Window and Help
The same commands as corresponding menus in dataset analysis are available.
310 Graphical Exploration of Data (GraphID)
Toolbar icons
Buttons are available in the toolbar providing direct access to the same commands/options as the corre-
sponding menus. They are listed here as they appear from the left to the right.
Open
Save
Copy
Print
Colors
Font for Labels
Font for Scales
Information about the version of GraphID.
40.4.2 Manipulation of the Displayed Matrix
Similar to the manipulation of 3D scatter diagrams, you can use the control elements of the dialogue box in
the left pane of the window to change the graphical image and to rotate the displayed matrix.
The top button can be used to reset the graphic to the start position.
The Colors button lets you change colours of:
Bar (positive values)
Wall
Bar (negative values)
Floor
Background
Labels and scale.
Boxes of the group Hide/Show allow you to display or hide walls, scale, labels on the corresponding axes
and the diagonal if applicable.
The buttons in the group Rotate can be used for rotating the matrix around the vertical axis.
The buttons in the groups Columns and Rows can be used to change the size of columns and rows
respectively.
The buttons in the group Center allow you to move the graphic left, right, up and down.
Chapter 41
Time Series Analysis (TimeSID)
41.1 Overview
TimeSID is a component of WinIDAMS for time series analysis. It uses IDAMS datasets as input where the
dictionary and data les must have the same name with extensions .dic and .dat respectively.
Only one dataset can be used at a time, i.e. opening of another dataset automatically closes the one being
used.
41.2 Preparation of Analysis
Selection of data. Use the menu command File/Open or click the toolbar button Open. Then, in the
Open dialogue box, select your le. Setting Files of type: to IDAMS Data File (*.dat) displays only
IDAMS data les.
Selection of series. You are also asked to specify the series (variables) you want to analyse. Numeric
variables can be selected from the Accessible series list and moved to the Selected series area. Moving
variables between the lists can be done by clicking the buttons >, < (move only highlighted variables), >>,
<< (move all variables). Note that alphabetic variables are not available here.
Missing data treatment. Missing data values are excluded from transformations of series; they are also
excluded from calculation of statistics and autocorrelations. For the other analysis, missing data values are
replaced by the overall mean.
41.3 TimeSID Main Window
After selection of variables and a click on OK, the TimeSID Main window displays the graphic of the rst
series from the list of selected series. The series can be manipulated and analysed using various options and
commands in the menus and/or equivalent toolbar icons.
312 Time Series Analysis (TimeSID)
41.3.1 Menu bar and Toolbar
File
Open Calls the dialogue box to select a new dataset for analysis.
Close Closes all windows for the current analysis.
Save As Calls the dialogue box to save the contents of the active pane/window.
Graphical images are saved in Windows Bitmap format (*.bmp). Data table
and tables with statistics are saved in text format.
Print Calls the dialogue box to print the contents of the active pane/window.
Print Preview Displays a print preview of the contents of the active pane/window.
Print Setup Calls the dialogue box for modifying printing and printer options.
Exit Terminates the TimeSID session.
The menu can also contain the list of recently opened les, i.e. les used in previous TimeSID sessions.
Edit
The menu has only one command, Copy, to copy the contents of the active pane/window to the Clipboard.
View
Toolbar Displays/hides toolbar.
Status Bar Displays/hides status bar.
OX Scale Displays/hides OX scale for the time series.
Font for Scales Calls the dialogue box to select the font for scales.
Basic Colors Calls the dialogue box to select colours for the margin and background.
41.3 TimeSID Main Window 313
Window
Data Table Calls the window with the data table. Columns of the data table are the
analyzed time series (including transformation results).
Besides Data Table, the menu contains the list of opened windows and Windows options for arranging them.
Help
WinIDAMS Manual Provides access to the WinIDAMS Reference Manual.
About TimeSID Displays information about the version and copyright of TimeSID and a link
for accessing the IDAMS Web page at UNESCO Headquarters.
The two other menus, Transformations and Analysis, are described in details in sections Transformation of
Time Series and Analysis of Time Series below.
Toolbar icons
There are 9 active buttons in the toolbar providing direct access to the same commands/options as the
corresponding menu items. They are listed here as they appear from the left to the right.
Open Histograms, basic statistical characteristics
Copy Auto-, cross-correlation
Print Auto-regression
Basic colors Display information about TimeSID
Font for scales
41.3.2 The Time Series Window
The time series window is divided into 3 panes: the left one is for changing the window properties and for
selecting series (variables), the right upper is for displaying several time series and the right lower is for
displaying the current series.
314 Time Series Analysis (TimeSID)
Changing the pane appearance. The two panes for displaying time series are synchronized and they can
be changed using the controls provided in the left pane. By default, the right upper pane is empty and its
size is reduced. The right lower pane displays the current series, keeping scroll bar and scales visible. The
size of either pane can be changed using the mouse, and the OX scale can be hidden/displayed using the
OX Scale command of the menu View. Moreover, presentation of graphics can be modied as follows:
regulation of graphic compression degree - use the buttons under Compression of OX,
colours for background and margins - use the Colors button or View/Basic Colors command,
font for scales - use the Scale Font button or View/Font for Scales command.
Changing time series name. Select the required time series, click its name with the right mouse button
and select the Change name option. The active window presents the name for modication. Note that these
modications are temporary and they are kept only during the current session.
Selecting time series for display. A list of analysed time series is provided in the left pane. By double
clicking a variable in the list, you can choose the shape and colour of the line for projection. After OK, the
corresponding graphic is displayed in the upper pane. This operation can be repeated for dierent variables
and thus you can get several graphics displayed simultaneously in the upper pane. The right lower pane
always displays the current series.
Deleting time series from analysis. Select the required time series, click its name with the right mouse
button and select the Delete series option.
41.4 Transformation of Time Series
Time series data can be transformed by calculating dierences, smoothing, trend suppression, using a number
of functions, etc. The menu Transformations contains commands for creating new time series based on
values of selected series. Note that variables displayed for selection are renumbered sequentially starting
from zero (0).
41.5 Analysis of Time Series 315
Average creates a new time series as an average of the specied series. Series to be taken for calculation
are selected in the dialogue box Selection of series (see section Preparation of Analysis).
Paired arithmetic creates a set of time series by performing arithmetic operations on pairs of time series
specied in the dialogue box (each series specied in the rst argument list with the second argument).
Dierences, MA, ROC creates a set of time series based on transformations (sequential dierences, un-
centered moving average, rate of change) of the series specied in the dialogue box. Parameters specic
for each transformation as well as the type of ROC transformation are set in the same dialogue box.
41.5 Analysis of Time Series
Analysis features are activated through commands in the menu Analysis.
Statistics creates the table with mean, standard deviation, minimum and maximum values as well as the
table with statistics for testing the hypothesis randomness versus trend for the selected time series.
It also displays a histogram for this series.
Auto-, cross-correlations creates a new window with a set of cells containing graphs of auto- and cross-
correlations for the set of specied time series.
Trend (parametric) creates a new time series as the estimation of a parametric trend model for the
specied time series. The trend model and the series are selected in a dialogue box.
Autoregression estimates the parameters of an auto-regression model for short-term prediction for the
specied time series.
Spectrum (spectral analysis) creates a table of spectrum values (frequency, period, density), graph of
spectrum estimation, and for DFT spectrum, graph of deviations of the cumulative spectrum from the
cumulative white-noise spectrum. It can use the fast discrete Fourier transformation (DFT) and/or
maximal entropy (MENT) method for the spectrum density estimation. In the DFT procedure, two
windows are used to get the improved estimation of spectral density: Welch data window in the time
domain and a polynomial smoothing in the frequency domain.
316 Time Series Analysis (TimeSID)
Cross-spectrum analyses a pair of stationary time series. It provides the values of cross-spectrum power,
phase and coherency function as well as their plots. Cross-spectrum is estimated using the Parzen
smoothing window.
Frequency lters procedure decomposes a time series into frequency components. It creates a new series
by applying one of the following lters: low frequency, high frequency, band-pass or band-cut. For
low or high frequency lter, its frequency bound is equal to the value of the Frequency parameter.
For band-pass or band-cut lter, the frequency bounds are determined by the interval (Frequency -
Window width, Frequency + Window width). An option Detrend allows to detrend the time series
before ltering (the trend component is added to the ltering results).
References
Farnum, N.R., Stanton, L.W., Quantitative Forecasting Methods, PWS-KENT Publishing Company, Boston,
1989.
Kendall, M.G., Stuart, A., The Advanced Theory of Statistics, Volume 3 - Design and Analysis, and time
series, Second edition, Grin, London, 1968.
Marple Jr, S.L., Digital Spectral Analysis with Applications, Prentice-Hall, Inc., 1987.
Part VI
Statistical Formulas and
Bibliographical References
Chapter 42
Cluster Analysis
Notation
x = values of variables
h, i, j, l = subscripts for objects
f, g = subscripts for variables
p = number of variables
c = subscript for cluster
k = number of clusters
N
j
= number of objects in cluster j
N = total number of cases.
42.1 Univariate Statistics
If the input is an IDAMS dataset, the following statistics are calculated for all variables used in the analysis:
a) Mean.
x
f
=
i
x
if
N
b) Mean absolute deviation.
s
f
=
i
|x
if
x
f
|
N
42.2 Standardized Measurements
In the same situation, the program can compute standardized measurements, also called z-scores, given by:
z
if
=
x
if
x
f
s
f
for each case i and each variable f using the mean value and the mean absolute deviation of the variable f
(see section 1 above).
320 Cluster Analysis
42.3 Dissimilarity Matrix Computed From an IDAMS Dataset
The elements d
ij
of a dissimilarity matrix measure the degree of dissimilarity between cases i and j. The
d
ij
are calculated directly from the raw data, or from the z-scores if the variables are requested to be
standardized. One of two distances can be chosen: Euclidean or city block.
a) Euclidean distance.
d
ij
=
_
p
f=1
(x
if
x
jf
)
2
b) City block distance.
d
ij
=
p
f=1
|x
if
x
jf
|
42.4 Dissimilarity Matrix Computed From a Similarity Matrix
If the input consists of a similarity matrix with elements s
ij
, the elements d
ij
of the dissimilarity matrix are
calculated as follows:
d
ij
= 1 s
ij
42.5 Dissimilarity Matrix Computed From a Correlation Matrix
If the input consists of a correlation matrix with elements r
ij
, the elements d
ij
of the dissimilarity matrix
are calculated using one of two formulas: SIGN or ABSOLUTE.
When using the SIGN formula, variables with a high positive correlation receive a dissimilarity coecient
close to zero, whereas variables with a strong negative correlation will be considered very dissimilar.
d
ij
= (1 r
ij
)/2
When using the ABSOLUTE formula, variables with a high positive or strong negative correlation will be
assigned a small dissimilarity.
d
ij
= 1 |r
ij
|
42.6 Partitioning Around Medoids (PAM)
The algorithm searches for k representative objects (medoids) which are centrally located in the clusters they
dene. The representative object of a cluster, the medoid, is the object for which the average dissimilarity to
all the objects in the cluster is minimal. Actually, the PAM algorithm minimizes the sum of dissimilarities
instead of the average dissimilarity.
The selection of k medoids is performed in two phases. In the rst phase, an initial clustering is obtained
by the successive selection of representative objects until k objects have been found. The rst object is the
one for which the sum of the dissimilarities to all the other objects is as small as possible. (This is a kind of
multivariate median of the N objects, hence the term medoid.) Subsequently, at each step, PAM selects
the object which decreases the objective function (sum of dissimilarities) as much as possible. In the second
phase, an attempt is made to improve the set of representative objects. This is done by considering all pairs
of objects (i, h) for which object i has been selected and object h has not, checking whether selecting h and
deselecting i reduces the objective function. In each step, the most economical swap is carried out.
42.6 Partitioning Around Medoids (PAM) 321
a) Final average distance (dissimilarity). This is the PAM objective function, which can be seen as
a measure of goodness of the nal clustering.
Final average distance =
N
i=1
d
i,m(i)
N
where m(i) is the representative object (medoid) closest to object i.
b) Isolated clusters. There are two types of isolated clusters: L-clusters and L
-clusters.
Cluster C is an L-cluster if for each object i belonging to C
max
jC
d
ij
< min
hC
d
ih
Cluster C is an L
-cluster if
max
i,jC
d
ij
< min
lC,hC
d
lh
c) Diameter of a cluster. The diameter of the cluster C is dened as the biggest dissimilarity between
objects belonging to C:
Diameter
C
= max
i,jC
d
ij
d) Separation of a cluster. The separation of the cluster C is dened as the smallest dissimilarity
between two objects, one of which belongs to cluster C and the other does not.
Separation
C
= min
lC,hC
d
lh
e) Average distance to a medoid. If j is the medoid of cluster C, the average distance of all objects
of C to j is calculated as follows:
Average distance
j
=
iC
d
ij
N
j
f ) Maximum distance to a medoid. If object j is the medoid of cluster C, the maximum distance of
all objects of C to j is calculated as follows:
Maximum distance
j
= max
iC
d
ij
g) Silhouettes of clusters. Each cluster is represented by a silhouette (Rousseeuw 1987), showing
which objects lie well within the cluster and which ones merely hold an intermediate position. For
each object, the following information is provided:
- the number of the cluster to which it belongs (CLU),
- the number of the neighbor cluster (NEIG),
- the value s
i
(denoted as S(I) in the printed output),
- the three-character identier of object i,
- a line, the length of which is proportional to s
i
.
For each object i the value s
i
is calculated as follows:
s
i
=
b
i
a
i
max(a
i
, b
i
)
where a
i
is the average dissimilarity of object i to all other objects of the cluster A to which i belongs
and b
i
is the average dissimilarity of object i to all objects of the closest cluster B (neighbor of object
i). Note that the neighbor cluster is like the second-best choice for object i. When cluster A contains
only one object i, the s
i
is set to zero (s
i
= 0).
322 Cluster Analysis
h) Average silhouette width of a cluster. It is the average of s
i
for all objects i in a cluster.
i) Average silhouette width. It is the average of s
i
for all objects i in the data, i.e. average silhouette
width for k clusters. This can be used to select the best number of clusters, by choosing that k
yielding the highest average of s
i
.
Another coecient, SC, called the silhouette coefficient, can be calculated manually as the
maximum average silhouette width over all k for which the silhouettes can be constructed. This
coecient is a dimensionless measure of the amount of clustering structure that has been discovered
by the classication algorithm.
SC = max
k
s
k
Rousseeuw (1987) proposed the following interpretation of the SC coecient:
0.71 1.00 A strong structure has been found.
0.51 0.70 A reasonable structure has been found.
0.26 0.50 The structure is weak and could be articial;
please try additional methods on this data.
0.25 No substantial structure has been found.
42.7 Clustering LARge Applications (CLARA)
Similarly to PAM, the CLARA method is also based on the search for k representative objects. But the
CLARA algorithm is designed especially for analyzing large data sets. Consequently, the input to CLARA
has to be an IDAMS dataset.
Internally, CLARA carries out two steps. First a sample is drawn from the set of objects (cases), and
divided into k clusters using the same algorithm as in PAM. Then, each object not belonging to the sample
is assigned to the nearest among the k representative objects. The quality of this clustering is dened as
the average distance between each object and its representative object. Five such samples are drawn and
clustered in turn, and the one is selected for which the lowest average distance was obtained.
The retained clustering of the entire data set is then analyzed further. The nal average distance, the average
and maximum distances to each medoid are calculated the same way as in PAM (for all objects, and not
only for those in the selected sample). Silhouettes of clusters and related statistics are also calculated the
same way as in PAM, but only for objects in the selected sample (since the entire silhouette plot would be
too large to print).
42.8 Fuzzy Analysis (FANNY)
Fuzzy clustering is a generalization of partitioning, which can be applied to the same type of data as the
method PAM, but the algorithm is of a dierent nature. Instead of assigning an object to one particular
cluster, FANNY gives its degree of belonging (membership coecient) to each cluster, and thus provides
much more detailed information on the structure of the data.
a) Objective function. The fuzzy clustering technique used in FANNY aims to minimize the objective
function
Objective function =
k
c=1
j
u
2
ic
u
2
jc
d
ij
2
j
u
2
jc
where u
ic
and u
jc
are membership functions which are subject to the constraints
u
ic
0 for i = 1, 2, . . . , N; c = 1, 2, . . . , k
c
u
ic
= 1 for i = 1, 2, . . . , N
42.9 AGglomerative NESting (AGNES) 323
The algorithm minimizing this objective function is iterative, and stops when the function converges.
b) Fuzzy clustering (memberships). These are the membership values (membership coecients u
ic
)
which provide the smallest value of the objective function. They indicate, for each object i, how
strongly it belongs to cluster c. Note that the sum of membership coecients equals 1 for each object.
c) Partition coecient of Dunn. This coecient, F
k
, measures how hard a fuzzy clustering is. It
varies from the minimum of 1/k for a completely fuzzy clustering (where all u
ic
= 1/k) up to 1 for an
entirely hard clustering (where all u
ic
= 0 or 1).
F
k
=
N
i=1
k
c=1
u
2
ic
/ N
d) Normalized partition coecient of Dunn. The normalized version of the partition coecient of
Dunn always varies from 0 to 1, whatever value of k was chosen.
F
k
=
F
k
(1/k)
1 (1/k)
=
kF
k
1
k 1
e) Closest hard clustering. This partition (= hard clustering) is obtained by assigning each object
to the cluster in which it has the largest membership coecient. Silhouettes of clusters and related
statistics are calculated the same way as in PAM.
42.9 AGglomerative NESting (AGNES)
This method can be applied to the same type of data as the methods PAM and FANNY. However, it is no
longer necessary to specify the number of clusters required. The algorithm constructs a tree-like hierarchy
which implicitly contains all values of k, starting with N clusters and proceeding by successive fusions until
a single cluster is obtained with all the objects.
In the rst step, the two closest objects (i.e. with smallest inter-object dissimilarity) are joined to constitute
a cluster with two objects, whereas the other clusters have only one member. In each succeeding step, the
two closest clusters (with smallest inter-object dissimilarity) are merged.
a) Dissimilarity between two clusters. In the AGNES algorithm, the group average method of
Sokal and Michener (sometimes called unweighted pair-group average method) is used to measure
dissimilarities between clusters.
Let R and Q denote two clusters and |R| and |Q| denote their number of objects. The dissimilarity
d(R, Q) between clusters R and Q is dened as the average of all dissimilarities d
ij
, where i is any
object of R and j is any object of Q.
d(R, Q) =
1
|R| |Q|
iR
jQ
d
ij
b) Final ordering of objects and dissimilarities between them. In the rst line, the objects are
listed in the order they will appear in the graphical representation of results. In the second line, the
dissimilarities between joining clusters are printed. Note that the number of dissimilarities printed is
one less than the number of objects N, because there are N 1 fusions.
c) Dissimilarity banner. It is a graphical presentation of the results. A banner consists of stars and
stripes. The stars indicate links and the stripes are repetitions of identiers of objects. A banner is
always read from left to right. Each line with stars starts at the dissimilarity between the clusters
being merged. There are xed scales above and below the banner, going from 0.00 (dissimilarity 0) to
1.00 (largest dissimilarity encountered). The actual highest dissimilarity (corresponding to 1.00 in the
banner) is provided just below the banner.
324 Cluster Analysis
d) Agglomerative coecient. The average width of the banner is called the agglomerative coecient
(AC). It describes the strength of the clustering structure that has been found.
AC =
1
N
i
l
i
where l
i
is the length of the line containing the identier of object i.
42.10 DIvisive ANAlysis (DIANA)
The method DIANA can be used for the same type of data as the method AGNES. Although AGNES and
DIANA produce similar output, DIANA constructs its hierarchy in the opposite direction, starting with one
large cluster containing all objects. At each step, it splits up a cluster into two smaller ones, until all clusters
contain only a single element. This means that for N objects, the hierarchy is built in N 1 steps.
In the rst step, the data are split into two clusters by making use of dissimilarities. In each subsequent
step, the cluster with the largest diameter (see 6.c above) is split in the same way. After N 1 divisive
steps, all objects are apart.
a) Average dissimilarity to all other objects. Let A denote a cluster and |A| denote its number of
objects. The average dissimilarity between object i and all other objects in cluster A is dened as in
6.g above.
d
i
=
1
|A| 1
jA,j=i
d
ij
b) Final ordering of objects and diameters of clusters. In the rst line, the objects are listed
in the order they will appear in the graphical representation. The diameters of clusters are printed
below that. These two sequences of numbers together characterize the whole hierarchy. The largest
diameter indicates the level at which the whole data set is split. The objects on the left side of this
value constitute one cluster, and the objects on the right side constitute another one. The second
largest diameter indicates the second split, etc.
c) Dissimilarity banner. As for the AGNES method, it is a graphical presentation of the results. It
also consists of lines with stars, and the stripes which repeat the identiers of objects. The banner is
read from left to right but the xed scales above and below the banner now go from 1.00 (corresponding
to the diameter of the entire data set) to 0.00 (corresponding to the diameter of singletons). Each
line with stars ends at the diameter at which the cluster is split. The actual diameter of the data set
(corresponding to 1.00 in the banner) is provided just below the banner.
d) Divisive coecient. The average width of the banner is called the divisive coecient (DC). It
describes the strength of the clustering structure found.
DC =
1
N
i
l
i
where l
i
is the length of the line containing the identier of object i.
42.11 MONothetic Analysis (MONA)
The method MONA is intended for data consisting exclusively of binary (dichotomic) variables (which take
only two values, so that x
if
= 0 or x
if
= 1). Although the algorithm is of the hierarchical divisive type, it
does not use dissimilarities between objects, and therefore a matrix of dissimilarities is not computed. The
division into clusters uses the variables directly.
At each step, one of the variables (say, f) is used to split the data by separating the objects i for which
x
if
= 1 from those for which x
if
= 0. In the next step, each cluster obtained in the previous step is split
42.12 References 325
further, using values (0 and 1) of one of the remaining variables (dierent variables may be used in dierent
clusters). The process is continued until each cluster either contains only one object, or the remaining
variables cannot split it.
For each split, the variable most strongly associated with the other variables is chosen.
a) Association between two variables. The measure of association between two variables f and g is
dened as follows:
A
fg
= |a
fg
d
fg
b
fg
c
fg
|
where a
fg
is the number of objects i with x
if
= x
ig
= 0, d
fg
is the number of objects with x
if
= x
ig
= 1,
b
fg
is the number of objects with x
if
= 0 and x
ig
= 1, and c
fg
is the number of objects with x
if
= 1
and x
ig
= 0.
The measure A
fg
expresses whether the variables f and g provide similar divisions of the set of objects,
and can be considered as a kind of similarity between variables.
In order to select the variable most strongly associated with the other variables, the total measure A
f
is calculated for each variable f as follows:
A
f
=
g=f
A
fg
b) Final ordering of objects. The objects are listed in the order they appear in the separation plot
(banner). The separation steps and the variables used for separation are printed under object identiers.
c) Separation plot (banner). This graphical presentation is quite similar to the banner printed by
DIANA. The length of a row of stars is now proportional to the step number at which separation
was carried out. Rows of object identiers correspond to objects. A row of identiers which does not
continue to the right-hand side of the banner signals an object that became a singleton cluster at the
corresponding step. Rows of identiers plotted between two rows of stars indicate objects belonging
to a cluster which cannot be separated.
42.12 References
Kaufman, L., and Rousseeuw, P.J., Finding Groups in Data: An Introduction to Cluster Analysis, John
Wiley & Sons, Inc., New York, 1990.
Rousseeuw, P.J., Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis,
Journal of Computational and Applied Mathematics, 20, 1987.
Chapter 43
Conguration Analysis
Notation
Let A
(n,t)
be a rectangular matrix of n variables(rows) and t dimensions(columns). A variable or point a
has t coordinates, each one corresponding to one dimension.
a
is
= element of the matrix A in the i
th
row and the s
th
column
i, j = subscripts for variables(rows)
n = number of variables
s, l, m = subscripts for dimensions(columns)
t = number of dimensions.
43.1 Centered Conguration
The variables are centered within each dimension by subtracting the mean of each column from each element
in the column.
Centered a
is
= a
is
i
a
is
n
After application of this formula, the mean of the coordinates of the n variables is zero for each dimension.
43.2 Normalized Conguration
The sum of squares of all the elements of the matrix A divided by the number of variables n gives the mean
of second moments of the variables. Each element of the matrix is normalized by the square root of this
value (see denominator below).
Normalized a
is
=
a
is
s
a
2
is
/n
After this normalization, the sum of squares of the a
is
elements is equal to n.
43.3 Solution with Principal Axes
The conguration is rotated so that successive dimensions account for maximum possible variance. Let A
be the conguration to be rotated and B be the conguration in its principal axis form.
Calculation of matrix B:
328 Conguration Analysis
The symmetric matrix A
A are
determined using Jacobis diagonalization method.
The matrix A is transformed into a matrix B of b
is
elements, such that B = AT , B having n lines and t
columns like the matrix A.
43.4 Matrix of Scalar Products
SP
ij
=
s
a
is
a
js
The matrix SP of dimensions (n, n) is a square and symmetric matrix of scalar products of variables. The
scalar product of a variable by itself is its second moment. If each variable is centered and normalized (mean
= 0, standard deviation = 1), the matrix SP becomes a correlation matrix.
43.5 Matrix of Interpoint Distances
DIST
ij
=
s
(a
is
a
js
)
2
DIST is a square and symmetric matrix of Euclidean distances between variables.
43.6 Rotated Conguration
The rotation can be performed only on two dimensions at a time. It belongs to the user to select the
dimensions, e.g. 2 and 5 (column 2 and column 5) and the angle of rotation in terms of degrees.
New coordinates are calculated as follows:
a
il
= a
il
cos +a
im
sin
a
im
= a
il
sin +a
im
cos
The calculation is performed for each value of i, and as many times as that there are variables.
In the matrix A, the columns l and m become the vectors of the new coordinates calculated as indicated
above.
43.7 Translated Conguration
The translation can be performed only on one single dimension(one column) at a time. The user species
the constant T to be added to each element of the dimension, and the column l it applies to.
For all the coordinates of l (n coordinates since n variables):
a
il
= a
il
+T
43.8 Varimax Rotation
(a) The elements a
is
of A are normalized by the square root of the communalities corresponding to each
variable, and one denes
b
is
=
a
is
_
s
a
2
is
43.9 Sorted Conguration 329
(b) Having constructed B = (b
is
), one looks for the best projection axes for the variables, after equalization
of their inertia. The maximization of the function V
c
is performed through successive rotations of two
dimensions at a time, until convergence is reached.
V
c
=
s
n
i
b
4
is
_
i
b
is
_
2
n
2
The result matrix B of b
is
elements has the same number of lines and columns as the initial matrix A.
43.9 Sorted Conguration
This is the nal conguration printed in a dierent format. Each dimension is printed as a row, with elements
for the dimension in ascending order.
43.10 References
Greenstadt, J., The determination of the characteristic roots of a matrix by the Jacobi method, Mathematical
Methods for Digital Computers, eds. A. Ralston and H.S. Wilf, Wiley, New York, 1960.
Herman, H.H., Modern Factor Analysis, University of Chicago Press, Chicago, 1967.
Kaiser, H.F., Computer program for varimax rotation in factor analysis, Educational and Psychological
Measurement, 3, 1959.
Chapter 44
Discriminant Analysis
Notation
x = values of variables
k = subscript for case
i, j = subscripts for variables
g = superscript for group
q = subscript for step
p = number of variables
w = value of the weight
x
g
k
= p elements vector corresponding to the case k in the group g
y
g
q
= vector with mean values of variables selected in the step q for the group g
N
g
= number of cases in the group g
W
g
= total sum of weights for the group g
I
q
= subset of indices for variables selected in the step q.
44.1 Univariate Statistics
These statistics, weighted if the weight is specied, are calculated for each group and for each analysis
variable, using the basic sample. The mean is calculated also for the whole basic sample (total mean).
a) Mean.
x
g
i
=
N
g
k=1
w
g
k
x
g
ki
W
g
Note: the total mean is calculated using the analogous formula.
b) Standard deviation.
s
g
i
=
_
N
g
k=1
w
g
k
(x
g
ki
)
2
W
g
(x
g
i
)
2
44.2 Linear Discrimination Between 2 Groups
The procedure is based on the linear discriminant function of Fisher and uses the total covariance matrix
for calculating coecients of this function. Classication of cases is done using the values of this function,
332 Discriminant Analysis
and not distances as such. The criterion applied for selecting the next variable is the D
2
of Mahalanobis
(Mahalanobis distance between two groups). After each step, the program provides the linear discriminant
function, the classication table and the percentage of correctly classied cases for both the basic and test
samples.
a) Linear discriminant function. Let us denote the function calculated in step q as
f
q
(x) =
iIq
b
qi
x
i
+ a
q
The coecients b
qi
of this function for the variables i included in step q correspond to the elements of
the unique eigenvector of the matrix
(y
1
q
y
2
q
)
T
1
q
and the constant term is calculated as follows:
a
q
=
1
2
(y
1
q
y
2
q
)
T
1
q
(y
1
q
+y
2
q
)
where T
q
is the matrix of total covariance (calculated for the cases from both groups) for the variables
included in step q, with the elements
t
ij
=
k
w
k
(x
ki
x
i
)(x
kj
x
j
)
W
1
+W
2
b) Classication table for basic sample.
A case is assigned:
to the group 1 if f
q
(x) > 0 ,
to the group 2 if f
q
(x) < 0 .
A case is not assigned if f
q
(x) = 0 .
Percentage of correctly classified cases is calculated as the ratio between the number of cases
on diagonal and the total number of cases in the classication table.
c) Classication table for test sample.
Constructed in the same way as for the basic sample (see 2.b above).
d) Criterion for selecting the next variable. The Mahalanobis distance between the two groups is
used for this purpose. The variable selected in step q is the one which maximizes the value of D
2
q
.
D
2
q
= (y
1
q
y
2
q
)
T
1
q
(y
1
q
y
2
q
)
e) Allocation and value of the linear discriminant function for the cases. These are calculated
and printed for the last step, or when the step precedes a decrease of the percentage of correctly
classied cases. The function value is calculated according to the formula described under point 2.a
above; the variables used in the calculation are those retained in the step. The assignment of cases to
the groups is done as described under point 2.b above.
The same formula and assignment rules are used for the basic sample, the group means, the test sample
and the anonymous sample.
44.3 Linear Discrimination Between More Than 2 Groups 333
44.3 Linear Discrimination Between More Than 2 Groups
The procedure for discrimination of 3 or more groups uses not only the total covariance matrix but also the
between groups covariance matrix. The criterion for selecting the next variable used here is the trace of a
product of these two matrices (generalization of Mahalanobis distance for two groups). After selecting the
new variable to be entered, discriminant factor analysis is performed and the program provides the overall
discriminant power and the discriminant power of the rst three factors. Cases are classied according to
their distances from the centres of groups. In each step, the program calculates and prints the classication
table and the percentage of correctly classied cases for both the basic and test samples.
a) Classication table for basic sample. The distance of a case x from the centre of the group g in
the step q is dened as the linear function
v
y
g
q
(x) = (y
g
q
)
T
1
q
(y
g
q
2x)
where T
q
, as described under 2.a above, is the matrix of total covariance (calculated for the cases from
all groups) for the variables included in step q, with the elements
t
ij
=
k
w
k
(x
ki
x
i
)(x
kj
x
j
)
W
A case is assigned to the group for which v
y
g
q
(x) has the smallest value (the smallest distance).
Percentage of correctly classified cases is calculated as the ratio between the number of cases
on diagonal and the total number of cases in the classication table.
b) Classication table for test sample.
Constructed in the same way as for the basic sample (see 3.a above).
c) Criterion for selecting the next variable. The variable selected in the step q is the one which
maximizes the value of the trace of the matrix T
1
q
B
q
, where T
q
is the total covariance matrix used
in step q (see 3.a above), and B
q
is the matrix of covariances between groups, with the elements
b
ij
=
g
W
g
(y
g
i
x
i
)(y
g
j
x
j
)
W
The following part of analysis (points 3.d - 3.h below) is performed in one of the three following
circumstances:
when the step precedes a decrease of the percentage of correctly classied cases,
when the percentage of correctly classied cases is equal to 100,
when the step is the last one.
d) Allocation and distances of cases in the basic sample. The distances from each group are
calculated as described under point 3.a above; the variables used in the calculation are those retained
in the step. The assignment of cases to the groups is done as described under point 3.a above.
e) Discriminant factor analysis. The matrix T
1
q
B
q
described under 3.c above is analysed. The rst
two eigenvectors corresponding to the two highest eigenvalues of this matrix are the two discriminant
factorial axes. The discriminant power of the factors is measured by the corresponding eigenvalues.
Since the program provides the discriminant power for the rst three factors, the sum of eigenvalues
allows to estimate the level of remaining eigenvalues, i.e. those which are not printed.
f ) Values of discriminant factors for all cases and group means.
For a case, the value of discriminant factor is calculated as the scalar product of the case vector
containing variables retained in the step by the eigenvector corresponding to the factor. Note that
these values are not printed, but they are used in a graphical representation of cases in the space of
the rst two factors.
For a group mean, the value of discriminant factor is calculated in the same way replacing the case
vector by the group mean vector.
334 Discriminant Analysis
g) Allocation and distances of cases in the test sample. The distances from each group are
calculated in the same way, and assignment of cases to the groups is done following the same rules as
for the basic sample (see 3.d above).
h) Allocation and distances of cases in the anonymous sample. The distances from each group
are calculated the same way and assignment of cases to the groups is done following the same rules as
for the basic sample (see 3.d above).
44.4 References
Romeder, J.M., Methodes et programmes danalyse discriminante, Dunod, Paris, 1973.
Chapter 45
Distribution and Lorenz Functions
Notation
p
i
= value of i
th
break point
i = subscript for break point
s = number of subintervals
N = total number of cases.
45.1 Formula for Break Points
The number of break points is one less than the number of requested subintervals, e.g. medians imply two
subintervals and one break point.
p
i
= V () + [V ( + 1) V ()]
where V is an ordered data vector, e.g. V (3) is the third item in the vector,
= entier
_
i(N + 1)
s
_
=
i(N + 1)
s
and entier(x) is the greatest integer not exceeding x.
45.2 Distribution Function Break Points
There are four possible situations:
If a break point falls exactly on a value and the value is not tied with any other value, then the value
itself is the break point.
If a break point falls between two values and the two values are not the same, then the break point is
determined using ordinary linear interpolation.
If a break point falls exactly on a value and the value is tied with one or more other values, then the
procedure involves computing new midpoints. Let k be the value, m be the frequency with which it
occurs and d be the minimum distance between items in the vector V. The interval k min(d, 1)/2 is
divided into m parts and midpoints are computed for these new intervals. The break point is then the
appropriate midpoint.
If a break point falls between two values which are identical, the procedure involves both the calculation
of new midpoints and ordinary linear interpolation. Let k be the value, m be the frequency with which
336 Distribution and Lorenz Functions
it occurs and d be the minimum distance between items in the vector V. The interval k min(d, 1)/2
is divided into m parts and midpoints are computed for these new intervals. Then linear interpolation
is performed between the two appropriate new midpoints.
45.3 Lorenz Function Break Points
To determine Lorenz function break points, the ordered data vector is cumulated, and at each step the
cumulated total is divided by the grand total. Then the break points are found the same way as described
above.
45.4 Lorenz Curve
The Lorenz function plotted against the proportion of the ordered population gives a Lorenz curve, which
is always contained in the lower triangle of the unit square. The QUANTILE program uses ten subintervals
for the Lorenz curve.
Note that Lorenz function values are called Fraction of wealth on the printout.
45.5 The Gini Coecient
The Gini coecient represents twice the area between the Lorenz function and the diagonal plotted in the
unit square. It takes on values between 0 and 1. Zero (0) indicates perfect equality - all data values are
equal. One (1) indicates perfect inequality - there is one non-zero data value.
The program uses an approximation:
Gini coecient = 1
1
s
2
s
s1
i=1
l
i
where l
i
is the i
th
Lorenz function break point.
This approximation becomes more accurate as the number of break points is increased; it is recommended
that at least ten be used.
45.6 Kolmogorov-Smirnov D Statistic
The Kolmogorov-Smirnov test is concerned with the agreement between two cumulative distributions. If
two sample cumulative distributions are too far apart at any point, it suggests that the samples come from
dierent populations. The test focuses on the largest dierence between the two distributions.
Let V
1
and V
2
be the ordered data vectors for the rst and the second variable respectively, and X the vector
of codes which appear in either distribution. The program creates the two cumulative step functions F
1
(x)
and F
2
(x) respectively. Then it looks for maximum absolute dierence between the distributions,
D = max(|F
1
(x) F
2
(x)|)
and prints:
x : the value where the rst maximum absolute dierence occurs
f
1
: the value of F
1
associated with the x
f
2
: the value of F
2
associated with the x.
If the Ns for V
1
and V
2
are equal and less than 40, the program prints K statistic equal to the dierence in
frequencies associated with the maximum dierence. A table of critical values of K statistic, denoted K
D
,
can be consulted to determine the signicance of the observed dierence.
45.7 Note on Weights 337
If the Ns for V
1
and V
2
are unequal or larger than 40, the program prints the following statistics:
Unadjusted deviation = D = |f
1
f
2
|
Adjusted deviation = D
_
N
1
N
2
N
1
+N
2
where N
1
and N
2
are equal to the number of cases in V
1
and V
2
respectively.
Chi-squared approximation = 4D
2
N
1
N
2
N
1
+N
2
Note: The signicance of the maximum directional deviation can be found by referring this chi-square value
to a chi-square distribution with two degrees of freedom.
45.7 Note on Weights
For distribution function break points, Lorenz function break points, and the Gini coecients, data may be
weighted by an integer. If a weight is specied, each case is implicitly counted as w cases, where w is
the weight value for the case. The Kolmogorov-Smirnov test is always performed on unweighted data.
Chapter 46
Factor Analyses
Notation
x = values of variables
i = subscript for case
j, j
i=1
w
i
x
ij
W
b) Variance (estimated).
s
j
2
=
_
N
N 1
__ W
I1
i=1
w
i
x
2
ij
_
I1
i=1
w
i
x
ij
_
2
W
2
_
c) Standard deviation (estimated).
s
j
=
_
s
j
2
d) Coecient of variability (C. Var.).
C
j
=
s
j
x
j
340 Factor Analyses
e) Total (sum for x
j
).
Total
j
=
I1
i=1
w
i
x
ij
f ) Skewness.
g1
j
=
m3
j
s
2
j
_
s
2
j
where m3
j
=
I1
i=1
w
i
(x
ij
x
j
)
3
W
g) Kurtosis.
g2
j
=
m4
j
( s
2
j
)
2
3 where m4
j
=
I1
i=1
w
i
(x
ij
x
j
)
4
W
h) Weighted N. Number of principal cases if the weight is not specied, or weighted number of principal
cases (sum of weights).
46.2 Input Data
The data are printed for both principal and supplementary cases.
The rst column of the table contains the values of the case ID variable (up to 4 digits). The second column
(Coef) contains the value of the weight assigned to each case (w
i
). The third column (PI) is equal to the
weighted sum of principal variables values, for each case (weighted row totals).
P
i
=
J1
j=1
w
i
x
ij
The rst line contains the rst four characters of each variable name. The second line (PJ) is equal to the
weighted sum of principal cases values, for each variable (weighted column totals).
P
j
=
I1
i=1
w
i
x
ij
Note that the value of the Coef at the beginning of this line is equal to the weighted number of principal
cases, and the value of PI is equal to the overall Total (P) of the principal variables for the principal cases.
P =
I1
i=1
P
i
=
J1
j=1
P
j
=
I1
i=1
J1
j=1
w
i
x
ij
The rest of the input data table contains the values (with one decimal point) of principal and supplementary
variables.
46.3 Core Matrices (Matrices of Relations)
For each type of analysis, a core matrix is calculated and printed. This is a matrix of relationships between
variables. Note that for the printout, the values in the matrix are multiplied by a factor the value of which is
printed next to the matrix title. This factor is set to zero when some values in the matrix exceed 5 characters
(it may be the case of scalar products or covariances matrices).
For the analysis of correspondences, the elements C
jj
of the core matrix are calculated as follows:
C
jj
=
1
_
P
j
_
P
j
I1
i=1
(w
i
x
ij
) (w
i
x
ij
)
P
i
46.4 Trace 341
For the analysis of scalar products, the elements SP
jj
of the core matrix are calculated as follows:
SP
jj
=
I1
i=1
w
i
x
ij
x
ij
i=1
w
i
x
ij
x
ij
_
_
I1
i=1
w
i
x
2
ij
__
I1
i=1
w
i
x
2
ij
_
For the analysis of covariances, the elements COV
jj
of the core matrix are calculated as follows:
COV
jj
=
I1
i=1
w
i
(x
ij
x
j
) (x
ij
x
j
)
W
For the analysis of correlations, the elements COR
jj
of the core matrix are calculated as follows:
COR
jj
=
I1
i=1
w
i
(x
ij
x
j
) (x
ij
x
j
)
_
I1
i=1
w
i
(x
ij
x
j
)
2
I1
i=1
w
i
(x
ij
x
j
)
2
46.4 Trace
Trace of the core matrix is calculated as a sum of its diagonal elements. Trace is also equal to the total
of eigenvalues (total inertia). Note that for the analysis of correlations and the analysis of normed scalar
products the total inertia is equal to the number of principal variables.
Trace =
J1
=1
Trace
100
e) Cumul (cumulative percent). Contribution of the factors 1 through to the total inertia (in terms
of percentages).
Cumul
=
1
+
2
+ +
=1
COS2
j
c) WEIG. Weight value of the variable. For all types of analysis, it is calculated as a ratio between
the total of the variable and the overall Total (see section 2 above), multiplied by 1000.
f
j
=
P
j
P
1000
Note that the weight (WEIG) printed in the last line of the table is equal to:
- the overall Total for the correspondence analysis,
- the weighted number of cases for other types of analysis.
d) INR. Inertia corresponding to the variable. It indicates the part of the total inertia related to the
variable in the space of factors.
For the analysis of correspondences, it is calculated as a ratio between the inertia of the variable
and the total inertia, multiplied by 1000. Note that the inertia of the variable depends on the variable
weight and that the Trace value used here does not include the trivial eigenvalue.
INR
j
=
f
j
J11
=1
F
2
j
Trace
1000
where F
j
is the ordinate of the variable j corresponding to the factor (see 7.e below).
46.8 Table of Supplementary Variables Factors 343
For the analysis of scalar products and the analysis of covariances, the inertia of the variable
does not depend on the variable weight.
INR
j
=
J1
=1
F
2
j
Trace
1000
For the analysis of normed scalar products and the analysis of correlations, the inertia
of the variable depends only on the number of principal variables.
INR
j
=
1
J1
1000
Note that the inertia (INR) printed in the last line of the table is equal to 1000.
The three following columns are repeated for each factor.
e) #F. The ordinate of the variable in the factor space, denoted here by F
j
.
f ) COS2. Squared cosine of the angle between the variable and the factor. It is a measure of distance
between the variable and the factor. Values closer to 1 indicate shorter distances from the factor.
For the analysis of correspondences, it is calculated as follows:
COS2
j
=
F
2
j
J11
=1
F
2
j
1000
For the analysis of scalar products and the analysis of covariances,
COS2
j
=
F
2
j
J1
=1
F
2
j
1000
For the analysis of normed scalar products and the analysis of correlations,
COS2
j
= F
2
j
1000
g) CPF. Contribution of the variable to the factor.
For the analysis of correspondences,
CPF
j
=
f
j
F
2
j
1000
For all the other types of analysis,
CPF
j
=
F
2
j
1000
Note that the contribution (CPF) printed in the last line of the table is equal to 1000.
46.8 Table of Supplementary Variables Factors
The table contains the same information as the one described under point 7. above, but for the supplementary
variables.
a) JSUP. Variable number for the supplementary variables.
b) QLT. Quality of representation of the variable in the space of m factors (see 7.b above).
344 Factor Analyses
c) WEIG. Weight value of the variable (see 7.c above).
d) INR. Inertia corresponding to the variable. Note that the supplementary variables do not contribute
to the total inertia. Thus, the inertia here indicates whether the variable could play any role in the
analysis if it would be used as a principal one. It is calculated in the same way as for the principal
variables in respective analyses (see 7.d above).
The inertia (INR) printed in the last line of the table is equal to the total INR over all the supplementary
variables.
The three following columns are repeated for each factor.
e) #F. The ordinate of the variable in the factor space, denoted here by F
j
.
f ) COS2. Squared cosine of the angle between the variable and the factor. It is calculated in the same
way as for the principal variables in respective analyses (see 7.f above).
g) CPF. Contribution of the variable to the factor. Note that the supplementary variables do not
participate in the construction of the factor space. Thus, the contribution only indicates whether the
variable could play any role in the analysis if it would be used as a principal one. CPF is calculated in
the same way as for the principal variables in respective analyses (see 7.g above).
The contribution (CPF) printed in the last line of the table is equal to the total CPF over all the
supplementary variables.
46.9 Table of Principal Cases Factors
The table contains the ordinates of the principal cases in the factorial space, their squared cosines with each
factor and their contributions to each factor. In addition, it contains the quality of representation of these
cases, their weights and their inertia.
a) IPR. Case ID value for the principal cases.
b) QLT. Quality of representation of the case in the space of m factors is measured, for all types of
analysis, by the sum of the squared cosines (see 9.f below). Values closer to 1 indicate higher level of
representation of the case by the factors.
QLT
i
=
m
=1
COS2
i
c) WEIG. Weight value of the case.
For the analysis of correspondences, it is calculated as a ratio between the (weighted) sum of
principal variables for this case and the overall Total (see section 2 above), multiplied by 1000.
f
i
=
P
i
P
1000
Note that the weight (WEIG) printed in the last line of the table is equal to the overall Total.
For all other types of analysis,
f
i
=
w
i
P
1000
Note that the weight (WEIG) printed in the last line of the table is equal to the weighted number of
cases.
d) INR. Inertia corresponding to the case. It indicates the part of the total inertia related to the case in
the space of factors.
46.9 Table of Principal Cases Factors 345
For the analysis of correspondences, it is calculated as a ratio between the inertia of the case
and the total inertia, multiplied by 1000. Note that the inertia of the case depends on the case weight
and that the Trace value used here does not include the trivial eigenvalue.
INR
i
=
f
i
J11
=1
F
2
i
Trace
1000
For all other types of analysis,
INR
i
=
_
w
i
W Trace
J1
j=1
z
2
ij
_
1000
where
z
ij
=
_
_
x
ij
for analysis of scalar products
xij
_
_
I1
i=1
wi x
2
ij
_
/ W
for analysis of normed scalar products
x
ij
x
j
for analysis of covariances
xijxj
sj
for analysis of correlations
and s
j
is the sample standard deviation of the variable j.
Note that the inertia (INR) printed in the last line of the table is equal to 1000.
The three following columns are repeated for each factor.
e) #F. The ordinate of the case in the factor space, denoted here by F
i
.
f ) COS2. Squared cosine of the angle between the case and the factor. It is a measure of distance
between the case and the factor. Values closer to 1 indicate shorter distances from the factor.
For the analysis of correspondences, it is calculated as follows:
COS2
i
=
F
2
i
J11
=1
F
2
i
1000
For all other types of analysis,
COS2
i
=
F
2
i
J1
=1
F
2
i
1000
g) CPF. Contribution of the case to the factor.
For the analysis of correspondences,
CPF
i
=
f
i
F
2
i
1000
For all other types of analysis,
CPF
i
=
w
i
F
2
i
W
1000
Note that the contribution (CPF) printed in the last line of the table is equal to 1000.
346 Factor Analyses
46.10 Table of Supplementary Cases Factors
The table contains the same information as the one described under the point 9. above, but for the supple-
mentary cases.
a) ISUP. Case ID value for the supplementary cases.
b) QLT. Quality of representation of the case in the space of m factors (see 9.b above).
c) WEIG. Weight value of the case (see 9.c above).
d) INR. Inertia corresponding to the case. Note that the supplementary cases do not contribute to the
total inertia. Thus, the inertia here indicates whether the case could play any role in the analysis if it
would be used as a principal one. It is calculated the same way as for the principal cases in respective
analyses (see 9.d above).
The inertia (INR) printed in the last line of the table is equal to the total INR over all the supplementary
cases.
The three following columns are repeated for each factor.
e) #F. The ordinate of the case in the factor space, denoted here by F
i
.
f ) COS2. Squared cosine of the angle between the case and the factor. It is calculated the same way as
for the principal cases in respective analyses (see 9.f above).
g) CPF. Contribution of the case to the factor. Note that the supplementary cases do not participate
in the construction of the factor space. Thus, the contribution only indicates whether the case could
play any role in the analysis if it would be used as a principal one. CPF is calculated the same way as
for the principal cases in respective analyses (see 9.g above).
The contribution (CPF) printed in the last line of the table is equal to the total CPF over all the
supplementary cases.
46.11 Rotated Factors
Applied only for correlation analysis. The variable factors can be rotated once the factor analysis is
terminated. The Varimax procedure used here is the same as the one used in CONFIG program. Note that
the variable factors for principal variables may be treated as a conguration of J1 objects in dimensional
space.
46.12 References
Benzecri, J.-P. and F., Pratique de lanalyse de donnees, tome 1: Analyse des correspondances, expose
elementaire, Dunod, Paris, 1984.
Iagolnitzer, E.R., Presentation des programmes MLIFxx danalyses factorielles en composantes principales,
Informatique et sciences humaines, 26, 1975.
Chapter 47
Linear Regression
Notation
y = value of the dependent variable
x = value of an independent (explanatory) variable
i, j, l, m = subscripts for variables
p = number of predictors
k = subscript for case
N = total number of cases
w = value of the weight multiplied by
N
W
W = total sum of weights.
47.1 Univariate Statistics
These weighted statistics are calculated for all variables used in the analysis, i.e. dummy variables, indepen-
dent variables and the dependent variable.
a) Average.
x
i
=
k
w
k
x
ik
N
b) Standard deviation (estimated).
s
i
=
_
N
k
(w
k
x
ik
)
2
k
w
k
x
ik
_
2
N(N 1)
c) Coecient of variation (C.var.).
C
i
=
100 s
i
x
i
47.2 Matrix of Total Sums of Squares and Cross-products
It is calculated for all variables used in the analysis as follows:
t.s.s.c.p.
ij
=
k
w
k
x
ik
x
jk
348 Linear Regression
47.3 Matrix of Residual Sums of Squares and Cross-products
This matrix, sometimes called a matrix of squares and cross-products of deviation scores, is calculated for
all variables used in the analysis as follows:
r.s.s.c.p.
ij
=
k
w
k
x
ik
x
jk
_
k
w
k
x
ik
__
k
w
k
x
jk
_
N
47.4 Total Correlation Matrix
The elements of this matrix are calculated directly from the matrix of residual sums of squares and cross
products. Note that if this formula is written out in detail, and the numerator and denominator are both
multiplied by N, it is a conventional formula for Pearsons r.
r
ij
=
r.s.s.c.p.
ij
r.s.s.c.p.
ii
r.s.s.c.p.
jj
47.5 Partial Correlation Matrix
The ij
th
element of this matrix is the partial correlation coecient between variable i and variable j, holding
constant specied variables. Partial correlations describe the degree of correlation that would exist between
two variables provided that variation in one or more other variables is controlled. They also describe the
correlation between independent (explanatory) variables which would be selected in a stepwise regression.
a) Correlation between x
i
and x
j
holding constant x
l
(rst-order partial correlation coecients).
r
ij l
=
r
ij
r
il
r
jl
_
1 r
2
il
_
1 r
2
jl
where r
ij
, r
il
, r
jl
are zero-order coecients (Pearsons r coecients).
b) Correlation between x
i
and x
j
holding constant x
l
and x
m
(second-order partial correlation
coecients).
r
ij lm
=
r
ij l
r
im l
r
jm l
_
1 r
2
im l
_
1 r
2
jm l
where r
ij l
, r
im l
, r
jm l
are rst-order coecients.
Note: The program computes the partial correlations by working up step by step from zero-order
coecients to rst order, to second order, etc.
47.6 Inverse Matrix
For a standard regression, this is the inverse of the correlation matrix of the independent (explanatory)
variables and the dependent variable. For a stepwise regression, this is the inverse of the correlation matrix
of the independent variables in the nal equation. The program uses the Gaussian elimination method for
inverting.
47.7 Analysis Summary Statistics 349
47.7 Analysis Summary Statistics
a) Standard error of estimate. This is the standard deviation of the residuals.
Standard error of estimate =
k
(y
k
y
k
)
2
df
where
y
k
= the predicted value of the dependent variable for the k
th
case
df = residual degrees of freedom (see 7.f below).
b) F-ratio for the regression. This is the F statistic for determining the statistical signicance of the
model under consideration. The degrees of freedom are p and N p 1.
F =
R
2
df
p (1 R
2
)
where R
2
is the fraction of explained variance (see 7.d below).
c) Multiple correlation coecient. This is the correlation between the dependent variable and the
predicted score. It indicates the strength of relationship between the criterion and the linear function
of the predictors, and is similar to a simple Pearson correlation coecient except that it is always
positive.
R =
R
2
R is not printed if the constant term is constrained to be zero.
d) Fraction of explained variance. R
2
can be interpreted as the proportion of variation in the
dependent variable explained by the predictors. Sometimes called the coecient of determination, it
is a measure of the overall eectiveness of the linear regression. The larger it is, the better the tted
equation explains the variation in the data.
R
2
= 1
k
(y
k
y
k
)
2
k
(y
k
y)
2
where
y
k
= the predicted value of the dependent variable for the k
th
case
y = the mean of the dependent variable.
Like R, R
2
is not printed if the constant term is constrained to be zero.
e) Determinant of the correlation matrix. This is the determinant of the correlation matrix of
the predictors. It represents as a single number the generalized variance in a set of variables, and
varies from 0 to 1. Determinants near zero indicate that some or all explanatory variables are highly
correlated. A zero determinant indicates a singular matrix, which means that at least one of the
predictors is a linear function of one or more others.
f ) Residual degrees of freedom.
If the constant is not constrained to be zero,
df = N p 1
If the constant is constrained to be zero,
df = N p
350 Linear Regression
g) Constant term.
A = y
i
B
i
x
i
where
y = the average of the dependent variable (see 1.a above)
x
i
= the average of the predictor variable i (see 1.a above)
B
i
= the B coecient for the predictor variable i (see 8.a below).
47.8 Analysis Statistics for Predictors
a) B. These are unstandardized partial regression coecients which are appropriate (rather than the
betas) to be used in an equation to predict raw scores. They are sensitive to the scale of measurement
of the predictor variable and to the variance of the predictor variable.
B
i
=
i
s
y
s
i
where
i
= the beta weight for predictor i (see 8.c below)
s
y
= the standard deviation of the dependent variable (see 1.b above)
s
i
= the standard deviation of the predictor variable i (see 1.b above).
b) Sigma B. This is the standard error of B, a measure of the reliability of the coecient.
Sigma B
i
= (standard error of estimate)
_
c
ii
r.s.s.c.p.
ii
where c
ii
is the i
th
diagonal element of the inverse of the correlation matrix of predictors in the
regression equation (see section 6 above).
c) Beta. These regression coecients are also called standardized partial regression coecients or
standardized B coecients. They are independent from a scale of measurement. The magnitudes of
the squares of the betas indicate the relative contributions of the variables to the prediction.
i
= R
1
11
R
yi
where
R
11
= correlation matrix of predictors in the equation
R
yi
= column vector of correlations of the dependent variable and predictors
indicated by the predictor i.
d) Sigma Beta. This is the standard error of the beta coecient, a measure of the reliability of the
coecient.
Sigma
i
= sigma B
i
s
i
s
y
e) Partial r squared. These are partial correlations, squared, between predictor i and the dependent
variable, y, with the inuence of the other variables in the regression equation eliminated. The partial
correlation coecient squared is a measure of the extent to which that part of the variation in the
dependent variable which is not explained by the other predictors is explained by predictor i.
r
2
yi jl...
=
R
2
y ijl...
R
2
y jl...
1 R
2
y jl...
47.9 Residuals 351
where
R
2
y ijl...
= multiple R squared with predictor i
R
2
y jl...
= multiple R squared without predictor i.
f ) Marginal r squared. This is the increase in variance explained by adding predictor i to the other
predictors in the regression equation.
marginal r
2
i
= R
2
y ijl...
R
2
y jl...
g) The t-ratio. It can be used to test the hypothesis that , or B, is equal to zero; that is, that predictor
i has no linear inuence on the dependent variable. Its signicance can be determined from the table
of t, with N p 1 degrees of freedom.
t =
i
sigma
i
B
i
sigma B
i
k=2
(e
k
e
k1
)
2
N
k=1
e
2
k
47.10 Note on Stepwise Regression
Stepwise regression introduces the predictors step by step into the model, starting with the independent
variable most highly correlated with y. After the rst step, the algorithm selects from the remaining inde-
pendent variables the one which yields the largest reduction in the residual (unexplained) variance of the
dependent variable, i.e. the variable whose partial correlation with y is the highest. The program then does
a partial F-test for entrance to see if the variable will take up a signicant amount of variation over that
removed by variables already in the regression. The user can specify a minimum F-value for the inclusion
of any variable; the program evaluates whether or not the F-value obtained at a given step satises the
minimum, and if it does, enters the variable. Similarly, the program decides at each step whether or not
each previously-included variable still satises a minimum (also provided by the user), and if not, removes
it.
Partial F-value for variable i =
(R
2
y Pi
R
2
y P
)(df)
1 R
2
y Pi
352 Linear Regression
where
R
2
y Pi
= multiple R squared for the set of predictors (P) already in the
regression, with predictor i
R
2
y P
= multiple R squared for the set of predictors (P) already in the
regression
df = residuals degrees of freedom.
At any step in the procedure, the results are the same as they would be for a standard regression using
the particular set of variables; thus, the nal step of a stepwise regression shows the same coecients as a
normal execution using the variables that survived the stepwise procedure.
47.11 Note on Descending Regression
Descending regression is like the stepwise regression, except that the algorithm starts with all the independent
variables and then drops and adds back variables in a stepwise manner.
47.12 Note on Regression with Zero Intercept
It is possible when using the REGRESSN program to request a zero regression intercept, i.e. that the
dependent variable is zero when all the independent variables are zero.
If a regression through the origin is specied, all statistics except those described in sections 1 through 4
above are based on a mean of zero. The multiple correlation coecient and fraction of explained variance
(items 7.c and 7.d) are not printed at all. Statistics which are not centered about the mean can be very
dierent from what they would be if they were centered; thus, in a stepwise solution, variables may very well
enter the equation in a dierent order than they would if a constant were estimated.
In the REGRESSN program a matrix with elements
a
ij
=
k
w
k
x
ik
x
jk
k
w
k
x
2
ik
k
w
k
x
2
jk
is analyzed rather than R, the correlation matrix.
The Bs, the unstandardized partial regression coecients, are obtained by
B
i
=
i
k
w
k
x
2
ik
k
w
k
x
2
jk
Chapter 48
Multidimensional Scaling
Notation
x = element of the conguration
i, j, l, m = subscripts for variables
n = number of variables
s = subscript for dimension
t = number of dimensions.
48.1 Order of Computations
For a given number of dimensions, t, MDSCAL nds the conguration of minimum stress by using an iterative
procedure. The program starts with an initial conguration (provided by the user or by the program) and
keeps modifying it until it converges to the conguration having minimum stress.
48.2 Initial Conguration
If the user does not supply a starting conguration the program generates an arbitrary conguration by
taking the rst n points from the following list (each expression between parenthesis represents a point):
(1, 0, 0, . . . , 0),
(0, 2, 0, . . . , 0),
(0, 0, 3, . . . , 0),
.
.
.
(0, 0, 0, . . . , t),
(t + 1, 0, 0, . . . , 0),
(0, t + 2, 0, . . . , 0),
.
.
.
48.3 Centering and Normalization of the Conguration
At the start of each iteration the conguration is centered and normalized.
If x
is
denotes the element in the i
th
line and s
th
column of the conguration, then
Centered x
is
= x
is
x
s
Normalized x
is
=
x
is
x
s
n.f.
354 Multidimensional Scaling
where
x
s
=
i
x
is
n
is the mean of dimension s and
n.f. =
_
n
s
x
2
is
is the normalization factor.
Note that the total sum of squares of the elements of the normalized centered conguration is equal to n,
the number of variables.
48.4 History of Computation
At the conclusion of each iteration, items 4.a through 4.h below are printed. This creates a history which, in
general, is of interest only when it is feared that convergence is not complete. However, at the end of history
the reason for stopping is printed. If the program does not stop because a minimum has been reached, it
may nonetheless be true that the solution reached is practically indistinguishable from the minimum that
would be reached after a few more iterations - in particular, if the stress is very small, this is generally the
case.
a) Stress. The measure of stress serves two functions. First, it is a measure of how well the derived
conguration matches the input data. Second, it is used in deciding how points should be moved on
the next iteration. There are two available formulas for calculating stress: SQDIST and SQDEV.
Stress SQDIST =
j
(d
ij
d
ij
)
2
j
d
2
ij
Stress SQDEV =
j
(d
ij
d
ij
)
2
j
(d
ij
d )
2
where
d
ij
= distance between variables i and j in the conguration (see 8.c below)
d
ij
= those numbers which minimize the stress, subject to the constraint that
the d
ij
have the same rank order as the input data (see 8.d below)
d = the mean of all the d
ij
s.
b) SRAT. Stress ratio. The user can stop the scaling procedure by specifying the stress ratio to be
reached. For the rst iteration (numbered 0) its value is set to 0.800 .
SRAT =
Stress
present
Stress
previous
c) SRATAV. Average stress ratio. For the rst iteration its value is equal to 0.800 .
SRATAV
present
= (SRAT
present
)
0.33334
(SRATAV
previous
)
0.66666
48.4 History of Computation 355
d) CAGRGL. This is the cosine of the angle between the current gradient and the previous gradient.
CAGRGL = cos =
s
g
is
g
is
s
g
2
is
s
(g
is
)
2
where
g = present gradient
g
= previous gradient.
The initial gradient is set to a constant:
Initial g
is
=
_
1
t
e) COSAV. Average cosine of the angle between successive gradients. This is a weighted average. For
the rst iteration, its value is set to 0.
COSAV
present
= CAGRGL
present
COSAVW + COSAV
previous
(1.0 COSAVW)
where COSAVW is a weighting factor under the control of the user.
f ) ACSAV. Average absolute value of the cosine of the angle between successive gradients. This is a
weighted average. For the rst iteration, its value is set to 0.
ACSAV
present
= |CAGRGL
present
| ACSAVW + ACSAV
previous
(1.0 ACSAVW)
where ACSAVW is a weighting factor under the control of the user.
g) SFGR. Scale factor of the gradient. As the computation proceeds, the scale factor of successive
gradients decreases. One way that the scaling procedure can stop is by reaching a user-supplied
minimum value of the scale factor of the gradient.
SFGR =
1
n
s
g
2
is
where g is the present gradient.
h) STEP. Step size. In the step size formula, the two main determinants of the new step size are the
previous step size and angle factor. The step sizes used do not aect the nal solution but they do
aect the number of iterations required to reach a solution.
STEP
present
= STEP
previous
angle factor relaxation factor good luck factor
where
angle factor = 4.0
COSAV
relaxation (or bias) factor =
1.4
AB
A = 1 + (min(1, SRATAV))
5
B = 1 + ACSAV |COSAV|
good luck factor =
_
min(1, SRAT)
The rst step size is computed as follows:
STEP = 50. Stress SFGR
356 Multidimensional Scaling
48.5 Stress for Final Conguration
This is a reiteration of the last value of the Stress column of the history of computation (see 4.a above).
Here the Stress is a measure of how well the nal conguration matches the input data.
Interpretation of the stress for the nal conguration depends on the formula used in the calculations. Note
that the use of Stress SQDEV yields to substantially larger values of stress for the same degree of goodness
of t.
For the classical mode of using MDSCAL, Kruskal and Carmone give the following table for the usual range
of values of N (say from 10 to 30) and the usual range of dimensionality (say from 2 to 5):
Stress SQDIST Stress SQDEV
Poor 20.0 % 40.0 %
Fair 10.0 % 20.0 %
Good 5.0 % 10.0 %
Excellent 2.5 % 5.0 %
Perfect 0.0 % 0.0 %
48.6 Final Conguration
On each iteration the next conguration is formed by starting from the old conguration and moving along
the (negative) gradient of stress a distance equal to the step size.
New conguration = old conguration +
STEP
SFGR
(gradient)
Each row of the nal conguration matrix provides the coordinates of one variable of the conguration.
The orientation of the reference axes is arbitrary and thus one should look for rotated or even oblique axes
that may be readily interpretable. If an ordinary Euclidean distance was used, it is possible to rotate the
conguration so that its principal axes coincide with the coordinate axes. The CONFIG program can be
used for this purpose.
48.7 Sorted Conguration
This is the nal conguration presented with each dimension sorted - the coordinates are reordered from
small to big.
48.8 Summary
a) IPOINT, JPOINT. These are variable subscripts, (i, j), indicating to which pair of variables refer
the three statistics below.
b) DATA. For each variable pair, it is the input index of similarity or dissimilarity as provided by the
user in the input data matrix.
c) DIST. This is the distance between points in the nal conguration.
For Minkowski r - metric,
d
ij
=
_
s
|x
is
x
js
|
r
_
1/r
In the case of r = 2 it becomes an ordinary Euclidean distance
d
ij
=
s
(x
is
x
js
)
2
48.9 Note on Ties in the Input Data 357
In the case of r = 1 it becomes a City block distance
d
ij
=
s
|x
is
x
js
|
d) DHAT. D-hats are the numbers which minimize the stress, subject to the constraint that the d-hats
have the same rank order as the input data; they are appropriate distances, estimated from the input
data.
They are obtained from
d
ij
=
j
d
ij
and
d
ij
d
lm
if p
ij
p
lm
(similarities)
or
p
ij
p
lm
(dissimilarities)
where
d
ij
= distance between variables i and j in the conguration
d
ij
= a monotonic transformation of the p
ij
s
p
ij
= the input index of similarity or dissimilarity between variables i and j.
48.9 Note on Ties in the Input Data
Ties in the input data, i.e. identical values in the input data matrix, can be treated in either of two ways -
the choice is up to the user.
The primary approach, DIFFER, treats ties in the input matrix as an indeterminate order relation, which
can be resolved arbitrarily so as to decrease dimensionality or stress.
The secondary approach, EQUAL, treats ties as implying an equivalence relation, which (insofar as possible)
is to be maintained (even if stress is increased).
If there are few ties, it does not make much dierence which approach is chosen.
48.10 Note on Weights
The program provides for weighting, but it is not weighting in the usual IDAMS sense.
MDSCAL weighting may be used to assign diering importance to diering data values, that is, to assign
weights to cells of the input data matrix. This sort of weighting can be used, for instance, to accommodate
diering measurement variability among the data values.
If weights are used,
Stress SQDIST =
j
w
ij
(d
ij
d
ij
)
2
j
w
ij
d
2
ij
Stress SQDEV =
j
w
ij
(d
ij
d
ij
)
2
j
w
ij
(d
ij
d )
2
where
d =
j
w
ij
d
ij
j
w
ij
358 Multidimensional Scaling
and w
ij
indicates the value in the cell ij of the weight matrix.
48.11 References
Kruskal, J.B., Multidimensional scaling by optimizing goodness of t to a non-metric hypothesis, Psycho-
metrica, 3, 1964.
Kruskal, J.B., Nonmetric multidimensional scaling: a numerical method, Psychometrica, 29, 1964.
Chapter 49
Multiple Classication Analysis
Notation
y = value of the dependent variable
w = value of the weight
k = subscript for case
i = subscript for predictor
j = subscript for category within a predictor
p = number of predictors
c = number of non-empty categories across all predictors
a
ij
= adjusted deviation of the j
th
category of predictor i (see 2.c below)
N
ij
= number of cases in the j
th
category of predictor i
N = total number of cases
W = total sum of weights
subscript ijk indicates that the case k belongs to the j
th
category of the predictor i.
49.1 Dependent Variable Statistics
a) Mean. Grand mean of y.
y =
k
w
k
y
k
W
b) Standard deviation of y (estimated).
s
y
=
_
_
N
N 1
__ W
k
w
k
y
2
k
_
k
w
k
y
k
_
2
W
2
_
c) Coecient of variation.
C
y
=
100 s
y
y
d) Sum of y.
Sum of y =
k
w
k
y
k
360 Multiple Classication Analysis
e) Sum of y squared.
Sum of y
2
=
k
w
k
y
2
k
f ) Total sum of squares.
TSS =
k
w
k
(y
k
y)
2
g) Explained sum of squares.
ESS =
j
a
ij
_
k
w
ijk
y
ijk
_
h) Residual sum of squares.
RSS = TSS ESS
49.2 Predictor Statistics for Multiple Classication Analysis
a) Class mean. Mean of the dependent variable for cases in the j
th
category of predictor i.
y
ij
=
k
w
ijk
y
ijk
k
w
ijk
b) Unadjusted deviation from grand mean.
Unadjusted a
ij
= y
ij
y
c) Coecient. Adjusted deviation a
ij
from grand mean. This is the regression coecient for each
category of each predictor.
Predicted y
k
= y +
i
a
ijk
The values of a
ij
are obtained by an iterative procedure which stops when
k
(y
k
predictedy
k
)
2
reaches the minimum.
d) Adjusted class mean. This is an estimate of what the mean would have been if the group had been
exactly like the total population in its distribution over all the other predictor classications. If there
were no correlation among predictors, the adjusted mean would equal the class mean.
Adjusted y
ij
= y +a
ij
e) Standard deviation (estimated) of the dependent variable for the j
th
category of the predictor i.
s
ij
=
k
w
ijk
y
2
ijk
_
k
w
ijk
y
ijk
_
2
/
k
w
ijk
k
w
ijk
_
k
w
ijk
/ N
ij
_
f ) Coecient of variation (C.var.).
C
ij
=
100 s
ij
y
ij
49.3 Analysis Statistics for Multiple Classication Analysis 361
g) Unadjusted deviation SS. This is the sum of squares of unadjusted deviations for predictor i.
U
i
=
j
_
k
w
ijk
_
_
y
ij
y
_
2
h) Adjusted deviation SS. This is the sum of squares of adjusted deviations for predictor i.
D
i
=
j
_
k
w
ijk
_
_
a
2
ij
_
i) Eta squared for predictor i. Eta squared can be interpreted as the percent of variance in the
dependent variable that can be explained by predictor i all by itself.
2
i
=
U
i
TSS
j) Eta for predictor i. It indicates the ability of the predictor, using the categories given, to explain
variation in the dependent variable.
i
=
_
2
i
k) Eta squared for predictor i, adjusted for degrees of freedom.
Adjusted
2
i
= 1 A(1
2
i
)
where A is the adjustment for degrees of freedom (see 3.b below).
l) Eta for predictor i, adjusted.
Adjusted
i
=
_
1 A(1
2
i
)
m) Beta squared for predictor i. Beta squared is the sum of squares attributable to the predictor,
after holding all other predictors constant, relative to the total sum of squares. This is not in terms
of percent of variance explained.
2
i
=
D
i
TSS
n) Beta for predictor i. Beta provides a measure of ability of the predictor to explain variation in the
dependent variable after adjusting for the eect of all other predictors. Beta coecients indicate the
relative importance of the various predictors (the higher the value the more variation is explained by
the corresponding beta).
i
=
_
2
i
49.3 Analysis Statistics for Multiple Classication Analysis
a) Multiple R squared unadjusted. This is the multiple correlation coecient squared. It indicates
the actual proportion of variance explained by the predictors used in the analysis.
R
2
=
ESS
TSS
b) Adjustment for degrees of freedom.
A =
N 1
N p c 1
362 Multiple Classication Analysis
c) Multiple R squared adjusted. It provides an estimate of the multiple correlation in the population
from which the sample was drawn. Note that it is an estimate of the multiple correlation which
would be obtained if the same predictors, but not necessarily the same coecients, were used for the
population.
Adjusted R
2
= 1 A(1 R
2
)
d) Multiple R adjusted. This is the multiple correlation coecient adjusted for degrees of freedom. It
is an estimate of the R which would be obtained if the same predictors were applied to the population.
Adjusted R =
_
1 A(1 R
2
)
49.4 Summary Statistics of Residuals
The residual for a case k is r
k
= y
k
predictedy
k
,
a) Mean.
r =
k
w
k
r
k
W
b) Variance (estimated).
s
2
r
=
_
N
N 1
__ W
k
w
k
r
2
k
_
k
w
k
r
k
_
2
W
2
_
c) Skewness. The skewness of the distribution of residuals is measured by
g
1
=
_
N
N 2
__
m
3
s
2
r
_
s
2
r
_
where
m
3
=
k
w
k
(r
k
r)
3
W
d) Kurtosis. The kurtosis of the distribution of residuals is measured by
g
2
=
_
N
N 3
__
m
4
( s
2
r
)
2
_
3
where
m
4
=
k
w
k
(r
k
r)
4
W
49.5 Predictor Category Statistics for One-Way Analysis of Vari-
ance
See One-Way Analysis of Variance chapter for details.
49.6 One-Way Analysis of Variance Statistics 363
49.6 One-Way Analysis of Variance Statistics
See One-Way Analysis of Variance chapter for details. Note that the adjustment factor A used in MCA
program for one-way analysis of variance is calculated dierently than in ONEWAY program, namely:
A =
N 1
N c
49.7 References
Andrews, F.M., Morgan, J.N., Sonquist, J.A., and Klem, L., Multiple Classication Analysis, 2nd ed.,
Institute for Social Research, The University of Michigan, Ann Arbor, 1973.
Chapter 50
Multivariate Analysis of Variance
Notation
y = value of dependent variable or covariate
i, j = subscripts for categories of predictors
k = subscript for case
p = number of dependent variables
df
h
= degrees of freedom for the hypothesis
df
e
= degrees of freedom for error.
50.1 General Statistics
a) Cell means. Let y
ijk
represent a value of a dependent variable or covariate for the k
th
case in the
i, j
th
subclass of a two-way classication.
y
ij
=
Nij
k=1
y
ijk
N
ij
where N
ij
is equal to the number of cases in the i, j
th
subclass.
b) Basis of design. The design matrix is generated by rst developing for each factor a one-way design
matrix (a one-way K
f
matrix) in accordance with the contrast type specied by the user for that factor.
The overall design matrix K is obtained from the one-way K
f
matrices by taking the Kronecker product
of the matrices.
The design matrix is always printed with the eects equations in columns, beginning with the grand
mean eect in the rst column.
c) Intercorrelations among the normal equations coecients. The basis of design is weighted by
the cell counts. The eect of unequal cell frequencies is to introduce correlations between columns of
the design matrix. These are those correlations. If the cell frequencies are equal, there will be 1s on
the diagonal and zeros elsewhere.
d) Solution of the normal equations. The parameters are estimated by least squares in the form
LX = (K
DK)
1
K
DY
where
L = the contrast matrix which has as rows i the independent contrasts
366 Multivariate Analysis of Variance
in the parameters which are to be estimated and tested
X = the parameters to be estimated
K = the design matrix
D = a diagonal matrix with the number of cases in each cell
Y = a matrix of cell means with columns corresponding to variables.
When dealing with an orthogonal design and orthogonal contrasts, the contrasts have independent
estimates. For unequal cell frequencies, however, the K appropriate for orthogonal designs is no longer
orthogonal. It is required to transform K to orthogonality in the metric D. This is done by putting
T = SK
D
1/2
with TT
= T
T = I = SK
DKS
so
K
D
1/2
= S
1
T
and
(K
DK)
1
= S
S
and, substituting in the rst equation above,
(S
)
1
LX = SK
DY
This last equation denes a new set of parameters which are linear functions of the contrasts, with the
matrix SK
replacing K
)
1
, is triangular.
e) Partitioning of matrices. In a univariate analysis of variance, each case has one dependent variable
y; in a multivariate analysis of variance, each case has a vector y of dependent variables. The multi-
variate analogue of y
2
is the matrix product y
Y
S
b
= Y.
DY.
S
w
= Y
Y Y.
DY.
where
Y = the original N p data matrix (N cases, p dependent variables)
Y. = the n p matrix of cell means (n cells, p dependent variables)
D = a diagonal matrix with the number of cases in each cell.
The between-subclasses sum of products is partitioned further according to the eects in the model.
f ) Error correlation matrix. In a multivariate analysis of variance, the error term is a variance-
covariance matrix. This is that error term reduced to a correlation matrix.
The correlation matrix is calculated using S
w
, the within, or error, sum or products.
R
e
= s
1
e
S
w
s
1
e
50.2 Calculations for One Test in a Multivariate Analysis 367
where
S
w
= the within-subclasses sum of products
s
2
e
= the diagonal entries of S
w
.
R
e
is the matrix of correlation coecients among the variates which estimate population values.
If the user specied that the within-subclasses sum of squares was to be augmented to form the error
term, augmentation takes place before the matrix is reduced to correlations.
g) Principal components of the error correlation matrix. This is a standard principal components
analysis of the matrix R
e
. It indicates the factor structure of the variables found in the population
under study. The eigenvalues (or roots) are printed beneath the components.
h) Error dispersion matrix. This is the error term, a variance-covariance matrix, for the analysis. The
matrix is adjusted for covariates, if any. Each diagonal element of the matrix is exactly what would
appear in a conventional analysis of variance table as the within mean square error for the variable.
M
e
=
S
w
df
e
where
S
w
= the within-subclasses sum of products
df
e
= the degrees of freedom for error, adjusted for augmentation if that was requested.
If augmentation is not requested, the degrees of freedom for error equals the number of cases minus
the number of cells in the design.
i) Standard errors of estimation. They correspond to the square roots of the diagonal elements of
the matrix M
e
.
50.2 Calculations for One Test in a Multivariate Analysis
The calculations are repeated for each test requested by the user. Results of internal calculations described
below under points a) to d) are not printed.
a) Sum of squares matrix due to hypothesis. The between-subclasses sum of squares is partitioned
according to the various eects in the model. For a given hypothesis to be tested, the program
determines the orthogonal estimates to be tested and computes the sum of squares due to hypothesis
(S
h
).
b) S
w
and S
h
reduced to mean squares and scaled to correlation space. The mean square matrix
for the hypothesis, M
h
, is calculated analogously to the means squares for error.
M
h
=
S
h
df
h
where
S
h
= the sum of squares matrix due to hypothesis (see above).
The degrees of freedom for the hypothesis depend on the test requested; for a test of main eect A,
where factor A has a levels, the degrees of freedom for hypothesis would be a 1.
M
h
is a matrix of the between-subclass mean products associated with a main eect or interaction
hypothesis.
Both M
e
and M
h
are scaled to correlation space:
R
e
=
1
e
M
e
1
e
368 Multivariate Analysis of Variance
C
h
=
1
e
M
h
1
e
where
R
e
= the matrix of correlation coecients among the variables estimating population values
C
h
= a matrix, which, although not a correlation matrix, does present the variances
and covariances for the variables as aected by the treatment
M
e
= the mean squares for error
M
h
= the mean squares for hypothesis
e
= a diagonal matrix containing the standard errors of estimation.
The matrix R
e
is computed twice, once as described in the section Error correlation matrix and once
as descibed here. If no covariates were specied, the results are identical and the second R
e
matrix is
not printed. If one or more covariates was specied, the second R
e
matrix incorporates adjustements
for the covariate(s).
c) Solution of the determinental equation. The usual method of computing Wilks likelihood ratio
criterion is from the determinental equation
|M
h
M
e
| = 0
The above equation is pre-and-post-multiplied by the diagonal matrix
1
e
|
1
e
M
h
1
e
R
e
| = 0
Let
R
e
= FF
where
F = the matrix of principal components coecients satisfying
F
F
1
(FF
)(F
1
)
| = 0
or
|(
e
F)
1
M
h
((
e
F)
1
)
I| = 0
The last equation is then solved for the values .
d) Likelihood ratio criterion.
=
s
q=1
_
1 +
df
h
df
e
q
_
1
where
q
= the non-zero values from the last equation in the previous section.
50.2 Calculations for One Test in a Multivariate Analysis 369
e) F-ratio for likelihood ratio criterion. The program uses the F-approximation to the percentage
points of the null distribution of .
F =
1
1/k
1/k
k(2df
e
+df
h
p 1) p(df
h
) + 2
2p(df
h
)
where
k =
p
2
(df
h
)
2
4
p
2
+ (df
h
)
2
5
This is a multivariate test of signicance of the eect for all the dependent variables simultaneously.
f ) Degrees of freedom for the F-ratio.
p(df
h
)
and
k(2df
e
+df
h
p 1) p(df
h
) + 2
2
If p = 1 or 2 and df
h
= 1 or 2, k is set to 1 in cases when p(df
h
) = 2.
g) Canonical variances of the principal components of the hypothesis. These are the lambdas
calculated as described in the section Solution of the determinental equation above. They are ordered
by decreasing magnitude. The number of non-zero lambdas for a given equation is equal to df
h
(the
number of degrees of freedom associated with M
h
), or p, the number of dependent variables, whichever
is smaller.
h) Coecients of the principal components of the hypothesis. Solving equation
|(
e
F)
1
M
h
((
e
F)
1
)
I| = 0
gives rise to T, for which
F
1
1
e
M
h
1
e
(F
1
)
= T T
F
1
1
e
X
h
X
h
1
e
(F
1
)
T =
The above equation is considered as
T
F
1
1
e
X
h
= S
h
where
S
h
(S
h
)
=
and written in usual factor equation form, X = FS, is
1
e
X
h
= FTS
h
The coecients of the principal components of the hypothesis, FT, are printed by the program.
i) Contrast component scores for estimated eects. The rows of S
h
are the sets of factor scores,
atributable to hypothesis that have as maximum variances the
i
.
370 Multivariate Analysis of Variance
j) Cumulative Bartletts tests on the roots. The tests can be used to determine the dimensionality
of the conguration. The lambdas, or roots, are ordered in ascending order of magnitude. In the
Bartletts tests, all the roots are tested rst. Then all but the rst, then all but the rst two, and so
forth. The Chi-square test provides a test of the signicance of the variance accounted for by the nk
roots after the acceptance of the rst k roots.
First the lambdas are scaled
normed
i
=
df
h
df
e
i
and then Chi-square is calculated
2
k+1
=
_
df
e
+df
h
df
h
+p + 1
2
_
_
s
i=k+1
ln(normed
i
+ 1)
_
where
k = the number of accepted roots (k = 0, 1, ..., s 1)
s = the number of roots.
The degrees of freedom are
DF = (p k)(g k 1)
where g is equal to the number of levels of the hypothesis.
k) F-ratios for univariate tests. These are the diagonal elements of
1
e
M
h
1
e
. The F-ratio for
variable y is exactly the F-ratio which would be obtained for the given eect if a univariate analysis
were performed with variable y being the only dependent variable.
50.3 Univariate Analysis
If a single dependent variable is specied, the calculations are nonetheless performed as outlined above.
Advantage, however, is taken of simplication, e.g. the principal component of the error correlation matrix
is set equal to one and no calculation is done.
Result of a univariate analysis of variance is a conventional ANOVA table with small dierences. It contains
a row for grand mean but does not contain a row for the total. The grand mean is generally not interpretable.
To obtain the total sum of squares, sum all the sums of squares except the sum for the grand mean.
50.4 Covariance Analysis
The formulas and discussion above do not, for the most part, take into account covariates. If one or more
covariates was specied, it is the sums of products matrices, S
e
and S
h
which are adjusted. If there are
q covariates, the program begins by carrying them along with p dependent variables. There is a (p q)
(p q) sum of product of error, S
e
matrix, and (p q) (p q) S
h
matrix for each hypothesis. The total
matrix S
t
is computed. S
e
and S
h
are partitioned into sections corresponding to the dependent variables
and covariates. Reduced (p p) error and total matrices are obtained and reduced matrix for hypothesis is
then obtained by subtraction.
Error correlation matrix and the principal components of this matrix are computed after the adjustment to
S
e
for covariates.
Chapter 51
One-Way Analysis of Variance
Notation
y = value of the dependent variable
w = value of the weight
k = subscript for case
i = subscript for category of the control variable
N
i
= number of cases in category i
W
i
= sum of weights for category i
N = total number of cases
W = total sum of weights
c = number of code categories of the control variable
with non-zero degrees of freedom.
51.1 Descriptive Statistics for Categories of the Control Variable
a) Mean.
y
i
=
k
w
ik
y
ik
W
i
b) Standard deviation (estimated).
s
i
=
_
_
N
i
N
i
1
__ W
i
k
w
ik
y
2
ik
_
k
w
ik
y
ik
_
2
W
2
i
_
c) Coecient of variation (C.var.).
C
i
=
100 s
i
y
i
d) Sum of y.
Sum y
i
=
k
w
ik
y
ik
e) Percent.
Percent
i
=
Sumy
i
i
Sumy
i
372 One-Way Analysis of Variance
f ) Sum of y squared.
Sum y
2
i
=
k
w
ik
y
2
ik
g) Total. The total row gives the statistics 1.a through 1.e above computed over all cases, except those
in code categories with zero degrees of freedom.
h) Degrees of freedom for the category i.
df
i
= W
i
(N
i
1) / N
i
Categories with zero degrees of freedom are not included in the computation of summary statistics.
51.2 Analysis of Variance Statistics
a) Total sum of squares.
TSS =
k
w
ik
y
2
ik
_
k
w
ik
y
ik
_
2
W
b) Between means sum of squares. This is sometimes called the between groups (or inter-groups)
sum of squares.
BSS =
i
_
_
k
w
ik
y
ik
_
2
k
w
ik
_
_
k
w
ik
y
ik
_
2
W
c) Within groups sum of squares. This is sometimes called the intra-groups sum of squares.
WSS = TSS BSS
d) Eta squared. This measure can be interpreted as the percent of variance in the dependent variable
that can be explained by the control variable. It ranges from 0 to 1.
2
=
BSS
TSS
e) Eta. This is a measure of the strength of the association between the dependent variable and the
control variable. It ranges from 0 to 1.
=
_
BSS
TSS
f ) Eta squared adjusted. Eta squared adjusted for degrees of freedom.
Adjusted
2
= 1 A(1
2
)
with adjustment factor
A =
W 1
W c
g) Eta adjusted.
Adjusted =
_
Adjusted
2
h) F value. The F ratio can be referred to the F distribution with c 1 and N c degrees of freedom.
A signicant F ratio means that mean dierences, or eects, probably exist among the groups.
F =
BSS/(c 1)
WSS/(N c)
The F ratio is not computed if a weight variable was specied.
Chapter 52
Partial Order Scoring
52.1 Special Terminology and Denitions
Let denote a set of elements by V = {a, b, c, . . . , } and a binary relation dened on it by R.
a) Binary relation. A binary relation R in V is such that for any two elements a, b V
aRb
For every binary relation R in V there exists a converse relation R
+
in V such that
bR
+
a
b) Reexive and anti-reexive relation. A relation R is reexive when
aRa for all a V
and R is anti-reexive when
not(aRa) for all a V
c) Symmetric and anti-symmetric relation. A relation R is symmetric when R = R
+
, that is when
aRb bRa for all a, b V
and R is anti-symmetric when symmetry does not appear for all a = b.
d) Transitive relation. A relation R is transitive when
aRb bRc = aRc for all a, b, c V
e) Equivalence relation. A relation R dened on a set of elements V is an equivalence relation when it
is:
reexive,
symmetric, and
transitive.
Note that the commonly used equality relation, (=), dened on the set of real numbers is an equiv-
alence relation.
f ) Strict partial order relation. A relation R is called a strict partial order when it satises the
conditions:
aRb and bRa cannot hold simultaneously, and
374 Partial Order Scoring
R is transitive.
A strict partial order relation is denoted hereafter by .
g) Partially ordered set. A set V is called a partially ordered set if a strict partial order relation
is dened on it. The fundamental properties of a partially ordered set are:
a b b c = a c for all a, b, c V
a b and b a cannot hold simultaneously.
h) Ordered set. A set V is called an ordered set if there are two relations and dened on it
and they satisfy the axioms of ordering:
for any two elements a, b V, one and only one of the relations a b, a b, b a holds,
is an equivalence relation, and
is a transitive relation.
In other words, an ordered set is a partially ordered set with additional equivalence relation dened
on it, and where the conditions neither a b nor b a and a b are equivalent.
i) Subset of elements dominating an element a.
G(a) =
_
g | g V; a g
_
j) Subset of elements dominated by an element a.
L(a) =
_
l | l V; l a
_
k) Subset of comparable elements.
C(a) = G(a) L(a)
Note that G(a) L(a) = .
l) Strict dominance. An element b strictly dominates an element a if
a b and not(b a)
It can also be said that b is strictly better than a, or that a is strictly worse than b.
52.2 Calculation of Scores
Let denote a list of variables to be used in the analysis by
{x
1
, x
2
, . . . , x
i
, . . . , x
v
}
and a priority list associated to them by
{p
1
, p
2
, . . . , p
i
, . . . , p
v
}.
The partial order relation constructed on the basis of this collection of variables,
a b for any cases a and b
is equivalent to the condition
x
1
(a) x
1
(b), x
2
(a) x
2
(b), . . . , x
v
(a) x
v
(b)
where x
i
(a) and x
i
(b) denote values of the i
th
variable for cases a and b respectively.
When comparing two cases, the variables of highest priority (lowest LEVEL value) are considered rst.
If they unambiguously determine the relation, the comparison procedure ends. In the situation of equality,
52.3 References 375
the comparison is continued using variables of the next priority level. This procedure is repeated until the
relation is determined at one of the priority levels, or the end of the variable list is reached.
For each case a from the analyzed set, the program calculates:
N(a) = the number of cases strictly dominating the case a
N(a) = the number of cases equivalent to the case a
N(a) = the number of cases strictly dominated by the case a
and then one (or two) of the following scores:
s
1
(a) = S
N(a)
N(a) +N(a) +N(a)
r
1
(a) = S s
1
(a)
s
2
(a) = S
N(a) +N(a)
N(a) +N(a) +N(a)
r
2
(a) = S s
2
(a)
s
3
(a) = S
N(a)
N
r
3
(a) = S
N(a) +N(a)
N
s
4
(a) = S
N(a) +N(a)
N
r
4
(a) = S
N(a)
N
where
N = total number of cases in the analyzed set
S = the value of the scale factor (see the SCALE parameter).
The values of the ORDER parameter select the score(s) as follows:
ASEA : r
3
(a)
DEEA : s
4
(a)
ASCA : r
4
(a)
DESA : s
3
(a)
ASER : s
1
(a), r
1
(a)
DESR : s
1
(a), r
1
(a)
ASCR : s
2
(a), r
2
(a)
DEER : s
2
(a), r
2
(a).
52.3 References
Debreu, G., Representation of a preference ordering by a numerical function, Decision Process, eds. R.M.
Thrall, C.A. Coombs and R.L. Davis, New York, 1954.
Hunya, P., A Ranking Procedure Based on Partially Ordered Sets, Internal paper, JATE, Szeged, 1976.
Chapter 53
Pearsonian Correlation
Notation
x, y = values of variables
w = value of the weight
k = subscript for case
N = number of valid cases on both x and y
W = total sum of weights.
53.1 Paired Statistics
They are computed for variables taken by pair (x, y) on the subset of cases having valid data on both x and
y.
a) Adjusted weighted sum. The number of cases, weighted, with valid data on both x and y.
b) Mean of x.
x =
k
w
k
x
k
W
Note: the formula for mean of y is analogous.
c) Standard deviation of x (estimated).
s
x
=
_
_
N
N 1
__ W
k
w
k
x
2
k
_
k
w
k
x
k
_
2
W
2
_
Note: the formula for standard deviation of y is analogous.
d) Correlation coecient. Pearsons product moment coecient r.
r
xy
=
W
k
w
k
x
k
y
k
_
k
w
k
x
k
__
k
w
k
y
k
_
_
_
W
k
w
k
x
2
k
_
k
w
k
x
k
_
2
__
W
k
w
k
y
2
k
_
k
w
k
y
k
_
2
_
e) t-test. This statistic is used to test the hypothesis that the population correlation coecient is zero.
t =
r
N 2
1 r
2
378 Pearsonian Correlation
53.2 Unpaired Means and Standard Deviations
They are computed variable by variable for all variables included in the analysis using the formulas given in
1.a, 1.b and 1.c respectively, the potential dierence in results being due to dierent number of valid cases.
a) Adjusted weighted sum. The number of cases, weighted, with valid data on x.
b) Mean of x. Mean of variable x for all cases with valid data on x.
c) Standard deviation of x (estimated). Standard deviation of variable x for all cases with valid
data on x.
53.3 Regression Equation for Raw Scores
It is computed on all valid cases for the pair (x, y).
a) Regression coecient. This is the unstandardized regression coecient of y (dependent variable)
on x (independent variable).
B
yx
= r
xy
_
s
y
s
x
_
b) Constant term.
A = y B
yx
x; regression equation: y = B
yx
x +A
53.4 Correlation Matrix
The elements of this matrix are computed on the basis of the formula given under 1.d above. Note that
standard deviations output with correlation matrix are calculated according to the formula given under 1.c
above (estimated standard deviations).
53.5 Cross-products Matrix
It is a square matrix with the following elements:
CP
xy
=
k
w
k
x
k
y
k
53.6 Covariance Matrix
It is a matrix containing the following elements:
COV
xy
= r
xy
s
x
s
y
where
s
x
=
_
W
k
w
k
x
2
k
_
k
w
k
x
k
_
2
W
2
and s
y
is calculated according to the analogous formula.
Note that the covariance matrix output by PEARSON does not contain diagonal elements. In order to
allow their recalculation, standard deviations output with this matrix are calculated according to the above
formula (unestimated standard deviations).
Chapter 54
Rank-ordering of Alternatives
Notation
i, j, l = subscripts for alternatives
m = number of alternatives
k = case index
n = number of cases
w = value of the weight.
54.1 Handling of Input Data
Let a set of alternatives be denoted by A = {a
1
, a
2
, . . . , a
i
, . . . , a
m
} and the set of sources of information
(called hereafter evaluations) be denoted by E = {e
1
, e
2
, . . . , e
k
, . . . , e
n
}.
In practice, data providing the primary information on the preference relations may appear in rather various
forms. The program accepts, however, two basic types of data: data representing a selection of alternatives
and data representing a ranking of alternatives. All other forms of data should be transformed by the user
prior to the execution of the RANK program.
a) Data representing a selection of alternatives. In this case the evaluations represent the choice
of the mostly preferred alternatives and optionally their preference order. In other words, all the
evaluations e
k
select a subset A
k
from A and optionally order the elements of it. For this reason A
k
is
a subset of alternatives (ordered or non-ordered), and the A
k
s constitute the primary individual data:
A
k
=
_
a
ki1
, a
ki2
, . . . , a
kip
k
_
where
p = maximum number of alternatives which could be selected in an evaluation
p
k
= number of alternatives actually selected in the evaluation e
k
and p
k
p < m .
b) Data representing a ranking of alternatives. Here the evaluations represent the ranking of the
alternatives within the whole set A, and the attribution to each of them of its rank number. Formally,
all the evaluations e
k
give a rank number
k
(a
i
) =
ki
to all the alternatives. In this case the data are
provided in the following form:
P
k
= {
k
(a
1
),
k
(a
2
), . . . ,
k
(a
m
)}
Note that an alternative a
ki1
is strictly preferred to or strictly dominates another alternative a
ki2
according to the data coming from the evaluation e
k
if the former has a rank higher than the latter.
380 Rank-ordering of Alternatives
Similarly, an alternative a
ki1
is preferred to or dominates another alternative a
ki2
according to
the data coming from the evaluation e
k
if the rank of a
ki1
is at least as high as the rank of a
ki2
. The
value 1 is taken for the highest rank.
Only the data described in paragraph b) are directly processed by the program. The data depicted in a) are
transformed into the form of b). This transformation makes a distinction between the strict and the weak
preference.
The transformation rule, when dealing with data representing a completely ordered selection of alter-
natives (strict preference), is the following:
for a
i
A
k
k
(a
i1
) = 1,
k
(a
i2
) = 2, . . . ,
k
(a
ip
k
) = p
k
for a
i
A
k
k
(a
i
) =
p
k
+ 1 +m
2
When dealing with data representing a non-ordered selection of alternatives (weak preference), it is assumed
that all the selected alternatives are at the same level of preference. According to this assumption, the
transformation rule is:
for a
i
A
k
k
(a
i
) =
p
k
+ 1
2
for a
i
A
k
k
(a
i
) =
p
k
+ 1 +m
2
As a result of the transformations dened above, the preference (or priority choice) data are for the next
steps of analyses in the form:
P
(n,m)
=
_
11
12
1i
1m
21
22
2i
2m
.
.
.
.
.
.
.
.
.
.
.
.
k1
k2
ki
km
.
.
.
.
.
.
.
.
.
.
.
.
n1
n2
ni
nm
_
_
54.2 Method of Classical Logic Ranking
In this method the matrix P is used as the initial data for the analysis. Concerning the strict or weak
character of the preference relation it should be noted that it plays a role only in the steps leading to the
matrix P. In the further steps of the analysis, the procedure is controlled by other parameters, such as rank
dierence for concordance and rank dierence for discordance (see below).
The classical logic ranking procedure consists of two major steps, namely: a) construction of the relations,
and b) identication of cores.
a) Construction of the relations. In this step, two working relations (the concordance relation and
the discordance relation) are constructed rst. Then they are used to construct a nal dominance
relation.
i) The concordance and discordance relations are build from the matrix P
(n,m)
, and the
rules applied in this process are essentially the same for both relations.
Concordance relation. Two parameters are used in creating a relation which reects the
concordance of the collective opinion that a
i
is preferred to a
j
:
d
c
= the rank dierence for concordance (0 d
c
m1)
p
c
= the minimum proportion for concordance (0 p
c
< 1).
Rank dierence for concordance enables the user to inuence the evaluation of data when con-
structing the individual preference matrices
RC
k
(d
c
) =
_
rc
k
ij
(d
c
)
_
where i, j = 1, 2, . . . , m.
54.2 Method of Classical Logic Ranking 381
The elements of RC
k
(d
c
), which measure the dominance of a
i
over a
j
according to the evaluation
k, are dened as follows:
rc
k
ij
(d
c
) =
_
1 if
kj
ki
d
c
0 otherwise.
The aggregation of these matrices measures the average dominance of a
i
over a
j
and has the form
of a fuzzy relation described by the matrix
RC(d
c
) =
_
rc
ij
(d
c
)
_
where
rc
ij
(d
c
) =
k
w
k
rc
k
ij
(d
c
)
k
w
k
Note that higher d
c
values lead to more rigorous construction rules, since d
1
c
< d
2
c
implies
rc
k
ij
(d
1
c
) rc
k
ij
(d
2
c
) and rc
ij
(d
1
c
) rc
ij
(d
2
c
)
Minimum proportion for concordance makes it possible to transform the fuzzy relation RC(d
c
)
into a non-fuzzy one, called the concordance relation, described by the matrix
RC(d
c
, p
c
) =
_
rc
ij
(d
c
, p
c
)
_
the elements of which are dened as follows:
rc
ij
(d
c
, p
c
) =
_
1 if rc
ij
(d
c
) p
c
0 otherwise.
The condition rc
ij
(d
c
, p
c
) = 1 means that the collective opinion is in concordance with the state-
ment a
i
is preferred to a
j
at the level (d
c
, p
c
).
It is clear again that increasing the p
c
value one obtains stricter conditions for the concordance.
Discordance relation. The construction of the discordance relation follows the same way as
was explained for the concordance. The two parameters controlling the construction are:
d
d
= the rank dierence for discordance (0 d
d
m1)
p
d
= the maximum proportion for discordance (0 p
d
1).
The individual discordance relations are determined rst in the matrices
RD
k
(d
d
) =
_
rd
k
ij
(d
d
)
_
where i, j = 1, 2, . . . , m.
The elements of RD
k
(d
d
), which measure the dominance of a
j
over a
i
according to the evaluation
k, are dened as follows:
rd
k
ij
(d
d
) =
_
1 if
ki
kj
d
d
0 otherwise.
The aggregation of these matrices measures the average dominance of a
j
over a
i
and has the form
of a fuzzy relation described by the matrix
RD(d
d
) =
_
rd
ij
(d
d
)
_
where
rd
ij
(d
d
) =
k
w
k
rd
k
ij
(d
d
)
k
w
k
As for concordance, the second parameter (maximum proportion for discordance), enables the
user to transform the fuzzy relation RD(d
d
) into a non-fuzzy one, called the discordance relation,
described by the matrix
RD(d
d
, p
d
) =
_
rd
ij
(d
d
, p
d
)
_
382 Rank-ordering of Alternatives
the elements of which are dened as follows:
rd
ij
(d
d
, p
d
) =
_
1 if rd
ij
(d
d
) > p
d
0 otherwise.
The condition rd
ij
(d
d
, p
d
) = 1 means that the collective opinion is in discordance with the state-
ment a
i
is preferred to a
j
, i.e. supports the opposite statement a
j
is preferred to a
i
, at the
level (d
d
, p
d
). This can be interpreted as a collective veto against the statement a
i
is preferred
to a
j
.
Note that higher values of d
d
and p
d
lead to less rigorous construction rules and thus to weaker
conditions for discordance.
ii) The dominance relation is composed of the concordance and discordance relations. The basic
idea is that the statement a
i
is preferred to a
j
can be accepted if the collective opinion
is in concordance with it, i.e. rc
ij
(d
c
, p
c
) = 1, and
is not in discordance with it, i.e. rd
ij
(d
d
, p
d
) = 0;
otherwise this statement has to be rejected. So the dominance relation, being a function of four
parameters, is described by the matrix R of mm dimensions
R =
_
r
ij
(d
c
, p
c
, d
d
, p
d
)
_
where the elements are obtained according to the expression
r
ij
(d
c
, p
c
, d
d
, p
d
) = min
_
rc
ij
(d
c
, p
c
), 1 rd
ij
(d
d
, p
d
)
The r
ij
is a monotonously decreasing function of the rst two parameters, and a monotonously
increasing function of the last two ones. This implies that:
by increasing the d
c
, p
c
and/or decreasing d
d
, p
d
one can diminish the number of connections
in the dominance relation, and
by changing the parameters in the opposite direction one can create more connections.
b) Identication of cores. The cores are subsets of A(set of alternatives) consisting of non-dominated
alternatives. An alternative a
j
is non-dominated if and only if
r
ij
= 0 for all i = 1, 2, . . . , m.
i) According to this criterion the core of the set A (the highest level core) is the subset
C(A) =
_
a
j
| a
j
A; r
ij
= 0, i = 1, 2, . . . , m
_
If C(A) = then all the alternatives are dominated.
If C(A) = A then all the alternatives are non-dominated.
ii) In order to nd the subsequent core, the elements of the previous core are removed from the
dominance relation rst. This means that the corresponding rows and columns are removed from
the relational matrix. Then the search for a new core is repeated in the reduced structure.
The successive application of i) and ii) gives a series of cores A
c
1
, A
c
2
, . . . , A
c
q
. These cores
represent consecutive layers of alternatives with decreasing ranks in the preference structure,
while the alternatives belonging to the same core are assumed to be of the same rank.
54.3 Methods of Fuzzy Logic Ranking: the Input Relation
In the fuzzy logic ranking methods, the matrix P
(n,m)
is used to construct: a) individual preference relations,
and b) the input relation (called also a fuzzy relation) on the set of alternatives A. Here the strict and
weak character of the preference relation plays an important role.
a) Construction of individual preference relations. For each evaluation e
k
an individual preference
relation, which is given implicitly in P, is transformed into the matrix of mm dimensions:
R
k
=
_
r
k
ij
_
where i, j = 1, 2, . . . , m
54.3 Methods of Fuzzy Logic Ranking: the Input Relation 383
in which
r
k
ij
=
_
1 if the statement a
i
is preferred to a
j
in the evaluation e
k
is true;
0 if this statement is false.
Depending on the preference type used, the statement a
i
is preferred to a
j
in the evaluation e
k
is
equivalent to the inequality
ki
<
kj
(strict preference), or
ki
kj
(weak preference).
b) Construction of the input relation (fuzzy relation). The aggregation of the individual preference
relation matrices provides the matrix representing a fuzzy relation on the set of alternatives A:
R =
_
r
ij
_
where
r
ij
=
k
w
k
r
k
ij
k
w
k
Each component r
ij
of R can be interpreted as the credibility of the statements a
i
is preferred to
a
j
in a global sense, and without referring to the single evaluation. Thus, the following general
interpretation is possible:
r
ij
= 1 a
i
is preferred to a
j
in all the evaluations,
r
ij
= 0 a
i
is preferred to a
j
in no evaluation,
0 < r
ij
< 1 a
i
is preferred to a
j
in a certain portion of the evaluations.
c) Characteristics of the input relation.
i) Fuzzyness
non-fuzzy : if r
ij
= 0 or r
ij
= 1 for all i, j = 1, 2, . . . , m;
fuzzy : otherwise.
ii) Symmetry
symmetric : if r
ij
= r
ji
for all i, j = 1, 2, . . . , m;
anti-symmetric : if r
ij
= 0 implies r
ji
= 0 for all i = j;
asymmetric : otherwise.
iii) Reflexivity
reexive : if r
ii
= 1 for all i = 1, 2, . . . , m;
anti-reexive : if r
ii
= 0 for all i = 1, 2, . . . , m;
irreexive : otherwise.
iv) Trichotomy
trichotome : if r
ij
+r
ji
= 1 for all i, j = 1, 2, . . . , m and i = j;
(normalized)
non-trichotome : otherwise.
(non-normalized)
v) Coherence index. Its value, C, depends on the order of the rows and columns in R , i.e. on
the order of the alternatives in A, and 1 C 1.
C =
i<j
(r
ij
r
ji
)
i<j
(r
ij
+r
ji
)
384 Rank-ordering of Alternatives
Absolute coherence index is an order-independent modication of C. Its value, C
a
, is the
upper bound for C and 0 C
a
1.
C
a
=
i<j
|r
ij
r
ji
|
i<j
(r
ij
+r
ji
)
Indices C and C
a
are indicators of unanimity in the preference data. A full coherence is shown
when C = 1, while C
a
= 0 indicates a full lack of coherence. The value 1 of the index C can be
interpreted as an order of alternatives opposite to the order dened by the fuzzy relation.
vi) Intensity index. This index can be interpreted as an average credibility level of the statements
a
i
is preferred to a
j
or a
j
is preferred to a
i
. In general, its value 1 I 2, while in the
case of a strict preference 0 I 1. Here I = 1 implies a normalized relation (see 3.c below)
and means that in all the preference data one of the above statements is valid for all the pairs of
alternatives.
I =
i<j
(r
ij
+r
ji
)
m(m1)/2
vii) Dominance index. It is also an order-dependent index, and 1 D 1.
D =
i<j
(r
ij
r
ji
)
m(m1)/2
Absolute dominance index, similarly to the coherence index, is dened as the order indepen-
dent dominance index. Its value, D
a
, is the upper bound for D and 0 D
a
1.
D
a
=
i<j
|r
ij
r
ji
|
m(m1)/2
The indices D and D
a
indicate the average dierence between the credibility of the statements
a
i
is preferred to a
j
and of their opposite statements a
j
is preferred to a
i
.
Note that C, I, D and C
a
, I, D
a
are not independent of one another, namely:
C I = D and C
a
I = D
a
d) Normalized matrix. A normalized matrix is obtained from the R matrix using the following trans-
formation:
r
ij
=
_
r
ij
r
ij
+r
ji
if i = j and r
ij
+r
ji
= 0
r
ij
otherwise.
54.4 Fuzzy Method-1: Non-dominated Layers
The fuzzy logic ranking methods assume a fuzzy preference relation with the membership function :
A A [0, 1] on a given set A of alternatives. This membership function is represented by the matrix
R (see section 3 above). The values r
ij
= (a
i
, a
j
) are understood as the degrees to which the preferences
expressed by the statements a
i
is preferred to a
j
are true.
Another assumption is that:
in the case of weak preference, is reexive, i.e.
(a
i
, a
i
) = r
ii
= 1 for all a
i
A
in the case of strict preference, is anti-reexive, i.e.
(a
i
, a
i
) = r
ii
= 0 for all a
i
A
The fuzzy method-1 procedure looks for a set of non-dominated alternatives (denoted ND alter-
natives), considering such a set as the highest level core of alternatives. The reason for this is that ND
54.5 Fuzzy Method-2: Ranks 385
alternatives are either equivalent to one another, or are not comparable to one another on the basis of the
preference relation considered, and they are not dominated in a strict sense by others.
In order to determine a fuzzy set of ND alternatives, two fuzzy relations corresponding to the given preference
relation R are dened: fuzzy quasi-equivalence relation and fuzzy strict preference relation. Formally they
are dened as follows:
fuzzy quasi-equivalence relation R
e
:
R
e
= R R
1
fuzzy strict preference relation R
s
:
R
s
= R\ R
e
= R\ (R R
1
) = R\ R
1
where R
1
is a relation opposite to the relation R.
Furthermore, the following membership functions are dened respectively for R
e
and R
s
:
e
(a
i
, a
j
) = min(r
ij
, r
ji
)
s
(a
i
, a
j
) =
_
r
ij
r
ji
when r
ij
> r
ji
0 otherwise.
For any xed alternative a
j
A the function
s
(a
j
, a
i
) describes a fuzzy set of alternatives which are strictly
dominated by a
j
. The complement of this fuzzy set, described by the membership function 1
s
(a
j
, a
i
),
is for any xed a
j
the fuzzy set of all the alternatives which are not strictly dominated by a
j
. Then the
intersection of all such complement fuzzy sets (over all a
j
A) represents the fuzzy set of those alternatives
a
i
A which are not strictly dominated by any of the alternatives from the set A. This set is called the
fuzzy set
ND
of ND alternatives in the set A. Thus, according to the denition of intersection
ND
(a
i
) = min
ajA
(1
s
(a
j
, a
i
)) = 1 max
ajA
s
(a
j
, a
i
)
The value
ND
(a
i
) represents the degree to which the alternative a
i
is not strictly dominated by any of the
alternatives from the set A.
The highest level core of alternatives contains those alternatives a
i
which have the greatest degree
of non-dominance or, in other words, which give a value for
ND
(a
i
) that is equal to the value:
M
ND
= max
aiA
ND
(a
i
)
The value of M
ND
is called the certainty level corresponding to the core dened by:
C(A) =
_
a
i
| a
i
A;
ND
(a
i
) = M
ND
_
The subsequent cores are constructed by a repeated application of the procedure described above. The
elements of the previous core are removed from the fuzzy relation rst, i.e. the corresponding rows and
columns are removed from the fuzzy relation matrix. Then the calculations are repeated in the reduced
structure.
54.5 Fuzzy Method-2: Ranks
The input relation to this method is the same as to the method-1, namely: the matrix R which has to be
reexive or anti-reexive. However, the question to be answered here is quite dierent.
The fuzzy method-2 procedure looks for the level of credibility, denoted c
jp
, of statements a
j
is
exactly at the p
th
place in the ordered sequence of the alternatives in A, denoted T
jp
. The c
jp
values form
a matrix M of mm dimensions representing a fuzzy membership function, in which the rows correspond
to the alternatives and the columns to the possible positions in the sequence 1, 2, . . . , m.
In order to make possible the calculation of c
jp
s they must be decomposed into already known credibility
levels r
ij
, and thus the statements T
jp
must be decomposed into elementary statements with known cred-
ibility levels r
ij
. For that, further notations are introduced. Note that for an alternative a
j
being exactly
at the p
th
place means that it is preferred to m p alternatives and is preceded by the remaining p 1
386 Rank-ordering of Alternatives
alternatives. When the subset of alternatives after a
j
is xed, then
A
j
mp
= the subset of those alternatives to which a
j
is preferred,
A
j
p1
= the subset of alternatives which are preferred to a
j
,
A
j
= the subset A\ {a
j
}.
Obviously,
A
j
p1
A
j
mp
= A
j
A
j
p1
A
j
mp
=
and the statement T
jp
is equivalent to a sequence of statements a
j
is preferred to all the elements of A
j
mp
and all the elements of A
j
p1
are preferred to a
j
, connected by the disjunctive operator of logic.
Furthermore, the statement a
j
is preferred to all the elements of A
j
mp
is a conjunction of the already
known statements a
j
is preferred to a
l
, with the credibility level equal to r
jl
, for all the elements a
l
of
A
j
mp
.
Similarly, the statement all the elements of A
j
p1
are preferred to a
j
is a conjunction of the already known
statements a
i
is preferred to a
j
, with the credibility level equal to r
ij
, for all the elements a
i
of A
j
mp
.
Applying the corresponding fuzzy operators, the elements of the matrix M can be obtained as follows:
c
jp
= max
A
j
mp
A
j
_
min
_
min
a
l
A
j
mp
r
jl
, min
aiA
j
p1
r
ij
_
_
The computation of the c
jp
values is performed using an optimization procedure which produces a series of
subsets A
j
mp
(while keeping j and p xed) with strictly monotonously increasing values of the function to
be maximized in successive steps.
The program provides two ways of interpretation of the matrix M.
Fuzzy sets of ranks by alternatives.
For each alternative a
j
, a fuzzy membership function values show the credibility of having this alternative
at the p
th
place (p = 1, 2, . . . , m). Also, the most credible ranks (places) for each alternative are listed.
Fuzzy subsets of alternatives by ranks.
For each rank (place) p, a fuzzy membership function value shows the credibility of the alternative a
j
(j = 1, 2, . . . , m) to be at this place. Also the most credible alternatives, candidates for the place, are listed.
54.6 References
Dussaix, A.-M., Deux methodes de determination de priorites ou de choix, Partie 1: Fondements mathematiques,
Document UNESCO/NS/ROU/624, UNESCO, Paris, 1984.
Jacquet-Lagr`eze, E., Analyse dopinions valuees et graphes de preference, Mathematiques et sciences hu-
maines, 33, 1971.
Jacquet-Lagr`eze, E., Lagregation des opinions individuelles, Informatique et sciences humaines, 4, 1969.
Kaufmann, A., Introduction ` a la theorie des sous-ensembles ous, Masson, Paris, 1975.
Orlovski, S.A., Decision-making with a fuzzy preference relation, Fuzzy Sets and Systems, Vol.1, No 3, 1978.
Chapter 55
Scatter Diagrams
Notation
x = value of the variable to be plotted horizontally
y = value of the variable to be plotted vertically
w = value of the weight
k = subscript for case
N = total number of cases
W = total sum of weights.
55.1 Univariate Statistics
These unweighted statistics are calculated for all variables used in the execution.
a) Mean.
x =
k
x
k
N
b) Standard deviation.
s
x
=
k
x
2
k
N
x
2
55.2 Paired Univariate Statistics
They are calculated on the set of cases having valid data on both x and y. These are weighted statistics if
a weight variable is specied.
a) Mean.
x =
k
w
k
x
k
W
Note: the formula for y is analogous.
388 Scatter Diagrams
b) Standard deviation.
s
x
=
k
w
k
x
2
k
W
x
2
Note: the formula for s
y
is analogous.
c) N. The number of cases, weighted, with valid data on both x and y.
55.3 Bivariate Statistics
They are calculated on the set of cases having valid data on both x and y.
a) Pearsons product moment r.
r
xy
=
W
k
w
k
x
k
y
k
_
k
w
k
x
k
__
k
w
k
y
k
_
_
_
W
k
w
k
x
2
k
_
k
w
k
x
k
_
2
__
W
k
w
k
y
2
k
_
k
w
k
y
k
_
2
_
b) Regression statistics: constant A and coecient B.
A =
k
w
k
y
k
k
w
k
x
k
B
W
where B is the unstandardized regression coecient.
B =
W
k
w
k
x
k
y
k
_
k
w
k
x
k
__
k
w
k
y
k
_
W
k
w
k
x
2
k
_
k
w
k
x
k
_
2
The constant A and coecient B can be used in the regression equation y = Bx+A to predict y from
x.
Chapter 56
Searching for Structure
Notation
y = value of the dependent variable
x = frequency (weighted) of the categorical dependent variable
or values (weighted) of dichotomous dependent variables
z = value of the covariate
w = value of the weight
k = subscript for case
j = subscript for category code of the dependent variable
or subscript for dichotomous dependent variables
m = number of codes of the dependent variable
or number of dichotomous dependent variables
g = subscript for group; g = 1 indicate the whole sample
i = subscript for nal groups
t = number of nal groups
N
g
= number of cases in group g
W
g
= sum of weights in group g
N
i
= number of cases in the nal group i
W
i
= sum of weights in the nal group i
N = total number of cases
W = total sum of weights.
56.1 Means analysis
This method can be used when analysing one dependent variable (interval or dichotomous) and several
predictors. It aims at creating groups which would allow for the best prediction of the dependent variable
values from the group average. In other words, created groups should provide largest dierences in group
means. Thus, the splitting criterion (explained variation) is based upon group means.
a) Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative
splits for parent groups as well as for each group resulting from the best split.
i) Sum (wt). Number of cases (N
g
) if the weight variable is not specied, or weighted number of
cases (W
g
) in group g.
390 Searching for Structure
ii) Mean y. Mean value of the dependent variable y in group g.
y
g
=
Ng
k=1
w
k
y
gk
W
g
iii) Var y. Variance of the dependent variable y in group g.
2
yg
=
Ng
k=1
w
k
(y
gk
y
g
)
2
W
g
Wg
Ng
iv) Variation. Sum of squares of the dependent variable (as in one-way analysis of variance) in
group g.
V
g
=
Ng
k=1
w
k
(y
gk
y
g
)
2
v) Var expl. Explained variation is measured by the dierence between the variation in the parent
group and the sum of variation in the two children groups. It provides, for each predictor, the
amount of variation explained by the best split for this predictor, i.e. the highest value obtained
over all possible splits for this predictor.
Let g
1
and g
2
denote two subgroups (children groups) obtained in a split of the parent group g,
and V
g1
and V
g2
their respective variation. The variation explained by such a split of group g is
calculated as follows:
EV
g
= V
g
(V
g1
+ V
g2
)
Then, this value is maximized over all possible splits for the predictor.
vi) Explained variation. This is the percent of the total variation explained by the nal groups.
Percent = 100
EV
TV
where EV and TV are, respectively, the variation explained by the nal groups and the total
variation (see 1.b below).
b) One-way analysis of nal groups. These are one-way analysis of variance statistics calculated for
the nal groups.
i) Explained variation and DF. This is the amount of variation explained by the nal groups
and the corresponding degrees of freedom.
EV = TV UV = TV
t
i=1
V
i
DF = t 1
ii) Total variation and DF. Variation calculated for the whole sample, i.e. for group 1, and the
corresponding degrees of freedom.
TV = V
1
DF = W 1
iii) Error and DF. This is the amount of unexplained variation and the corresponding degrees of
freedom.
UV =
t
i=1
V
i
DF = W t
c) Split summary table. The table provides group mean value, variance and variation of the dependent
variable at each split as well as the variation explained by that split (see 1.a above).
56.2 Regression Analysis 391
d) Final group summary table. The table provides mean value, variance and variation of the dependent
variable for the nal groups (see 1.a above).
e) Percent of explained variation. The percent of total variation explained by the best split for each
group is calculated as follows:
Percent
g
= 100
EV
g
TV
Note that this value is equal to zero for the nal groups (indicated by an asterisk).
f ) Residuals. The residuals are the dierences between the observed value and the predicted value of
the dependent variable.
e
k
= y
k
y
k
As predicted value, a case is assigned the mean value of the dependent variable for the group to which
it belongs, i.e.
y
ik
= y
i
56.2 Regression Analysis
This method can be used when analysing a dependent variable (interval or dichotomous) with one covariate
and several predictors. It aims at creating groups which would allow for the best prediction of the dependent
variable values from the group regression equation and the value of covariate. In other words, created groups
should provide largest dierences in group regression lines. The splitting criterion (explained variation) is
based upon group regression of the dependent variable on the covariate.
a) Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative
splits for parent groups as well as for each group resulting from the best split.
i) Sum (wt). Number of cases (N
g
) if the weight variable is not specied, or weighted number of
cases (W
g
) in group g.
ii) Mean y,z. Mean value of the dependent variable y and the covariate z in group g (see 1.a.ii
above).
iii) Var y,z. Variance of the dependent variable y and the covariate z in group g (see 1.a.iii above).
iv) Slope. This is the slope of the dependent variable y on the covariate z in group g.
b
g
=
Ng
k=1
w
k
(y
gk
y
g
)(z
gk
z
g
)
Ng
k=1
w
k
(z
gk
z
g
)
2
v) Variation. This is the error or residual sum of squares from estimating the variable y by its
regression on covariate in group g, i.e. a measure of deviation about the regression line.
V
g
=
Ng
k=1
w
k
(y
gk
y
g
)
2
b
g
Ng
k=1
w
k
(y
gk
y
g
)(z
gk
z
g
)
where b
g
is the slope of the regression line in group g.
vi) Var expl. Explained variation (EV). See 1.a.v above for general information, and 2.a.v above
for details on V (variation) used in regression analysis.
vii) Explained variation. This is the percent of the total variation explained by the nal groups.
See 1.a.vi above and 2.b below.
b) One-way analysis of nal groups. These are the summary statistics for the nal groups. See 1.b
above for general information, and 2.a.v and 2.a.vi above for details on V and EV measures used in
regression analysis.
392 Searching for Structure
c) Split summary table. The table provides group mean value, variance and variation of the dependent
variable at each split as well as the variation explained by that split. It also provides mean value and
variance of the covariate. See 2.a above for formulas. Moreover, the following regression statistics are
calculated for each split:
i) Slope. It is the slope of the dependent variable y on the covariate z in group g (see 2.a.iv above).
ii) Intercept. It is the constant term in the regression equation.
a
g
= y
g
b
g
z
g
where b
g
is the slope in group g.
iii) Corr. Pearson r correlation coecient between the dependent variable y and the covariate z in
group g.
r
g
=
Ng
k=1
w
k
(y
gk
y
g
) (z
gk
z
g
)
_
2
yg
2
zg
d) Final group summary table. The table provides the same information (except the explained vari-
ation) as in Split summary table, but for nal groups.
e) Percent of explained variation. The percent of total variation explained by the best split for each
group (see 1.e and 2.a.vi above).
f ) Residuals. The residuals are the dierences between the observed value and the predicted value of
dependent variable.
e
k
= y
k
y
k
Predicted values are calculated as follows:
y
ik
= a
i
+b
i
z
ik
where a
i
and b
i
are regression coecients for the nal group i.
56.3 Chi-square Analysis
This method can be used when analysing one dependent variable (nominal or ordinal) or a set of dichotomous
dependent variables with several predictors. It aims at creating groups which would allow for the best
prediction of the dependent variable category from its group distribution. In other words, created groups
should provide largest dierences in the dependent variable distributions. The splitting criterion (explained
variation) is calculated on the basis of frequency distributions of the dependent variable. Note that multiple
dependent dichotomous variables are treated as categories of one categorical variable.
a) Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative
splits for parent groups as well as for each group resulting from the best split.
i) Sum (wt). Number of cases (N
g
) if the weight variable is not specied, or weighted number of
cases (W
g
) in group g.
ii) Variation. This is the entropy for group g, i.e. a measure of disorder in the distribution of the
dependent variable.
V
g
= 2
m
j=1
x
jg
ln
x
jg
x
g
where
x
jg
=
Ng
k=1
x
jgk
x
g
=
m
j=1
x
jg
and x
jgk
is the frequency (coded 0 or 1) of code j (or value of variable j) of case k in group g.
56.4 References 393
iii) Var expl. Explained variation (EV). See 1.a.v above for general information, and 3.a.ii above
for details on V (variation) used in chi-square analysis.
iv) Explained variation. This is the percent of the total variation explained by the nal groups.
See 1.a.vi above and 3.b below.
b) One-way analysis of nal groups. These are the summary statistics for the nal groups. See 1.b
above for general information, and 3.a.ii and 3.a.iii above for details on V and EV measures used in
chi-square analysis.
c) Split summary table. The table provides variation of the dependent variable at each split as well as
the variation explained by that split. See 3.a.ii and 3.a.iii above for formulas.
d) Final group summary table. The table provides variation of the dependent variable for the nal
groups.
e) Percent of explained variation. The percent of total variation explained by the best split for each
group (see 1.e and 3.a.iii above).
f ) Percent distributions. A bivariate table showing percentage distributions of the dependent variable
for all groups (P
jg
).
g) Residuals. The residuals are the dierences between the observed value and the predicted value of
dependent variable.
For analysis with one categorical dependent variable, residuals are calculated for each category
of the variable. Thus, the number of residuals is equal to the number of categories.
e
jk
= x
jk
x
jik
Observed values, x
jk
, are created as a series of dummy variables, coded 0 or 1.
As predicted value for category j, a case is assigned the proportion of cases being in this category for
the group to which the case belongs, i.e.
x
jik
= P
ji
/100
For analysis with several dichotomous dependent variables, residuals are calculated for each
variable. Thus, the number of residuals is equal to the number of dependent variables.
e
jk
= x
jk
x
jik
Observed values are calculated as follows:
x
jk
=
x
jk
m
j=1
x
jk
As predicted value for variable j, a case is assigned the proportion of cases having value 1 for this
variable in the group to which the case belongs, i.e.
x
jik
= P
ji
/100
56.4 References
Morgan, J.N., Messenger, R.C., THAID A Sequential Analysis Program for the Analysis of Nominal Scale
Dependent Variables, Institute for Social Research, The University of Michigan, Ann Arbor, 1973.
Sonquist, J.A., Baker, E.L., Morgan, J.N., Searching for Structure, Revised ed., Institute for Social Research,
The University of Michigan, Ann Arbor, 1974.
Chapter 57
Univariate and Bivariate Tables
Notation
x = value of the row variable in bivariate tables,
or value of the variable in univariate tables
y = value of the column variable in bivariate tables
w = value of the weight
k = subscript for case
i = subscript for row in bivariate tables
j = subscript for column in bivariate tables
r = number of rows in bivariate tables
c = number of columns in bivariate tables
f
i
= marginal frequency in the row i of a bivariate table
f
j
= marginal frequency in the column j of a bivariate table
N = total number of cases.
57.1 Univariate Statistics
a) Wtnum. The weight variable number, or zero if the weight variable is not specied.
b) Wtsum. Number of cases if the weight variable is not specied, or weighted number of cases (sum of
weights).
c) Mode. The rst category which contains the maximum frequency.
d) Median. The median is calculated as an n-tile with two requested subintervals. See Distribution
and Lorenz Functions chapter for details.
e) Mean.
x =
k
w
k
x
k
k
w
k
f ) Variance. This is an unbiased estimation of the population variance.
s
2
x
=
_
N
N 1
_
k
w
k
(x
k
x)
2
k
w
k
396 Univariate and Bivariate Tables
g) Standard deviation. It should be noted that s
x
is not itself an unbiased estimate of the population
standard deviation.
s
x
=
_
s
2
x
h) Coecient of variation (C.var.).
C
x
=
100 s
x
x
i) Skewness. The skewness of the distribution of x is measured by
g
1
=
_
N
N 2
__
m
3
s
2
x
_
s
2
x
_
where m
3
=
k
w
k
(x
k
x)
3
k
w
k
Skewness is a measure of asymmetry. Distributions which are skewed to the right, i.e. the tail is on
the right, have positive skewness; distributions which are skewed to the left have negative skewness; a
normal distribution has skewness equal to 0.0.
j) Kurtosis. The kurtosis of the distribution of x is measured by
g
2
=
_
N
N 3
__
m
4
( s
2
x
)
2
_
3 where m
4
=
k
w
k
(x
k
x)
4
k
w
k
Kurtosis measures the peakedness of a distribution. A normal distribution has kurtosis equal to 0.0.
A curve with a sharper peak has positive kurtosis; distributions less peaked than a normal distribution
have negative kurtosis.
k) n-tiles. The n-tile break points are calculated the same way as in the QUANTILE program.
57.2 Bivariate Statistics
a) Chi-square. Chi-square is appropriate for testing the signicance of dierences of distributions
among independent groups.
2
=
j
(f
ij
E
ij
)
2
E
ij
where
f
ij
= the observed frequency in cell ij
E
ij
= the expected(calculated) frequency in cell ij;
it is the product of the frequency of the row i times
the frequency in the column j, divided by the total N.
For two by two tables, the
2
is computed according to the following formula:
2
=
N(|ad bc| N/2)
2
(a +b)(c +d)(a +c)(b +d)
where a, b, c, d represent the frequencies in the four cells.
57.2 Bivariate Statistics 397
b) Cramers V. Cramers V describes the strength of association in a sample. Its value lies between 0.0
reecting complete independence, and 1.0 showing complete dependence of the attributes.
V =
2
N(L 1)
where L = min(r, c) .
c) Contingency coecient. Like Cramers V , the coecient of contingency is used to describe the
strength of association in a sample. Its upper limit is a function of the number of categories. The
index cannot attain 1.0 .
CC =
2
+N
d) Degrees of freedom.
df = (r 1)(c 1)
e) Adjusted N. This is the N used in the statistical computations, i.e. the number of cases with valid
codes. It is weighted if a weight variable was specied.
f ) S. S equals the number of agreements in order minus the number of disagreements in order. For a
given cell in a table, all the cases in cells to the right and below are in agreement, all the cases to the
left and below are in disagreement. S is the numerator of the tau statistics and of gamma.
S =
r1
i=1
c
j=1
f
ij
_
_
r
h=i+1
c
l=j+1
f
hl
r
m=i+1
j1
n=1
f
mn
_
_
where f
ij
, f
hl
and f
mn
are the observed frequencies in cells ij, hl and mn respectively.
g) Variance of S. This is the variance of S when ties exist. (A tie is present in the data if more than
one case appears in a given row or column.)
2
s
=
N(N 1)(2N + 5)
j
f
j
(f
j
1)(2f
j
+ 5)
i
f
i
(f
i
1)(2f
i
+ 5)
18
+
+
_
j
f
j
(f
j
1)(f
j
2)
__
i
f
i
(f
i
1)(f
i
2)
_
9N(N 1)(N 2)
+
+
_
j
f
j
(f
j
1)
__
i
f
i
(f
i
1)
_
2N(N 1)
h) Standard deviation of S.
s
=
_
2
s
i) Normal deviation of S. It provides a large sample test of signicance for tau or gamma with ties.
The minus one in the numerator is a correction for continuity (if S is negative, unity is added). The
value may be referred to a normal distribution table. The test is conditional to the distribution of ties.
Z =
S 1
s
398 Univariate and Bivariate Tables
j) Tau a. The Kendalls is a measure of association for ordinal data. Tau a assumes that there are no
ties in the data, or that ties, if present, represent a measurement failure which is properly reected
by a reduced strength of relationship. Tau a can range from 1.0 to +1.0 .
a
=
S
N(N 1)
2
k) Tau b. Tau b is like tau a except that ties are permitted, i.e. there may be more than one case in
a given row or column of the bivariate table. Tau b can reach unity only when the number of rows
equals the number of columns.
b
=
S
_
N(N 1)
2
T
1
_ _
N(N 1)
2
T
2
_
where
T
1
=
_
i
f
i
(f
i
1)
_
/ 2
T
2
=
_
j
f
j
(f
j
1)
_
/ 2
l) Tau c. Tau c (also known as Kendall-Stuart tau) is like tau b except that if the number of rows is
not equal to the number of columns, tau b cannot attain the values 1.0 while tau c can attain these
values.
c
=
S
1/2 N
2
[(L 1)/L]
where L = min(r, c).
m) Gamma. The Goodman-Kruskal is another widely used measure of association that is closely related
to Kendalls . It can range from 1.0 to +1.0 and can be computed even though ties occur in the
data.
=
S
S
+
+ S
where
S = S
+
S
S
+
= the total number of pairs in like order
S
s
=
x
2
+
y
2
d
2
2
_
x
2
y
2
57.2 Bivariate Statistics 399
where
x
2
=
N
3
N
12
T
x
y
2
=
N
3
N
12
T
y
d
2
=
k
(X
k
Y
k
)
2
T
x
= the sum of the Ts for all rows with more than 1 case
T
y
= the sum of the Ts for all columns with more than 1 case
X
k
= the rank of case k on the row variable
Y
k
= the rank of case k on the column variable.
Note that when more than one case occurs in a given row (or column), the value of the X
k
s (or Y
k
s)
for the tied cases is the average of the ranks which would have been assigned if there had been no ties.
For example, if there are 15 cases in the rst row of a table, then those 15 cases would all be assigned
a rank, i.e. X value, of 8.
o) Lambda symmetric. This lambda is a symmetric measure of the power to predict; it is appropriate
when neither rows nor columns are specially designated as the thing predicted from, or known, rst.
Lambda has the range from 0 to 1.0 .
sym
=
i
max
j
f
ij
+
j
max
i
f
ij
max
j
f
j
max
i
f
i
2N max
j
f
j
max
i
f
i
where
f
ij
= the observed frequency in cell ij
max
j
f
ij
= the largest frequency in row i
max
i
f
ij
= the largest frequency in column j
max
j
f
j
= the largest marginal frequency among the columns j
max
i
f
i
= the largest marginal frequency among the rows i.
p) Lambda A, row variable dependent. This lambda is appropriate when the row variable is the
dependent variable. It is a measure of proportional reduction in the probability of error, when predicting
the row variable, aorded by specifying the column category. The lambda row dependent has the range
from 0 to 1.0 .
rd
=
j
max
i
f
ij
max
i
f
i
N max
i
f
i
See above for the denition of the terms in this formula.
q) Lambda B, column variable dependent. This lambda is appropriate when the column variable is
the dependent variable. It has the range from 0 to 1.0 .
cd
=
i
max
j
f
ij
max
j
f
j
N max
j
f
j
See above for the denition of the terms in the formula.
400 Univariate and Bivariate Tables
r) Evidence Based Medicine (EBM) statistics. They are calculated for 2 x 2 tables where the rst
row represents frequences of event (a) and no event (b) for cases in the treated group, and the second
row represents frequences of event (c) and no event (d) for cases in the control group.
The following statistics are calculated:
Experimental event rate
EER = a/(a +b)
Control event rate
CER = c/(c +d)
Absolute risk reduction (risk dierence)
ARR = |CER EER|
Relative risk reduction
RRR = ARR/CER
Number needed to treat
NNT = 1/ARR
Relative risk (risk ratio)
RR = EER/CER
and its 95% condence interval
CI
RR
= exp
_
ln(estimator RR) 1.96
T
_
where estimated variance of ln(estimator RR) is
T =
b/a
a +b
+
d/c
c +d
Relative odds (odds ratio)
OR = ad/bc
and its 95% condence interval
CI
OR
= exp
_
ln(estimator OR) 1.96
V
_
where estimated variance of ln(estimator OR) is
V =
1
a
+
1
b
+
1
c
+
1
d
s) Fisher exact test. The Fisher exact probability test is an extremely useful non-parametric technique
for analyzing discrete data (either nominal or ordinal) from two independent samples. It is used when
all the cases from two independent random samples fall into one or the other of two mutually exclusive
categories. The test determines whether the two groups dier in the proportion with which they fall
into the two classications.
Probability of observed outcome is calculated as follows:
p =
(a +b)! (c +d)! (a +c)! (b +d)!
N! a! b! c! d!
where a, b, c, d represent the frequencies in the four cells.
The TABLES program gives also both one-tailed and two-tailed exact probabilities, called probability
of outcome equal to or more extreme than observed and probability of outcome as extreme as
observed in either direction respectively.
57.2 Bivariate Statistics 401
t) Mann-Whitney test. The Mann-Whitney U test can be used to test whether two independent
groups have been drawn from the same population. It is a most useful alternative to the parametric
t-test when the measurement is weaker than interval scaling. In the TABLES program it is required
that the row variable be the dichotomous grouping variable.
Let
n
1
= the number of cases in the smaller of the two groups
n
2
= the number of cases in the second group
R
1
= sum of ranks assigned to group with n
1
cases
R
2
= sum of ranks assigned to group with n
2
cases.
Then
U
1
= n
1
n
2
+
n
1
(n
1
+ 1)
2
R
1
U
2
= n
1
n
2
+
n
2
(n
2
+ 1)
2
R
2
and
U = min(U
1
, U
2
)
If there are more than 10 cases in each group, the TABLES program provides Z approximation (normal
approximation of U) calculated as follows:
Z =
U n
1
n
2
/2
_
n
1
n
2
(n
1
+n
2
+ 1)
12
u) Wilcoxon signed ranks test. The Wilcoxon test is a statistical test for two related samples and
it utilizes information about both the direction and the relative magnitude of the dierences within
pairs of variables.
The sum of positive ranks, T
+
, is obtained as follows:
The signed dierences d
k
= x
k
y
k
are calculated for all cases.
The dierences d
k
are ranked without respect to their signs. The cases with zero d
k
s are dropped.
The tied d
k
s are assigned the average of the tied ranks.
Each rank is axed the sign (+ or ) of the d which it represents.
N
T
+
T
+
where
T
+ =
N
(N
+ 1)
4
2
T
+ =
N
(N
+ 1) (2N
+ 1)
24
1
2
g
t=1
n
t
(n
t
1) (n
t
2)
and
g = the number of groupings of dierent tied ranks
n
t
= the number of tied ranks in grouping t.
Note that Z approximation is also adjusted for the tied ranks. The use of this, however, produces no
change in variance when there are no ties.
402 Univariate and Bivariate Tables
v) t-test. This t-ratio is appropriate for testing the dierence between two independent means, i.e. two
independent samples. The variance is pooled.
t =
y
i
y
h
_
n
i
s
2
i
+n
h
s
2
h
n
i
+n
h
2
__
n
i
+n
h
n
i
n
h
_
where
y
i
= the mean of the column variable for cases in row i
y
h
= the mean of the column variable for cases in row h
s
2
i
= the sample variance of the column variable for cases in row i
s
2
h
= the sample variance of the column variable for cases in row h.
If t-tests are requested, sample standard deviations are calculated for the cases in each row as follows:
s
i
=
y
2
n
i
y
2
i
57.3 Note on Weights
If bivariate statistics are requested and a weight variable is specied, a warning is printed and the statistics
are computed using weighted values:
x
k
= w
k
x
k
x
2
k
= w
k
x
2
k
y
k
= w
k
y
k
y
2
k
= w
k
y
2
k
N =
k
w
k
f
ij
= the weighted frequency in cell ij.
Chapter 58
Typology and Ascending
Classication
Notation
x = values of variables
k = subscript for case
v = subscript for variable
g, i, j = subscripts for groups
a = number of active variables (quantitative and dichotomized qualitative)
p = number of passive variables (quantitative and dichotomized qualitative)
t = number of initial groups
N
i
= number of cases in group i
(weighted if the case weight is used)
N
j
= number of cases in group j
(weighted if the case weight is used)
= value of the variable weight
w = value of the case weight
W = total sum of case weights.
58.1 Types of Variables Used
The program accepts both quantitative and qualitative (categorical) variables, the latter being treated
as quantitative after full dichotomization of their respective categories, i.e. after the construction of as many
dichotomic (1/0) variables as the number of categories. The variables used by the program may be either
active or passive. The active variables are those on the basis of which the typology is constructed. The
passive variables do not participate in the construction of typology, but the program prints for them the
main statistics within the groups of typology.
A set of active variables is denoted here X
a
, and a set of passive variables X
p
.
58.2 Case Prole
Prole of the case k is a vector P
k
such as
P
k
= (x
k1
, x
k2
, . . . , x
kv
, . . . , x
ka
) = (x
kv
)
where all x
v
X
a
.
404 Typology and Ascending Classication
If the active variables are requested to be standardized, the k
th
case prole becomes
P
k
=
_
x
kv
s
v
_
where s
v
is the standard deviation of the variable x
v
(see 7.b below).
58.3 Group prole
Prole of the group i, called also barycenter of group, is a vector P
i
such as
P
i
= (x
i1
, x
i2
, . . . , x
iv
, . . . , x
ia
) = (x
iv
)
and in the case of standardized data it becomes
P
i
=
_
x
iv
s
v
_
where the numerator is the mean of the variable x
v
for the cases belonging to the group i and denominator
is the overall standard deviation of this variable.
58.4 Distances Used
There are three basic types of distances used in the program, namely: city block distance, Euclidean distance
and Chi-square distance of Benzecri. They may be used to calculate distances between two cases, between
a case and a group of cases and between two groups of cases. Below, this distances are dened as distances
between two groups of cases (between two group proles), but the other distances can easily be obtained by
adapting respective formulas.
a) City block distance.
d
ij
= d(P
i
, P
j
) =
a
v=1
v
|x
iv
x
jv
|
a
v=1
v
b) Euclidean distance.
d
ij
= d(P
i
, P
j
) =
_
a
v=1
v
(x
iv
x
jv
)
2
a
v=1
v
c) Chi-square distance.
d
ij
= d(P
i
, P
j
) =
_
a
v=1
1
p
v
_
p
iv
p
i
p
jv
p
j
_
2
where
p
v
=
t
g=1
x
gv
, p
i
=
a
v=1
x
iv
, p
j
=
a
v=1
x
jv
p
iv
=
x
iv
t
g=1
a
v=1
x
gv
, p
jv
=
x
jv
t
g=1
a
v=1
x
gv
58.5 Building of an Initial Typology 405
Moreover, the program provides a possibility of using weighted distance, called displacement, which is
dened as follows:
D
ij
= D(P
i
, P
j
) =
2N
i
N
j
N
i
+N
j
d
ij
Note that displacement between two case proles is equal to their distance since N
i
= N
j
= 1.
58.5 Building of an Initial Typology
a) Selection of an initial conguration. Before starting the process of aggregating the cases, the
program selects the initial conguration, i.e. t initial group proles, in either one of the following ways:
case proles of t randomly selected cases (using random numbers) constitute the starting con-
guration; in order to obtain the initial conguration, the remaining cases are distributed into t
groups as described below;
case proles of t cases selected in a stepwise manner constitute the starting conguration; in order
to obtain the initial conguration, the remaining cases are distributed into t groups as described
below;
the initial conguration is a set of group proles calculated for cases distributed across categories
of a key variable;
the initial conguration is a set of a priori group proles provided by the user.
When the construction starts from t case proles, the program considers this set of t vectors as a set
of t starting cases and distributes the remaining cases according to their distance to each of the
starting case.
Let denote the set of t starting cases by
P
starting
=
_
P
k1
, P
k2
, . . . , P
kt
_
and the distance between groups and/or cases i and j by D(P
i
, P
j
).
Note that D(P
i
, P
j
) can be any distance dened in the section 4 above.
For each case i P
starting
the program calculates
= min
1jt
_
D(P
i
, P
kj
)
_
= min
_
D(P
k1
, P
k2
), D(P
k1
, P
k3
), . . . , D(P
kt1
, P
kt
)
_
There are two possibilities:
: case i is assigned to the closest group P
kj
and the prole of this group is recalculated
P
kj
=
_
P
kj
+P
i
_
/2
> : case i forms a new group which is added to the set P
starting
, and the two closest proles
P
kj
and P
k
j
_
/2
At the end of this procedure, the initial conguration is a set of t proles
P
initial
=
_
P
1
, P
2
, . . . , P
j
, . . . , P
t
_
where P
j
is a mean prole of all the cases belonging to the group j.
At this stage the program does not take into account weighting of cases, if any.
406 Typology and Ascending Classication
b) Stabilization of the initial conguration. The initial conguration is stabilized by an iteration
process. During each iteration, the program redistributes the cases among initial groups taking into
account their distances to each group prole.
Here again there are two possibilities:
when case i P
j
and
D(P
i
, P
j
) = min
1gt
_
D(P
i
, P
g
)
_
then this case remains in the group P
j
;
when case i P
j
but
D(P
i
, P
j
) = min
1gt
_
D(P
i
, P
g
)
_
then the case i is moved from the group P
j
to the group P
j
, and the proles of those two groups
are recalculated as follows:
P
j
= (N
j
P
j
P
i
) /(N
j
1)
P
j
= (N
j
P
j
+P
i
) /(N
j
+ 1)
After this operation, the group P
j
contains N
j
1 cases and the group P
j
contains N
j
+ 1 cases.
Note that if the cases are weighted, then
N
j
= N
j
w
i
N
j
= N
j
+w
i
P
i
= w
i
P
i
where w
i
is the weight of the case i, and N
j
and N
j
are the weighted number of cases in the groups
P
j
and P
j
respectively.
Stability of groups is measured by the percentage of cases that do not change groups between two
subsequent iterations.
The procedure is repeated until the groups are stabilized or when the number of iterations xed by
the user is reached.
58.6 Characteristics of Distances by Groups
a) N. The number of cases in each group of the initial typology.
b) Mean. Mean distance for each group, i.e. the mean of distances from the group prole over all cases
belonging to this group.
c) SD. Standard deviation of distance for each group.
d) Classication of distances. Distribution of cases, both in terms of frequency and percentages,
across 15 continuous intervals, which are dierent for each group.
e) Total count. Total number of cases participating in the building of the initial typology.
f ) Mean. Overall mean distance.
g) SD. Overall standard deviation of distance.
h) Classication of distances (same limits for each group). Same as 6.d above except that the
15 intervals are of the same range for all groups.
58.7 Summary Statistics for Quantitative Variables and for Qualitative Active Variables 407
58.7 Summary Statistics for Quantitative Variables and for Qual-
itative Active Variables
a) Mean. Mean of quantitative x
v
(X
a
X
p
). For qualitative variable categories, it is a proportion of
cases in this category.
x
v
=
k
w
k
x
kv
W
b) S. D. Standard deviation.
s
v
=
_
W
k
w
k
x
2
kv
_
k
w
k
x
kv
_
2
W
2
c) Weight. The value of variable weight calculated for each variable as follows:
v
=
_
_
0 for quantitative passive variables
1 for quantitative active variables
(c+1)/3
c
for categories of a qualitative active variable,
where c is the number of non-empty categories
of the variable under consideration
1 for categories of a qualitative active variable
if Chi-square distance is used.
58.8 Description of Resulting Typology
At the end of the initial typology construction, and also at the end of each step of ascending classication,
all variables, i.e. active and passive are evaluated by the amount of explained variance. It is a measure of
discriminant power of each quantitative variable and each category of qualitative variables. This is followed
by an individual description of all groups of the typology.
a) Proportion of cases. Percentage, multiplied by 1000, of cases belonging to each group of the
typology.
b) Explained variance.
EV(x
v
) =
tg
i=1
N
i
(x
iv
x
v
)
2
k
w
k
(x
kv
x
v
)
2
1000
where
t
g
= number of groups in the typology
x
iv
= mean of the variable v in group i
x
v
= grand mean of the variable v.
c) Grand mean.
For quantitative variables, mean values as described under 7.a above.
For each category of qualitative variables, percentage of cases in this category.
d) Statistics for each group of the typology.
408 Typology and Ascending Classication
For quantitative variables:
rst line: mean values as described under 7.a above;
second line: standard deviations as described under 7.b above.
For each category of qualitative variables:
rst line: column percentage of cases;
second line: row percentage of cases.
58.9 Summary of the Amount of Variance Explained by the Ty-
pology
Similarly to the description of the resulting typology, a summary table is printed at the end of the initial
typology construction and at the end of each step of ascending classication.
a) Variables explaining 80% of the variance. List of the most discriminating variables, i.e. those
variables which taken altogether are responsible for at least 80% of the explained variance, together
with the amount of variance explained by each of them individually (see 8.b above).
b) Mean variance explained by active variables.
EV
active
=
a
v=1
v
EV(x
v
)
a
v=1
v
c) Mean variance explained by all variables.
EV
all
=
a+p
v=1
v
EV(x
v
)
a+p
v=1
v
d) Mean variance explained by the variables which explain 80% of the total variance. After
each regrouping, the program looks for variables which explain at least 80% of the total variance (see
9.a above) and prints mean variance explained by those variables before and after regrouping, and the
percentage of such variables.
58.10 Hierarchical Ascending Classication
After creation of the initial typology, the program performs a sequence of regroupings, reducing one by one
the initial number of groups up to the number specied by the user. At each regrouping, the program selects
two closest groups, i.e. two groups with the smallest distance or displacement (see section 4 above), and
calculates the prole for this new group.
a) Group i + j. Prole of the new group, printed for up to 15 active variables in descending order of
their deviation (see 10.d below). Note that if there are less than 15 active variables, or less than 15
variables with valid cases in aggregated groups, the program completes the list using passive variables.
b) Group i. Prole of the group i, printed for the same variables as above.
c) Group j. Prole of the group j, printed for the same variables as above.
d) Dev. Absolute value of the dierence between proles of groups i and j, printed for the same variables
as above.
Dev(x
v
) = |x
iv
x
jv
|
58.11 References 409
e) Weighted deviation. Deviation weighted by the variable weight and the variable standard deviation,
printed for the same variables as above.
WDev(x
v
) = Dev(x
v
)
v
s
v
58.11 References
Aimetti, J.P., SYSTIT: Programme de classication automatique, GSIE-CFRO, Paris, 1978.
Diday, E., Optimisation en classication automatique, RAIRO, Vol. 3, 1972.
Hall & Ball, A clustering technique for summarizing multivariate data, Behavioral Sciences, Vol. 12, No 2,
1967.
Appendix
Error Messages From IDAMS
Programs
Overview
An eort has been made to make the error messages self-explanatory. Thus this Appendix essentially
describes the coding scheme used for error messages.
Errors and Warnings
Errors (E) always cause termination of IDAMS program execution, while warnings (W) alert the user on
possible abnormalities in the data and/or in the control statements, and also on possible misinterpretation
of results. Error and warning messages have the following format:
***E* aaannn text of error message
***W* aaannn text of warning message
where
nnn is a three digit number, starting from 001 for warnings and from 101 for errors;
aaa indicates where the message comes from, according to the following rules:
Messages from programs: the rst letter of the program name followed by next two consonants in
the program name.
Messages from subroutines:
SYN general syntax errors;
RCD Recode (syntax) errors and warnings;
DTM data and dictionary errors, and warnings about data and dictionary les;
SYS errors and warnings from the Monitor;
FLM le management errors and warnings.
412 Error Messages From IDAMS Programs
Fortran Run-Time Error Messages
When errors occur during program execution (run time) of a program, the Visual Fortran RTL issues
diagnostic messages. They have the following format:
forrtl: severity (number): text
forrtl Identies the source as the Visual Fortran RTL.
severity The severity levels are: severe (must be corrected), error (should be corrected), warning
(should be investigated), or info (for informational purposes only).
number This is the message number, also the IOSTAT value for I/O statements.
text Explains the event that caused the message.
The run-time error messages are self-explanatory and thus they are not listed here.
Index
aggregation of data, 45, 50, 97
alphabetic variables, 13
analysis
of correspondences, 193
of time series, 311, 315
of variance, 217, 231, 359, 371
analysis of variance
multivariate, 225
auto-correlation, 315
auto-regression, 315
binary splits, 261, 389, 391, 392
bivariate
statistics, 269, 294, 396
output by TABLES, 272
tables, 269, 293
graphical presentation, 294
output by TABLES, 272
blanks, 13
detection, 112
recoding, 29, 103
box and whisker plots, 307
C-records, 15
listing, 143
use in data validation, 109
case
creating several cases from one, 49
deletion, 127, 159
identication (ID)
correction, 127
listing, 127, 143, 163
principal, 193, 344
selection
with lter, 25
with Recode, 49
size limitations, 12
specifying number of records per case, 14
supplementary, 193, 346
categorical variables
in regression, 201
checking
codes, 58, 109
consistency, 59, 115
data structure, 58, 119
range of values, 58, 109
sort order, 159
chi-square
distance, 285, 404
test, 269, 294, 396
city block distance, 174, 215, 285, 320, 357, 404
classication of objects
based on fuzzy logic, 172, 322
based on hierarchical clustering, 172, 323, 324
based on partitioning, 171, 320, 322
cluster analysis, 171, 319
code
checking, 58, 109
labels, 15
coecients
B, 203, 244, 257, 350, 378, 388
beta, 203, 219, 350, 361
constant term, 203, 244, 257, 350, 378, 388
eta, 219, 232, 361, 372
Gini, 189, 336
multiple correlation, 203, 349
of variation, 203, 219, 232, 269, 347, 359, 360,
371, 396
partial correlation, 203, 348
Pearson r, 243, 377
comments in IDAMS setup, 22
condition code
checking between programs, 21
setting for control statements errors, 21
conguration
analysis, 177, 327
centering, 327, 353
matrix, 327, 353, 356
input to CONFIG, 178
input to MDSCAL, 214
input to TYPOL, 284
output by CONFIG, 178
output by MDSCAL, 213
output by TYPOL, 283
normalization, 327, 353
projection, 178
rotation, 177, 327
transformation, 177, 328
varimax rotation, 178, 328
consistency checking, 59, 115
contingency
coecient, 269, 294, 397
tables, 269
continuation line
control statements, 25
Recode statements, 33
control statements, 24
lter, 25
label, 26
parameters, 27
rules for coding, 25
414 INDEX
copying
datasets, 159
correcting
case ID, 127
data, 58, 88, 127
dictionary, 86
variables, 127
correlation
analysis, 243, 377
coecients, 243, 377
matrix, 341, 348, 378
input to CLUSFIND, 172
input to MDSCAL, 213
input to REGRESSN, 204
output by PEARSON, 244
output by REGRESSN, 202, 203
partial, 203, 348
correspondence analysis, 193
covariance matrix, 341, 378
output by PEARSON, 245
Cramers V, 269, 294, 397
cross-spectrum, 316
crosstabulations, 269
data
aggregation, 97
correction, 58, 88, 127
editing, 14, 57, 103
entry, 88
export
in DIF format, 134
in free format, 90, 134
format in IDAMS, 12
import, 19
in DIF format, 135
in free format, 89, 135
in the input stream, 22
listing, 143
recoding, 59
sorting, 88
structure checking, 58, 119
transformation, 59, 163
validation, 57, 109, 115, 119
dataset
building, 103
copying, 159
denition in IDAMS, 11
merging, 147
subsetting, 159
ddname, 23
for dictionary and data les, 30
deciles, 189, 271, 335, 396
decimal places, specication, 15
defaults in IDAMS parameters, 27
deleting
cases, 127, 159, 163
variables, 159, 163
densities, 305
descriptive statistics, 97, 98, 194, 257, 269, 291, 292,
339, 387, 395
dictionary, 14
code label (C-records), 15
copying, 159
creation, 86, 103
descriptor record, 14
example, 16
in the input stream, 22
listing, 143
variable descriptor (T-record), 14
verication, 86
discriminant
analysis, 183, 331
factor analysis, 184, 333
function, 183, 332
distance
chi-square, 285, 404
city block, 174, 215, 285, 320, 357, 404
Euclidean, 174, 211, 215, 285, 320, 356, 404
Mahalanobis, 183, 332
distribution
frequencies, 269
function, 189, 335
dummy variables
creation with Recode, 46
used in regression, 201
duplicate
cases, deletion, 159, 161
records, detection and deletion, 120
Durbin-Watson (test), 203, 351
EBM statistics, 269, 400
editing
data, 57
non-numeric data values, 29, 103
text les, 93
eigenvalues, 341
eigenvectors, 341
ELECTRE ranking method, 249
error messages, 411
Euclidean distance, 174, 211, 215, 285, 320, 356, 404
export
of data, 90, 133
of datasets, 6
of matrices, 6, 133
of multidimensional tables, 294
F-test, 203, 219, 232, 349, 372
factor analysis, 184, 193, 333, 339
les
data le, 79
dictionary le, 79
matrix le, 79
merging, 147, 155
names, 79
results le, 79
setup le, 79
size limitations for IDAMS, 12
sorting, 155
specifying in IDAMS, 22
system les, 80
INDEX 415
permanent, 80
temporary, 80
used in WinIDAMS, 79
user les, 79
lter
control statement, 25
local
in ONEWAY, 234
in QUANTILE, 192
in SCAT, 260
in TABLES, 274
placement, 25
rules for coding, 25
syntax verication, 91
with R-variables, 49
Fisher
exact test, 269, 400
F-test, 203, 219, 232, 349, 372
folders
default folders, 80
used in WinIDAMS, 80
frequency distributions, 269, 291
frequency lters, 316
fuzzy logic
classication of objects, 172, 322
ranking of alternatives, 249, 384, 385
gamma (statistic), 269, 294, 398
Gini (coecient), 189, 336
graphical exploration of data, 301
grouping data cases, 97
hierarchical clustering
agglomerative, 172, 323
based on dichotomic variables, 172, 324
divisive, 172, 324
histograms, 305, 315
IDAMS
control statements, 24
dataset, 11
building, 103
dictionary, 14
error messages, 411
execution of programs, 92
matrix, 16
export, 133
import, 133
results handling, 92
setup, 21
preparation, 90
verication, 91
IDAMS commands, 21
$CHECK, 21
$COMMENT, 22
$DATA, 22
$DICT, 22
$FILES, 22
$MATRIX, 22
$PRINT, 22
$RECODE, 22
$RUN, 22
$SETUP, 23
import
of data, 133
of data les, 89
of datasets, 6
of matrices, 6, 133
interaction
denition, 217
detection and treatment, 217
inverse matrix, 203, 348
Kaiser criterion, 197
Kendalls taus, 269, 294, 398
keywords
for common parameters, 29
rules for coding, 28
types, 27
Kolmogorov-Smirnov (D test), 189, 192, 336
kurtosis, 340, 396
label
control statement, 26
for code categories, 15
for variables, 15
placement, 26
rules for coding, 27
lambda statistics, 269, 294, 399
listing
cases, 127, 143
data, 143, 163
dictionary, 143
Lorenz
curve, 336
function, 189, 336
Mahalanobis distance, 183, 332
Mann-Whitney (test), 269, 401
marginal distributions, 269
matrix
export (free format), 134
import (free format), 135
in the input stream, 22
inverse, 203, 348
of correlations, 341, 348, 378
input to CLUSFIND, 172
input to MDSCAL, 213
input to REGRESSN, 204
output by PEARSON, 244
output by REGRESSN, 202, 203
of covariances, 341, 378
output by PEARSON, 245
of cross-products, 203, 244, 347, 348, 378
of dissimilarities, 171, 320
input to CLUSFIND, 172
input to MDSCAL, 213
of distances, 178, 328
output by CONFIG, 178
of partial correlations, 203, 348
416 INDEX
of relations, 193, 194, 249, 340, 382, 383
of scalar products, 178, 328, 341
of similarities
input to CLUSFIND, 172
input to MDSCAL, 213
of statistics, 269
output by TABLES, 272
of sums of squares, 203, 347, 348
projection, 308
rectangular, 18
square, 16
vector of means and SDs, 18
mean, 319, 331, 339, 347, 359, 360, 365, 371, 377,
378, 387, 395, 407
merging
datasets, 147
at dierent levels, 147
at the same level, 147
les, 155
Minkowski r-metric, 211, 356
missing data
case-wise deletion
in PEARSON, 243
in REGRESSN, 202
checking for with Recode, 45
codes
assignment by Recode, 50
specication, 13, 15
denition, 13
handling by Recode, 34
pair-wise deletion
in PEARSON, 243
to be used for checking, 30
multidimensional scaling, 211, 353
multidimensional tables, 293
multiple classication analysis, 217
multivariate analysis of variance, 225
n-tiles, 189, 271, 335, 396
non-numeric data values, 13
detection, 103
editing, 29, 103
non-parametric tests
Fisher (exact), 269, 400
Mann-Whitney, 269, 401
Wilcoxon (signed ranks), 269, 401
normalization
of conguration, 327, 353
of relation matrix, 249, 384
numeric variables, 103
coding rules, 12
outliers
denition, 222, 264
detection and elimination, 222
identication and printing, 262
parameters
common
BADDATA, 29
INFILE, 30
MAXCASES, 30
MDVALUES, 30
OUTFILE, 30
VARS, 30
WEIGHT, 30
default values, 27
parameter statements, 27
placement, 27
presentation in the Manual, 27
rules for coding, 28
types of keyword, 27
partial
correlation coecients, 203, 348
order scoring, 235, 373
partitioning around medoids, 171, 320, 322
Pearson (correlation coecient r), 243, 377, 388
Phi (statistic), 294
plotting scattergrams, 257
preference
data
example, 251
types of, 249, 379
strict, 250
weak, 250
principal components factor analysis, 193
printing IDAMS setup, 22
quantiles, 189, 271, 335, 396
random values
generation by Recode, 41
ranking analysis, 249, 379
classical logic, 249, 380
fuzzy logic, 249, 384, 385
Recode
accessing the Recode facility, 22
arithmetic functions, 36
constants
character, 35
numeric, 35
continuation line, 33
elements of language, 35
expressions, 36
arithmetic, 36
logical, 36
format of statements, 33
initialization of variable values, 34
logical functions, 44
missing data handling, 34
operands, 35
operators
arithmetic, 35
logical, 36
relational, 36
restrictions, 54
statements, 45
syntax verication, 91
testing, 34
V- and R-variables, 35
INDEX 417
Recode, arithmetic functions
ABS, 37
BRAC, 37
COMBINE, 38
COUNT, 39
LOG, 39
MAX, 39
MD1, MD2, 40
MEAN, 40
MIN, 40
NMISS, 40
NVALID, 41
RAND, 41
RECODE, 41
SELECT, 42
SQRT, 42
STD, 43
SUM, 43
TABLE, 43
TRUNC, 44
VAR, 44
Recode, logical functions
EOF, 45
INLIST, 45
MDATA, 45
Recode, statements
assignment, 45
BRANCH, 48
CARRY, 50
CONTINUE, 48
DUMMY, 46
ENDFILE, 48
ERROR, 48
GO TO, 48
IF, 49
MDCODES, 50
NAME, 51
REJECT, 49
RELEASE, 49
RETURN, 49
SELECT, 47
recoding data, 31, 33, 59
example, 33, 51, 60
saving recoded variables, 163
record
duplicate record detection and deletion, 120
invalid record deletion, 119
missing record detection and padding, 120
regression, 201, 244, 257, 347, 378, 388
descending stepwise, 201, 352
lines, 306
multiple linear, 201, 347
stepwise, 201, 351
with categorical variables, 201, 206, 217
with dummy variables, 201, 206
with zero intercept, 352
repetition factor
in TABLES, 274
residuals, 351, 362, 391393
output by MCA, 217, 219
output by REGRESSN, 202, 204
output by SEARCH, 261, 262
rotation of conguration, 177, 327
saving recoded variables, 163
scaling analysis, 211, 353
scatter plots, 257
3-dimensional, 308
grouped plot, 307
manipulation, 304
rotation, 308
scores
calculated by FACTOR, 194, 345, 346
calculated by POSCOR, 236, 375
scoring analysis, 235, 373
segmentation analysis, 261, 389
selecting cases with lter, 25
skewness, 340, 396
Sormers D, 294
sort order checking, 129, 159
sorting les, 88, 155
spatial analysis, 177, 327
Spearmans rho, 269, 398
spectrum, 315
standard deviation, 331, 339, 347, 359, 360, 371, 377,
378, 387, 388, 396, 407
standardization
of measurements, 171, 319
of variables, 404
Student (t-test), 269, 402
subset specications
in POSCOR, 239
in QUANTILE, 191
in TABLES, 274
subsetting
cases, 25
datasets, 159
T-records, 14
t-tests of means, 269, 402
tau statistics, 269, 294, 398
test
chi-square, 269, 294, 396
D of Kolmogorov-Smirnov, 189, 192, 336
Durbin-Watson, 203, 351
Fisher (exact), 269, 400
Fisher F, 203, 219, 232, 349, 372
Mann-Whitney, 269, 401
t of Student, 269, 402
Wilcoxon (signed ranks), 269, 401
testing
program control statements, 30
recode statements, 34
time series
analysis, 311
transformation, 314
transformation
of conguration, 177, 328
of data, 59, 163
418 INDEX
of time series, 314
trend estimation, 315
univariate
statistics, 97, 98, 194, 203, 257, 269, 291, 292,
305, 315, 339, 387, 395
tables, 269, 293
graphical presentation, 294
output by TABLES, 272
validation of data, 57, 109
variable
active, 281, 403
aggregated, 97, 98
alphabetic, 13
correction, 127
decimal, 12
descriptor record, 14
dummy, 46
name, 15, 51
number, 12, 15
numeric, 12
coding rules, 12
editing, 14, 103, 105
passive, 281, 403
principal, 193, 342
reference number, 15
supplementary, 193, 343
type, 15
variable list
rules for coding, 30
variance analysis, 231, 371
varimax rotation
of conguration, 178, 328
of factors, 194, 346
weighting data, 30
Wilcoxon (signed ranks test), 269, 401
WinIDAMS
les, 79
folders, 80
User Interface
customization of environment, 83