UserGuide QualityStage
UserGuide QualityStage
QualityStage
Version 7.0
August 2003
Important Notice
This document, and the software described or referenced in it, are confidential and proprietary to Ascential Software
Corporation ("Ascential"). They are provided under, and are subject to, the terms and conditions of a license agreement between
Ascential and the licensee, and may not be transferred, disclosed, or otherwise provided to third parties, unless otherwise
permitted by that agreement. No portion of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of
Ascential. The specifications and other information contained in this document for some purposes may not be complete, current,
or correct, and are subject to change without notice. NO REPRESENTATION OR OTHER AFFIRMATION OF FACT
CONTAINED IN THIS DOCUMENT, INCLUDING WITHOUT LIMITATION STATEMENTS REGARDING CAPACITY,
PERFORMANCE, OR SUITABILITY FOR USE OF PRODUCTS OR SOFTWARE DESCRIBED HEREIN, SHALL BE
DEEMED TO BE A WARRANTY BY ASCENTIAL FOR ANY PURPOSE OR GIVE RISE TO ANY LIABILITY OF ASCENTIAL
WHATSOEVER. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL ASCENTIAL BE LIABLE FOR
ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER
RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE. If you are acquiring this software on behalf of the U.S. government, the Government shall have only "Restricted
Rights" in the software and related documentation as defined in the Federal Acquisition Regulations (FARs) in Clause 52.227.19
(c) (2). If you are acquiring the software on behalf of the Department of Defense, the software shall be classified as "Commercial
Computer Software" and the Government shall have only "Restricted Rights" as defined in Clause 252.227-7013 (c) (1) of DFARs.
© 2003, 1999-2002 Ascential Software Corporation. All rights reserved.
QualityStage, QualityStage Designer, QualityStage Real Time, QualityStage SERP, QualityStage DPID Interface Solution for
ATLAS, QualityStage GeoLocator, QualityStage WAVES, QualityStage CASS, and QualityStage Z4Change are trademarks of
Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions.
Adobe and Acrobat are trademarks of Adobe Systems Incorporated.
Data Warehouse Center is a trademark; ISPF/PDF MVS, TSO, IBM, and MVS are registered trademarks of International
Business Machines Corporation.
UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Ltd.
Windows and Windows NT are trademarks of Microsoft Corporation.
Winsock REXECD/NT is a copyright of Denicomp Systems.
Other marks are the property of the owners of those marks.
Published by Ascential Software.
This Product may contain or utilize third-party components subject to the following (as applicable);
Copyright (c) 1995-2000 by the Hypersonic SQL Group. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
conditions are met: 1) Redistributions of source code must retain the above copyright notice, this list of conditions and the
following disclaimer. 2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimer in the documentation and/or other materials provided with the distribution. 3) All advertising
materials mentioning features or use of this software must display the following acknowledgment: "This product includes
Hypersonic SQL." 4) Products derived from this software may not be called "Hypersonic SQL" nor may "Hypersonic SQL"
appear in their names without prior written permission of the Hypersonic SQL Group. 5) Redistributions of any form
whatsoever must retain the following acknowledgment: "This product includes Hypersonic SQL." This software is provided
"as is" and any expressed or implied warranties, including, but not limited to, the implied warranties of merchantability
and fitness for a particular purpose are disclaimed. In no event shall the Hypersonic SQL Group or its contributors be
liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to,
procurement of substitute goods or services; loss of use, data, or profits; or business interruption). However caused any on
any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out
of the use of this software, even if advised of the possibility of such damage. This software consists of voluntary
contributions made by many individuals on behalf of the Hypersonic SQL Group.
Copyright © 2002 Sun Microsystems, Inc. All rights reserved. Redistribution and use in source and binary forms, with or
without modification, are permitted provided that the following conditions are met: 1. Redistribution of source code must
retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistribution in binary form
must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or
other materials provided with the distribution. Neither the name of Sun Microsystems, Inc. or the names of contributors
may be used to endorse or promote products derived from this software without specific prior written permission. You
acknowledge that this software is not designed, licensed or intended for use in the design, construction, operation or
maintenance of any nuclear facility.
This product includes software developed by the Apache Software Foundation (http://www.apache.org/). Copyright ©
1999-2000 The Apache Software Foundation. All rights reserved. Redistribution and use in source and binary forms, with
or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code
must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary
form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution. 3. The end-user documentation included with the redistribution, if
any, must include the following acknowledgment: "This product includes software developed by the Apache Software
Foundation (http://www.apache.org/)."Alternately, this acknowledgment may appear in the software itself, if and wherever
such third-party acknowledgments normally appear. 4. The names "Xerces" and "Apache Software Foundation" must not
be used to endorse or promote products derived from this software without prior written permission. For written
permission, please contact apache@apache.org. 5. Products derived from this software may not be called "Apache", nor may
"Apache" appear in their name, without prior written permission of the Apache Software Foundation.
TCL/TK License Terms. This software is copyrighted by the Regents of the University of California, Sun Microsystems,
Inc., Scriptics Corporation, and other parties. The following terms apply to all files associated with the software unless
explicitly disclaimed in individual files. The authors hereby grant permission to use, copy, modify, distribute, and license
this software and its documentation for any purpose, provided that existing copyright notices are retained in all copies and
that this notice is included verbatim in any distributions. No written agreement, license, or royalty fee is required for any
of the authorized uses. Modifications to this software may be copyrighted by their authors and need not follow the licensing
terms described here, provided that the new terms are clearly indicated on the first page of each file where they apply. IN
NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT,
SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, ITS
DOCUMENTATION, OR ANY DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE. THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS PROVIDED ON AN "AS
IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO OBLIGATION TO PROVIDE MAINTENANCE,
SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. GOVERNMENT USE: If you are acquiring this
software on behalf of the U.S. government, the Government shall have only "Restricted Rights" in the software and related
documentation as defined in the Federal Acquisition Regulations (FARs) in Clause 52.227.19 (c) (2). If you are acquiring
the software on behalf of the Department of Defense, the software shall be classified as "Commercial Computer Software"
and the Government shall have only "Restricted Rights" as defined in Clause 252.227-7013 (c) (1) of DFARs.
Notwithstanding the foregoing, the authors grant the U.S. Government and others acting in its behalf permission to use
and distribute the software in accordance with the terms specified in this license.
Copyright © 1997-1998 DUNDAS SOFTWARE LTD., all rights reserved.
Copyright © 2001 Ironring Software (http://www.ironringsoftware.com).
Copyright © 1987 Regents of the University of California. All rights reserved.
Copyright © 1996, 2000, 2001, Nara Institute of Science and Technology. All rights reserved. Redistribution and use in
source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1.
Redistribution of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2.
Redistribution in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning
features or use of this software must display the following acknowledgments: This product includes software developed by
Nara Institute of Science and technology. 4. The name Nara Institute of Science and Technology my not be used to endorse
or promote products derived from this software without specific prior written permission.
ANTLR 1989-2000 Developed by jGuru.com (MageLang Institute), http://www.ANTLR.org and http://www.jGuru.com.
LAPACK Users’ Guide, 3rd Edition, Society for Industrial and Applied Mathematics.
Copyright 1990, by Alfalfa Software Incorporated, Cambridge, Massachusetts. All rights reserved. Permission to use,
copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice
appear in supporting documentation, and that Alfalfa’s name not be used in advertising pr publicity pertaining to
distribution of the software without specific, written permission.
ICU License - ICU 1.8.1 and later
COPYRIGHT AND PERMISSION NOTICE
Copyright (c) 1995-2002 International Business Machines Corporation and others
All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy,
modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is
furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the
Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT
OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR
PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote
the sale, use or other dealings in this Software without prior written authorization of the copyright holder.
All trademarks and registered trademarks mentioned herein are the property of their respective owners.
Jetty License Revision: 3.7
Preamble:
The intent of this document is to state the conditions under which the Jetty Package may be copied, such that the
Copyright Holder maintains some semblance of control over the development of the package, while giving the users of the
package the right to use, distribute and make reasonable modifications to the Package in accordance with the goals and
ideals of the Open Source concept as described at http://www.opensource.org.
It is the intent of this license to allow commercial usage of the Jetty package, so long as the source code is distributed or
suitable visible credit given or other arrangements made with the copyright holders. Additional information available at
http://jetty.mortbay.org
Definitions:
"Jetty" refers to the collection of Java classes that are distributed as a HTTP server with servlet capabilities and associated
utilities.
"Package" refers to the collection of files distributed by the Copyright Holder, and derivatives of that collection of files
created through textual modification.
"Standard Version" refers to such a Package if it has not been modified, or has been modified in accordance with the wishes
of the Copyright Holder.
"Copyright Holder" is whoever is named in the copyright or copyrights for the package.
Mort Bay Consulting Pty. Ltd. (Australia) is the "Copyright Holder" for the Jetty package.
"You" is you, if you're thinking about copying or distributing this Package.
"Reasonable copying fee" is whatever you can justify on the basis of media cost, duplication charges, time of people
involved, and so on. (You will not be required to justify it to the Copyright Holder, but only to the computing community at
large as a market that must bear the fee.)
"Freely Available" means that no fee is charged for the item itself, though there may be fees involved in handling the item.
It also means that recipients of the item may redistribute it under the same conditions they received it.
0. The Jetty Package is Copyright (c) Mort Bay Consulting Pty. Ltd. (Australia) and others. Individual files in this package
may contain additional copyright notices. The javax.servlet packages are copyright Sun Microsystems Inc.
1. The Standard Version of the Jetty package is available from http://jetty.mortbay.org.
2. You may make and distribute verbatim copies of the source form of the Standard Version of this Package without
restriction, provided that you include this license and all of the original copyright notices and associated disclaimers.
3. You may make and distribute verbatim copies of the compiled form of the Standard Version of this Package without
restriction, provided that you include this license.
4. You may apply bug fixes, portability fixes and other modifications derived from the Public Domain or from the Copyright
Holder. A Package modified in such a way shall still be considered the Standard Version.
5. You may otherwise modify your copy of this Package in any way, provided that you insert a prominent notice in each
changed file stating how and when you changed that file, and provided that you do at least ONE of the following:
a) Place your modifications in the Public Domain or otherwise make them Freely Available, such as by posting said
modifications to Usenet or an equivalent medium, or placing the modifications on a major archive site such as ftp.uu.net, or
by allowing the Copyright Holder to include your modifications in the Standard Version of the Package.
b) Use the modified Package only within your corporation or organization.
c) Rename any non-standard classes so the names do not conflict with standard classes, which must also be provided, and
provide a separate manual page for each non-standard class that clearly documents how it differs from the Standard
Version.
d) Make other arrangements with the Copyright Holder.
6. You may distribute modifications or subsets of this Package in source code or compiled form, provided that you do at
least ONE of the following:
a) Distribute this license and all original copyright messages, together with instructions (in the about dialog, manual page
or equivalent) on where to get the complete Standard Version.
b) Accompany the distribution with the machine-readable source of the Package with your modifications. The modified
package must include this license and all of the original copyright notices and associated disclaimers, together with
instructions on where to get the complete Standard Version.
c) Make other arrangements with the Copyright Holder.
7. You may charge a reasonable copying fee for any distribution of this Package. You may charge any fee you choose for
support of this Package. You may not charge a fee for this Package itself. However, you may distribute this Package in
aggregate with other (possibly commercial) programs as part of a larger (possibly commercial) software distribution
provided that you meet the other distribution requirements of this license.
8. Input to or the output produced from the programs of this Package do not automatically fall under the copyright of this
Package, but belong to whomever generated them, and may be sold commercially, and may be aggregated with this
Package.
9. Any program subroutines supplied by you and linked into this Package shall not be considered part of this Package.
10. The name of the Copyright Holder may not be used to endorse or promote products derived from this software without
specific prior written permission.
11. This license may change with each release of a Standard Version of the Package. You may choose to use the license
associated with version you are using or the license of the latest Standard Version.
12. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES,
INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE.
13. If any superior law implies a warranty, the sole remedy under such shall be , at the Copyright Holders option either a)
return of any price paid or b) use or reasonable endeavours to repair or replace the software.
14. This license shall be read under the laws of Australia.
The End
This license was derived from the Artistic license published on http://www.opensource.com
The Apache Software License, Version 1.1
This product includes software developed by the Apache Software Foundation (http://www.apache.org/).
Copyright (c) 2000 The Apache Software Foundation. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
disclaimer in the documentation and/or other materials provided with the distribution.
3. The end-user documentation included with the redistribution, if any, must include the following acknowledgment: "This
product includes software developed by the Apache Software Foundation (http://www.apache.org/)." Alternately, this
acknowledgment may appear in the software itself, if and wherever such third-party acknowledgments normally appear.
4. The names "Apache" and "Apache Software Foundation" must not be used to endorse or promote products derived from
this software without prior written permission. For written permission, please contact apache@apache.org.
5. Products derived from this software may not be called "Apache", nor may "Apache" appear in their name, without prior
written permission of the Apache Software Foundation.
THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT
NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR ITS
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
This software consists of voluntary contributions made by many individuals on behalf of the Apache Software Foundation.
For more information on the Apache Software Foundation, please see <http://www.apache.org/>.
Portions of this software are based upon public domain software originally written at the National Center for
Supercomputing Applications, University of Illinois, Urbana-Champaign.
3. REQUIREMENTS
A Contributor may choose to distribute the Program in object code form under its own license agreement, provided that:
a) it complies with the terms and conditions of this Agreement; and
b) its license agreement:
i) effectively disclaims on behalf of all Contributors all warranties and conditions, express and implied, including
warranties or conditions of title and non-infringement, and implied warranties or conditions of merchantability and fitness
for a particular purpose;
ii) effectively excludes on behalf of all Contributors all liability for damages, including direct, indirect, special, incidental
and consequential damages, such as lost profits;
iii) states that any provisions which differ from this Agreement are offered by that Contributor alone and not by any other
party; and
iv) states that source code for the Program is available from such Contributor, and informs licensees how to obtain it in a
reasonable manner on or through a medium customarily used for software exchange.
When the Program is made available in source code form:
a) it must be made available under this Agreement; and
b) a copy of this Agreement must be included with each copy of the Program.
Contributors may not remove or alter any copyright notices contained within the Program.
Each Contributor must identify itself as the originator of its Contribution, if any, in a manner that reasonably allows
subsequent Recipients to identify the originator of the Contribution.
4. COMMERCIAL DISTRIBUTION
Commercial distributors of software may accept certain responsibilities with respect to end users, business partners and
the like. While this license is intended to facilitate the commercial use of the Program, the Contributor who includes the
Program in a commercial product offering should do so in a manner which does not create potential liability for other
Contributors. Therefore, if a Contributor includes the Program in a commercial product offering, such Contributor
("Commercial Contributor") hereby agrees to defend and indemnify every other Contributor ("Indemnified Contributor")
against any losses, damages and costs (collectively "Losses") arising from claims, lawsuits and other legal actions brought
by a third party against the Indemnified Contributor to the extent caused by the acts or omissions of such Commercial
Contributor in connection with its distribution of the Program in a commercial product offering. The obligations in this
section do not apply to any claims or Losses relating to any actual or alleged intellectual property infringement. In order to
qualify, an Indemnified Contributor must: a) promptly notify the Commercial Contributor in writing of such claim, and b)
allow the Commercial Contributor to control, and cooperate with the Commercial Contributor in, the defense and any
related settlement negotiations. The Indemnified Contributor may participate in any such claim at its own expense.
For example, a Contributor might include the Program in a commercial product offering, Product X. That Contributor is
then a Commercial Contributor. If that Commercial Contributor then makes performance claims, or offers warranties
related to Product X, those performance claims and warranties are such Commercial Contributor's responsibility alone.
Under this section, the Commercial Contributor would have to defend claims against the other Contributors related to
those performance claims and warranties, and if a court requires any other Contributor to pay any damages as a result, the
Commercial Contributor must pay those damages.
5. NO WARRANTY
EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, THE PROGRAM IS PROVIDED ON AN "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED INCLUDING,
WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Each Recipient is solely responsible for
determining the appropriateness of using and distributing the Program and assumes all risks associated with its exercise
of rights under this Agreement, including but not limited to the risks and costs of program errors, compliance with
applicable laws, damage to or loss of data, programs or equipment, and unavailability or interruption of operations.
6. DISCLAIMER OF LIABILITY
EXCEPT AS EXPRESSLY SET FORTH IN THIS AGREEMENT, NEITHER RECIPIENT NOR ANY CONTRIBUTORS
SHALL HAVE ANY LIABILITY FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING WITHOUT LIMITATION LOST PROFITS), HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OR DISTRIBUTION OF THE PROGRAM
OR THE EXERCISE OF ANY RIGHTS GRANTED HEREUNDER, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
DAMAGES.
7. GENERAL
If any provision of this Agreement is invalid or unenforceable under applicable law, it shall not affect the validity or
enforceability of the remainder of the terms of this Agreement, and without further action by the parties hereto, such
provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable.
If Recipient institutes patent litigation against a Contributor with respect to a patent applicable to software (including a
cross-claim or counterclaim in a lawsuit), then any patent licenses granted by that Contributor to such Recipient under this
Agreement shall terminate as of the date such litigation is filed. In addition, If Recipient institutes patent litigation against
any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Program itself (excluding combinations of
the Program with other software or hardware) infringes such Recipient's patent(s), then such Recipient's rights granted
under Section 2(b) shall terminate as of the date such litigation is filed.
Table of Contents
Table of Contents
Preface
About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii
Prerequisites for Using QualityStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxviii
Related Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxviii
Documentation Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxx
QualityStage Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxx
Additional Information and Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii
Chapter 1
Welcome
Using Re-engineered Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-1
Introducing QualityStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-2
QualityStage and QualityStage Real Time. . . . . . . . . . . . . . . . . . . . . . . .1-2
Product Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-3
Supports Data Quality Management Standards . . . . . . . . . . . . . . . . . . .1-3
Feature Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-4
Benefits Highlights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-4
How QualityStage Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5
Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5
Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5
Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-6
Chapter 2
The Workflow for Creating Re-engineered Data
What is Re-engineered Data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-1
Overview of the Re-engineering Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2
Overview of Phase One. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Overview of Phase Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3
Overview of Phase Three . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-4
Overview of Phase Four . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-4
Phase One: Understand the Business Goals . . . . . . . . . . . . . . . . . . . . . . . . . .2-4
How High Quality Data Meets Business Goals . . . . . . . . . . . . . . . . . . . .2-5
Example of How Business Goals Determine Data Re-engineering
Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6
Phase Two: Understand the Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6
Step One: Prepare for QualityStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-7
General Knowledge About the Source Data . . . . . . . . . . . . . . . . . . . .2-8
File Format of Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8
Preparing Data for QualityStage . . . . . . . . . . . . . . . . . . . . . . . . . . .2-10
Step Two: Investigate the Source Data . . . . . . . . . . . . . . . . . . . . . . . . . .2-10
Organizing Source Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-11
Parsing Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-11
Classifying Source Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-12
Analyzing Patterns in Source Data. . . . . . . . . . . . . . . . . . . . . . . . . .2-12
Step Three: Evaluate the Results and Redefine the Project . . . . . . . . .2-12
Phase Three: Design and Develop the Re-engineering Application . . . . . . .2-13
Step One: Conditioning the Source Data . . . . . . . . . . . . . . . . . . . . . . . .2-14
About Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-14
Decisions You Make . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-14
Step Two: Matching the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-15
Example of Matching Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-16
Step Three: Determining Surviving Records and Formatting . . . . . . . .2-17
Keeping All Duplicate Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-17
Keeping Only One Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-17
Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-18
Phase Four: Evaluating Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-18
Chapter 3
Using the QualityStage Development Environment
Installing QualityStage Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
Starting QualityStage Designer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2
Using the QualityStage Main Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
Using the QualityStage Menu Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4
File Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5
Edit Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
View Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
Rules Menu. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
Help Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
Using the Left Pane of the QualityStage Main Window . . . . . . . . . . . . .3-7
Using the Right Pane of the QualityStage Main Window . . . . . . . . . . . .3-8
Using the QualityStage Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-10
Using QualityStage Dialog Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-10
Selecting Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-11
Moving Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-11
Additional Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-11
Browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-12
Setting QualityStage Designer Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-12
Local Working Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-13
Standardize Process Definition Directory . . . . . . . . . . . . . . . . . . . . . . . .3-13
Default Import Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-13
Preferred Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14
Data Warehouse Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14
How to Set Designer Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-14
Chapter 4
Working with Projects
Creating QualityStage Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-1
Adding a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2
Copying a Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
Deleting Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
Exporting Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
Exporting Datafile Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6
Exporting Datafile Definitions via MetaBrokers . . . . . . . . . . . . . . . . . . .4-6
Exporting Datafile Definitions to MetaStage . . . . . . . . . . . . . . . . . . . . . .4-7
Importing Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8
Chapter 5
Setting Up Run Profiles
Creating a Run Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
What Run Profiles Define . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
More About Run Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Creating and Managing Run Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Creating Run Profiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2
Copying, Modifying, or Deleting Run Profiles . . . . . . . . . . . . . . . . . . . . .5-4
Defining an OS/390 Run Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-4
Defining a UNIX or Windows Run Profile . . . . . . . . . . . . . . . . . . . . . . . . . .5-10
Defining a Local Windows Run Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-16
Chapter 6
Building Jobs
Why You Use Jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2
Building QualityStage Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3
Chapter 7
Deploying Jobs
About Deploying and Running Jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1
Run Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-1
About Deploying Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-2
About Running Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-2
About Run Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-2
File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3
Data Stream Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3
Parallel Extender Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-4
Comparing Run Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5
Deploying a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5
How to Deploy a Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6
Deploying Jobs in Data Stream Mode or Parallel Extender Mode . . . . .7-7
Deploying Jobs in File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7
Using the File Mode Execution Dialog Box for Deploying Jobs. . . . . . . .7-8
Deploying a Job Creates a Project File Structure . . . . . . . . . . . . . . . . . . . . . .7-9
Moving Input Data to the Correct Project Library Location . . . . . . . . . . . . .7-9
Deploying Jobs on an OS/390 Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9
Deploying Jobs on a UNIX Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-10
Deploying Jobs on a Windows Server . . . . . . . . . . . . . . . . . . . . . . . . . . .7-10
Deploying Jobs on a Local Windows Server . . . . . . . . . . . . . . . . . . . . . .7-11
Chapter 8
Running Jobs
About Running Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-1
Chapter 9
Defining Investigate Stages
Using an Investigate Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2
Creating an Investigate Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-3
Using Character Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-5
Using the Pattern Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-5
Using Discrete Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
Using Concatenate Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-7
Using the Field Mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-7
Creating a Character Investigate Stage . . . . . . . . . . . . . . . . . . . . . . . . . .9-9
Using Word Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11
Using Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-12
Using Pattern Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13
Using Word Frequency Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-14
Using Word Classification Reports . . . . . . . . . . . . . . . . . . . . . . . . . .9-16
Specifying Advanced Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-18
Creating a Word Investigation Stage . . . . . . . . . . . . . . . . . . . . . . . . . . .9-21
Running Investigate Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-23
Running in Parallel Extender Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-26
Running in File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-27
Chapter 10
Defining Standardize Stages
Using the Standardize Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
About Rule Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
Standardization Processing Flow for U.S. Records . . . . . . . . . . . . . . . .10-4
Domain Pre-Processor Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
Why You Use the Domain Pre-Processor Rule Sets . . . . . . . . . . . . . . . .10-6
Preparing the Input File for the Domain Pre-Processor . . . . . . . . . . . .10-8
Domain-Specific Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-9
Validation Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-10
Standardized Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-10
Rules Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-11
Defining Standardize Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-11
Defining the Input File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-11
Inserting Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-11
Delimiter Literals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-12
Defining the Results File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-12
Creating a Standardize Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-13
Selecting Rule Sets, Fields, and Literals . . . . . . . . . . . . . . . . . . . . . . .10-15
Using the Append Field Selection Dialog Box . . . . . . . . . . . . . . . .10-20
Using the Data Selection for Reports Dialog Box . . . . . . . . . . . . .10-21
Specifying Case Formatting Options. . . . . . . . . . . . . . . . . . . . . . . . . . .10-22
Using Classification Tokens to Specify Fields for Case Formatting . . .
10-23
Applying Case Formatting Rules . . . . . . . . . . . . . . . . . . . . . . . . . .10-23
Case Formatting Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-23
Running Standardize Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-24
Running in Data Stream Mode or Parallel Extender Mode. . . . . . . . .10-26
Running in File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-28
Standardizing a Multinational Address File Using Standardize. . . . . . . .10-29
About the Country Identifier Rule Set . . . . . . . . . . . . . . . . . . . . . . . . .10-30
Using the Country Identifier Rule Set . . . . . . . . . . . . . . . . . . . . . . . . .10-31
Preparing the Input File for the Country Identifier. . . . . . . . . . . . . . .10-31
Managing the Rule Sets and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-32
Accessing the Rules Management Dialog Box . . . . . . . . . . . . . . . . . . .10-33
Viewing or Modifying Rule Set Files and Tables . . . . . . . . . . . . . .10-34
Creating New Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-34
Chapter 11
Defining Multinational Standardize Stages
The Multinational Standardize Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-2
Which Countries Can Be Standardized . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-2
City-Level Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-2
Street-Level Standardization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-3
Modifying Standardization Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . .11-3
Input File Requirements and Recommendations . . . . . . . . . . . . . . . . . . . . .11-4
Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
Recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-4
Input Field Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-5
Creating a Multinational Standardize Stage . . . . . . . . . . . . . . . . . . . . . . . .11-5
Running Multinational Standardize Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . .11-9
Running in Data Stream Mode or Parallel Extender Mode. . . . . . . . .11-12
Running in File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-14
Multinational Standardize Output Fields . . . . . . . . . . . . . . . . . . . . . . . . . .11-15
About the Output File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-18
Chapter 12
Defining Match Stages
About Matching Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2
Using Match Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-3
One-To-One Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
Many-To-One Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4
Matching for Unduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-5
Blocking Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-5
The Strategy For Using Match Passes . . . . . . . . . . . . . . . . . . . . . . .12-5
Matching Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-6
About m-probability and u-probability . . . . . . . . . . . . . . . . . . . . . . .12-7
About Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7
About Cutoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-8
About Unduplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-8
Reviewing the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-9
Extracting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-9
Defining Match Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-10
Defining Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-10
Defining Output Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-10
Defining a Match Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-11
Chapter 13
Working with Match Reports
About Match Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-2
Using the Default Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-2
Customizing a Match Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-3
Defining a Custom Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-4
Chapter 14
Defining Survive Stages
Using the Survive Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-2
Grouping Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3
Defining Survive Stage Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4
Defining the Input File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4
Defining the Results File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4
Creating a Survive Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4
Using the Survive Stage Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-5
Defining Survive Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-8
Defining Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-8
Defining a Simple Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-9
Defining a Complex Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-11
Using the Survivorship Rule Expression Builder . . . . . . . . . . . . .14-12
Adding the Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-13
Selecting Data for a Predefined QualityStage Report . . . . . . . . . . . . .14-15
Modifying and Maintaining Survivorship Rules . . . . . . . . . . . . . .14-16
Running Survive Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-18
Running in Data Stream Mode or Parallel Extender Mode. . . . . . . . .14-21
Running in File Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-22
Creating Rules Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-23
Rule Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-24
Rule Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-25
Chapter 15
Working with QualityStage Reports
Using Stage Wizards to Prepare Data for Predefined QualityStage Reports . .
15-2
Creating and Running QualityStage Reports Using Unprepared Data . . .15-2
Preparing Your Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-3
Converting Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-3
Adding a .txt Extension to Flat Files . . . . . . . . . . . . . . . . . . . . . . . .15-3
File Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-3
Creating Customized Access Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
Designing a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
Creating a Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4
Creating a Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-5
Creating a Query With the Query Wizard . . . . . . . . . . . . . . . . . . . . . . .15-5
Creating a Report in the Design View. . . . . . . . . . . . . . . . . . . . . . . . . . .15-6
Creating a Report with the Report Wizard . . . . . . . . . . . . . . . . . . . . . . .15-6
Testing and Debugging the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-7
Creating a Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-8
Generating and Viewing QualityStage Reports . . . . . . . . . . . . . . . . . . . . . .15-9
Specifying the Data Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-10
ODBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-11
Microsoft Access Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-11
Flat Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-11
Specifying the Reports Database Location . . . . . . . . . . . . . . . . . . . . . .15-12
Selecting and Running a QualityStage Report . . . . . . . . . . . . . . . . . . .15-12
Saved Report Entries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-13
About Predefined QualityStage Reports . . . . . . . . . . . . . . . . . . . . . . . . . . .15-14
Reports Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-14
Chapter 16
Using the QualityStage Data File and Report Viewer
Selecting the Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16-2
Appendix A
Importing Projects from MVS and UNIX into QualityStage Designer
Preparing Your Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
Collecting Information from the UNIX or MVS System . . . . . . . . . . . . A-2
For UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
For MVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2
Transferring the PDS Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Updating the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-3
Creating the IMF File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
Input File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-5
Job List File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
Control List File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-6
Data Definition List File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Using the jcl_cnv Command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-7
Conversion Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Converting Data File Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . A-8
Sorts Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-9
Limited Operation Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Understanding Conversion Problems . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Warnings Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-10
Fatal Error Messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-11
Log Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-11
Things to Check After Import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-12
Appendix B
Match Comparisons
ABS_DIFF Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
AN_DINT Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2
AN_INT Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3
CHAR Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-4
CNT_DIFF Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-5
Appendix C
Rule Set Files
Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1
Rule Set Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Dictionary File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2
Field Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-5
Classification Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-6
Threshold Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7
The Null Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8
Pattern-Action File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8
Pattern Matching Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-8
Tokenization and Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9
Pattern-Action File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-11
Rule Set Description File (.PRC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14
Lookup Tables (.TBL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14
Appendix D
More About Using Rules
Country Identifier Rule Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1
Input File: Country Code Delimiters. . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
Output File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
Domain Pre-Processor Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
Input File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3
Why You Use the Domain Pre-Processor Rule Sets . . . . . . . . . . . . . . . . D-4
Domain Pre-Processor File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5
Domain Pre-Processor Dictionary File . . . . . . . . . . . . . . . . . . . . . . . . . . D-6
Domain Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-6
Reporting Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7
User Flag Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8
Domain Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8
Upgrading Pre-Processor Rule Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-8
Domain-Specific Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10
Domain-Specific File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-11
Domain-Specific Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-11
Business Intelligence Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-12
Matching Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-12
Reporting Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-13
Data Flag Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-13
Validation Rule Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-14
Validation File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-14
VDATE Rule Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-15
Default Parsing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-15
Input Date Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-15
Output Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-16
Business Intelligence Output Fields . . . . . . . . . . . . . . . . . . . . . . . . D-16
Error Reporting Output Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-17
VEMAIL Rule Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-17
Default Parsing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-18
Parsing Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-18
Business Intelligence Output Fields . . . . . . . . . . . . . . . . . . . . . . . . D-18
Error Reporting Output Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-19
Appendix E
Customizing and Testing Rule Sets
Rule Set Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1
Rule Set Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2
Using Override Tables to Customize Rule Sets . . . . . . . . . . . . . . . . . . . . . . E-2
Domain Pre-Processor Override Tables . . . . . . . . . . . . . . . . . . . . . . . . . E-4
Domain-Specific Override Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-4
Validation Override Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-5
QualityStage WAVES/Multinational Address Override Tables . . . . . . E-5
Working with Multiple Projects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-5
Domain Pre-Processor Rule Set Process . . . . . . . . . . . . . . . . . . . . . . . . . E-6
Domain-Specific, Validation, and WAVES/Multinational Address Rule Set
Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-8
Using Overrides to Customize Rule Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . E-9
Domain Pre-Processor Overrides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-11
Adding Classification Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . E-11
Adding Input Pattern and Field Pattern Overrides. . . . . . . . . . . . E-15
Adding Input Text and Field Text Overrides . . . . . . . . . . . . . . . . . E-19
Creating Domain-Specific, Validation, and WAVES/Multinational Address
Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-22
Adding Classification Overrides . . . . . . . . . . . . . . . . . . . . . . . . . . . E-22
Adding Input Pattern and Unhandled Pattern Overrides. . . . . . . E-22
Appendix F
ISO Country Codes
Appendix G
Sharing Dictionary Fields and Variable Names Across Rule Sets
Scoping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-1
Log File Warning Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2
Scoping For Dictionary Field Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-2
Backward Compatibility for Dictionary Field Name Scopes . . . . . . . . . G-3
Modifying a Previous Rule Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-3
Scoping for Variable Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G-4
Backward Compatibility for Variable Name Scopes . . . . . . . . . . . . . . . G-5
Appendix H
Using AuditStage with QualityStage
Source File Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-2
Overview of Building a Source File from Data Tables . . . . . . . . . . . . . . H-2
Exporting a Sample Source File . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-3
Pre-Standardization Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-3
Validating at the Row Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-3
Using Your AuditStage Results in QualityStage . . . . . . . . . . . . . . . . . . H-4
Tuning Standardization and Matching Jobs . . . . . . . . . . . . . . . . . . . . . . . . . H-4
Testing the Results Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-4
Accessing QualityStage Data for Testing . . . . . . . . . . . . . . . . . . . . . H-5
Sampling the Results Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-5
Creating and Using a Sample Data Set . . . . . . . . . . . . . . . . . . . . . . H-5
Maintaining Your QualityStage Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H-6
Index
Related Documentation
In addition to this user guide, the QualityStage documentation also
includes:
• QualityStage: Getting Started
Uses a simple real-life example to walk the reader through a data
re-engineering project. Topics include creating a project; defining
data files; and running jobs that use the Investigate, Standardize,
Match, and Survive stages.
• QualityStage OS/390 Server Guide
Describes how to install the server software, how to verify
installation, how to transfer jobs in a non-FTP environment, and
how to troubleshoot common problems. It also includes some
sample JCL.
Documentation Conventions
This guide uses the following conventions:
• User entries and book titles appear in italic typeface.
• Arrows represent a menu path, for example Select Edit ➤ Paste.
• Examples are represented by Arial font.
• Note: indicates nice-to know information.
• Important: identifies vital information.
• Caution: warns you about actions that could cause damage to data
or unintentional termination of processing.
QualityStage Terminology
As of release 7.0 of QualityStage, certain INTEGRITY terms have
changed. This guide uses the new QualityStage terminology
throughout.
The following table lists the new equivalents to the older terminology
used in earlier versions of QualityStage:
Here are several examples of how the new terms are used:
Welcome
• Consolidated Billing.
• Ongoing Maintenance of Operational Data.
Introducing QualityStage
QualityStage is a client/server application. QualityStage Designer
provides a client interface for defining and customizing data
re-engineering jobs. QualityStage Designer runs on a Windows
workstation.
Whereas the QualityStage Designer defines how source data will be
processed, the QualityStage server accesses the source data, and
processes them into the target re-engineered data. It maintains the
data fields and executes the QualityStage data re-engineering jobs
that you build with QualityStage Designer.
The QualityStage server runs on the following systems:
• Windows
• OS/390
• UNIX
For more information on the operating systems that the QualityStage
server runs on, see the QualityStage OS/390 Server Guide and the
QualityStage UNIX, Linux, and Windows Server Guide.
Product Highlights
This section highlights key capabilities and benefits.
Feature Highlights
QualityStage features:
• Robust data re-engineering solution – including data investigation,
parsing, matching, and reconciliation
• GUI/menu-driven, easy to learn
• Stages and tables that automate common data re-engineering
operations
• Over 24 match comparison algorithms providing a full spectrum of
fuzzy matching functions
• Callable libraries for real-time matching
• Draws few programming resources and reduces need for clerical
staff
• Can be customized for your business rules
• Build once, run anywhere, run everywhere
Benefits Highlights
QualityStage benefits your company because it:
• Constructs consolidated customer views for the purpose of
cross-selling, up-selling, and customer retention
• Reduces time and cost to implement ERP (SAP, Baan, PeopleSoft,
JDE, etc.) initiatives
• Improves customer support and identifies your most profitable
customers
• Maximizes purchasing power for consolidated vendor projects
• Improves inventory control management via consolidated views of
inventory/product – to sell more products while increasing profit
margins
Investigation
Data investigation gives 100 percent visibility into the actual
condition of data, providing a sound understanding of the information
in legacy sources. This lets you identify and correct data problems
before they corrupt new systems.
Investigation parses and analyzes free-form and single domain fields,
determining the number and frequency of unique values and
classifying or assigning a business meaning to each occurrence of a
value within a field. As a result, investigation:
• Uncovers trends, potential anomalies, metadata discrepancies, and
undocumented business practices
• Identifies invalid or default values
• Reveals common terminology
• Verifies the reliability of fields proposed as matching criteria
Conditioning
Based on the understanding of the data gleaned from investigation,
conditioning standardizes and reformats data from multiple systems
Matching
Matching ensures data integrity. Matching:
• Identifies duplicate entities (such as customers, suppliers,
products, parts, etc.) within one or more files
• Creates a consolidated view of an entity
• Performs householding
• Establishes cross-reference linkage
• Enriches existing data with new attributes from external sources
QualityStage applies probabilistic matching technology to any
relevant attribute — evaluating user-defined full fields, parts of fields,
or even individual characters — and assigns agreement weights and
disagreement weights to key data elements, based on a number of
factors such as frequency distribution, discriminating value, and
reliability.
It can also gauge the number of differences in a field, to account for
errors such as transpositions (for example, in Social Security
numbers). It can match records character by character exactly or find
and match even nonexact record matches (and provide a probable
likelihood that two records match), in the absence of common keys. It
matches records faster and more accurately than any visual inspection
• You are aware of patterns and relationships within your data that
you can use to analyze your organization and forecast trends.
Figure 2-2
Phase Two: Understand the Nature and Content of the Source Data
By extracting this type of information about the source data, you set
the phase for importing the data into QualityStage. You will need to
define these four fields for QualityStage.
Note: In this example, the fields are not sequential; there are gaps in
the starting position. You do not have to extract all fields from a
record, just the ones you want. In this case, four fields are being
extracted, but the original record contains many more that are
not being dealt with at this time.
1. Working from the source data, create a flat data file and a file
definition (indicating metadata labels, field lengths, and starting
positions for each field).
2. Tell QualityStage where to find the data file.
Now you’re ready to investigate the data using QualityStage.
Figure 2-3 Phase Three: Design and Develop the Data Re-engineering
Application
About Conditioning
Conditioning data involves moving free-form data into fixed fields and
manipulating data to conform to standard conventions. This process
identifies and corrects invalid values, standardizes spelling formats
and abbreviations, and validates the format and content of the data.
QualityStage uses the data classification generated during the
Investigation process to condition and standardize the data.
2. Specify how you want input records to match. You can indicate
which fields are important, how to group records, and which fields
to use for weights and penalties.
3. Define the output.
Depending on the source data, you can perform multiple matching
passes. You can also decide whether each pass is independent or
dependent. With a dependent pass, you can choose to exclude matches
from a previous pass in the succeeding pass.
Your business rules determine these criteria decisions. And your data
determines how many passes you may need and how you need to
group the data. By evaluating the results of the previous phases, you
can determine the appropriate matching strategy for your application.
Formatting
The formatting part of the consolidating activity involves defining the
output. For example, you can
• Define the order of fields in an output record.
• Create initial database load files or transaction input records.
• Create an initial production job stream.
The actual results of the formatting task depend on your goals, but
some possibilities are:
• Creating a file for each table.
• Creating from or to cross-reference tables.
• Creating data exception reports.
Sample projects QualityStage provides three sample projects and the data files used
and data files with these projects. This chapter refers to these sample projects,
which you can use to learn how to work in QualityStage. When you
install the QualityStage Designer, QualityStage creates a directory,
DATA_Samples, in the directory where you installed QualityStage.
To install:
2. Click OK.
The QualityStage main window appears.
File Menu
The File menu lists the following commands:
• Open Repository. Use this command to open another
QualityStage repository. (When you launch QualityStage
Designer, QualityStage always opens the most recently used
repository.)
• Import. Use this command to import:
– A project from an IMF file
– Datafile definitions:
• From a COBOL copybook
• From an ODBC file definition
• From Visual Warehouse
• Via MetaBrokers
See “Importing Projects” on page 4-8 for information about
importing projects.
• Export. Use this command to export:
– A project to an IMF file
– A datafile definition via MetaBrokers
– A job to Visual Warehouse
See “Exporting Projects” on page 4-4 for information about
exporting projects.
• Run profiles. Use this command to create, modify, or delete run
profiles. See Chapter 5, “Setting Up Run Profiles” for information
about setting up run profiles.
• Reports. Use this command to set up and run QualityStage
reports. See Chapter 15, “Working with QualityStage Reports”, for
information about QualityStage reports.
Edit Menu
The Edit menu lists the following commands:
• Cut. Use this command to delete an item selected in the right pane.
• Copy. Use this command to copy the selected item to the clipboard.
• Paste. Use this command to paste the item currently in the
clipboard.
View Menu
The View menu lists the following commands:
• Tool Bar. Use this command to display or hide the QualityStage
tool bar.
• Status Bar. Use this command to display or hide the status bar.
• Ascential Banner. Use this command to display or hide the
Ascential banner just below the title bar.
Rules Menu
The Rules menu lists the following commands:
• Standardize Rules Management. Use this command to create,
modify, or delete Standardize stage rule sets. See “Managing the
Rule Sets and Files” on page 10-32 for information about
Standardize rule sets.
Help Menu
The Help menu lists the following commands:
• Online Help. Use this command to display the QualityStage
online help system.
• User Guide. Use this command to display the QualityStage
Designer User Guide in Acrobat Reader.
• Stages Guide. Use this command to display the QualityStage
Stages Reference Guide in Acrobat Reader.
• Getting Started. Use this command to display the QualityStage:
Getting Started guide in Acrobat Reader.
Click the plus sign next to a project folder to expand its contents. Each
project folder contains the following three subfolders:
• Datafile Definitions. When you select a DataFile Definitions
folder, the right pane displays all datafile definitions defined for
this project. Click the plus sign next to this folder to expand the list
of datafile definitions in the left pane.
– When you select a datafile definition in the left pane, the right
pane displays all fields defined for this datafile.
• Stages. When you select a Stages folder, the right pane displays all
stages defined for this project.
• Jobs. When you select a Jobs folder, the right pane displays all jobs
defined for this project. Click the plus sign next to this folder to
expand the list of jobs in the left pane.
Drag and drop You can use drag-and-drop operations to copy or move items listed in
the right pane to appropriate folders that are listed in the left pane.
To drag and drop multiple adjacent items, hold down the SHIFT key
while you click items to select them.
To drag and drop multiple items that are not adjacent, hold down the
CTRL key while you click items to select them.
The following table lists the available drag-and-drop operations:
Note: You cannot directly copy or move a stage from one project to a job
in another project. To do this, first copy or move the stage from
the source project to the target project, and then add the stage
you copied or moved to a job in the target project.
Deleting items In the right pane, you can delete datafile definitions, datafield
definitions, stages, jobs, and projects.
• Create new:
– Projects
– Datafile definitions
– Datafield definitions
– Stages
– Jobs
– Large icons
– Small icons
– Details
Selecting Items
There are two methods of selecting multiple items in a list:
• To select multiple nonadjacent items, hold down the CTRL key and
click the items to select them.
• To select multiple adjacent items, hold down the SHIFT key and
click the items to select them.
Moving Items
The drag and drop method of moving items can be used for some lists
in the following ways:
• You can drag and drop any item in a list to change its position in
the list.
• You can drag and drop from one list to another any item that is
highlighted when selected.
When you move an item, the item is highlighted in the list, and the
two items between which you are placing it appear in bold text.
The icon indicates that the item can be moved. If the move icon
does not appear, you cannot alter the position of the item in the list
while in that dialog box.
In addition to using the drag and drop method in some dialog boxes,
you can also use the Move Up and Move Down buttons to move items.
Additional Menus
In addition to the menus in the menu bar, there are several menus
accessible from within the dialog boxes themselves. Right-click an
item to bring up any additional menus specific to that item.
Browsing
Some dialog boxes contain locations for a specific file or directory. You
can browse for the correct location by using the browse button on
the right side of the entry. Here is an example of the Designer Options
dialog box in which the browse button for Local Working Directory is
selected:
Preferred Editor
By default, QualityStage uses Notepad for its text file editor. If you
prefer to use another editor, enter the path to the editor of your choice.
The Designer Options dialog box appears and displays the General
tab:
Note: This is optional and not a project directory. Unless you are
very short on disk space, do not change this directory.
5. Under Preferred Editor, enter the file name of the editor you want
to use with QualityStage. You can:
• Enter the full path to the editor executable file.
• Enter the file name of any editor executable file located in a
directory included in your PATH environment variable.
the datafile definitions, stages, and jobs. You can then edit these
datafile definitions, stages, and jobs to create your own project.
If you are sharing a data repository with other QualityStage Designer
clients, only one user can access a project at a time, but different users
can access different projects at the same time.
Adding a Project
To add a project:
1. Do one of the following:
Copying a Project
To copy an existing project:
Deleting Projects
To delete a project:
Exporting Projects
When you want another user to work with the same project or one
created using a previous version of QualityStage, you can export the
project so that it is loaded onto the other user’s client machine. You
can export projects from one QualityStage Designer client and import
them as projects into another QualityStage Designer client.
During an export, QualityStage creates a single Interchange
Metadata Format (IMF) file of the project. You can then move this file
to any QualityStage Designer host running the same version or a later
version.
To export a project:
6. Click OK.
Importing Projects
When you need to use a project created in an earlier version of
QualityStage (or INTEGRITY), or someone else’s project, you import
that project. You can import QualityStage projects that were exported
from another QualityStage Designer client.
To import a project, you need to create an Interchange Metadata
Format (IMF) file of the project and make it available on the
QualityStage Designer client host. QualityStage uses the IMF file and
re-creates the data file definitions, the jobs, and the stage definitions.
If you are importing a project from an MVS/UNIX host, refer to the
QualityStage UNIX, Linux, and Windows Server Guide for instructions for
how to create the IMF file.
To import a project converted to an IMF file:
6. Using the standard navigation, select the desired IMF file, and
then click Open.
Your imported project appears in the project list on the right pane
of the QualityStage main window.
You can also use a COBOL Copybook to ensure that your file
definitions are consistent and accurate; by importing the definitions,
you can avoid manual entry and any possible keying errors.
When QualityStage imports a COBOL Copybook, it automatically
creates a new project. You can then add the newly imported datafiles
to an existing project if you want to.
To import a COBOL Copybook:
6. Enter the name and description of the data file that the COBOL
Copybook will define.
7. Click OK.
File names, field When importing datafile definitions via a MetaBroker or from
names, and field MetaStage, the following conditions apply:
lengths • If an imported file name is longer than 8 characters, QualityStage
Designer truncates the name to 8 characters and puts the full file
name to the file description.
• If an imported field name is longer than 7 characters, QualityStage
Designer truncates the name to 7 characters and puts the full field
name to the field description.
• If a file name or field name contains characters that QualityStage
cannot use (for example _ (underscore)), QualityStage Designer
replaces the name with BADNAMEn and puts the original name to
the file or field description.
• If there are no field lengths defined in the import file, QualityStage
Designer sets the field length to 0.
When you re-export such datafile definitions via a MetaBroker, the
MetaBroker uses the original name.
4. Enter the name of the file to import, or use to browse for the
file.
See the technical bulletin for the MetaBroker you are using for
detailed information about completing this dialog box. When you
finish entering parameters, click OK.
The Status dialog box appears. The MetaBroker decodes datafile
definitions and writes them to a temporary file.
5. Do one of the following:
• Click Select All to select all datafile definitions.
• Click Filter to open the Meta Data Selection dialog box. You
can filter out some of the datafile definitions. For detailed
instructions on filtering, click Help.
When you are done, click OK. The Parameter Selection dialog box
appears.
6. Do the following:
a. Select the Verbose check box to make the Status dialog box
display the status of each imported datafile definition.
b. Next to Log File, enter the path of the log file to create, or click
to browse for the file.
c. Click OK.
7. When the import is complete, click Finish.
When you configure or build a job, you select files from a list of
available files to assign those files to the job. Therefore you must
define your files before you configure or build jobs.
All jobs require at least one input file, some require two, and most jobs
require one output (or results file), some require two. The Match stage
also requires files if you specify custom reports or extracts.
Important: File names must be eight characters or less, and must not contain
extensions such as .txt.
About Arrays
The Match stage provides the ability to compare arrays of fields. An
array can comprise any number of fields, including one. To use arrays,
you need to define them using the Arrayfields dialog box.
Arrays allows you to reduce the number of cross comparisons you
would have to define with Match. For example, if you have a first
name, a middle name, and a last name field that might appear in any
order (first name in last name field), arrays compare all names
without regard to order.
Check that any jobs and stages using this file now select this file
name.
When you are inserting a new field, if the starting position does
not correspond to the beginning of an existing field, an error
occurs. Click OK to return to the Datafield dialog box to modify
the Start Position value.
9. If you want the field to be tested by the Survive stage as an
integer, select Integer under Field Use Type.
You need to change a field to integer only if you are testing the
field by performing an arithmetic operation on it; for example,
adding or subtracting a value from an age.
Note: The field use type is used only by the Survive stage.
10. Under Field Data Type select the appropriate data type for your
field:
• Alphanumeric - This is the default value.
• Packed Decimal.
• Zoned Decimal.
• Binary Unsigned.
Any data type can be selected for either input or output files. All
numeric data types are right-justified when converted to
alphanumeric data types.
11. Do one of the following:
• Click Apply to add the datafield definition to the data file.
Information in the dialog box is cleared, letting you define
another datafield.
• Click OK to add the datafield definition to the data file and
exit the dialog box.
Note: The Total Record Length of the file is derived from your data
definitions. If you do not have a field defined that represents the
last column of the record, QualityStage may not correctly
calculate the Total Record Length of the file.
Defining Arrays
To define an array:
3. Under Available Fields, select the desired data field, and then
click or drag and drop the file from the Available Fields box
into the Fields in the Arrayfield box.
The data field definition appears under Fields in the Arrayfield.
Configuring the To configure the Interface to Data Warehouse Center, use the
Interface to Data Designer Options dialog box:
Warehouse Center
1. From the QualityStage main window select File ➤ Designer
Options.
3. For detailed information about using this dialog box, see the
QualityStage Guide to the Data Warehouse Center Interface.
More information Consult the following documentation for more information about
licensed stages:
• The QualityStage CASS and Z4Change Guide.
• The QualityStage WAVES User Guide.
• The QualityStage SERP User Guide.
2. Click New.
The Select Run Profile Template window appears:
Important: Some of the information required for the profile was defined
during the installation of the server software. Your system
administrator can supply any information that you require.
Next to Enter
Host Name Either the IP address or the host name for the
OS/390 system.
QualityStage Procedure Up to 44 characters.
Lib Location for the QualityStage job library that
contains the TSOPROC job.
Account Valid account number used for charge back,
audit, etc.
TCP Port TCP port number. The default is 23.
User ID Your user logon name
Password Your user password
Email (Optional) Full e-mail address for notification
when a job completes.
Important: Do not enter if your e-mail
system does not support the SMTP protocol.
Alternate Locale Enter the international locale you want
QualityStage to use at the server to process
the data.
This value needs to be set only if you are
processing data for a language that is not the
default language of the server. For example,
the default language for your server is
French, and the data to be processed is
Italian.
User Information (Optional) Any additional information that is
printed on the output banner page.
Printer (Optional) Local or remote printer to which
output from a batch job is routed. You can
enter up to eight characters. The default is
LOCAL.
Job Parameters (Optional) Parameters passed to the system
through a jobcard. You can enter up to 60
characters.
Next to Enter
VSAM DASD Volume Either use the default asterisk (*) or, if it is
not supported, enter a valid DASD volume
(up to 6 characters). Defines temporary space
required by the Standardize and Match
stages.
Region Amount in multiples of 1024K of CPU storage
used for a region size when a job is submitted
for batch processing. The default is 0M.
If you set this value to less than the default,
you risk causing an ABEND, nonzero return
codes, or unexpected error messages. See the
QualityStage OS/390 Server Guide for more
information about this parameter.
Execution Time (Optional) Maximum allotted execution time.
You can enter any value between 0 and 1440,
with 1440 indicating no time limit.
Execution Class One character indicating the execution class.
The default is A. Check with your system
administrator.
Output Class One character indicating the hold queue for
the TCP/IP connection. The default is H.
Check with your system administrator.
Disk Space (Cylinders) (Optional) An integer from 1 through 5000.
Amount of primary space allocated for all of
the QualityStage job’s data sets.
If this field is left empty, primary space to be
allocated is taken from the default settings in
the SYMBOLS file.
Next to Enter
National Characters The 3 national characters that are valid for
(valid in data set data set names. The default is $#@, which
names) are the national characters that are valid in
the US.
In the UK, for example, # and @ are valid,
but $ is not, whereas the UK £ (pound
sterling sign) is valid.
Library Qualifiers Up to 35 characters.
First- and second-level qualifiers for the ARD
(alib), Control (clib), Reports (rlib), Repository
(tlib), Skeletons (slib), and Uni (ulib) libraries.
These are the libraries where QualityStage
stores the staged information from the client.
Data First Qualifier High-level qualifier for your input and output
data files.
Data Second Qualifiers Second-level qualifiers for your input and
output data files.
VSAM Qualifiers Up to 22 characters.
High-level qualifiers for VSAM files.
Work File Name Up to 22 characters.
Qualifiers QualityStage jobs create certain files that are
cataloged but not required (except perhaps
for debugging purposes) once the job has
completed successfully.
You can give such data sets a distinct
high-level qualifier to distinguish them from
permanent data sets. Furthermore, if you
want to run the same QualityStage job more
than once in parallel, you may need to use
different work file name qualifiers to avoid
contention on work file data sets.
If you leave this field empty, the default value
is the data high-level qualifier.
Note: The UNIX and Remote Windows templates are similar, but they
provide different default values.
To define your profile, do the following for either the UNIX or the
Remote Windows server template:
Next to Enter
Host Name Host name or IP address for the server.
Host Server Path Full path for the directory in which you installed the
server software. The default is:
On UNIX: /Ascential/QualityStageServer70/bin
On Windows: C:\Ascential\QualityStageServer70
Master Project Full path for the project directory. This is the
Directory directory in which the data, scripts, control
members, and logs are stored.
The default is:
On UNIX: /Ascential/QualityStageServer70/Projects
On Windows: C:\Projects
If this directory does not exist, QualityStage
automatically creates it when you deploy your first
job provided the user has the appropriate access for
creating that directory.
TCP Port Port that the server is started on.
Email (Optional) Full e-mail address for notification when
a job finishes.
Next to Enter
Alternate Locale Enter the international locale you want
QualityStage to use at the server to process the
data.
This value needs to be set only if you are processing
data for a language that is not the default language
of the server. For example, the default language for
your server is French, and the data to be processed
is Italian.
If you are running the Multinational Standardize or
the WAVES stage on a UNIX server, you must enter
a German ISO locale.
For more information on locales, see the
QualityStage UNIX, Linux, and Windows Server Guide.
Local Report Full path to the location on the client system where
Data Location prepared QualityStage report data is stored.
With the Advanced Project Settings dialog box, you can define an
alternative location for the following directories:
• Data, where your input data files reside and the output data files
are created.
• Controls, where the control member files are created.
• Temp, where files are created and deleted during script execution.
• Logs, where the log files from running a job are created.
• Scripts, where the executable scripts are generated.
By default, QualityStage creates these directories under the directory
you specified as the Master Project Directory in the Profile Definition
dialog box. If you want any of these directories in another location, you
must specify the full path in this dialog box.
When finished, you can either:
• Click OK. Your newly defined profile appears:
– In the list on the Run Profiles dialog box, and
– In the Profile list on the Job Run Options dialog box
• Click the FTP Settings tab.
The FTP Settings tab looks like this:
Login ID Name of the user that owns the server directory and
the master project directory and that starts the
QualityStage server.
Password Password for the Login ID.
FTP Protocol Choose either SFTP (secure FTP) or FTP.
Port FTP port number. The default is …
Public Key
Defining a run profile for a local Windows server requires that you
enter the paths to your QualityStage server software and Master
Project Directory.
Next to Enter
Host Server Path Full path to the directory in which you
installed the server software.
The default is:
C:\Ascential\QualityStageServer70
Master Project Full path to the project directory. The default
Directory is: C:\Projects.
You must create this directory before you can
deploy or run a project.
Alternate Locale Enter the international locale you want
QualityStage to use at the server to process
the data.
This value needs to be set only if you are
processing data for a language that is not the
default language of the server. For example,
the default language for your server is
French, and the data to be processed is
Italian.
For more information, see the QualityStage
UNIX, Linux, and Windows Server Guide.
Local Report Data Full path to the location on the client system
Location where prepared QualityStage report data is
stored.
Building Jobs
Use the QualityStage main window to create, modify, and run jobs.
Renaming a Job
To rename a job or to change its description:
1. On the left pane of the QualityStage main window, select the job
you want to add stages to.
2. Right-click anywhere on the right pane to display a list of stage
types.
Tip: You can also drag and drop stages to reorder the list.
2. On the right pane, right-click the stage you want to copy, and then
click Copy.
3. Right-click anywhere on the right pane, and then click Paste.
The Select a stage name dialog box appears.
Deploying Jobs
Run Profiles
Before you deploy and run a job for the first time, you must set up a
run profile. For information about setting up run profiles, see
Chapter 5, “Setting Up Run Profiles”.
Tip: Because QualityStage builds the JCL and shell script on the
server, you can deploy and run the same job on all server types.
File Mode
File mode should be familiar to all preexisting QualityStage users,
because until INTEGRITY version 3.6, it was the only processing
mode available.
If you are using file mode, QualityStage inputs your data file into the
first stage in your job and processes the entire file before handing the
results to the next stage in your job. QualityStage also generates
interim files while processing, which remain on your system.
File mode has several advantages:
• Allows you to run sections of jobs; this is a useful feature for
debugging.
• Allows you to generate Match reports and default extracts. See
“Working with Match Reports” on page 13-1 for more information.
• May be faster if you are processing with a single CPU (Central
Processing Unit).
Deploying a Job
When you first create a job, you must deploy it without running it.
When you first deploy a job, the project directory is created with the
appropriate subdirectories (Controls, Data, Logs, etc.). After all
directories in the project directory exist, you must move your data files
into the project directory or the data library. The data in these files is
used when you run the job.
If you have not set up a run profile, you must do so before you deploy
any jobs. For information about setting up run profiles, see Chapter 5,
“Setting Up Run Profiles”.
Important: On Windows systems, you must re-deploy jobs that were created
and deployed using versions of QualityStage or INTEGRITY
earlier than version 7.0. See the UNIX, Linux, and Windows Server
Guide for details.
The Deploy check box is selected as in the preceding Job Run Options
dialog box.
Using the File Mode Execution Dialog Box for Deploying Jobs
In file mode, QualityStage creates data files that you can then use for
debugging a job. The File Mode Execution dialog box lists all the
stages in the job. Use this dialog box to set starting and ending points
for deploying the job.
By default, all stages listed are run from first to last. However, to
select a subset, do the following:
2. Click OK.
You must now move your input files into the appropriate project
directory. See “Moving Input Data to the Correct Project Library
Location” on page 7-9 for information on how to do this.
Optionally, you can specify the full path and directory in which your
data resides with the Advanced Project Settings tab of the Profile
Definition dialog box. You must put your data files in this location
before running the job.
Running Jobs
After you deploy your job and move your input data files into the
appropriate directory, you are ready to run the job.
This chapter describes the following:
• “Running a Job from QualityStage Designer” on page 8-2
• “Running a Job from the Command Line on UNIX Systems” on
page 8-8
• “Running a Job from the Command Line on Windows Systems” on
page 8-9
• “Restarting a Job” on page 8-12
• “Viewing Job Output Files” on page 8-13
Remote Servers
With OS/390, UNIX, Linux, and Windows servers, QualityStage ends the
connection from the client to the server after all files have been
transferred.
When the job is finished, you can receive an e-mail message
containing the job results. This message is sent to the e-mail address
you define in the run profile.
Optionally you can continue the connection to the server during the
running of the job and receive status messages in the Status window.
You might want to follow the status of your job run during
development and testing of your project.
Important: You can clear the Deploy check box if you made no changes
(such as add or modify a stage, modify a data file definition,
edit a Pattern-Action file) to your job. However, if you make
changes to your job or to any of its stages, you need to deploy
it again. For information about deploying jobs, see Chapter 7,
“Deploying Jobs”.
Advanced Run 8. (Optional) Click Advanced Run Options to see other options you
Options can set, depending upon whether:
• Your QualityStage server is running on an OS/390 system
• You are running in Parallel Extender mode
OS/390 job options If you are running a job on an OS/390 server and you select
Advanced Run Options, the following screen appears:
Parameter Value
Data First Qualifier High-level qualifier for your input and output
data files.
Data Second Second-level qualifiers for your input and output
Qualifiers data files.
VSAM Qualifiers Up to 22 characters.
High-level qualifiers for VSAM files.
Parameter Value
Work File Name Up to 22 characters.
Qualifiers QualityStage jobs create certain files that are
cataloged but not required (except perhaps for
debugging purposes) once the job has completed
successfully.
You can give such data sets a distinct high-level
qualifier to distinguish them from permanent
data sets. Furthermore, if you want to run the
same QualityStage job more than once in
parallel, you may need to use different work file
name qualifiers to avoid contention on work file
data sets.
If you leave this field empty, the default value is
the data high-level qualifier.
Disk Space (Optional) An integer from 1 through 5000.
(Cylinders) Amount of primary space allocated for all of the
QualityStage job’s data sets.
If this field is left empty, primary space to be
allocated is taken from the default settings in
the SYMBOLS file.
Run Identifier (Optional) A single uppercase letter (A – Z) or a
number from 0 – 9.
The Run ID is suffixed to the name of the MVS
job on the server. Use Run IDs to distinguish
among two or more MVS jobs running at the
same time on the server.
Parallel Extender If you are running a job on a UNIX server, the following screen
job options appears:
If you are running a job using Parallel Extender, you can specify
the kind of sorting you want to use.
8. Click OK.
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
Mode scripts The base name of the script file is the name of the job. Its extension
identifies the run mode:
• File mode scripts end with .stp
• Data stream mode scripts end with .scr
• Parallel Extender mode scripts end with .par
For example, if the job name is TEST, the following three scripts are
created:
• TEST.stp
• TEST.scr
• TEST.par
How to run To run any of the scripts, use the following syntax at a UNIX shell
mode scripts prompt:
scriptname -ipe.env proc_env_file -ipe.env proj_env_file
scriptname is the full or relative path of the run script.
proc_env_file is the full or relative path of the environment file
associated with the job. It is located in the Scripts directory. Its file
name is the name of the job with an .env extension.
proj_env_file is the full or relative path of the project environment file.
It is located in the project directory. Its file name is ipe.env.sh.
Example To run the TEST.par Parallel Extender mode script, enter the following
command from the Scripts directory:
TEST.par –ipe.env TEST.env –ipe.env ../ipe.env.sh
Mode scripts The base name of the script file is the name of the job. Its extension
identifies the run mode:
• File mode scripts end with .stp
• Data stream mode scripts end with .scr
QualityStage Designer User Guide 8-9
8 RUNNING JOBS
Running a Job from the Command Line on Windows Systems
For example, if the job name is TEST, the following two scripts are
created:
• TEST.stp
• TEST.scr
How to run To run any of the scripts, use the following syntax at an MKS bash or
mode scripts ksh shell prompt:
scriptname -ipe.env proc_env_file -ipe.env proj_env_file
scriptname is the full or relative path of the run script.
proc_env_file is the full or relative path of the environment file
associated with the job. It is located in the Scripts directory. Its file
name is the name of the job with an .env extension.
proj_env_file is the full or relative path of the project environment file.
It is located in the project directory. Its file name is ipe.env.sh.
Example To run the TEST.scr script, enter the following command from the
Scripts directory:
TEST.scr –ipe.env TEST.env –ipe.env ../ipe.env.sh
Example To run the TEST.par Parallel Extender on persistent data sets, enter
the following command from the Scripts directory:
TEST.par –ipe.env TEST.env –ipe.env ../ipe.env.sh –noimport 1
Restarting a Job
If your job halts before finishing, you can restart any job at a specific
stage.
To restart:
4. On the Job Run Options screen select the Run check box. Clear the
Deploy check box if necessary.
5. Click Execute File Mode.
6. On the File Mode Execution screen click Set Starting Stage to
move the job starting point to the stage that failed.
7. Click Run From Start to End.
QualityStage builds and then submits new JCL or script, which starts
executing at the stage that previously failed.
Note: On OS/390 systems, you can also manually edit the QualityStage
JCL and include a RESTART parameter on the JOB card. For a
description of how to do this, see the QualityStage OS/390 Server
Guide.
QualityStage See Chapter 15, “Working with QualityStage Reports” for information
reports about how to create and generate QualityStage formatted reports.
QualityStage Data See Chapter 16, “Using the QualityStage Data File and Report
File and Report Viewer” for information about how to use the QualityStage Data File
Viewer and Report Viewer.
Figure 9-1 Phase Two: Understand the Nature and Content of The
Source Data
1. Depending on the source data and what you’re trying to find, you
choose the type of investigation to perform:
• Word investigation, used on free-form fields.
• Character investigation, used on single-domain fields.
2. Specify the fields you want to investigate. Be sure these fields
were defined appropriately when you prepared the data for
QualityStage.
3. Choose the rule set to use in classifying tokens or words.
4. Run the investigation on each field.
The Investigate stage provides the following sets of reports:
• Pattern—contains the pattern analysis of the data entities.
• Word Frequency—shows the frequency distribution of the field
values.
• Word Classification.
In addition, the process generates a file with the name job.FRQ, which
displays the tokens and patterns for all records.This file is different
for the two options for Character investigation.
• For the CONCATENATE option, the first column contains the
frequency count, followed by frequency percentage, the pattern,
and the entire fields.
• For the DISCRETE option, the first column contains the field
name, followed by the frequency count, followed by the frequency
percentage, the pattern, and the entire fields.
8. Click Finish.
To modify the mask field and change the investigation type for any
selected field:
1. Under Selected Fields, select the field for which you want to
modify the investigation type.
2. Click Change Mask.
The Mask Field Selection dialog box appears.
3. Make the appropriate changes, and then click OK.
Class Description
^ Numeric containing all digits, such as 1234
? Unknown token containing one or more words, such as
CHERRY HILL
> Leading numeric containing numbers followed by one or more
letters, such as 123A
< Leading alpha containing letters followed by one or more
numbers, such as A3
@ Complex mix containing an alpha and numeric characters that
do not fit into either of the above classes, such as: 123A45 and
ABC345TR
Class Description
0 Null
- Hyphen
/ Slash
& Ampersand
# Number sign
( Left parenthesis
) Right parenthesis
~ Special containing special characters that are not generally
found in addresses, such as !, \, @, ~, %, etc.
Frequency
Count Percent Pattern Field Standardized Fields
00000051 24.519% ^?T [X}| 15423 COUSTEAU DR | 15423 COUSTEAU
00000037 17.788% ^? [X}| 6806 ROCKLEDGE | 6806 ROCKLEDGE
COVE COV
00000027 12.981% ^D>T [X}| 8541 W 72ND STREET | 8541 W 72ND
00000010 4.808% ^D?T [X}| 3625 SE HOWARD | 3625 SE HOWARD
DRIVE
00000010 4.808% ^D? [X}| 1304 N MAIN | 1304 N MAIN
00000009 4.327% ^D>S [X}| 4405 W 128TH ST | 4405 W 128th
00000006 2.885% ^D> [X}| 1537 E 37TH | 1537 E 37TH
Note that the vertical lines (|) are used to separate the field from the
pattern and from the standardized presentation. If you do not select
the standardize option, the standardized fields are not present.
The filenames for these reports are the first seven characters of the job
name with the following appended to it:
a.FRQ The Word Frequency report in ascending order by
token.
c.FRQ The Word Frequency report in descending order by
frequency count.
You have the option of including unclassified tokens along with the
classified tokens in these reports with the Advanced Options dialog
box. When you include the unclassified tokens, you generate a report
of all tokens in your input file.
Frequency
Count Token
0000000030 W
0000000018 ST
0000000017 DR
0000000015 DRIVE
0000000015 STREET
0000000013 BOX
0000000011 E
0000000009 COURT
0000000008 TERR
The Word Frequency report assists you in reviewing the quality and
content of your data. When sorted by frequency, the reports allows you
to determine quickly the values present in your data. When sorted by
token, the report assists in identifying alternate representations, such
as misspellings, of your data.
Note: To generate the unclassified token report (the second file listed
above), you must specify Include Unclassified Alphas in Word
Frequency Files in the Advanced Options dialog box.
Frequency
Token Standardization Class Count
ALDEN ALDEN ? ;0000000001
ALVAMAR ALVAMAR ? ;0000000001
AMESBURG AMESBURG ? ;0000000001
ANN ANN ? ;0000000002
ANTIOCH ANTIOCH ? ;0000000001
ARLINGTON ARLINGTON ? ;0000000002
ARROWHEAD ARROWHEAD ? ;0000000001
AVON AVON ? ;0000000001
B B ? ;0000000001
BAKER BAKER ? ;0000000001
BARTON BARTON ? ;0000000001
The format of this report allows you to merge entries with exiting
Classification tables (.CLS files). You can use this report to fine-tune
your rule sets for investigating and standardizing data by adding to or
creating a new Classification Table. See Appendix E, “Customizing
and Testing Rule Sets”, for details on customizing your rule sets.
want to see house and apartment numbers, but you might want to
see numbers if you are investigating part numbers.
• Include Unclassified Alphas in Word Frequency Files.
This option includes all word tokens that are not in the
Classification Table in both Word reports. If you do not select this
option, the Word reports only include tokens from the
Classification Table.
• Include Mixed Types and Punctuated Words in Word Frequency
Files.
This option includes tokens with leading or trailing numerics,
such as 109TH and 42ND, in both Word reports.
You can select to display your tokens in the reports in one of the
following forms:
• Standard Abbreviation — the standardized representation of the
token from the Classification table.
• Original Spelling — the form as the token appears in the data file.
• Correct Spelling — allows the Investigation process to correct any
misspellings if the Classification table has a weight assigned to the
token.
By default, QualityStage provides one sample for each unique pattern
in the Pattern report. You can increase the number of samples
displayed for each unique token; for example, you might want to see
four samples for each token.
You can also limit the frequencies that are displayed. You might not
want to see the low frequency patterns, the ones that appear only once
or twice. You can set a cutoff count for the frequency. You can change
either or both of these default settings through the Advanced Options
button.
You can also specify what characters separate tokens and whether
special characters are included as a token with the following options:
• Separator List.
This list includes all special characters that separate tokens.
• Strip List.
This lists includes all special characters from the Separator List
that are not to be a token. For example, the pound sign (#) by
default is not part of this list; therefore, APT#3A is three tokens:
APT, #, and 3A.
You can edit these lists to add or remove special characters. Note that
the space special character is included in both lists.
In addition, you can standardize the samples generated for the
Pattern report with the Standardize Representative Records option.
When you select this option, the Investigate stage invokes the
Standardize stage. Note that you need to have a Pattern-Action file
and a Dictionary file in your rule set. See Chapter 10, “Defining
Standardize Stages”, and Appendix C, “Defining Investigate Stages”,
for details.
Important: You must order the fields in the Standard Fields list in an
order used by the rule set. For example, for the Place rule set,
the field order must be: city, state, ZIP; and for the Names
rule set, the order must be: first name, middle, last name,
suffix.
4. When all desired fields for the rule set are listed, click Add Rule.
The rule set and fields to be standardized appear under Scheduled
Processes.
5. To change the investigation options, click Advanced Options.
The Advanced Options dialog box appears.
Note: To reset Separator List and the Strip List to the supplied
special characters, click Restore Defaults.
7. Click Finish.
Setting up If this is the first job that you run in the project, you must set up the
the file structure file structure for the project. See Chapter 7, “Deploying Jobs”, for
instructions.
Once your data files are available to QualityStage, run the Investigate
stage:
How to run 1. On the left pane of the QualityStage main window, select a Jobs
the stage folder.
2. From the jobs list on the right pane, select the job you want to run.
3. Do one of the following:
The Job Run Options dialog box appears. It looks like this:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
The File Mode Execution dialog box lists all the stages in the job. You
must run the same stages that you selected when deploying the job.
See “Deploying Jobs in File Mode” on page 7-7 for more information.
By default, all stages listed are run from first to last. However, you
can select a subset of stages to run. For information about how to
select a subset, see “To select a subset” on page 8-7.
When the run has successfully finished, a message like this one
appears:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
Figure 10-1 Where you are in Phase Three: Design and Develop the
Data Re-engineering Application
Conditioning the input data ensures that each type of data has the
same type of content and format— that it is internally consistent.
Conditioned data is also called standardized data. Standardized data
is important for:
• Effectively matching data (step two)
Free-form fields Free-form fields can contain any alphanumeric information of any
length that is less than or equal to the maximum field length defined
for that field.
For example, an address field might contain address data that
includes numbers, letters, and special characters, such as 53 Main St.
#301, or 1416 West Road.
Fixed-formatted Fixed-formatted fields, on the other hand, contain only one specific
fields type of information such as only numeric, only character, or only
alphanumeric, and that has a specific format.
For example, a date of birth field such as 01/29/55 or a social security
number field such as 123-33-1234, both include numbers and special
symbols that appear in a specific format.
The Standardize stage parses both field types into single-domain
fields. This creates a consistent representation of the input data,
corrects any misspellings, and incorporates business and industry
standards.
This chapter explains:
• How to use the Standardize stage included with QualityStage
• How to create your own Standardize stages
It assumes that you have already prepared and specified the input
data files as described in Chapter 9, “Defining Investigate Stages”.
To correctly parse and identify each element or token, and place them
in the appropriate field in the output file, Standardize uses rule sets
that are designed to meet the name (individual and business) and
address conventions of a specific country. To see a list of country rule
sets available with QualityStage, scroll down the list of Available Rule
Sets in the Standardize Wizard – Command definition dialog box.
Additionally, the Standardize rule sets can standardize the
representation of any data, and append additional information from
the input data, such as sex.
The Standardize rule sets are the same as those used in the
Investigation process. You can run these rules out of the box or
customize them to handle data challenges not covered by the standard
rule sets.
Using a Standardize stage requires that you:
Input File
U.S. Records
U.S. DOMAIN
Pre-Processor
Rule Set
Intermediate File
These rule sets do not perform standardization but parse the fields in
each record and filter each token into one of the appropriate Domain
-Specific column sets, which are Name, Area, or Address.
Important: If you fail to enter at least one metadata delimiter for the input
record, you receive the following error message in the output file:
See “Validation Rule Sets” on page D-14, for more information on the
standardized data structures output from the Validation rule sets.
Standardized Results
At the end of a run, Standardize:
• Creates a fixed-format file and
• Adds the fields to the data file definition
Depending on the type of rule set, each field contains one data element
from the input file. There may be additional data such as a SOUNDEX
phonetic or NYSIIS codes. You can use any of the additional data for
blocking and matching fields with Match or other matching jobs.
Using QualityStage Designer, you have the following options:
• Appending the input record to the end of the standardized output
record
• Appending none of the input fields
• Appending selected input fields as defined in the input data file
definition
Rules Overrides
The rule sets provided with QualityStage are designed to provide
optimum results. However, if the results are not satisfactory, you can
modify rule set behavior using rule override tables.
See Appendix E, “Customizing and Testing Rule Sets”, for information
about how to use override tables.
Inserting Literals
If the input records do not include critical entries, you can insert the
required values as a literal, which will appear in the output file. You
insert the literal using the Standardize Command Definition dialog
box as described in “Creating a Standardize Stage” on page 10-13.
For example, the input records lack a state entry because all records
are for the state of Vermont. To include the state in the standardized
records, you would insert the literal VT between the city name and the
ZIP code.
If input records have an apartment number field containing only an
apartment number, you could insert a # (pound sign) literal between
the unit type and the unit value.
Delimiter Literals
You must insert field delimiters using literals for the Domain
Pre-Processor rule sets. A delimiter literal does not appear in the
output record. Add a metadata delimiter literal in front of at least one
field.
The delimiters are:
Delimiter Description
ZQNAMEZQ Name delimiter
ZQADDRZQ Address delimiter
ZQAREAZQ Area delimiter
Important: We strongly suggest that you enter a delimiter for every field or
group of fields in a record.
Important: Do not add any fields to this data file definition. Standardize adds
the appropriate fields when you run the job.
5. (Optional) If you are using the Domain-Specific Name rule set, you
can choose the following process options:
This option is useful if you know what types of names your input
file contains. For instance, if you know that your file mainly
contains organization names, specifying Process All as
Organization enhances performance by eliminating the processing
steps of determining the name’s type.
6. (Optional) If you want to specify special case formatting rules,
select the With Case Formatting check box.
7. When all desired fields for the rule set are listed, click Add Rule.
If you selected the With Case Formatting check box, the Case
Formatting Options dialog box appears. For details about
specifying case formatting, see “Specifying Case Formatting
Options” on page 10-22.
Note: The order the rule sets appear in a scheduled process is the
order in which Standardize processes the fields.
A rule set can be scheduled only once in a Standardize process; for
example, you can specify only one USNAME rule set, one USAREA
rule set, and one USADDR rule set.
8. For additional rule sets, repeat step 1 through step 7.
9. (Optional) If you want to modify the case formatting rules for a
scheduled process, select it and click Edit Case Formatting
Options. The Case Formatting Options dialog box appears (see
“Specifying Case Formatting Options” on page 10-22).
10. When all rule sets are defined, do one of the following:
a. Click Finish if you selected Append All or No Append.
b. Click Next:
• If you selected Custom Append, the Append Field
Selection dialog box appears (see “Using the Append Field
Selection Dialog Box” on page 10-20).
• If you selected Map Input Fields for Report Use, the Data
Selection for Reports dialog box appears (see “Using the
Data Selection for Reports Dialog Box” on page 10-21).
1. Select the field to be appended to the output file and click Add to
Append Fields.
1. Select the field to serve as the record key for reporting, and then
click .
The name of the field appears under RECKEY.
2. Repeat step 1 for up to six additional fields.
3. When all desired fields are selected, click Finish.
Name Description
UPPERALL UPPERCASE ALL
PRESERVE Preserve the case
LOWERALL lowercase all
UPPEREACHWORD Capitalize Every Word
CITY Case formatting of city names
ENGEN General rule for English: lowercasing of common
words plus capitalization of all others
NAMES Case formatting of personal names
TRADE Case formatting of companies and trademark names
You can also create your own customized case formatting rule sets.
For information about case formatting rule sets, see Chapter 1 of the
QualityStage Stages Reference Guide.
Setting up If this is the first job that you run in the project, you must set up the
the file structure file structure for the project. See Chapter 7, “Deploying Jobs”, for
instructions.
Once your data files are available to QualityStage, run the
Standardize stage:
How to run 1. On the left pane of the QualityStage main window, select a Jobs
the job folder.
2. From the jobs list on the right pane, select the job you want to run.
3. Do one of the following:
The Job Run Options dialog box appears. It looks like this:
When the job finishes running, a message like this one appears:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
The File Mode Execution dialog box lists all the stages in the job. You
must run the same stages that you selected when you deployed the
job. See “Deploying Jobs in File Mode” on page 7-7 for more
information.
By default, all stages listed are run from first to last. However, you
can select a subset of stages to run. For information about how to
select a subset, see “To select a subset” on page 8-7.
See “About Deploying and Running Jobs” on page 7-1 for more
information about deploying and running jobs.
b. (Optional) Select Prepare Report Data.
This specifies that prepared report data output will be put in
the Data directory for the project.
c. (Optional) Select Retrieve Report Data, and specify the
maximum file size to retrieve.
The output file will be copied to the location specified in the
run profile for local report data.
See Chapter 15, “Working with QualityStage Reports”, for more
information about preparing data for formatted reports.
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
1. Run a Standardize job using the Country Identifier rule set, which
creates an intermediate file in which an ISO country code is
appended to each record.
2. Run a job using the Select stage to create an intermediate file that
contains records from a single country. The Select stage can use
the ISO country code field to create this file.
3. Run the Standardize job using the appropriate Domain
Pre-Processor rule set with the intermediate file.
4. Complete standardization of the file by running the Standardize
job using the appropriate Domain-Specific rule sets.
Flag Description
Y The rule set was able to identify the country.
N The rule set was not able to identify the country and used the default
value that you set as the default country delimiter.
After you create this output file, you can use a Select stage to create a
file comprising only one country, which can be used with a Domain
Pre-Processing rule set for the appropriate country.
Input File
Standardize Job
Assign each record in the
Country Identifier file with a two-character
Rule Set ISO country code
Intermediate File
Input records
with ISO country code
ACCEPT REJECT
Important: If you fail to specify a country delimiter, the word ERROR appears
at the beginning of every line in each output record.
Note that the tree contains the rule set tables and files (CLS, PRC,
PAT, and DCT), and the Override tables.
This name also becomes the name of the directory where the files
reside.
4. Click OK.
QualityStage creates a directory in the rule set directory and adds
four empty files with the same name and the extensions: PRC,
PAT, DCT, and CLS.
5. Edit the rule set files to make the necessary changes.
Optionally, you can create a new rule set by copying an existing one.
When you copy a file, you create a copy of the file within its rule set
with a new name. When you move a file, you move it from one rule set
to another.
Tip: To copy, rename, or delete a rule set or file, you can also
right-click the file, and then select the desired action from the
shortcut menu.
Defining Multinational
Standardize Stages
City-Level Standardization
For more than 200 countries, the job does the following:
• Separates street-level address information from city-level
information (if necessary)
• Assigns City, Locality, Province/State, Postal Code, and Postal
Code add-on (ZIP4) to separate fields
• Assigns ISO country codes (2 and 3 byte versions)
Street-Level Standardization
For more than 50 countries, the job does the following in addition to
city-level standardization:
• Separates street information into discreet fields, including house
number, street name, and so on.
• Assigns floor and unit information to the Secondary Address
Information field.
• Assigns building, contact, and address type to their respective
fields.
• Assigns unhandled address information to a field.
• Assigns unprocessed street address information when no address
standardization rules are available to Unprocessed Address field.
For a complete description of the fields in the Multinational
Standardize output file, see “Multinational Standardize Output
Fields” on page 11-15.
Requirements
Input files must be fixed-field, fixed-record-length data files. The total
line length of any input record can be no greater than 4096 columns;
the address data must occur within the first 3072 columns.
Each record must contain a country indicator, which may be the full
spelling, an abbreviation, or the 2- or 3-byte ISO country code (see “ISO
Country Codes” on page F-1). If the country indicator does not match
the expected country-level or street-level formats for the indicated
country, the data is not standardized and is output as unhandled. For
example, if the record identifier is U.S. and the address format is that
of France, the record is not standardized.
Recommendations
We strongly suggest that you use a preprocessor to remove any
nonaddress or noncontact data from the address fields. Any
information other than address information is not handled by the
standardization rules. This extraneous data should be removed from
the file before you run the job.
Addresses should include the following information:
• Street address
• City
• State or Province
• Postal code
• Country code or name (required)
See the next section, “Input Field Configuration”, on how this
information can be organized into fields in the input file.
For information about creating new jobs, see “Creating a New Job” on
page 6-4. For information about adding stages to jobs, see “Adding
Existing Stages to a Job” on page 6-6.
After adding your Multinational Standardize stage to a job, you can
run it.
Setting up If this is the first job that you have run in the project, you need to set
the file structure up the file structure for the project. See Chapter 7, “Deploying Jobs”,
for instructions.
Once your data files are available to QualityStage, run the
Multinational Standardize stage:
How to run the job 1. On the left pane of the QualityStage main window, select a Jobs
folder.
2. From the jobs list on the right pane, select the job you want to run.
3. Do one of the following:
The Job Run Options dialog box appears. It looks like this:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
The File Mode Execution dialog box lists all the stages in the job. You
must run the same stages that you selected when you deployed the
job. See “Deploying Jobs in File Mode” on page 7-7 for more
information.
By default, all stages listed are run from first to last. However, you
can select a subset of stages to run. For information about how to
select a subset, see “To select a subset” on page 8-7.
See “About Deploying and Running Jobs” on page 7-1 for more
information about deploying and running jobs.
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13.
You can view the output file you defined in your job as described in
“Using the QualityStage Data File and Report Viewer” on page 16-1.
Note that you can use the QualityStage report as described in the
“Standardization WAVES/Multinational Report” on page 15-28.
Start
Domain Field Length Position Comments
City-level City 28 1
Neighborhood/Locality 40 29
State/Province 3 69
Postal Code 10 72 ZIP for U.S.
Start
Domain Field Length Position Comments
ZIP4 or Additional 4 82
Sorting/Routing
Information
ISO Country Code (alpha 2 86 See “ISO Country Codes” on page
2) F-1
ISO Country Code (alpha 3 88 See “ISO Country Codes” on page
3) F-1
City NYSIIS 8 91 NYSIIS phonetic spelling; may be
used in matching for
deduplication.
City RSNDX 4 99 Reverse SOUNDEX spelling; may
be used in matching for
deduplication.
reserved 1 103
Area Verification Indicator 1 104 Used in QualityStage WAVES
only.
Area Match Pass 1 105 Used in QualityStage WAVES
only.
Street-level House # 15 106 Includes prefix # and suffix #
Prefix Directional 3 121 Such as N for North, S for South,
etc.; language-specific for each
country.
Prefix Type 20 124 Includes highway and route
prefixes such as Rue de la Mer
where Rue is the prefix.
Street Name 35 144
Suffix Type 15 179 Such as Ave., Rd., etc.
language-specific for each country.
Suffix Directional 3 194 Such as N for North, S for South,
etc.; language-specific for country.
Start
Domain Field Length Position Comments
Box Type 15 197 Such as P.O. Box, Box, etc.;
language-specific for each country.
Box Value 10 212 Alphanumeric
Secondary Address Info 50 222 Unit, floor and multi-unit info.
Building Name 30 272 For example, Empire State Building,
Rockefeller Plaza, etc.
Contact Info 60 302 Includes attention, C/O and
department information.
Address Type Indicator. 2 362 S = Street only
B = Box info
L = Building info
G = General Delivery
Y = Secondary address info
O = Other
Unhandled Text 50 364 Any information that does not
conform to an expected address
output field.
reserved 10 414
reserved 10 424
Address Verification 1 434 Used in QualityStage WAVES
Indicator only.
Address Match Weight 7 435 Used in QualityStage WAVES
only.
Address Match Pass 1 442 Used in QualityStage WAVES
only.
reserved 6 443
Start
Domain Field Length Position Comments
Unhandled Address 150 449 Address data not standardized.
NYSIIS of Street Name 8 599 The phonetic spelling of the street
Root name root; for example, for Rue de
la Mer, the root is Mer. This
information can be using in
matching for deduplication
Appended Input file n 607 The original input file
Figure 12-1 Phase Three: Design and Develop the Data Re-engineering
Application
Matching data helps you identify duplicate entities within one or more
files, which you need to know later on, in Step Three. This step also
lets you establish cross-reference linkage and enrich existing data
with new attributes from external sources.
For the first two processes, the Match stage uses two files: FileA and
FileB. For the third process, the Match stage uses only one data file,
FileA.
One-To-One Matching
For the one-to-one matching process, the input files can be either
File A or File B. This matching process identifies all records on one file
that correspond to a record for the same individual, event, household,
street address, etc., on the second file. Only one record on FileB can
match a single record on FileA, because you are matching individual
events.
Many-To-One Matching
For the many-to-one matching process, multiple records on FileA can
match a single record on FileB. With these matching jobs, FileB is
considered a reference file. An example of many-to-one matching is
matching a transaction file to a master file, where you can have many
transactions for one person on the master file.
Another example of many-to-one matching processes is geographic
coding. These processes match a file containing street addresses either
to a Post Office ZIP code file to obtain ZIP codes, or to a Census Tiger
file to obtain latitude-longitude coordinates or census tract
information.
When matching, one or more fields on FileA must have equivalent
fields on FileB. For example, if you want to match on last name and
age, both FileA and FileB must have a field for last name and a field
for age. The location and length of the fields can be different in the two
files.
Blocking Phase
The blocking phase limits the number of record pairs being examined,
increasing the efficiency of the matching. This phase creates a subset
or block of records that have a high probability of being associated
with or linked to other records during the matching phase. Blocking
identifies pairs of records that have a low probability of matching to be
ignored during the matching phase.
During the blocking phase, all records having the same value in the
blocking fields are eligible for comparison during the matching phase.
For example, if LAST_NAME is a blocking field, all persons with the
same last name on the two files are included in the block for the
matching phase. Records having different last names are not included.
Matching Phase
After creating a block of records, the Match stage compares fields that
you specified as matching fields to determine the best match for a
record or, in the case of an unduplicating match, the master record
and associated duplicates. The Match stage provides over 20 types of
comparison, which are algorithms based on the type of data in the
fields, such as numeric data versus character strings or parts of street
addresses.
To determine whether a record is a match, the Match stage calculates
a weight for each comparison, according to the probability associated
with each field. The Match stage uses two probabilities for each field:
the m-probability and the u-probability.
FileA FileB
M M
M F
F M
F F
About Weights
For each matching field, the Match stage computes a weight. If the
comparison between a pair of fields agrees, the pair of fields receives
an agreement or positive weight, which is calculated as the
log2(m-probability/(u-probability). If the comparison disagrees, the
pair of fields receives a disagreement or negative weight, which is
calculated as the log2((1 – m-probability)/(1 – u-probability)).
The Match stage sums the weights assigned to each field comparison
and obtains a composite weight. The agreement weight of each field
adds to the composite weight, and the disagreement weight subtracts
from the composite weight; that is, the higher the composite weight,
the greater the agreement.
About Cutoffs
The composite weights assigned to each record pair create a
distribution of scores that range from very high positive to very high
negative. Within the distribution of positive values, you want to define
a value or cutoff at which any record pair receiving a weight equal to
or greater than this cutoff is considered a match, and is referred to as
the match cutoff.
Conversely, you want to define a cutoff at which any record pair
receiving a weight equal to or less than this cutoff is considered a
non-match. Any record pairs with weights that fall between these two
cutoff values are considered clerical review cases, and is referred to as
the clerical cutoff.
If more than one record pair receives a composite weight higher than
the match cutoff weight, those records are declared duplicates. The
way in which duplicate records are handled is based on what type of
matching you selected (see “Defining a Match Stage” on page 12-11).
Any record pair that falls below the clerical cutoff becomes a residual
and is eligible for the next matching pass.
About Unduplication
When unduplicating a file, you are essentially grouping records that
share common attributes. You might unduplicate a file to group all
invoices for a customer or merge a mailing list.
With unduplication, the Match stage declares all records with weights
above the match cutoff as a set of duplicates. The Match stage then
identifies a master record by selecting the record within the set that
matches to itself with the highest weight. The master record is
associated with its set of duplicates.
Any records that are not part of a set of duplicates are declared
residuals. These and the master records are generally made available
for the next pass. Duplicates are not included in subsequent passes,
because you want them to belong to only one set.
When selecting fields for matching (including unduplication matches),
you want to include as many fields as possible. You should include all
fields in common on both files for the matching fields. Include fields
that are not very reliable and assign them low m-probabilities.
If you want the Match stage to do exact matching only, specify the
blocking fields and not the matching fields. This results in all record
pairs that agree on the blocking fields being declared matches or
duplicates.
Extracting Data
During the matching job, the Match stage stores the decisions made
on each record pair for each pass. If requested, the Match stage
creates output files of all matched records, of the clerical review
records, and the duplicate and residual records on both FileA and
FileB.
You can customize what output files are created and what records are
written to those files. You can also customize unduplication matches
to create sets of groups.
When you define these files, you need to specify only a name and
description for this file using the Add a New Datafile dialog box.
Important: Do not add any fields to this data file definition. This file
definition does not require or use field definitions.
With the Match stage, you can perform the following types of
matching:
• Match.
Matches a record on FileA to only one record on FileB. Any other
records that match are considered duplicates.
• Match Sets.
Matches duplicate sets of records on FileA with duplicate sets of
records on FileB. For example, if you have two sets of records on
both FileA and FileB for John Doe, the Match stage generates two
matches: one for each matching John Doe.
• Geomatch.
Geographic matching (geocoding) or many-to-one matching, in
which each FileA record can match more than one FileB record.
This type of matching is similar to matching with the Unijoin
stage.
Note: When the number of input records is very small compared to the
number of records in the reference database, we suggest you run
the Geomatch in File Mode. Doing so improves performance.
• Geomatch Multiple.
Multiple records on FileB having the same weight as the matched
pair are flagged as duplicate records. For example, if 101 Main St.
• Match Sets
• Geomatch
• Geomatch Multiple
• Geomatch Duplicates
• Undup
• Undup Independent
6. Under Data File A, select the file to be FileA.
7. If you selected either the Undup or the Undup Independent
option, click Next.
If you selected any other option, select the file to be FileB under
Data File B, and then click Next. Data File A must be a different
file from Data File B.
The Select Match Build Method dialog box appears.
5. Click OK.
2. Click the match specification activity you need. For the defined
Match stage, you can:
• Add or modify passes
• Define variable types
• Define a report
• Define an extract
You can also delete passes and rearrange the order in which they
are executed. The first pass listed becomes Pass 1, the second Pass
2, the third Pass 3, and so on.
Tip: By expanding the passes, you can view the blocking variables
assigned to each pass.
Defining a Pass
Defining a pass involves specifying which fields on your data files are
to be used for blocking and which for matching. All passes require that
you define the blocking fields. However, if you want exact matching,
you need to specify only your blocking fields.
Tip: Generally you do not want to use the blocking fields for this
pass for the matching fields. You can use blocking fields from
other passes for the matching fields on this pass.
Cutoff Description
Match When a record pair receives a composite weight greater than or
equal to this weight, the pair is declared a match.
Clerical When a record pair receives a composite weight greater than or
equal to this weight and less than the Match cutoff, the pair is
declared a clerical review. This weight is equal to or less than
the Match cutoff weight.
If you do not want a clerical review, set the Clerical cutoff equal
to the Match cutoff.
Duplicate The lowest weight that a record pair can have to be considered
a duplicate. This cutoff weight is optional and must be higher
than the Match cutoff weight. Note that this cutoff is not used
with Undup.
Initially you could try setting the cutoffs very low, such as zero (0).
Run a match for the pass and generate a report ordered by weight.
The high weight matches are the best. As the weight goes down, your
confidence in the match should decrease. Assign cutoffs at a weight
that is appropriate for your project.
Using Arrays
The Match stage provides the ability to compare arrays of fields on
FileA to arrays of fields on FileB or arrays of fields on a single file for
an unduplication run. An array can comprise any number of fields,
including one. To use arrays, you need to define them (see “Defining
Arrays” on page 4-27).
Using arrays allows you to reduce the number of cross comparisons
you would have to define. For example, if you have a first name,
middle name, and last name field that might appear in any order (first
name in last name field), arrays compare all names without regard to
order.
Array matching is available with some comparisons. When you select
a comparison that supports array matching, the Array option is
available in the Match Wizard – Match Pass dialog box.
When calculating the weights for an array, the Match stage never
allows the weight for the array to exceed the weight that would result
if a single field were compared. This keeps the weights for array
comparisons from dominating weights of single fields.
3. Enter one, more than one, or all five of the weight overrides by:
Next to Enter
Agreement Weight (AW) An agreement weight if the values for the
field agree and are not missing.
Disagreement Weight (DW) A disagreement weight if the values for
the field disagree and are not missing.
FileA Missing Weight (AM) A weight when the value on FileA is
missing.
FileB Missing Weight (BM) A weight when the value on FileB is
missing.
Both Missing Weight (XM) A weight when values are missing on
both files.
You can specify negative values, which if you are adding to the
weights causes points to be subtracted from the calculated weight
for the field.
4. (Optional) Specify a value in the Conditional FileA Value or the
Conditional FileB Value, or both, as described here:
Next to Enter
Conditional FileA Value The value, enclosed in single quotes ('), in
(AV) a field on FileA or the word ALL.*
Conditional FileB Value The value, enclosed in single quotes ('), in
(BV) a field on FileB or the word ALL.*
Defining Vartypes
You can assign a field or an array special treatment, such as
specifying that a disagreement on the field would cause the record
pair automatically to be considered a nonmatch. You can assign a field
more than one special treatment. This treatment applies to all passes
for this match job. To define Vartypes:
Action Description
CLERICAL A disagreement on the field causes the record pair
automatically to be considered a clerical review case
regardless of the weight.
CLERICAL A missing value should not cause the record pair being
MISSINGOK considered to be forced into clerical review, but a
disagreement would.
CRITICAL A disagreement on the field causes the record pair
automatically to be considered a non-match.
CRITICAL Missing values on one or both fields is acceptable; that
MISSINGOK is, the record pair is automatically rejected if there is a
non-missing disagreement.
NOUPDATE The m-probability is not changed after running mprob.
NOFREQ A frequency analysis is not run on the field. Used when
a field has unique values, such as Social Security
number.
CONCAT Concatenates up to four fields to form one frequency
count.
Setting up If this is the first job that you have run in the project, you need to set
the file structure up the file structure for the project. See Chapter 5, “Setting Up Run
Profiles”, for instructions.
Once your data files are available to QualityStage, run the Match job
How to run 1. On the left pane of the QualityStage main window, select a Jobs
the job folder.
2. From the jobs list on the right pane, select the job you want to run.
3. Do one of the following:
The Job Run Options dialog box appears. It looks like this:
Match job options When you run a Match job, the following screen appears:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
The File Mode Execution dialog box lists all the stages in the job. You
must run the same stages that you selected when deploying the job.
See “Deploying Jobs in File Mode” on page 7-7 for more information.
By default, all stages listed are run from first to last. However, you
can select a subset of stages to run. For information about how to
select a subset, see “To select a subset” on page 8-7.
Match reports b. (Optional) Select the Report check box if you want to generate
and extracts a Match report.
If you select Report, select the passes for which you want to
run a report. Undup Independent reports only the last pass
run.
c. (Optional) Select the Extract check box if you want to generate
a Match extract.
If you have previously deployed and run your Match stage, you
can select Report, Extract, or both. Otherwise you must deploy
and run the job to generate a report or extract.
Once you have run your Match stage, you can generate a
report or an extract any time without also deploying and
running the stage.
See “Working with Match Reports” on page 13-1 for more
information about Match reports and extracts.
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
Buffer sizes The Match stage uses, by default, 2000 buffers of 1024 bytes (about
2 megabytes) each of memory or disk space, depending on the server,
for most of its processing. You might want to change the number of
buffers used by the Match stage if you have a server with little
memory or very large data files (if the server has extensive memory).
You might want to review the statistic summary section of the
statistics report to determine if you need to alter the number of buffers
allocated. See the “Summary Statistics Section” on page 13-30.
Frequency file size By default, QualityStage includes up to 100 entries in a frequency file,
which means that for any field requiring frequency analysis, the 100
most frequent occurrences are included in the frequency file. You can
use the Maximum Frequency Entry field to increase the maximum
number of entries. You may want to do this if you are processing large
numbers of records.
Presort output The Match stage sorts the data file before each match pass. If you
location have limited space where your data files are located, you can specify
another location for the sorted files.
Match debug file You can request generating the statistics report by specifying the base
file name in the Match Debug File field. QualityStage appends an
underscore ( _ ), the pass number, and a.OUT to this file name, which
is created in the Script directory. You can specify a full path if you want
the report in a different location.
Important: QualityStage only imports XML Match Specs, it does not produce
them.
1. In the Select Match Build Method dialog box, click Match Spec
Library.
2. In the Match Spec Library dialog box, select the Match spec you
want to use to create your match.
Limitations
All characters that are used by the XML language must be escaped,
for example <> characters.
XML Overview
The table below describes the most important elements in the
QualityStage XML Match Spec.
Element Description
MATCHSPEC This is the main element in the DTD, it defines all of
the parameters in the match.
MATCHPASS This element defines the parameters for a particular
match pass. If you have multiple passes, you may have
multiple entries for this element.
BLOCKSPEC This element describes the blocking for a particular
match pass.
MATCHCOMMAND This element describes all of the parameters of the
match itself.
VARTYPE This element describes the parameters for the vartype
for the match.
4. Edit the necessary elements so that the XML Match Spec reflects
the match you want to create.
XML DTD
<!ELEMENT MATCHSPEC (MATCHDESCRIPTION, MATCHPASS+,
VARTYPE*) >
<!ATTLIST MATCHSPEC UNDUP (Y | N) #REQUIRED>
<!ATTLIST MATCHSPEC TWOFILE (Y | N) #REQUIRED>
<!ELEMENT MATCHDESCRIPTION (#PCDATA) >
<!ELEMENT MATCHPASS (DESCRIPTION, BLOCKSPEC+,
MATCHCOMMAND+, MATCHCUTOFF, CLERICALCUTOFF,
DUPLICATECUTOFF?) >
<!ELEMENT DESCRIPTION (#PCDATA) >
<!ELEMENT BLOCKSPEC (FIELD1, FIELD2?) >
<!ATTLIST BLOCKSPEC BSCOMPARE (CHARACTER | NUMERIC)
#REQUIRED>
<!ELEMENT MATCHCOMMAND (MCCOMPARISON, (ARRAYFIELD1 |
FIELD1)+, (ARRAYFIELD2 | FIELD2)*, MPROB, UPROB, PARAM1?, PARAM2?,
WEIGHTOVERRIDE*) >
<!ELEMENT MATCHCUTOFF (#PCDATA) >
<!ELEMENT CLERICALCUTOFF (#PCDATA) >
<!ELEMENT DUPLICATECUTOFF (#PCDATA) >
<!ELEMENT FIELD1 (#PCDATA) >
<!ELEMENT FIELD2 (#PCDATA) >
<!ELEMENT MCCOMPARISON (#PCDATA) >
<!ELEMENT MPROB (#PCDATA) >
<!ELEMENT UPROB (#PCDATA) >
<!ELEMENT PARAM1 (#PCDATA) >
Element Description
<!ELEMENT MATCHSPEC ( This is the match spec object. This defines all of
MATCHDESCRIPTION, MATCHPASS+, the parameters of the match.
VARTYPE*) >
<!ATTLIST MATCHSPEC UNDUP (Y | N) Defines whether the match is an unduplication.
#REQUIRED> This attribute cannot have a N value if the
TWOFILE attribute also has an N value. All other
combinations of the UNDUP and TWOFILE
attributes are valid.
<!ATTLIST MATCHSPEC TWOFILE (Y | N) Defines whether the match is a single file match or
#REQUIRED> a two-file match.
This attribute cannot have a N value if the
UNDUP attribute also has an N value. All other
combinations of the UNDUP and TWOFILE
attributes are valid.
Element Description
<!ELEMENT MATCHDESCRIPTION (#PCDATA) > Defines the match description. The description can
have any desired length. However, we recommend
less than fifty characters, because you will use this
description to select the match in QualityStage.
You cannot include characters in this description
that are used in XML, such as <,>, and &.
<!ELEMENT MATCHPASS (DESCRIPTION, Describes the parameters of the match pass.
BLOCKSPEC+, MATCHCOMMAND+,
MATCHCUTOFF, CLERICALCUTOFF,
DUPLICATECUTOFF?) >
<!ELEMENT DESCRIPTION (#PCDATA) > Defines the match pass description. This
description will be automatically truncated by
QualityStage to under forty characters. You
cannot include characters in this description that
are used in XML, such as <,>, and &.
<!ELEMENT BLOCKSPEC (FIELD1, FIELD2?) > Describes the blocking fields for the match.
<!ATTLIST BLOCKSPEC BSCOMPARE Defines the blocking comparison type.
(CHARACTER | NUMERIC) #REQUIRED>
<!ELEMENT MATCHCOMMAND Defines the match command parameters. This
(MCCOMPARISON, (FIELD1 | ARRAYFIELD1)+, defines the comparison type, the fields or arrays
(FIELD2 | ARRAYFIELD2)*, MPROB, UPROB, compared, and other matching parameters.
PARAM1?, PARAM2?, REVERSE?,
WEIGHTOVERRIDE*) >
<!ATTLIST MATCHCOMMAND MCMODE (NONE Defines the mode for comparison types that
| ZEROVALID | ZERONULL | EITHER | require a mode to be defined.
BASEDPREV) #REQUIRED >
<!ELEMENT MATCHCUTOFF (#PCDATA) > Defines the cutoff number for the match.
<!ELEMENT CLERICALCUTOFF (#PCDATA) > Defines the clerical cutoff number for the match.
<!ELEMENT DUPLICATECUTOFF (#PCDATA) > Defines the duplicate cutoff number for the match.
<!ELEMENT FIELD1 (#PCDATA) > Defines a field from the first match file.
<!ELEMENT FIELD2 (#PCDATA) > Defines a field from the second match file. This
element is not required for single file matches.
<!ELEMENT ARRAYFIELD1 (#PCDATA) > Defines an array from the first match file.
Element Description
<!ELEMENT ARRAYFIELD2 (#PCDATA) > Defines an array from the second match file. This
element is not required for single file matches.
<!ELEMENT MCCOMPARISON (#PCDATA) > Defines the comparison type of the match
command.
<!ELEMENT MPROB (#PCDATA) > Defines the M-probability for the match.
This must be an integer value. This should be the
value you would enter in QualityStage * 1000. For
example, if you want to enter a M-probability of .9,
you would include a value of 900.
<!ELEMENT UPROB (#PCDATA) > Defines the U-probability for the match.
This must be an integer value. This should be the
value you would enter in QualityStage * 1000. For
example, if you want to enter a U-probability of
.01, you would include a value of 10.
<!ELEMENT PARAM1 (#PCDATA) > Defines parameter one for the match. This must be
an integer value. This element is optional.
<!ELEMENT PARAM2 (#PCDATA) > Defines parameter two for the match. This must
be an integer value. This element is optional.
<!ELEMENT REVERSE (#PCDATA) > Designates a match as a reverse match. This
element is optional.
<!ELEMENT WEIGHTOVERRIDE (AW?, DW?, Defines the parameters for weight overrides.
AV?, BV?, AM?, BM?, XM?) >
<!ATTLIST WEIGHTOVERRIDE WOTYPE (ADD | Defines the weight override type.
REPLACE | MULTIPLY) #REQUIRED>
<!ELEMENT AW (#PCDATA) > Defines the agreement weight parameter. This
element is optional.
<!ELEMENT DW (#PCDATA) > Defines the disagreement weight parameter. This
element is optional.
<!ELEMENT AV (#PCDATA) > Defines the conditional FileA value parameter.
This element is optional.
<!ELEMENT BV (#PCDATA) > Defines the conditional FileB value parameter.
This element is optional.
Element Description
<!ELEMENT AM (#PCDATA) > Defines the conditional FileA missing weight
parameter. This element is optional.
<!ELEMENT BM (#PCDATA) > Defines the conditional FileB missing weight
parameter. This element is optional.
<!ELEMENT XM (#PCDATA) > Defines both files missing weight parameter. This
element is optional.
<!ELEMENT VARTYPE ((FIELD1 | Defines the VARTYPE for the match.
ARRAYFIELD1)) >
<!ATTLIST VARTYPE ACTION (CRITICAL | Defines the type of action for the VARTYPE.
CRITICALMISSINGOK | CLERICAL |
CLERICALMISSINGOK | NOFREQ | NOUPDATE |
CONCAT) #REQUIRED>
The Match stage can perform multiple matching passes. You may find
it useful to evaluate the results of one pass before performing the next
pass.
For example, on the first pass, your plan may be to first perform a
pass that matches on social security number and then perform a pass
that matches on date of birth. However, the results of the first pass
may be sufficient and you don’t need to perform the second pass, or the
results may indicate that you need to choose a different field for the
second pass.
The contents of the reports can help you choose the appropriate action.
2. Under Select Outputs, select the output file for the report type.
4. To use a different output file for another report, select the file from
the Select Outputs list, and then click the report type.
5. After defining all reports, click Next.
The Match Report Specification dialog box appears.
Undup run. You can specify any field that has been defined for the
files, including those not used for blocking or matching.
Statement To
LOW weight Specify the lowest weight to appear on report.
HIGH weight Specify the highest weight to appear on report.
With the following example, the report would show only those records
with weights within the range of 3.0 to 9.0.
LOW 3.0
HIGH 9.0
If you specified either only LOW or only HIGH, the range is
open-ended. If you specify neither, all records are reported. Weights do
not apply to residual records.
Argument Description
"literal" Any character string, which can be mixed case. Must be
enclosed in quotation marks (" ").
TO The header that is printed at the top of each page. Only
HEADER literals should be moved to the header line. All MOVE
Literal Statements, when entered through the Add button,
are automatically placed in the header.
column The output position of the literal value in the header line.
The first column is column 1, and report lines can be 150
characters long. For example, if a literal value “New Report”
had a column value of 10, the literal value would start at
column 10 in the header.
Argument Description
fieldname You can move an array or field. To do so click the Datafield A
or Datafield B buttons and select the field from the display
dialog box. To insert array fields click the Arrayfield A or
Arrayfield B buttons and select the array from the display
dialog box.
You can select any field from FileA and FileB, even if you did
not use that field for blocking or matching.
TO LINEA The line for fields from FileA. This line is not displayed for
DUPA and RESA report types.
Argument Description
TO LINEB The line for fields from FileB. This line is not displayed for
DUPB and RESB report types.
column The output position of the field on line A or line B. The first
column is column 1, and report lines can be 150 characters
long.
length The length of the field to be displayed. The default is the
defined length of the field.
Argument Description
@variable One of the following variables, selected from the Special
Variables drop-down list:
@SET8 The group identifier. This is a number assigned
@SET9 to each group by Match. Every record in a group
@SET10 will receive the same number.
See Note on page 13-12.
@WGT The match weight assigned during the match
process. Available only for: MATCH,
CLERICAL, DUPA, DUPB.
@TYPE The type of the record:
[MA]Match on FileA
[MB]Match on FileB
[CA]Clerical review on FileA
[CB]Clerical review on FileB
[DA]Duplicate on FileA
[DB]Duplicate on FileB
[RA]Residual on FileA
[RB]Residual on FileB
[XA]Master record in a set of duplicate
records (only Undup)
@RECA8 The record number from FileA.
@RECA9 See Note below.
@RECA10
Argument Description
@RECB8 The record number from FileB.
@RECB9 See Note below.
@RECB10
@EXACT The exact match flag for fields that had values
and matched exactly.
@LR The left/right match flag; only for Geomatch
comparisons of double intervals that set the
left/right flag.
TO LINEA The location on line A for the variable. This line is not
displayed for DUPB and RESB report types.
TO LINEB The location on line B for the variable. This line is not
displayed for DUPB and RESB report types.
column The output position of the variable on line A or line B. The
first column is column 1, and reports can be 150 characters
long.
Note: The number appended to SET, RECA, and RECB indicates the
number of bytes allocated. For example, SET10 uses 10 bytes. It
is strongly recommended that when processing more than 100
million records, you use either 9- or 10-byte variables.
Tip: Generally, you should format LINEA and LINEB statements one
after the other so that you can see how each field compared.
Argument Description
left-field The left field (such as ZIP or city) from FileB; only
appears if matched to the left interval.
right-field The right field from FileB; only appears if matched to the
right interval.
TO LINEB The line B for the field.
column The output position of the field. The first column is
column 1, and report lines can be 150 characters long.
length The length of the field to be displayed. The default is the
defined length of the field.
The following example moves the left Census Tract ID to the output
line if the type L field matched to the left interval, otherwise the right
Census Tract ID is moved to the same location:
MOVELR LEFT_TRACT RIGHT_TRACT TO LINEB 25 6
which records are duplicates. You use the extract as input for the
Survive stage, as explained in Chapter 14 “Defining Survive Stages”.
When an extract is generated, Match writes the requested records to a
file, which you can use for subsequent stages of your data
re-engineering project.
You can generate an extract any time after you have staged and run
your Match job.
Match requires an extract specification that defines the content of the
extract file. QualityStage provides a default specification. Optionally,
you can create a customized specification for a job.
Match generates an extract file of any or all passes of a match run
when you either:
• Specify Extract in the File Mode Execution dialog box (when
running in file mode)
• Define a custom extract and run the job in data stream mode.
This section describes the default extract specifications and provides
instructions on customizing extract files.
Note: You can generate default extracts in file mode only. You can
generate custom extracts in file mode or data stream mode.
2. Under Select Outputs, select the output file for the extract type.
4. To use a different output file for another extract, select the file
from the Select Outputs list, and then click the extract type.
5. After defining all extracts, click Next.
The Match Extract Specification dialog box appears.
Creating Statements
You can easily generate a statement by selecting various options
available under the Enter the Data area at the top of the dialog box.
Depending on the statement type you select from the Statements list,
additional drop-down lists and/or text boxes display for each argument
you must specify for the selected statement.
To generate a statement:
1. In the Statement text box, enter the statement and its arguments.
2. Click Insert to add it to the statement list at the bottom of the
screen.
Maintaining Statements
To remove a statement from the statement list at the bottom of the
screen:
Argument Description
"literal" Any character string, which can be mixed case. Must be
enclosed in quotation marks (" ").
column The output Column position of the header on the report. The
first column is column 1, and report lines can be 150
characters long.
MOVE-Variable
The MOVE-Variable statement sets up the record format by moving
special variables to the output record. The MOVE-Variable statement
uses the following format:
MOVE @variable TO LINEA | TO LINEB column
Argument Description
@variable One of the following variables, selected from the Special Variables
drop-down list:
Argument Description
@SEQNUM A unique sequence number for this match. In Geomatch
runs, each match receives a unique sequence number
Position The location on Line A or Line B for the variable. This line is not
displayed for DUPB and RESB extract types.
Column The output Column position of the variable on Line A or Line B. The
first column is column 1, and reports can be 150 characters long.
Note: The number appended to SET, RECA, and RECB indicates the
number of bytes allocated. For example, SET10 uses 10 bytes. It
is strongly recommended that when processing more than 100
million records, you use either 9- or 10-byte variables.
MOVE-FieldName
The MOVE-FieldName statement sets up the record format by moving
field values to the output record’s Line A. The MOVE-FieldName
statement uses the following format:
Argument Description
fieldname You can move an array or field. To do so click the Datafield A
or Datafield B buttons and select the field from the display
dialog box. To insert array fields click the Arrayfield A or
Arrayfield B buttons and select the array from the display
dialog box.You can select any field from FileA and FileB, even
if you did not use that field for blocking or matching.
You can also select array fields from FileA or FileB.
OF A FileA for the fieldname.
OF B FileB for the fieldname.
column The output Column position of the field on line A. The first
column is column 1, and report lines can be 150 characters
long.
length The Length of the field to be displayed. The default is the
defined length of the field.
Note: FileA and FileB move the data, but not the field definitions. If you
want these field definitions for the Survive stage, you need to
copy the field definitions before you run the extract.
MOVELR
The MOVELR statement moves a field to the output record’s Line B,
depending on whether the left or right flag is set from any interval
comparison for parity match. You use this statement when executing
Geomatch (or Match) runs against a Census Bureau Tiger reference
file or similar file. The MOVELR statement uses the following format:
Argument Description
left-field The Left Field from FileB only if matched to the left interval.
right-field The Right Field from FileB only if matched to the right
interval.
column The output Column position of the field on line B. The first
column is column 1, and report lines can be 150 characters
long.
length The Length of the field to be displayed. The default is the
defined length of the field.
Statement Description
MOVEALL OF A Moves the entire FileA record to the output file.
MOVEALL OF B Moves the entire FileB record to the output file.
Item Description
VALUE The value whose weights are reported on this
line.
FREQ The number of times this value has occurred
on both files.
MAgree The number of times the value has agreed in
matched pairs. For the initial run, this value
is zero.
MPart The number of times the value has
participated in a matched pair. For the initial
run, this value is zero.
mprob The calculated m probability used for this
value. The m probability is the accuracy of
the particular value or 0.9 means there is a
10% error for this value in the matched
records.
uprob The calculated u-probability used for this
value. This probability is based on the
frequency counts.
type How the m-probability is determined, where
D (default) and derived from a user-supplied
value.
Item Description
AGR WGT The agreement weight, the weight assigned
to a match on this value. The rare values
have higher weights than the more common
values.
DIS WGT The disagreement weight, the weight
assigned if the value appears on one of the
files and disagrees. Since two values are
involved in a disagreement, the matcher uses
the value from the table with the highest
frequency (most common value).
MISS WGT If one or both values are missing, the missing
weight is used. This is the midpoint between
the global agreement and disagreement
weights. This is printed before the individual
values for the variable are listed.
Histogram Section
Two histograms of the matching results are printed at the end of each
run. The first histogram shows the distribution of weights for all
comparisons. The second histogram shows the distribution of weights
for only paired records (that is, match, clerical, and duplicates).
The histogram shows how the weights are distributed for all
comparisons. The higher the weight, the more you can be confident of
a correct match. Each line indicates the weight (in 0.5 increments),
the frequency, and a graphic representation. Lines ending with a right
bracket (>) exceed the range.
You can use the histogram to decide where the cutoff values should be.
Notice in the example below, the match cases trail off around 6. Below
this cutoff there are some bumps. For this example, you want to make
the clerical review cutoff 6 and the unmatched cutoff 4.
An example is:
*
* HISTOGRAM
*
* Distribution of observed weights for all comparisons
* Scale based on mean frequency of: 8
*
* For weights with a frequency greater than the mean -
* The histogram shows an arrow in the last column
*
* WGT Freq
*
* 4.00 0
* 4.50 0
* 5.00 2 **
* 5.50 2 **
* 6.00 1*
* 6.50 1*
* 7.00 1*
* 7.50 1*
* 8.00 1*
* 8.50 2 **
* 9.00 2 **
* 9.50 1*
* 10.00 7 *******
* 10.50 3 ***
* 11.00 8 ********
* 11.50 11 ***********
* 12.00 9 *********
* 12.50 2 **
* 13.00 7 *******
*
* TOTAL COMPARISONS: 0000307
The frequencies for unmatched cases trail off as the weights go higher
and the frequencies for matched cases trail off as the weights go lower.
This forms two curves (or modes). These represent the unmatched and
the matched cases. The farther apart these are from each other, the
better the discrimination between the matched and unmatched
records. Try to draw a continuous curve from the histogram chart, and
examine the tails of the curve to decide where to make the cutoff
points.
Figure 14-1 Phase Three: Design and Develop the Data Re-engineering
Application
Defining survivorship:
• Resolves multiple occurrences of records
• Defines the format of the surviving data
Fields and You select field values based on rules for testing the fields. A rule
field values contains a set of conditions and a list of target fields. If a field tests
true against the conditions, the field value for that record becomes the
best candidate for the target. After testing each record in the group,
the fields declared best candidates combine to become the output
record for the group. Whether a field survives is determined by the
target. Whether a field value survives is determined by the rules.
Some approaches to selecting the best candidate are:
• Duration of record
Grouping Records
You use the Survive stage often after unduplicating a data file in
which you have identified groups of records that contain similar or
identical data. For example, the following portion of three records has
been identified as the same person using different representations for
the name:
JOHN SMITH JR
JOHNNY SMITHE
JOHN E SMITH
Group identifier Once you identify each group of records, you must assign an identifier
to the group. For example, the Match stage lets you include a set
number with each record in a group when you extract the data. See
“Defining a Match Stage” on page 12-11 for more information.
Sorting grouped The Survive stage sorts your input data file on the group identifier to
records ensure that all records in the group are together. However, you cannot
control the order in which records appear in a group.
Note: The field use type for each field in your input file must be
specified as either a string or an integer.
10. Select the Field Name that contains the group identifier you want.
11. Click Next.
The Survivorship Rules Definition Screen – SURVIVE appears.
Defining Targets
Targets are the data file fields you want to write to the output record.
Fields you do not include as targets are excluded from the output
record. You can easily create targets for individual fields, groups of
fields, or the entire record using the Specify Output Fields area of the
Survivorship Rules Definition Screen – SURVIVE.
To define a target:
Under Available Fields, drag each Field Name you want to include to
the Targets list, or use or to add or remove a selected field
from the Targets list.
Technique Pattern
Shortest Field SIZEOF(TRIM(c.<field>)) <=
SIZEOF(TRIM(b.<field>))
Longest Field SIZEOF(TRIM(c.<field>)) >=
SIZEOF(TRIM(b.<field>))
Most Frequent FREQUENCY
Technique Pattern
Most Frequent FREQUENCY
[Non-blank] (Skips missing values when counting most frequent.)
Equals c.<field> = <DATA>
Not Equals c.<field> <> <DATA>
Greater Than c.<field> >= <DATA>
Less Than c.<field> <= <DATA>
At Least One 1
(At least one record survives, regardless of other
rules.)
Tip: You can use standard Windows shortcut keys for copy, cut, and
paste commands (Ctrl+C, Ctrl+X, and Ctrl+V) on the text in the
Data box.
Rule Expression Builder screen to help you (for details see “Using the
Survivorship Rule Expression Builder” on page 14-12).
For information about correct syntax and rules processing, see
“Creating Rules Syntax” on page 14-23.
Tip: You can use standard Windows shortcut keys for copy, cut, and
paste commands (Ctrl+C, Ctrl+X, and Ctrl+V) on the text within
the Complex Survivorship Expression box.
Tip: You can use standard Windows shortcut keys for copy, cut,
and paste commands (Ctrl+C, Ctrl+X, and Ctrl+V) on the text
within the Expression box.
• Click Next if you selected Map Input Fields for Report Use, to
display the Data Selection for Reports dialog box (see “Selecting
Data for a Predefined QualityStage Report” on page 14-15).
• Click Cancel to discard all changes made in this screen and return
to the Stage Wizard screen.
Tip: See the “Creating rules similar to existing rules” on page 14-17.
1. Select the group identifier field, and then click to move the
field name next to Match Type.
2. Select the record number field, and then click to move the field
name next to Set Number.
3. Repeat steps 1 and 2 for up to six Name fields, up to six Address
fields, and up to six Area fields.
4. When all desired fields are selected, click Finish.
To work on a single rule, click anywhere in that rule’s row to select it,
or click on its selector box to highlight that row.
To select a group of consecutive rules, click on the first rule’s selector
box. Then Shift+Click on the last rule’s selector box; if necessary, use
the grid’s scroll bars to display the last rule before you Shift+Click.
To select a group of nonconsecutive rules, Ctrl+Click the selector box
of each rule you want to include.
Using To modify an existing rule using the Survivorship Rule section at the
Survivorship Rule upper right of the screen:
to modify a rule
1. Select the rule you want to modify.
2. Click Edit Survivorship Rule to temporarily move that rule from
the grid to the Survivorship Rule section.
3. Edit the rule as needed.
• Edit the entry directly within the text box, or
• Click Expression Builder to use the Survivorship Rule
Expression Builder screen to edit the expression.
See “Using the Survivorship Rule Expression Builder” on page
14-12.
4. Click Add Survivorship Rule to move the rule back to the
Survivorship Rules grid list.
Tip: You can easily reorder rules by using drag and drop. As you drag,
green and red arrows appear that indicate the position at which
the dragged record is inserted when you release the mouse
button.
Setting up If this is the first job that you have run in the project, you need to set
the file structure up the file structure for the project. See Chapter 5, “Setting Up Run
Profiles”, for instructions.
Once your data files are available to QualityStage, run the Survive
stage:
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
The File Mode Execution dialog box lists all the stages in the job. You
must run the same stages that you selected when deploying the job.
See “Deploying Jobs in File Mode” on page 7-7 for more information.
By default, all stages listed are run from first to last. However, you
can select a subset of stages to run. For information about how to
select a subset, see “To select a subset” on page 8-7.
See “About Deploying and Running Jobs” on page 7-1 for more
information about deploying and running jobs.
b. (Optional) Select Prepare Report Data.
This specifies that prepared report data output will be put in
the Data directory for the project.
c. (Optional) Select Retrieve Report Data, and specify the
maximum file size to retrieve.
The output file will be copied to the location specified in the
run profile for local report data.
See Chapter 15, “Working with QualityStage Reports”, for more
information about preparing data for formatted reports.
After processing is finished, you can view the results. See “Viewing
Job Output Files” on page 8-13 and “Working with QualityStage
Reports” on page 15-1 for more information.
Rule Format
The syntax for a rule is:
TARGETS: CONDITION;
Rule Operators
Survive supports the following comparison operators for both string
and integer fields:
= Equal to
+ Add
– Subtract
* Multiple
Rule Processing
The Survive stage reads the first record in a group and evaluates the
record against all the rules. The fields for this record are the current
fields. For the first record in a group, there are no best fields. All rules
for each target are evaluated against the fields in the record. If any
target passes the test, its data fields become the best fields.
The job evaluates each subsequent record in the group, which are the
current records during the evaluation. When a target passes the test,
its data fields become the best fields, replacing any existing best
fields. If none of the current fields meet the conditions, the best fields
remain unchanged.
After all records in the group are evaluated, the values designated as
the best values combine to become the output record. Survive
continues the process with the next group.
For example, the following rule states that FIELD3 of the current
record should be retained if the field contains five or more characters
and FIELD1 has any contents.
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM
c.FIELD1) > 0) ;
The first group has the following three records:
No. of characters in
Record FIELD1 FIELD3
1 3 2
2 5 7
3 7 5
The first record in the group has two characters in FIELD3 and three
characters in FIELD1. This record fails the test, because FIELD3 has
fewer than 5 characters. The next record has seven characters in
FIELD3 and five in FIELD1. This current record passes the conditions for
this rule. The current FIELD3 (from the second record) becomes the
best field. The third record with five characters in FIELD3 and five in
FIELD1 also passes the conditions, and FIELD3 from this record replaces
the best value as the new best value.
When you define multiple rules for the same target, the rule that
appears later in the list of rules has precedence. For example, if you
have two rules for the target FIELD1, the value from the record that
meets the conditions of the second rule as listed becomes the best
value. If no target passes the second rule, the best values from the
first rule becomes part of the output record.
Date as a Target
You compare the field for the greater value in the following way:
DATE : c.YEAR > b.YEAR
Multiple Targets
In the following example the rule has two target fields, NAME and
PHONE. The current values should survive if the current year is
greater than the best year:
NAME PHONE : c.YEAR > b.YEAR ;
Using Frequency
To specify that the most frequent value of a field within each group
survives you first specify the field followed by the keyword
FREQUENCY separated by a colon, as in this example:
FIELD1: FREQUENCY
Multiple Rules
If you have multiple rules for a surviving field, the value that satisfies
the later rule survives. In the following example, RECORD is the entire
record, and TYPE, FREQ, FIRSTACC, and V9 are fields in the record.
Using the following rules:
RECORD : (c.TYPE<>”DD”) ;
RECORD : (c.FREQ>b.FREQ) ;
RECORD : (c.FIRSTACC = 1) ;
V9: (c.V9 > b.V9) ;
the following is true:
• If a record satisfies the last rule for the target RECORD, that is, the
value for field FIRSTACC is 1, the record becomes the best record
(b.RECORD).
• If more than one record in the group passes the last rule, the latest
record processed that satisfies the rule survives.
• If no records pass the FIRSTACC rule, the last record processed that
passes the c.FREQ>b.FREQ rule survives.
• If no records pass the FREQ rule, the last record processed that
passes the c.TYPE<>”DD” rule survives.
• If no records pass any of the rules, the surviving record is all
blanks.
However, in this set of rules there is a rule for one of the fields (V9) in
the RECORD to survive. Since the V9 rule appears later in the list of
rules than the rules for RECORD, it takes precedence over whatever
value survives for that field in RECORD.
In the following example, you have three records in a group with the
following field values:
DD 4 1 19990401
The second input record survives the rule for the RECORD target,
because FIRSTACC=1, but the first input record provides the surviving
value for V9. If the FIRSTACC were not equal to 1 for any of these
records, the third record would survive the RECORD target since it has
the highest value for FREQ.
You can view input and output data from your jobs using:
• QualityStage Reports
• QualityStage Data File and Report Viewer
Chapter 16 describes how to use the QualityStage Data File and
Report Viewer.
This chapter describes how to work with QualityStage Reports. It
describes:
• How to use stage wizards to prepare data for a QualityStage Report
• How to create and run QualityStage Reports using unprepared
data
• How to create customized Access reports
• How to generate and view QualityStage Reports
• How to select and run a QualityStage Report
The final section of this chapter describes each of the predefined
QualityStage reports.
File Location
Since Microsoft Access must link to the input data, it must be
accessible to the QualityStage Designer host either locally or
remotely.
Access retrieves all remote data across your network to generate the
report. Network performance affects the speed with which Access
generates the report.
Note: If you want to create your own customized reports, you must have
installed and be familiar with Microsoft Access 2000. The
instructions in this section for creating customized reports
assume that you are an experienced user of Access.
Designing a Report
The database to design a report can be:
• The Reports database (reports.mdb) shipped with QualityStage, or
• Any database that you create to hold report definitions.
Determine which fields to display in the report before you create the
report in Access.
Creating a Table
Tables that have the same name and fields in the Access reports
database and other relational databases are known as mirror tables. If
the data is in a relational table, the tables that are created must
mirror the structure of the relational tables.
To create a table:
1. Create (do not populate) tables in Access that mirror your data.
If the data is in a flat file, the tables should map to the fix-fielded
version of the data before the data is converted to delimited text.
Creating a Query
Use these instructions if you are familiar with SQL. The query will be
used on the relational database with the populated tables, but
compose the SELECT statement without including an IN clause, as if
the statement were to be used on the local Access database tables. At
run time the tables are linked to the Access Reports database and are
local to the reports database.
To create a query:
2. Click .
3. Select Design View from the New Query dialog box and close the
Show Table dialog box
4. Select Create query in Design view.
5. Change the view of the query to SQL and compose a SELECT
statement.
4. Click OK.
5. Select the table or query from the Table/Queries list.
6. Move your selected Available Fields to the Selected Fields.
7. Click Next.
8. Enter a query title in the Simple Query Wizard window.
9. Click the Finish button to complete the Query Wizard job. The
options you selected in the Query Wizard appear in the query that
is generated from your selections.
1. Select Reports from the Objects list in the local Access Reports
database.
Creating a Specification
For data in flat files, create a specification for the file in Access. The
specification stores the information such as the delimiter type and the
field names needed by Access to link the flat file.
To create a specification for a flat file in Access:
1. Open the Reports database.
2. Select File ➤ Get External Data ➤ Link Tables to open the Link
dialog box.
3. Double-click the flat file for which you would like to create a
Specification.
The Link Text Wizard dialog box opens.
4. Click the Advanced button on the bottom of the Link Text Wizard
dialog box to open the Specification dialog box.
5. Choose the file format. For flat files, choose Delimited, and in
Field Delimiter type choose the delimiter type.
6. Enter the field names of the flat file in the Field Information grid.
Choose the appropriate data type for each field.
7. Once all the field names have been entered, click Save As.
8. Save the specification file as the name of the flat file. For example,
if the flat file is called INPUT, save the specification for that flat file
as INPUT.
9. After you save the specification, click Cancel to exit the Link Text
Wizard.
The specification for that flat file is saved in the Reports database.
10. Exit Microsoft Access.
11. Run the report from the Report Generator dialog box. See “When
you define an Investigate, Standardize, Match, or Survive stage,
you prepare your report data as part of the job definition.” on page
15-2 for more information.
ODBC
To access a remote relational database via ODBC:
1. Select ODBC.
The ODBC data source and driver must be preconfigured through
Windows.
2. Enter the Data Source Name (DSN), Username (optional),
Password (optional), and the Database name (required if there is
more than one database on a server).
Flat Files
To access a flat file:
Important: You must specify a printer in your Windows Printers folder before
you run a report. Failure to do so prevents processing. Before you
try to run the report, select Settings ➤ Printers from the Start
menu, and then specify a printer.
How to select and 1. From the list of reports, select the report to run. (For a description
run a report of the predefined QualityStage reports, see “About Predefined
QualityStage Reports” on page 15-14.)
2. Click Formatted Reports.
When you click Formatted Reports, your data files are linked to
the Reports database, and Microsoft Access generates the report.
After Access finishes running the report, the report opens in
preview mode in Access (not in QualityStage). You can:
• View the report
• Print or export the report
• Close the Access report window
• Run another report
• Close the Report Generator dialog box
3. Click Finish after you run your reports.
Running the report When you generate the same report at another time, QualityStage
again removes the previous links to the data tables in the Reports database
and creates new links according to the data location you specify at this
time. This ensures that the same report does not run with the
incorrect data tables you linked earlier.
Reports Database
The Reports database included with QualityStage Designer is a
Microsoft Access database and contains predefined reports and
associated queries.
The Reports database also contains a table called REPORT_TABLES,
which has two fields:
• ReportName: Contains the name of a report.
• TableName: Contains the name of the data table or flat file that is
queried by the corresponding ReportName.
For example, if REPORT_TABLES contains:
• A report named Standardization US Summary Report that queries
the tables (or flat files) INPUT and US020000
• A report named Match Summary Report that queries one table (or
flat file) MTCHGRP
the REPORT_TABLES table would look like this:
ReportName TableName
Standardization US Summary Report INPUT
Standardization US Summary Report US020000
Match Summary Report MTCHGRP
Investigation Reports
For a given option you select in the Options box in the Stage
Definition Wizard, you run the appropriate report to display the
results.
For instance, if you select Investigation Character Discrete in the
Wizard, you run the Investigation Character Discrete report.
This table shows the Investigation option and the report you should
use with each option.
1. Use the Program stage to remove the .srt extension from the
output file name.
2. Use the Format Convert stage to convert the output file to a flat
file or an ODBC file (see “Converting Output Files” on page 15-3).
3. If you converted the output file to a flat file, use the Program stage
to add the .txt extension to the file name (see “Adding a .txt
Extension to Flat Files” on page 15-3).
this when the field mask for each field is All C. The report can be used
to report on multiple single domain fields that are grouped separately.
Country Codes
The country-specific standardization report names use the country’s
two-character ISO abbreviation. These are the same abbreviations
used for the rule sets.
For example, the following countries use these abbreviations:
• US (United States)
• GB (Great Britain)
• CA (Canada)
• FR (France)
• DE (Germany)
• AU (Australia)
• IT (Italy)
• ES (Spain)
Thus the Standardization US Report is for the United States, the
Standardization GB Report is for Great Britain, and so on.
This chapter uses CC to represent the country code in the report
names.
Standardization CC Report
This report contains the input fields and preconfigured business
intelligence fields from the NAME, ADDR, and AREA rule sets.
• Processed.
• Fully processed.
• With unhandled data.
• That have no address, area or name information.
• With additional name and address information.
• Broken down by address type. (For example, address type would be
street or box.).
This report uses the CCA20000 table.
FIELD1
FIELD2
FIELD3
FIELD4
FIELD5
FIELD6
Matching Reports
The following reports are used with the output from the following
Match jobs as defined by the match option you select in the Stage
Definition Wizard:
Match Fields
The Match reports use the MTCHGRP table. The MTCHGRP table
contains the following fields:
Note: The Match extract you use for Match reports must contain all the
fields listed in the preceding Match Fields table.
Survivorship Report
The survivorship report provides before and after information for each
group of records, including the surviving record and any matching
records.
Input and output data files, and Match reports and extracts, can be
very large (on the order of gigabytes) and cannot be viewed easily with
traditional text editors such as vi or WordPad.
QualityStage includes a Report Data File and Report Viewer, which
lets you easily view and analyze these files used in stages without
having to switch to another application.
You view data files, reports, and extracts of a specific job. First you
select the job, and then you select the data file, report, or extract to
view.
Select the file you want to view and click View File to display the
selected file in the Report Viewer screen.
Note: You cannot edit the file from this screen. You must use an
external text editor for editing.
After you are finish viewing the file, click Exit to return to the Choose
Datafile or Report to View dialog box.
Navigating
Use the controls at the bottom left of the screen to navigate through
the file.
As you scroll through the file, the total lines in the file and the current
line numbers being viewed are noted at the top. The end of the file is
indicated with an [EOF] line.
Note: If you entered a number greater than the number of lines in the
file, the last 27 lines of the file are shown.
Searching
The Report Viewer also lets you search for a specific word in the file,
starting from the line that is currently at the top of the screen.
Problem Solution
The Report Viewer behaves The Report Viewer is designed to work with
unpredictably, either entirely files that have a fixed line length. Each line
missing a report or printing must be the same size. This problem is
out only parts of others. caused because the server uses a constant
address lookup scheme. Pad your file with
spaces to make it fixed length. (Utilities
will soon be coming to automate this
process.)
The Server Reports and The stage is being run on an OS390 host.
Datafiles button is grayed out The Report Viewer is currently unavailable
when I select a stage with the for OS390.
file I want to view.
The Report Viewer comes up You may not have an FTP daemon running
with an error message on the host machine. Make sure FTP is
complaining about FTP. running.
Connection is aborted due to One of two possible problems:
timeout or other failure. • Your host is being loaded down by other
programs running on the host machine.
Remove the load.
• You are running a version of the
INTEGRITY server that is older than
version 3.8. Get version 3.8 (or later) of
the server and retry.
Connection is forcefully You do not have a server running on the
rejected. host machine. Get a 3.8 server version and
retry.
For UNIX
If you are migrating from UNIX, you only need to create a list of the
full pathname for the job files and the name of the directories
containing the control members.
For MVS
If you are migrating from MVS, you need to do more work. The MVS
runtime provides a command, UNIXPORT, that steps through one
individual job within an application and produces a PDS for the JCL
procs and a PDS for the control members, including the Unijoin source
code (Save Language source). The outputs from this command are:
• GETLIST
• UNIX.SLIB
• UNIX.CLIB
To use the UNIXPORT command:
1. Using the ISPF panels, access the application you want to
migrate.
Important: You must select the jobs in the order in which the application
runs.
later. The QualityStage server provides some stage options that are
currently not supported in the QualityStage Designer interface. These
options are:
• Transfer
– Skip Count
– Fill
– Fill Character
• Parse
– Capitalization Mode, such as CAPSON and CAPSOF
• Select
– Output Mode = TAB
– Select Key
– Convert Key
– Output Skip
– Output Length
• Unijoin
– Output Mode = NONE
– Interval equate fields
– The following EQUATE comparison codes:
Field Numeric (FN) Field Signed (FS)
If you have operations using these options, you need to review their
control files and determine if you need to edit and remove or change
these options.
A-4 QualityStage Designer User Guide
IMPORTING PROJECTS FROM MVS AND UNIX INTO QUALITYSTAGE DESIGNER
Creating the IMF File
QualityStage Designer permits only one input data file to any stage,
except the Unijoin stage. If you have a stage, such as the Sort stage,
with two input files, you might want to create two separate Sort
stages, one for each input data file. Otherwise, when you create the
IMF file, only one input file will be used for the stage.
Tip: If you are using the Sort stage to concatenate files, select Append
to File when you select the output file.
Field Description
field1 Contains the full pathname of each job file to be converted.
field2 Contains the keyword JCL or ARD indicating what type of
job.
field3 Contains a default job name for the operations in the JCL or
ARD file if a job name is not found. Optional argument.
Note: If the conversion cannot locate any control files for an operation,
the conversion continues, but the IMF file will not contain any
operation specifications, only data file information. You will have
to define each stage in the job.
Argument Description
–p proc_list (Required) The full path of the job list file.
–i jcl_file The full path of a single JCL job file. Used in place
of -p proc_list.
–m ard_file The full path of a single ARD job file. Used in place
of -p proc_list.
–E imf_filename (Required) The full path for the output IMF file.
–d datadef_list (Optional) The full path of the data definition list
file.
Argument Description
–c control_list (Optional) The full pathname of the control list file.
If not specified, jcl_cnv looks in the current working
directory for control member files.
–n project_name (Optional) A default project name if no project name
is found within the job files.
–C check_log (Optional) Logs as a warning any occurrence of a
filename greater than eight characters.
–v (Optional) Turns on verbose-mode messages for
locating operations with record length conflicts.
Conversion Issues
This section describes some issues around data file definitions and the
Sort stage you might encounter after the conversion.
Important: Always check the file and field definitions using the Data File
Wizard and Data Field Wizard of the QualityStage Designer.
Update any definitions, if necessary, before running the project.
Sorts Operations
Sort stages that precede a Build, a Unijoin, or a Collapse stage are
removed from the job if the Sort stage only sorts the fields used by the
stages and the input and output files are the same file. QualityStage
Designer automatically performs a Sort operation before a Build, a
Collapse, or a Unijoin stage.
Although these Sort operations are not listed as operations in the job,
they are added to the ARD when the job is compiled. Leaving in the
Sort operations from the original job would result in duplicate Sort
operations, which increases processing time.
If you have Sort operations on input files (and reference files) to the
Build, Collapse, and Unijoin stages that sort other fields or additional
fields, jcl_cnv issues a warning message indicating the number of such
sorts.
Warnings Messages
Warning messages typically indicate invalid specifications in control
files or missing file or field information. The jcl_cnv program continues
converting the project files and creates an IMF file if only
warning-level errors are encountered.
You should review all warning messages to determine if specifications
need to be adjusted or file definitions need to be updated before
importing or running the project using QualityStage Designer. If you
are able, correct the cause of the warning messages with the input file
and rerun jcl_cnv to create a complete and accurate IMF of the project.
Log Files
If you receive a non-fatal error message, it may refer you to a log file.
The log file is stored in your system’s temporary directory (usually
C:/TEMP).
The log file is named IBTxxxxxx.LOG. The x represents the date the log
file was created. For example, a file named:
IBT010319.LOG
indicates that the log file was created on March 19, 2001.
To find the log file on Windows:
1. Open the Control Panel.
2. Click System.
Match Comparisons
Note: For Undup runs, there is no FileB; matches are done among the
records on FileA.
ABS_DIFF Comparison
The ABS_DIFF comparison is an absolute difference comparison that
compares two numbers, such as age, and assigns the full weight if the
numbers agree. If numbers do not agree, the full disagreement weight
is assigned. You can use this comparison with arrays.
You must specify the following two fields:
Field Description
varA The number from FileA
varB The number from FileB
Parameter Description
Param 1 The absolute value of the difference in the values of the
fields.
For example, you are comparing ages and specify the 5 for Param 1. If
the ages being compared are 24 and 26, the absolute difference is 2
and the full agreement weight is assigned. If the ages are 45 and 52,
the absolute difference is 7 and the full disagreement weight is
assigned.
AN_DINT Comparison
The AN_DINT comparison is a double alphanumeric left/right interval
comparison that compares house numbers in Census Bureau Tiger
files, the Etak files, the GDT DynaMap files, or the U.S. Post Office
ZIP+4 files. A single house number, which might contain alpha
characters, is compared to two intervals.
One interval represents the left side of the street and the other
represents the right side of the street; for example, 123A to the
intervals 101-199 and 100-198. For a number to match to an interval,
both the parity (odd/even) and the range must agree. This comparison
causes a special flag to be set to indicate whether the left or the right
interval matched.
You specify the following five fields:
Field Description
varA The number on FileA
varB1 The beginning of the interval range for one side of the street
(such as from left) from FileB
Field Description
varB2 The ending of the interval range for one side of the street
(such as to left) from FileB
varB3 The beginning of the interval range for the other side of the
street (such as from right) from FileB
varB4 The ending of the interval range for the other side of the
street (such as to right) from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
AN_INT Comparison
The AN_INT comparison is an alphanumeric odd/even interval
comparison that compares a single number on FileA to an interval or
range of numbers on FileB. These numbers can contain alphanumeric
suffixes or prefixes. The number must agree in parity with the low
range of the interval. For example, an interval such as 123A to 123C is
valid and contains the numbers 123A, 123B and 123C.
A single number on FileA is compared to an interval on FileB. If the
number on FileA is odd, the begin range number on FileB must also be
odd to be considered a match. Similarly, if the number on FileA is
even, the begin range on FileB must be even to be considered a match.
Field Description
varA The number on FileA
varB1 The beginning of the interval range from FileB
varB2 The ending of the interval range from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
The beginning number of the interval can be higher than the ending
number and still match; that is, the files can have a high address in
the FROM field and a low address in the TO field. For example, 153
matches both the range 200-100 and the range 100-200.
CHAR Comparison
The CHAR comparison is a character-by-character comparison. If one
field is shorter than the other, the shorter field is padded with trailing
blanks to match the length of the longer field. Any mismatched
character causes the disagreement weight to be assigned. You can use
the CHAR comparison with arrays and reverse matching.
Field Description
varA The character string from FileA
varB The character string from FileB
CNT_DIFF Comparison
The CNT_DIFF comparison counts keying errors in numeric fields,
such as dates, telephone numbers, file or record numbers, and Social
Security numbers. For example, you have the following birth dates
appearing on both files, and you suspect that these represent the same
birth date with a data entry error on the month (03 vs. 08):
19670301
19670801
You can use this comparison with arrays and reverse matching.
You specify the following two fields:
Field Description
varA The number from FileA
varB The number from FileB
Parameter Description
Param 1 Indicates how many keying errors will be tolerated.
If you specify 2, the errors are divided into thirds. One error results in
assigning the agreement weight minus 1/3 the weight range from
agreement to disagreement. Two errors would receive the agreement
weight minus 2/3 the weight range, and so on. Thus, the weights are
prorated according to the seriousness of the disagreement.
D_INT Comparison
The D_INT comparison is a left/right interval comparison that
compares house numbers in Census Bureau Tiger files, the Etak files,
or the GDT DynaMap files. A single house number is compared to two
intervals. One interval represents the left side of the street and the
other represents the right side of the street. For a number to match to
an interval, both the parity (odd/even) and the range must agree.
You specify the following five field names:
Field Description
varA The number on FileA
varB1 The beginning range of the interval for one side of the
street (such as from left) from FileB
varB2 The ending range of the interval for one side of the
street (such as to left) from FileB
varB3 The beginning range of the interval for the other side
of the street (such as from right) from FileB
varB4 The ending range of the interval for the other side of
the street (such as to right) from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
D_USPS Comparison
The D_USPS comparison is a left/right USPS interval comparison
that processes United States Postal Service (USPS) ZIP+4 files or
other files that might contain non-numeric address ranges. The
D_USPS comparison requires the field names for the house number
(generally on FileA), two intervals for house number ranges on FileB,
and control fields that indicate the parity of the house number range.
You specify the following seven fields:
Field Description
varA The number from FileA
varB1 The beginning range of the interval for one side of the
street (such as from left) from FileB
varB2 The ending range of the interval for one side of the
street (such as from left) from FileB
varB3 The beginning range of the interval for the other side
of the street (such as from right) from FileB
varB4 The ending range of the interval for the other side of
the street (such as to right) from FileB
Bcontrol1 The odd/even parity for the range defined with varB1
and varB2
Bcontrol2 The odd/even parity for the range defined with varB3
and varB4
Control Description
O The range represents only odd house numbers.
E The range represents only even house numbers
B The range represents all numbers (both odd and. even)
in the interval.
U The parity of the range is unknown.
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
DATE8 Comparison
The DATE8 comparison allows tolerances in dates, taking into
account the number of days in a month and leap years. The supported
date format is yyyymmdd.
You can use this comparison with arrays and reverse matching. You
specify the following two fields:
Field Description
varA The date from FileA
varB The date from FileB
The DATE8 comparison requires at least one and can use two
parameters:
Parameter Description
Param 1 The number of days difference that can be tolerated. If
you only specify Param 1, this is the number of days
that can be tolerated for either varB greater than varA
or varA greater than varB.
If you specified both parameters, Param 1 is the
number of days tolerated for varB greater than varA.
Param 2 The number of days difference that can be tolerated
when varB is less than varA.
For example, you are matching on birth date and specified a 1 for
Param 1. If the birth dates differ by one day, the weight is the
agreement weight minus 1/2 of the weight range from agreement to
disagreement.
Two or more days difference results in a disagreement weight.
Similarly, if the value were 2, one day difference reduces the
agreement weight by 1/3 of the weight range and two days by 2/3.
An example is matching highway crashes to hospital admissions. A
hospital admission cannot occur before the accident date to be related
to the accident. You might specify a 1 for Param 1, which allows the
admission date to be one day later (greater) than the crash date, and a
0 for Param 2, which does not allow an admission date earlier than the
crash date.
DELTA_PERCENT Comparison
The DELTA_PERCENT comparison compares fields in which the
difference is measured in percentage, such as 10% difference in ages.
For example, a one year difference for an 85 year-old is less significant
than for a 3 year-old, but a 10% difference for each is more
meaningful. You can use this comparison with arrays and reverse
matching.
You specify the following two fields:
Field Description
varA The value from FileA
varB The value from FileB
Parameter Description
Param 1 The percentage difference that can be tolerated. If you
only specify Param 1, this is the percentage that can
be tolerated for either varB greater than varA or varA
greater than varB.
If you specified both parameters, Param 1 is the
percentage tolerated for varB greater than varA.
Param 2 The maximum percentage difference that can be
tolerated when the value from varB is less than varA.
For example, you are comparing age in two files. If you want tolerance
of a ten percent difference in the values, specify 10 for Param 1. A one
percent difference subtracts 1/11 of the weight range (the difference
between the agreement and disagreement weight) from the agreement
weight. A 10 percent difference subtracts 10/11 of the difference in the
weight range.
You would specify Param 2 = 5 if you want a five percent tolerance
when varB is less than varA.
DISTANCE Comparison
The DISTANCE comparison computes the Pythagorean distance
between two points and prorates the weight on the basis of the
distance between the points. You can use this comparison for
matching geographic coordinates where the farther the points are
from each other, the less weight is applied.
You specify the following four fields:
Field Description
varA1 The X coordinate from FileA
varA2 The Y coordinate from FileA
varB1 The X coordinate from FileB
varB2 The Y coordinate from FileB
Parameter Description
Param 1 The maximum distance to be tolerated.
INT_TO_INT Comparison
The INT_TO_INT comparison matches if an interval on one file
overlaps or is fully contained in an interval in another file. You could
use this comparison type for comparing hospital admission dates to
see if hospital stays are partially concurrent, or for matching two
geographic reference files containing ranges of addresses. You can use
this comparison with arrays and reverse matching.
You specify the following four fields:
Field Description
varA1 The beginning range of the interval from FileA
varA2 The ending range of the interval from FileA
varB1 The beginning range of the interval from FileB
varB2 The ending range of the interval from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
INTERVAL_NOPAR Comparison
The INTERVAL_NOPAR comparison is an interval noparity
comparison that compares a single number on FileA to an interval
(range of numbers) on FileB. Interval comparisons are primarily used
for geocoding applications, where FileB is the reference file. The single
number must be within the interval (including the end points) to be
considered a match. Otherwise, it is a disagreement.
You specify the following three fields:
Field Description
varA The number from FileA
varB1 The beginning of the range of the interval from FileB
varB2 The ending of the range of the interval from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values
The begin number of the intervals can be higher than the end number
and still match; that is, the files can have a high address in the
beginning range field and a low address in the ending range field. For
example, 153 matches both the range 200-100 and the range 100-200.
INTERVAL_PARITY Comparison
The INTERVAL_PARITY comparison is an odd/even interval
comparison that is identical to the INTERVAL_NOPAR comparison,
except that the number must agree in parity with the parity of the low
range of the interval. A single number on FileA is compared to an
interval on FileB. If the number on FileA is odd, the begin range
number on FileB must also be odd to be considered a match. Similarly,
if the number on FileA is even, the begin range on FileB must be even
to be considered a match.
You specify the following three fields:
Field Description
varA The number from FileA
varB1 The beginning range of the interval from FileB
varB2 The ending range of the interval from FileB
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values
LR_CHAR Comparison
The LR_CHAR comparison is a left/right character string comparison
that can compare place and ZIP code information in geocoding
applications. A single field on the user data file must be matched to
the two fields on FileB on a character-by- character basis.
Census Bureau Tiger files and other geographic reference files contain
a left ZIP code and a right ZIP code, a left city code and a right city
code. The left code applies if the there was a match to the left address
range interval and the right code applies if there was a match to the
right address range.
Field Description
varA The field from FileA
varB1 The left field (ZIP, city, etc.) from FileB
varB2 The right field (ZIP, city, etc.) from FileB
Mode Description
EITHER VarA must match varB1 or varB2 (or both) to receive
the full agreement weight.
BASED_PREV Use the result of a previously D_INT comparison to
decide which field to compare.
If you specify the EITHER mode, varA must match one or both varB1
or varB2 to receive an agreement weight. If you specified the
BASED_PREV mode, varA must match to B1 of a previous D_INT
comparison or of a similar double interval comparison in which varA
matched to the left interval, or the varA field must match to varB1 of
the previous D_INT in which varA matched to the right interval. If
neither the left or right interval agrees, the missing weight for the
field is assigned.
LR_UNCERT Comparison
The LR_UNCERT comparison is a left/right uncertainty string
comparison that is used in conjunction with geocoding applications for
comparing place information. Census Bureau Tiger files and other
geographic reference files contain a left ZIP code and a right ZIP code,
a left city code and a right city code, etc.
Field Description
varA The field from FileA
varB1 The left field (city, for example) from FileB
varB2 The right field (city, for example) from FileB
Parameter Description
Param 1 The minimum threshold, which is a number between 0
and 900. Use the following guidelines:
900 The two strings are identical
850 The two strings can be considered the same
800 The two strings are probably the same
750 The two strings are probably different
700 The two strings are different
Mode Description
EITHER The contents of varA must match one or both of the
varB fields specified to receive the full agreement
weight.
BASED_PREV Use the result of a previous LR_UNCERT comparison
to decide which field to compare.
MULT_EXACT Comparison
The MULT_EXACT comparison compares all words on one record in
the field with all words in the second record. This comparison is
similar to array matching, except that the individual words are
considered to be the array elements. This type of comparison allows
matching of free-form text where the order of the words may not
matter and where there may be missing words or words in error. The
score is based on the similarity of the fields.
For example:
Building 5 Apartment 4-B
would match:
Apartment 4-B Building 5
You specify the following fields:
Field Description
varA The character string from FileA
varB The character string from FileB
MULT_RANGE Comparison
The MULT_RANGE comparison matches a single house number to a
list of house number ranges. Each range must be separated by a pipe
symbol (|). The tilde (~) is used to indicate the ranges, since the
hyphen may be a legitimate address suffix (123-A). The prefix “B:” can
be used to signify both odd and even numbers in the range. Otherwise,
the parity of the low number is used.
In this example:
101~199 | B:201~299|456|670 ½| 800-A~898-B|1000~
The following ranges are defined:
101 to 199 odd numbers only
201 to 299 both odd and even number
456 (the one house number only)
Field Description
varA The character string from FileA
varB The character string from FileB
MULT_UNCERT Comparison
The MULT_UNCERT comparison is identical to MULT_EXACT,
except the uncertainty character comparison routine is used to match
the words. For more information on this uncertainty routine, see
“UNCERT Comparison” on page B-25.
For example:
Bilding 5 Apartment 4B
Would be close to:
Apartment 4-B Building 5
You specify the following two fields:
Field Description
varA The character string from FileA
varB The character string from FileB
Parameter Description
Param 1 The cutoff threshold, which is a number between 0
and 900. Use the following guidelines:
900 The two strings are identical
850 The two strings can be safely considered to be
the same
800 The two strings are probably the same
750 The two strings are probably different
700 The two strings are almost certainly different
NAME_UNCERT Comparison
The NAME_UNCERT comparison compares first names, where one
name might be truncated. This comparison uses the shorter length of
the two names for the comparison and does not compare any
characters after that length.
For example, the following two sets of first names would be considered
exact matches:
AL ALBERT
W WILLIAM
This is different from CHAR where these two names would not match.
The length is computed by ignoring trailing blanks (spaces).
Embedded blanks are not ignored.
Field Description
varA The first name from FileA
varB The first name from FileB
Parameter Description
Param 1 The minimum threshold, which is a number between 0
and 900. Use the following guidelines:
900 The two strings are identical
850 The two strings can be safely considered to be
the same
800 The two strings are probably the same
750 The two strings are probably different
700 The two strings are almost certainly different
NUMERIC Comparison
The NUMERIC comparison is an algebraic numeric compare. Leading
spaces are converted to zeros and the numbers are compared. You can
use this comparison with arrays and reverse matching.
Field Description
varA The field from FileA
varB The field from FileB
PREFIX Comparison
The PREFIX comparison compares character strings one, of which
might be truncated. This comparison uses the shorter length of the
two strings for the comparison and does not compare any characters
after that length. You can use this comparison with reverse matching.
For example, a last name of ABECROMBY could be truncated to
ABECROM. The PREFIX comparison considers these two
representations to be an equal match. This is different from CHAR
where these two names would not match. The length is computed by
ignoring trailing blanks (spaces). Embedded blanks are not ignored.
You specify the following two fields:
Field Description
varA The string from FileA
varB The string from FileB
PRORATED Comparison
The PRORATED comparison allows numeric fields to disagree by a
specified absolute amount that you specify. A difference of zero
between the two fields results in the full agreement weight being
assigned. A difference of more or equal to the absolute amount results
in the disagreement weight being assigned. Any difference between
zero and the specified absolute amounts receives a weight
Field Description
varA The numeric field from FileA
varB The numeric field from FileB
The PRORATED comparison requires at least one and can use two
parameters:
Parameter Description
Param 1 The absolute value difference that can be tolerated. If
you only specify Param 1, this the difference that can
be tolerated for either varB greater than varA or varA
greater than varB.
If you specified both parameters, Param 1 is the
difference tolerated for varB greater than varA.
Param 2 The absolute value difference that can be tolerated
when varB is less than varA.
For example, if you are comparing two dates and specify 5 for Param 1
and 7 for Param 2, the varB can exceed varA by 5 days, but the varA
can exceed varB by 7 days.
TIME Comparison
The TIME comparison compares times in hours and minutes or only
hours. The time must be in 24 hour format in which 0 is midnight and
2359 is 11:59 PM. Times can cross midnight since the difference is
always the shortest way around the clock. You can specify an
acceptable maximum time difference in minutes. You can use this
comparison with arrays.
A difference of zero between the two times results in the full
agreement weight being assigned. A difference of more or equal to the
absolute amount results in the disagreement weight being assigned.
Any difference between zero and the specified maximum time
difference receives a weight proportionally equal to the difference.
For example, if the maximum time difference is 10 and the times
differ by 12 minutes, the comparison receives the full disagreement
weight. If the times differ by 5 minutes, the comparison receives a
weight between the agreement and disagreement weight. If you want
to specify unequal tolerance, you specify a second time allowance.
You specify the following two fields:
Field Description
varA The time from FileA
varB The time from FileB
The TIME comparison requires at least one and can use two
parameters:
Parameter Description
Param 1 The maximum time difference that can be tolerated. If
you only specify Param1, this is the difference that can
be tolerated for either varA grater than varB or varB
greater than varA.If you specified both parameters,
Param 1 is the difference tolerated for varB greater
than varA.
Param 2 The maximum time difference that can be tolerated in
other direction when varB is less than varA.
For example, if you specify 20 for Param 1 and 14 for Param 2, varB
can exceed varA by 20 minutes, but varA can exceed varB by 4
minutes. The second parameter allows for minor errors in recording
the times.
UNCERT Comparison
The UNCERT comparison is a character comparison that uses an
information-theoretic character comparison algorithm to compare two
character strings. This comparison provides for phonetic errors,
transpositions, random insertion, deletion, and replacement of
characters within strings. You can use this comparison with arrays
and reverse matching.
The weight assigned is based on the difference between the two
strings being compared as a function of the string length (longer words
can tolerate more errors and still be recognizable than shorter words
can), the number of transpositions, and the number of unassigned
insertions, deletions, or replacement of characters.
You specify the following two fields:
Field Description
varA The character string from FileA
varB The character string from FileB
Parameter Description
Param 1 The cutoff threshold, which is a number between 0
and 900. Use the following guidelines:
900 The two strings are identical
850 The two strings can be safely considered to be
the same
800 The two strings are probably the same
750 The two strings are probably different
700 The two strings are almost certainly different
USPS Comparison
The USPS comparison processes United States Postal Service (USPS)
ZIP+4 files or other files that can contain non-numeric address ranges.
The USPS comparison requires that FileA contains the field names for
the house number and FileB contains a low house number range, a
high house number range, and a control field, indicating the parity of
the house number range.
You specify the following four fields:
Field Description
varA The house number from FileA
Field Description
varB1 The ZIP+4 field primary low house number for the
beginning of the range from FileB
varB2 The ZIP+4 field primary high house number for the
ending of the range from FileB
Bcontrol The odd/even parity for the range defined with varB1
and varB2
Control Description
O The range represents only odd house numbers.
E The range represents only even house numbers.
B The range represents all numbers (both odd and even)
in the interval.
U The parity of the range is unknown.
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
USPS_DINT Comparison
The USPS_DINT comparison is an interval to double interval USPS
comparison that compares an interval on FileA to two intervals on
FileB. If the interval on FileA overlaps any part of either interval on
FileB and the parity flags agree, the results match.
Field Description
varA1 The beginning of the street address range from FileA
varA2 The ending of the street address range from FileA
varB1 The beginning of the street address range for one side
of the street (such as from left) from FileB
varB2 The ending of the street address range for one side of
the street (such as from left) from FileB
varB3 The beginning of the street address range for the other
side of the street (such as from right) from FileB
varB4 The ending of the street address range for the other
side of the street (such as to right) from FileB
Acontrol The odd/even parity for the range defined with varA1
and varA2
Bcontrol1 The odd/even parity for the range defined with varB1
and varB2
Bcontrol2 The odd/even parity for the range defined with varB3
and varB4
Control Description
O The range represents only odd house numbers.
E The range represents only even house numbers.
B The range represents all numbers (both odd and even)
in the interval.
U The parity of the range is unknown.
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
USPS_INT Comparison
The USPS_INT comparison is an interval to interval comparison that
compares an interval on FileA to an interval on FileB. If the interval
on FileA overlaps any part of the interval on FileB and the parity
agrees, the results match.
Both files require an address primary low number, an address
primary high number, and an address primary odd/even control, such
as the USPS ZIP+4 file control field.
You specify the following fields:
Field Description
varA1 The beginning of the street address range from FileA
varA2 The ending of the street address range from FileA
varB1 The beginning of the street address range from FileB
Field Description
varB2 The ending of the street address range from FileB
Acontrol The odd/even parity for FileA
Bcontrol The odd/even parity for FileB
Control Description
O The range represents only odd house numbers.
E The range represents only even house numbers.
B The range represents all numbers (both odd and even)
in the interval.
U The parity of the range is unknown.
Mode Description
ZERO_VALID Indicates zero or blanks should be treated as any other
number.
ZERO_NULL Indicates zero or blank fields should be considered
null or missing values.
This appendix describes the rule set files used by the Standardize,
Multinational Standardize, and Investigate stages, as well as by the
WAVES stage.
Rule sets are fundamental to the standardization process. They
determine how fields in input records are parsed and classified into
tokens.
You can also create new rule sets using QualityStage Designer.
Features
The features and benefits offered by the rule set file architecture are:
• Support of your Business Intelligence objective by maximizing the
critical information contained within data. The data structures
created by the rules provide comprehensive addressability to all
data elements necessary to meet data storage requirements and
facilitate effective matching.
• Modular design that allows for a “plug and play” approach to
solving complex standardization challenges.
• Flexible approach to input data file format. You do not need to
organize the columns in any particular order.
Dictionary File
The Dictionary File defines the fields for the output file for this rule
set. The file holds a list of domain, matching, and reporting fields.
Each field is identified by a two-character abbreviation, for example
CN for City Name. The Dictionary also provides the data type
(character, for instance) and field offset and length information.
Format Description
field-identifier A two character field name (case insensitive) that
must be unique for all dictionaries. The first character
must be an alpha character. The second character can
be an alpha character or a digit.
If this field is overlaid, enter two asterisks (**) for the
field-identifier and put the field-identifier in the first
two chars of the comments position.
field-type The type of information in the field (see “Field Types”
on page C-5).
field-length The field length in characters.
missing value A missing value identifier. The possible values are:
identifier S – spaces
Z – zero or spaces
N – negative number (for example, –1)
9 – all nines (for example, 9999)
X – no missing value
Generally, use X or S for this argument.
description The field description that appears with the field in the
Data File Wizard.
; comments Optional, unless this is an overlaid field (two asterisks
(**) are the field-identifier) comments, including a
field-identifier for overlaid fields, and must follow a
semicolon (;).
Note that comments can also be place on a separate
line if preceded by a semicolon.
Comment lines The Dictionary File must also include the following two comment
lines:
• ; Business Intelligence Fields. This comment line must immediately
precede the list of fields.
• ; Matching Fields. This comment line must immediately follow the list
of fields.
Important: If the Dictionary File does not include these two comment lines,
QualityStage Designer cannot display the list of fields.
Field order The order of fields in the Dictionary File is the order in which the
fields appear in the output file.
Field Types
The following field types are supported:
Classification Table
The Classification Table allows the standardization process to identify
and classify key words such as street name, street type, directions,
and so on, by providing:
• Standard abbreviations for each word; for example, HWY for
Highway.
• A list of single-character classification tokens that are assigned to
individual data elements during processing.
The header of the Classification Table includes the name of the rule
set and the classification legend, which indicates the classes and their
descriptions.
The Standardize stage uses the Classification Table to identify and
classify key words (or tokens), such as street types (AVE, ST, RD),
street directions (N, NW, S), and titles (MR, DR). The Classification
table also provides standardization for these words.
The format for the Classification Table file is:
token /standard value/class /[threshold-weights]/ [; comments]
Format Description
token Spelling of the word as it appears in the input file.
standard value The standardized spelling or representation of the
word in the output file. The standardization process
converts the word to this value. The standardization
can be multiple words, which must be enclosed in
double quotation marks. You can use up to twenty-five
characters.
The standardization can be either an abbreviation for
the word; for example, the direction WEST, WST, or W
is converted to W. Optionally, the standardization
could force an expansion of the word; for example,
POB converted to “PO BOX”.
Format Description
class A one-character tag indicating the class of the word.
The class can be any letter, A to Z, and a zero (0),
indicating a null word.
threshold- Specifies the degree of uncertainty that can be
weights tolerated in the spelling of the word. The weights are:
900 Exact match
800 Strings are almost certainly the same
750 Strings are probably the same
700 Strings are probably different
Lower numbers tolerate more differences between the
strings.
comments Optional, and must follow a semicolon (;). Comments
can also be place on a separate line if preceded by a
semicolon.
Threshold Weights
The threshold weights specify the degree of uncertainty that can be
tolerated in the spelling of the token. An information-theoretic string
comparator is used that can take into account phonetic errors, random
insertion, deletion and replacement of characters, and transpositions
of characters.
The score is weighted by the length of the word, since small errors in
long words are less serious than errors in short words. In fact, the
threshold should be omitted for short words since errors generally
cannot be tolerated.
Pattern-Action File
The Pattern-Action file contains the rules for standardization; that is,
the actions to execute with a given pattern of tokens.
This section first describes the principles behind pattern matching,
tokenization, and classification. It concludes with a description of the
file itself.
For more detailed information about the Pattern-Action file, see the
QualityStage Pattern-Action Reference Guide.
123 ^ Numeric
No D Direction
Cherry Hill ? Unknown words
Road T Street type
The pattern represented by this address can be coded as:
^|D|?|T
The vertical lines separate the operands of a pattern. The address
above matches this pattern. The classification of D comes from the
token table. This has entries, such as NO, EAST, E, and NW, which are
all given a class of D to indicate that they generally represent
directions. Similarly, the token class of T is given to entries in the
table representing street types, such as ROAD, AVE, and PLACE.
pattern
actions
pattern
actions
…
There are two special sections in the Pattern-action File. The first
section consists of post-execution actions within the \POST_START
and \POST_END lines. The post-execution actions are those actions
which should be executed after the pattern matching process is
finished for the input record.
Post-execution actions include computing Soundex codes, NYSIIS
codes, reverse Soundex codes, reverse NYSIIS codes, copying,
concatenating, and prefixing dictionary field value initials.
The second special section consists of specification statements within
the \PRAGMA_START and \PRAGMA_END lines. The only
specification statements currently allowed are SEPLIST and
STRIPLIST. The special sections are optional. If omitted, the header
and trailer lines should also be omitted.
Other than the special sections, the Pattern-action File consists of sets
of patterns and associated actions. The pattern requires one line. The
actions are coded one action per line. The next pattern can start on the
following line.
Blank lines can be used to increase readability. For example, it is
suggested that blank lines or comments separate one pattern-action
set from another.
Comments follow a semicolon. An entire line can be a comment line by
specifying a semicolon as the first non-blank character; for example:
;
; This is a standard address pattern
;
^ | ? | T ; 123 Maple Ave
As an illustration of the pattern format, consider post actions of
computing a NYSIIS code for street name and processing patterns to
handle:
123 N MAPLE AVE
123 MAPLE AVE
\POST_START
NYSIIS {SN} {XS}
\POST_END
^|D|?|T ; 123 N Maple Ave
COPY [1] {HN} ; Copy House number (123)
COPY_A [2] {PD} ; Copy direction (N)
COPY_S [3] {SN} ; Copy street name (Maple)
COPY_A [4] {ST} ; Copy street type (Ave)
EXIT
^|?|T
COPY [1] {HN}
COPY_S [2] {SN}
COPY_A [3] {ST}
EXIT
Note that this example Pattern-action File has a post section that
computes the NYSIIS code of the street name (in field {SN}) and
moves the result to the {XS} field.
The first pattern matches a numeric followed by a direction followed
by one or more unknown words followed by a street type (as in 123 N
MAPLE AVE). The associated actions are to:
Override Tables
The override tables are designed to complement the Classification
table and the Pattern-action file by providing additional instructions
during processing. The information in the override tables take
precedence over the contents of the rule set files. These tables enable
you to adjust tokenization and standardization behavior if the results
you are getting are incorrect or incomplete.
You use the Overrides dialog boxes in the QualityStage Designer to
edit the contents of the override tables. See Appendix E, “Customizing
and Testing Rule Sets”” for more information.
Output File
The rule set creates an output file in which the following fields are
appended to the beginning of each input record:
• A two-byte ISO country code. The code is associated with the
geographic origin of the record’s address and area information.
• An Identifier flag. The values are:
Flag Description
Y The rule set was able to identify the country.
N The rule set was not able to identify the country and used the
default value that you set as the default country delimiter.
Input File
The Domain Pre-Processor rule sets do not assume a data domain
with a field position. Therefore, you must insert at least one metadata
delimiter for a field in your input record. It is strongly recommended
that you delimit every field or group of fields. The delimiter indicates
what kind of data you are expecting to find in the field based on one or
more of the following:
• Metadata description
• Investigation results
• An informed estimate
The delimiter names are:
For example, here are the files in the United States Domain
Pre-Processor rule set:
USPREP.CLS Classification Table
USPREP.DCT Dictionary File
USPREP.PAT Pattern-Action File
USPREP.PRC Rule Set Description File
Domain Fields
Domain Pre-Processor rule sets move every input token to one of the
following domain fields:
Reporting Fields
Domain Pre-Processor rule sets provide reporting fields for quality
assurance and post-standardization investigation. All Domain
Pre-Processor rule sets have the following reporting fields:
See the next table for descriptions of the user flag fields.
Domain Masks
The Domain Pre-Processor attempts to assign a domain mask to each
input token. All pattern-actions retype tokens to one of the domain
masks, which are:
For example, here are the files in the United States NAME rule set:
USNAME.CLS Classification Table
USNAME.DCT Dictionary File
USNAME.PAT Pattern-Action File
USNAME.PRC Rule Set Description File
• Matching fields.
• Reporting fields.
Matching Fields
Domain-Specific rule sets create data structures that facilitate
effective data matching. Different domains will have different
matching fields. The most common matching fields are phonetic keys
for primary fields. Here is an example:
Reporting Fields
The reporting fields for quality assurance and post-standardization
investigation. These rule sets have the following reporting fields:
Name Description
Unhandled The pattern generated for the remaining tokens not
Pattern {UP} processed by the rule set based on the parsing rules,
token classifications, and any additional
manipulations by the pattern-action language.
Unhandled The remaining tokens not processed by the rule set,
Data {UD} with one character space between each token.
Input Pattern {IP} The pattern generated for the stream of input tokens
based on the parsing rules and token classifications.
Exception Data {ED} The tokens not processed by the rule set because
they represent a data exception. Data exceptions
may be tokens that do not belong to the domain of
the rule set or are invalid or default values.
UO User Override Flag. A flag indicating what type of
user override was applied to this record.
User Re-Code A flag indicating whether the current record was
Dropped Data affected by a user re-code that specified the dropping
Flag {U5} (deleting) of one or more input tokens.
For example, here are the files in the DATE rule set:
VDATE.CLS Classification Table
VDATE.DCT Dictionary File
VDATE.PAT Pattern-Action File
VDATE.PRC Rule Set Description File
Format Example
mmddccyy 09211991
mmmddccyy OCT021983
mmmdccyy OCT21983
mmddccyy 04101986
mm/dd/ccyy 10/23/1960
m/d/ccyy 1/3/1960
mm/d/ccyy 10/3/1960
m/dd/ccyy 1/13/1960
mm-dd-ccyy 04-01-1960
Format Example
m-d-ccyy 1-3-1960
mm-d-ccyy 10-3-1960
m-dd-ccyy 1-13-1960
ccyy-mm-dd 1990-10-22
Output Example
• The default classification table for this rule set contains common
domain (for instance, ORG, COM, EDU, GOV, etc.) and
sub-domain qualifiers (for instance, country and state codes).
Parsing Examples
The parsing parameters will parse the address into multiple tokens as
in the following examples:
The @ and . are used to separate the data. They are removed
during the parsing process.
• User {US}
• Domain {DM}
• Top-level Qualifier {TL}
• URL {RL}
Parsing Examples
The following table shows examples of how phone numbers are
parsed:
The hyphen, space, and parentheses are used to separate the data.
After the data is parsed the hyphen, spaces, and parentheses are
dropped.
Validation Logic
The VPHONE rule set validates patterns and values based on the
following criteria:
• The value has 7 or 10 numeric bytes. Can be over 10 bytes with
extensions.
• The first three bytes are not all zeros (000). If all zeroes, they are
replaced with blanks
• The value is not listed on the ‘invalid table’, INVPHONE.TBL as
shown here:
0000000000
1111111111
2222222222
3333333333
4444444444
5555555555
6666666666
7777777777
8888888888
9999999999
1234567
5551212
1111111
2222222
3333333
4444444
5555555
6666666
7777777
8888888
9999999
0000000
If the data value fails any one of the validation requirements the
‘Invalid Data’ and the ‘Invalid Reason’ fields are populated.
Examples
The following table shows sample input data and the output they
produce:
Parsing Examples
The following table shows examples of how tax IDs and Social
Security numbers are parsed:
The hyphen, space, and parentheses are used to separate the data.
After the data is parsed the hyphen, spaces, and parentheses are
deleted.
Validation Logic
The rule set validates patterns and values based on the following
criteria:
• The value has nine numeric characters.
• The first three bytes are not all zeros (000).
• The value is not listed on the Invalid table (INVTAXID.TBL) as
shown here:
000000000
111111111
222222222
333333333
444444444
555555555
666666666
777777777
888888888
999999999
123456789
987654321
111223333
Examples
The following table shows sample input data and the output they
produce:
Important: We strongly advise that you read “Rule Set Files” on page C-1
before you try to customize rule sets.
See “Using Override Tables to Customize Rule Sets” on page E-2 for
information on how to use rule override tables.
You can also make more complex modification to Standardize rule sets
using subroutines, which are described in “User Modification
Subroutines” on page E-40.
Input Text
Input Overrides 1B
2 Modifications
User Subroutine
Input Pattern
Overrides
1C
Continuation
3 User Subroutine
Modifications
Field Continuations
y MIXED
User Overrides
Main
4 Pattern
y OTHR
y AREA
Action y ADDR
(Part 1) y NAME
Field Text
Field Overrides 5A
5
Overrides
Field Pattern
Overrides
5B
6 Field
User Subroutine
Modifications
Field Typing
y MIXED
7 Main
y OTHR
Pattern y AREA
Action y ADDR
(Part 2) y NAME
Input Pattern
3 Common Overrides 1C
Patterns
User Overrides
Main Pattern
4 Action
Unhandled Text 5A
Overrides
Unhandled
5 Overrides
Unhandled
Pattern 5B
Overrides
Unhandled
6 User Subroutine
Modifications
Note: If the list of rules does not appear, select File ➤ Designer
Options from the QualityStage main window, and then
2. Select the rule set for which you want to specify an override.
The appropriate override dialog box appears. The dialog box tabs
let you choose the override table you wish to modify.
3. Under Input Token, enter the word token for which you want to
override the classification, such as SSTREET. Spell the word as it
appears in the input file.
4. Under Standard Form, enter the standardized spelling of the
token, such as ST.
5. From the Classification menu, select the one-character tag that
indicates the class of the token word, such as T-Street Types.
7. Click Add to add the override to the list box at the bottom of the
dialog box.
Note: To continue working with user overrides for the current rule set,
click Apply to save your edits without closing the dialog box.
3. From the Classification Legend list, select the first token, such as
N, and then click Append this code to current pattern.
This adds the token to the Enter Input Pattern text box. The token
also appears in the Current Pattern List with the default A
(Address Domain) Override Code.
4. For each token in the input pattern, repeat step 3, such as for
tokens ^, +, T, A, +, S, and ^.
Note: You can also type the tokens directly in the Enter Input
Pattern text box. Using the list ensures that you enter valid
values.
5. You can leave the default domain setting for a token, or you can
change it. To change a token’s override code:
a. Select the Token in the Current Pattern List.
b. Select a domain from the Dictionary Fields list.
For example, for the pattern N^+TA+S^, the N^+T can keep the
default A–Address Domain. Change A+S^ to the R–Area Domain.
Tip: Alternatively, you can use the Current Pattern List and
select the Token and Override Code you want. For example,
select A from the list of tokens, and then select R from the list
of override codes.
Note: To continue working with user overrides for the current rule set,
click Apply to save your edits without closing the dialog box.
Overriding the field pattern is done similarly, except that you click the
Field Pattern tab at Step 1 instead of the Input Pattern tab.
See “Modifying and Maintaining Overrides” on page E-32 to learn how
to add a new override based on an existing one.
3. Under Enter Input Tokens, enter the domain delimiter and the
text for which you want to override its tokens. For example, enter
ZQNAMEZQ MARTIN LUTHER KING ZQADDRZQ BLVD.
Each word you enter appears in the Current Token List with the
word itself in the Token column and the default domain, such as A
(Address Domain), in the Override Code column.
4. Select the Token for the current domain delimiter of the text you
want to override, and then select the override you want to apply
from the Dictionary Fields list.
When you run the customized rule set on the text, it is processed as
you specified. For example, when you run USPREP on the text
ZQNAMEZQ MARTIN LUTHER KING ZQADDRZQ BLVD, the entire text
string will be handled as an address.
Overriding the field text is done similarly, except that you click the
Field Text tab at Step 1 instead of the Input Text tab.
See “Modifying and Maintaining Overrides” on page E-32 to learn how
to add a new override based on an existing one.
Both Pattern dialog boxes contain the same fields and are used the
same way.
• The Input Pattern override allows you to specify rule overrides
based on the input pattern. Input Pattern overrides take
precedence over the pattern-action file. Input Pattern overrides
can only be specified for the entire input pattern. Partial pattern
matching is not allowed.
• The Unhandled Pattern override allows you to specify rule
overrides based on the unhandled pattern. Unhandled Pattern
overrides work on tokens not processed by the pattern-action file.
Unhandled pattern overrides can only be specified for the entire
unhandled pattern. Partial pattern matching is not allowed.
For example, you would override the Input Pattern table if you
wanted to designate the following pattern:
^+T
this way:
3. From the Classification Legend list, select the first token, such as
^, and then click Append this code to current pattern.
This adds the token to the Enter Input Pattern text box. The token
also appears under the Current Pattern List with the default
override code AA1 (Additional Address Information and code 1).
4. Repeat step 3 for the tokens + and T.
Note: You can also type the tokens directly in the Enter Input
Pattern text box. Using the list ensures that you enter valid
values.
10. Click Add to add the override to the Override Summary list.
11. Click OK to save your edits and close the dialog box.
Whenever you run the rule set, the pattern for which you have
specified overrides will be processed accordingly. For example,
when you next run the USADDR rule set, the pattern ^+T is
handled accordingly.
Note: To continue working with user overrides for the current rule set,
click Apply to save your edits without closing the dialog box.
from the Classification table and use their original data values. The
following shows the tokenized address before you add any overrides:
3. Under Input Text, enter the text string for which you want to
define a pattern override, such as 100 SUMMER STREET FLOOR 15.
Each text token appears in the Current Token List under the
Token column. Next to each token, the default code of AA
(Additional Address Information) plus action code 1 appears.
For a list of action codes, see the “Action Codes for
Domain-Specific, Validation, and WAVES/Multinational
Standardize Rule Sets” on page E-33.
4. Select the first text token, such as 100.
5. From the Dictionary Fields list, select the code you want, such as
HN - House Number.
The AA1 next to 100 in the Current Token List changes to HN1.
6. Repeat step 4 and step 5 for each of the remaining text tokens, for
example:
Text Type
Summer SN - Street Name
Street ST - Street Suffix Type
Floor FT - Floor Type
15 FV - Floor Value
7. Select the text token in the Current Token List, such as STREET
and then select Standard Value.
The row STREET ST1 changes to STREET ST2, indicating that the
standard value from the Classification table will be used for this
token in this pattern. The rest of the tokens are left as the original
value.
8. Repeat step 7 for each text token you wish to standardize. For
example, repeat step 7 for text token FLOOR.
10. Click OK to save your edits and close the dialog box.
Whenever you run the rule set on the text string, it is processed as you
specified. For example, when you next run the USADDR rule set, the
text string 100 SUMMER STREET FLOOR 15 is handled accordingly.
Note: To continue working with user overrides for the current rule set,
click Apply to save your edits without closing the dialog box.
Deleting Overrides
To delete overrides:
Modifying Overrides
To modify an existing override:
The title bar and Rule Set box display the name of the current rule set
you are testing.
At any time you can select another rule set to test from the Rule Set
list, which lists all available rule sets. If you select a different type of
rule set (Domain Pre-processor rather than Domain-Specific or
Validation), the screen resets itself to reflect the correct type. Any
data under Input String is maintained, but the results grid at the
bottom of the screen is cleared.
The Standardization Rules Analyzer supports international rules. If
no Locale is specified, the Standardization Rules Analyzer assumes
you want to use the default locale to which your computer is set. By
specifying a different locale, you can run data against rule sets that
are not designed for the default locale.
Important: You must set the delimiter for each input string; you cannot leave
it at [None]. If you attempt to run the test without specifying a
When you access this screen, the Rule Analyzer first populates the
Input String box’s history. QualityStage maintains a separate history
log of up to five previously tested input strings for each rule set.
For Domain-Specific or Validation rule sets, you can enter only one
Input String to test.
To test a Domain-Specific or Validation rule set:
Subroutine Limitations
Subroutine modifications are not portable to the next upgrade of
QualityStage. Also, the syntax must be correctly written, or
unpredictable output may result. For these reasons, we strongly
advise that you use the Standardization Overrides to control
standardization output.
\SUB input_Modifications
Input Modifications
Pattern-Action statements added to the input modifications
subroutine are performed before any other pattern-actions.
Modifications should be added here if you have determined that
certain conditions are completely mishandled or unhandled by the
rule set.
The subroutine section of the Pattern-Action file is delimited by a
header, as shown here:
;--------------------------------------------------
;Input_Modifications SUBROUTINE Starts Here
;--------------------------------------------------
\SUB Input_Modifications
Continuation Modifications
The logical flow of the Domain Pre-Processor begins with the isolation
of each contiguous pair of delimited input fields to search for the
Field Modifications
The second step in the Domain Pre-Processor logical flow is to isolate
each delimited input field, one at a time, to search for common domain
patterns.
Pattern-action statements added to the field modifications subroutine
are performed before any other field pattern-actions.
The input subroutine section of the Pattern-action file can be found at
the beginning of the subroutine section or by searching for:
;--------------------------------------------------
;Field_Modifications SUBROUTINE Starts Here
;--------------------------------------------------
\SUB Field_Modifications
Input Modifications
Pattern-actions added to the input modifications subroutine are
performed before any other pattern-actions.
The input modification subroutine can be found in the Pattern-Action
file at the beginning of the subroutine section or by searching for:
;--------------------------------------------------
; Input_Modifications SUBROUTINE Starts Here
;--------------------------------------------------
\SUB Input_Modifications
Unhandled Modifications
Pattern-actions added to the unhandled modifications subroutine are
performed after all other pattern-actions.
The unhandled modification subroutine can be found in the
Pattern-Action file at the beginning of the subroutine section or by
searching for:
;--------------------------------------------------
; Unhandled_Modifications SUBROUTINE Starts Here
;--------------------------------------------------
\SUB Unhandled_Modifications
The following table list the 2- and 3-character ISO country codes:
Two- Three-
Country Character Character
AFGHANISTAN AF AFG
ALBANIA AL ALB
ALGERIA DZ DZA
ANDORRA AD AND
ANGOLA AO AGO
ANGUILLA AI AIA
ANTARCTICA AQ ATA
ARGENTINA AR ARG
ARMENIA AM ARM
ARUBA AW ABW
Two- Three-
Country Character Character
AUSTRALIA AU AUS
AUSTRIA AT AUT
AZERBAIJAN AZ AZE
BAHAMAS BS BHS
BAHRAIN BH BHR
BANGLADESH BD BGD
BARBADOS BB BRB
BELARUS BY BLR
BELGIUM BE BEL
BELIZE BZ BLZ
BENIN BJ BEN
BERMUDA BM BMU
BHUTAN BT BTN
BOLIVIA BO BOL
BOTSWANA BW BWA
BRAZIL BR BRA
BULGARIA BG BGR
Two- Three-
Country Character Character
BURUNDI BI BDI
CAMBODIA KH KHM
CAMEROON CM CMR
CANADA CA CAN
CHAD TD TCD
CHILE CL CHL
CHINA CN CHN
COLOMBIA CO COL
COMOROS KM COM
CONGO CG COG
CUBA CU CUB
Two- Three-
Country Character Character
CYPRUS CY CYP
DENMARK DK DNK
DJIBOUTI DJ DJI
DOMINICA DM DMA
ECUADOR EC ECU
EGYPT EG EGY
EL SALVADOR SV SLV
ERITREA ER ERI
ESTONIA EE EST
ETHIOPIA ET ETH
FIJI FJ FJI
FINLAND FI FIN
FRANCE FR FRA
Two- Three-
Country Character Character
GABON GA GAB
GAMBIA GM GMB
GEORGIA GE GEO
GERMANY DE DEU
GHANA GH GHA
GIBRALTAR GI GIB
GREECE GR GRC
GREENLAND GL GRL
GRENADA GD GRD
GUADELOUPE GP GLP
GUAM GU GUM
GUATEMALA GT GTM
GUINEA GN GIN
GUINEA-BISSAU GW GNB
GUYANA GY GUY
HAITI HT HTI
HONDURAS HN HND
HUNGARY HU HUN
Two- Three-
Country Character Character
ICELAND IS ISL
INDIA IN IND
INDONESIA ID IDN
IRAQ IQ IRQ
IRELAND IE IRL
ISRAEL IL ISR
ITALY IT ITA
JAMAICA JM JAM
JAPAN JP JPN
JORDAN JO JOR
KAZAKHSTAN KZ KAZ
KENYA KE KEN
KIRIBATI KI KIR
KUWAIT KW KWT
KYRGYZSTAN KG KGZ
LATVIA LV LVA
LEBANON LB LBN
Two- Three-
Country Character Character
LESOTHO LS LSO
LIBERIA LR LBR
LIECHTENSTEIN LI LIE
LITHUANIA LT LTU
LUXEMBOURG LU LUX
MACAU MO MAC
MADAGASCAR MG MDG
MALAWI MW MWI
MALAYSIA MY MYS
MALDIVES MV MDV
MALI ML MLI
MALTA MT MLT
MARTINIQUE MQ MTQ
MAURITANIA MR MRT
MAURITIUS MU MUS
MAYOTTE YT MYT
MEXICO MX MEX
Two- Three-
Country Character Character
MONACO MC MCO
MONGOLIA MN MNG
MONTSERRAT MS MSR
MOROCCO MA MAR
MOZAMBIQUE MZ MOZ
MYANMAR MM MMR
NAMIBIA NA NAM
NAURU NR NRU
NEPAL NP NPL
NETHERLANDS NL NLD
NICARAGUA NI NIC
NIGER NE NER
NIGERIA NG NGA
NIUE NU NIU
NORWAY NO NOR
OMAN OM OMN
Two- Three-
Country Character Character
PAKISTAN PK PAK
PALAU PW PLW
PANAMA PA PAN
PARAGUAY PY PRY
PERU PE PER
PHILIPPINES PH PHL
PITCAIRN PN PCN
POLAND PL POL
PORTUGAL PT PRT
QATAR QA QAT
REUNION RE REU
ROMANIA RO ROM
RWANDA RW RWA
SAMOA WS WSM
Two- Three-
Country Character Character
SENEGAL SN SEN
SEYCHELLES SC SYC
SINGAPORE SG SGP
SLOVENIA SI SVN
SOMALIA SO SOM
SPAIN ES ESP
SUDAN SD SDN
SURINAME SR SUR
SWAZILAND SZ SWZ
SWEDEN SE SWE
Two- Three-
Country Character Character
SWITZERLAND CH CHE
TAJIKISTAN TJ TJK
THAILAND TH THA
TOGO TG TGO
TOKELAU TK TKL
TONGA TO TON
TUNISIA TN TUN
TURKEY TR TUR
TURKMENISTAN TM TKM
TUVALU TV TUV
UGANDA UG UGA
UKRAINE UA UKR
Two- Three-
Country Character Character
URUGUAY UY URY
UZBEKISTAN UZ UZB
VANUATU VU VUT
VENEZUELA VE VEN
YEMEN YE YEM
YUGOSLAVIA YU YUG
Scoping
Scoping allows for a precise, reliable way to guarantee that the correct
dictionary field is referenced. Scoping also allows for the new
Example
In previous versions of QualityStage, the GEOCODE rule set’s
Pattern-Action file referenced the state abbreviation field ({SA}) in the
Dictionary File of the PLACE rule set:
[ {SA} = “PR” ]
In order for the above pattern to test true, it must be modified to
include a global scope reference to the PLACE rule set:
[ {SA of PLACE} = “PR” ]
All other references of {SA} in the GEOCODE Pattern-Action file must
be changed to {SA of PLACE}. If the global scopes are not added, the
executing Standardize stage will not terminate. The patterns will not
test true, and therefore the associated actions are not performed.
Example
If the following pattern-action set appeared in USADDR.PAT, it would
move the token to the USADDR temp variable:
&
COPY [1] temp
And, if the following pattern-action set appeared in USADDR.PAT, it
would move the token to the USAREA temp variable:
&
COPY [1] temp<USAREA>
Important: Rule sets using global scope references for either dictionary fields
or variable names are not compatible with versions of
INTEGRITY before release 3.6.
Note: We recommend that you use a delimited text file format when
exchanging files between AuditStage and QualityStage.
Pre-Standardization Validation
Before running a Standardize stage in QualityStage, you can use
AuditStage to complement the QualityStage Investigate stage.
Identifying problem areas in the data before standardizing it ensures
that the results will be more meaningful and thus enhance the
productivity of your QualityStage work session.
1. Use the Sample Clause function to create your sample file. See
Chapter 9, Types of Data Filter Checks in the AuditStage User’s
Guide for more information.
2. When defining the sample, do one of the following:
a. For standardization, create a random sample of records of
your QualityStage results file.
b. For matching, use a random sample of files based upon unique
Match Set IDs, rather than rows, of your results file. Extract
all the records with the same Match Set IDs for the sample.
3. Create a data filter for your sample QualityStage source file that
defines the columns you want to test.
Based on the results of your sample tests, you can modify your rule
sets to better meet the requirements of your source data.
A B
Abbreviate stage 6-2 blocking 12-5
adding specifying 12-27
jobs 6-4 Build stage 6-3
projects 4-2 business names, standardizing with the
stages to jobs 6-5, 6-6 Standardize stage 10-30, D-1
stages to user-defined jobs 6-9
add-on modules, using 4-28
Advanced Options dialog box C
for Character mode 9-10 Character Discrete 15-15
for Word mode 9-22 Character mode
Append Field Selection dialog box 10-20 about 9-5
Arrayfields dialog box 4-27 about Pattern reports 9-5
arrays Concatenate option 9-7
adding fields to 4-27 creating job 9-9
assigning missing values 4-27 Discrete option 9-6
assigning special treatment 12-40 Character Mode dialog box 9-9
defining 4-27 classification
using with the Match stage 12-36 in Pattern-Action file C-9
AuditStage with rule sets C-9
aggregating data sources H-2 Classification Table
documentation set H-1 special classes C-8
job maintenance H-6 threshold weights C-7
Pre-standardization validation H-3 COBOL Copybook, importing 4-9
sampling data H-5 Collapse stage 6-2
tuning standardization and Command Definition dialog box 10-15
matching H-4 copying projects 4-3
lookup C-14 V
override C-14, C-15 Validation rule sets 10-10
threshold weights Vartype dialog box 12-40
Classification Table C-7 Vartypes, defining 12-40
tokens
in Pattern-Action file C-9
with rule sets C-9 W
Transfer stage 6-2 Weight Override dialog box 12-38
troubleshooting weights
Report Viewer 16-6 calculating with Match 12-7
specifying overrides for Match 12-36
Windows server
U advanced project settings 5-14
unduplicating defining directories for 5-14
using Match 12-8 defining local Windows run profile 5-16
Unijoin stage 6-2 defining run profile 5-10
UNIX server location of data directory 5-14
advanced project settings 5-14 Word mode
defining directories 5-14 about 9-11
defining run profile 5-10 about Pattern reports 9-13
input data location 7-10 about rule sets 9-12
location of data directory 5-14 about Word Classification reports 9-16
u-probability about Word Frequency reports 9-14
defining 12-7 creating job 9-21
specifying 12-35 setting advanced options 9-18
US Standardization Before/After 15-20 Word Mode dialog box 9-21
US Standardization with PREP Summary word report 15-16
Report 15-27 workflow 2-1
user overrides E-9 working client directory 3-13
using working with Match reports 13-1–13-32
add-on modules 4-28 working with QualityStage reports 15-1–
dialog boxes 3-10 15-32
additional menus 3-11
browsing 3-12
moving items 3-11
selecting items 3-11
Multinational Standardize stage 10-24
projects 4-1
QualityStage 3-1–3-16
QualityStage main window 3-4