TET2 2manual
TET2 2manual
TET2 2manual
www.pdflib.com
Reference Manual
®
PDFlib GmbH
Tal 40, 80331 München, Germany
www.pdflib.com
phone +49 • 89 • 29 16 46 87
fax +49 • 89 • 29 16 46 86
If you have questions check the PDFlib mailing list and archive at tech.groups.yahoo.com/group/pdflib
This publication and the information herein is furnished as is, is subject to change without notice, and
should not be construed as a commitment by PDFlib GmbH. PDFlib GmbH assumes no responsibility or lia-
bility for any errors or inaccuracies, makes no warranty of any kind (express, implied or statutory) with re-
spect to this publication, and expressly disclaims any and all warranties of merchantability, fitness for par-
ticular purposes and noninfringement of third party rights.
PDFlib and the PDFlib logo are registered trademarks of PDFlib GmbH. PDFlib licensees are granted the
right to use the PDFlib name and logo in their product documentation. However, this is not required.
Adobe, Acrobat, and PostScript are trademarks of Adobe Systems Inc. AIX, IBM, OS/390, WebSphere, iSeries,
and zSeries are trademarks of International Business Machines Corporation. ActiveX, Microsoft, Windows,
and Windows NT are trademarks of Microsoft Corporation. Apple, Macintosh and TrueType are trademarks
of Apple Computer, Inc. Unicode and the Unicode logo are trademarks of Unicode, Inc. Unix is a trademark
of The Open Group. Java and Solaris are a trademark of Sun Microsystems, Inc. Other company product
and service names may be trademarks or service marks of others.
The PDFlib Text Extraction Toolkit contains modified parts of the following third-party software:
Zlib compression library, Copyright © 1995-2002 Jean-loup Gailly and Mark Adler
Cryptographic software written by Eric Young, Copyright © 1995-1998 Eric Young (eay@cryptsoft.com)
Cryptographic software, Copyright © 1998-2002 The OpenSSL Project (www.openssl.org)
The PDFlib Text Extraction Toolkit contains the RSA Security, Inc. MD5 message digest algorithm.
Contents
0 First Steps with TET 5
0.1 Installing the Software 5
0.2 Applying the TET License Key 6
1 Introduction 9
1.1 TET Application Scenarios 9
1.2 TET Features 9
1.3 TET Command-Line Tool or TET Library? 11
1.4 The TET Plugin for Adobe Acrobat 11
Contents 3
5.3 Composite Data Structures and IDs 56
5.4 Path Syntax 57
5.5 Pseudo Objects 59
5.6 Encrypted PDF Documents 65
B Revision History 90
Index 91
4 Contents
0 First Steps with TET
0.1 Installing the Software
TET is delivered as an MSI installer package for Windows systems, and as a compressed
archive for all other supported operating systems. All TET packages contain the TET
command-line tool and the TET library/component, plus support files, documentation,
and examples. After installing or unpacking TET the following steps are recommended:
> Users of the TET command-line tool can use the executable right away. The available
options are discussed in Section 2.1, »Command-Line Options«, page 13, and are also
displayed when you execute the TET command-line tool without any options.
> Users of the TET library/component should read one of the sections in Chapter 3,
»TET Library Language Bindings«, page 19, corresponding to their preferred develop-
ment environment, and review the installed examples. On Windows, the TET pro-
gramming examples are accessible via the Start menu.
If you obtained a commercial TET license you must enter your TET license key according
to Section 0.2, »Applying the TET License Key«, page 6.
CJK configuration. In order to extract Chinese, Japanese, or Korean (CJK) text TET gen-
erally requires the corresponding CMap files for mapping CJK encodings to Unicode.
The CMap files are contained in all TET packages, and will be installed in the resource/
cmap directory within the TET installation directory. On Windows systems simply
choose the full installation option when installing TET. The CMap files will be found au-
tomatically via the registry.
On other systems you must manually configure the CMap files:
> For the TET command-line tool this can be achieved by supplying the name of the di-
rectory holding the CMap files with the --searchpath option.
> For the TET library/component you can set the searchpath at runtime:
TET_set_option(tet, "searchpath=/path/to/resource/cmap");
As an alternative method for configuring access to the CJK CMap files you can set the
TETRESOURCEFILE environment variable to point to a UPR configuration file which con-
tains a suitable searchpath definition.
Restrictions of the evaluation version. The TET command-line tool and library can be
used as fully functional evaluation versions even without a commercial license. Unli-
censed versions support all features, but will only process PDF documents with up to 10
pages and 1 MB size. Evaluation versions of TET must not be used for production pur-
poses, but only for evaluating the product. Using TET for production purposes requires
a valid TET license.
Note TET license keys are platform-dependent, and can only be used on the platform for which they
have been purchased.
Entering the license key in the Windows installer. Windows users can enter the license
key when they install TET using the supplied installer. This is the recommended meth-
od on Windows. If you do not have write access to the registry or cannot use the install-
er, refer to one of the alternate methods below instead.
Setting the license key with a TET library call. If you use the TET library, add a line to
your script or program which sets the license key at runtime:
> In COM/VBScript:
oTET.set_option "license=...your license key..."
> In C:
TET_set_option(tet, "license=...your license key...");
> In RPG:
d licensekey s 20
d licenseval s 50
c eval licenseopt='license=... your license key ...'+x'00'
c callp TET_set_option(TET:licenseopt:0)
The license option must be set immediately after instantiating the TET object, i.e., after
calling TET_new( ) (in C, PHP 4) or creating a TET object (in C++, COM, .NET, Java, and
PHP 5).
Entering the license key in a license file. Set an environment (shell) variable which
points to a license file before TET functions are called. If you are using the TET library
you can alternatively set the path to the license file by setting the licensefile parameter
with the TET_set_option( ) function. The license file must be a text file according to the
sample below; you can use the license file template licensekeys.txt which is contained in
all TET distributions. Lines beginning with a ’#’ characters contain comments, and will
be ignored; the second line contains version information for the license file itself:
# Licensing information for PDFlib GmbH products
PDFlib license file 1.0
TET 2.2 ...your license key...
This command can be specified in the startup program QSTRUP and will work for all
PDFlib GmbH products.
TET has been designed for standalone use, and does not require any third-party soft-
ware. It is robust and suitable for multi-threaded server use. The core library is written
in highly optimized C code for maximum performance and minimum overhead. Sever-
al language bindings are available for use with all major programming languages.
Supported PDF input. TET has been tested against thousands of PDF test files from var-
ious sources. It accepts all relevant flavors of PDF:
> PDF versions 1.0-1.7 (corresponding to Acrobat 1-8);
> all compression filters;
> all font and encoding combinations: base 14 fonts, TrueType, PostScript, OpenType,
single- and multi-byte CID fonts;
> documents encrypted with 40- or 128-bit keys (only if content extraction is allowed
by the document’s permission settings, or the master password is supplied);
Some PDF documents do not contain enough information for reliable Unicode map-
ping. In order to successfully extract the text nevertheless TET offers various configura-
tion options which can be used to supply auxiliary information for proper Unicode
mappings. In order to facilitate writing the required mapping tables we make available
PDFlib FontReporter, a free plugin for Adobe Acrobat. This plugin can be used for ana-
lyzing fonts, encodings, and glyphs in PDF.
CJK support. TET includes full support for extracting Chinese, Japanese, and Korean
text:
> All predefined CJK CMaps (encodings) are recognized; CJK text will be converted to
Unicode. CMap files are shipped with the TET distribution.
> Both horizontal and vertical writing modes are supported.
> CJK font names will be normalized to Unicode.
pCOS interface for simple access to PDF objects. TET includes pCOS (PDFlib Compre-
hensive Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve
PDF metadata, hypertext (e.g. bookmark text, contents of form fields), or any other in-
formation from a PDF document with a simple query interface.
Geometry. TET provides precise metrics for the text, such as the position on the page,
glyph widths, and text direction. Specific areas on the page can be excluded or included
in the text extraction process, e.g. to ignore headers and footers or margins.
Word detection and content analysis. TET can be used to retrieve low-level glyph in-
formation, but also includes advanced algorithms for high-level content analysis:
> Detect word boundaries to retrieve words instead of characters.
> Recombine the parts of hyphenated words (dehyphenation).
> Remove duplicate instances of text, e.g. shadow and fake bold text.
> Recombine paragraphs into reading order.
> Reorder text which is scattered over the page.
> Reconstruct lines of text.
10 Chapter 1: Introduction
What is text? While TET deals with a large class of PDF documents, not all visible text
can successfully be extracted. The text must be encoded using PDF’s text and encoding
facilities (i.e., it must be based on a font). Although the following flavors of text may be
visible on the page they cannot be extracted with TET:
> Rasterized (pixel image) text, e.g. scanned pages
> Vectorized text
Note that metadata and text in hypertext elements (such as bookmarks, form fields,
notes, or annotations) can be retrieved with the pCOS interface. On the other hand, TET
may extract some text which is not visible on the page. This may happen in the follow-
ing situations:
> Text using PDF’s invisible attribute (however, there is an option to exclude this kind
of text from the text retrieval process)
> Text which is obscured or clipped by some other element on the page, e.g. an image.
> PDF layers are currently ignored; TET will retrieve the text from all layers regardless
of their visibility.
The TET command-line tool is built on top of the TET library. You can supply library op-
tions using the --docopt, --tetopt, and --pageopt options according to the option list ta-
bles in Chapter 6, »TET Library API Reference«, page 67. Table 2.1 lists all TET command-
line options (this list will also be displayed if you run the TET program without any op-
tions).
Note In order to extract CJK text you must configure access to the CMap files which are shipped with
TET according to Section 0.1, »Installing the Software«, page 5.
--docopt <option list> Additional option list for TET_open_document( ) (see Table 6.2, page 73)
--firstpage -f integer The number of the page where text extraction will start. The keyword
last can be used to specify the last page. Default: 1
--format utf8 | utf16 Specifies the format for text output (default: utf8):
utf8 UTF-8 with BOM (byte order mark)
utf16 UTF-16 in native byte ordering with BOM
--inmemory Load the input file(s) into memory and process it from there. This can re-
sult in a significant performance gain on some systems at the expense of
memory usage.
--lastpage -l integer The number of the page where text extraction will finish. The keyword
last can be used to specify the last page. Default: last
--outfile -o <filename> Output file name. The file name »-« can be used to designate standard
output. Default: name of the input file, with .pdf or .PDF replaced with .txt
(for text output) or .xml (for XML output).
--pageopt <option list> Additional option list for TET_open_page( ) (see Table 6.4, page 77); the
option granularity will always be set to page.
--targetdir -t <dirname> Output directory name; the directory must exist. Default: .
--tetopt <option list> Additional option list for TET_set_option( ) (see Table 6.7, page 83). The op-
tion outputformat will be ignored (use --format instead).
Constructing TET command lines. The following rules must be observed for construct-
ing TET command lines:
> Input files will be searched in all directories specified as searchpath.
> Short forms are available for some options, and can be mixed with long options.
> Long options can be abbreviated provided the abbreviation is unique (e.g. --tet in-
stead of --tetopt)
> Depending on the encryption status of the input file, a user or master password may
be required for successfully extracting text. It must be supplied with the --password
option. TET will check whether this password is sufficient for text extraction, and
will generate an error if it isn’t.
TET checks the full command line before processing any file. If an error is encountered
in the options anywhere on the command line, no files will be processed at all.
File names. File names which contain blank characters require some special handling
when used with command-line tools like TET. In order to process a file name with blank
characters you should enclose the complete file name with double quote " characters.
Wildcards can be used according to standard practice. For example, *.pdf denotes all files
in a given directory which have a .pdf file name suffix. Note that on some systems case
is significant, while on others it isn’t (i.e., *.pdf may be different from *.PDF). Also note
that on Windows systems wildcards do not work for file names containing blank charac-
ters.
Exit codes. The TET command-line tool returns with an exit code which can be used to
check whether or not the requested operations could be successfully carried out:
> Exit code 0: all command-line options could be successfully and fully processed.
> Exit code 1: one or more file processing errors occurred, but processing continued.
What’s included in the XML output? XML output created by TET will be encoded in
UTF-8 (on zSeries with USS or MVS: EBCDIC-UTF-8), and includes the following informa-
tion:
> A Creation element showing the date and operating system platform for the TET exe-
cution, plus the version number of TET
> A Document element with general information including PDF file name and size, PDF
version number, metadata (a Metadata element with XMP if present, or a DocInfo ele-
ment with document info fields otherwise)
> A Page element for each page of the PDF document, containing page size attributes
and the page contents
> For each page of the PDF document, a Structure element with the actual text and co-
ordinates according to the chosen granularity. For glyph and word granularity a Font
element will be written with font information. In glyph granularity a Glyph element
will contain the position and width of the corresponding glyph.
The XML output also includes relevant document- and page-related options which were
supplied to TET. A DTD (Document Type Definition) describing the TET XML output in
detail can be found at the following Web location:
www.pdflib.com/XML/tet2/tet-1.0.dtd
The following sample shows XML output for the same file, but in glyph mode:
TET signals such errors by returning a value of –1 as documented in the API reference.
Other events may be considered harmful, but will occur rather infrequently, e.g.
> running out of virtual memory;
> supplying wrong function parameters (e.g. an invalid document handle);
> supplying malformed option lists;
> a required resource (e.g. a CMap file for CJK text extract) cannot be found.
When TET detects such a situation, an exception will be thrown instead of passing a spe-
cial error return value to the caller. In languages which support native exceptions
throwing the exception will be done using the standard means supplied by the lan-
guage or environment. For the C language binding TET supplies a custom exception
handling mechanism which must be used by clients (see Section 3.2, »C Binding«, page
20).
It is important to understand that processing a document must be stopped when an
exception occurred. The only methods which can safely be called after an exception are
TET_delete( ), TET_get_apiname( ), TET_get_errnum( ), and TET_get_errmsg( ). Calling any
other method after an exception may lead to unexpected results. The exception will
contain the following information:
> A unique error number;
> The name of the API function which caused the exception;
> A descriptive text containing details of the problem;
Querying the reason of a failed function call. Some TET function calls, e.g. TET_open_
document( ) or TET_open_page( ), can fail without throwing an exception (they will return
-1 in case of an error). In this situation the functions TET_get_errnum( ), TET_get_errmsg( ),
and TET_get_apiname( ) can be called immediately after a failed function call in order to
retrieve details about the nature of the problem.
The following code fragment demonstrates these rules with the typical idiom for deal-
ing with TET exceptions in client code (a full sample can be found in the TET package):
volatile int pageno;
...
if ((tet = TET_new()) == (TET *) 0)
{
printf("out of memory\n");
return(2);
}
TET_TRY(tet)
{
for (pageno = 1; pageno <= n_pages; ++pageno)
{
/* process page */
Unicode handling for name strings. The C language does not natively support Uni-
code. Some string parameters for API functions may be declared as name strings. These
are handled depending on the length parameter and the existence of a BOM at the be-
ginning of the string. In C, if the length parameter is different from 0 the string will be
interpreted as UTF-16. If the length parameter is 0 the string will be interpreted as UTF-8
if it starts with a UTF-8 BOM, or as EBCDIC UTF-8 if it starts with an EBCDIC UTF-8 BOM,
or as host encoding if no BOM is found (or ebcdic on all EBCDIC-based platforms).
Unicode handling for option lists. Strings within option lists require special attention
since they cannot be expressed as Unicode strings in UTF-16 format, but only as byte ar-
rays. For this reason UTF-8 is used for Unicode options. By looking for a BOM at the be-
ginning of an option TET decides how to interpret it. The BOM will be used to determine
the format of the string. More precisely, interpreting a string option works as follows:
> If the option starts with a UTF-8 BOM (\xEF\xBB\xBF) it will interpreted as UTF-8.
> If the option starts with an EBCDIC UTF-8 BOM (\x57\x8B\xAB) it will be interpreted as
EBCDIC UTF-8.
> If no BOM is found, the string will be treated as winansi (or ebcdic on EBCDIC-based
platforms).
Note The TET_utf16_to_utf8( ) utility function can be used to create UTF-8 strings from UTF-16
strings, which is useful for creating option lists with Unicode values.
3.2 C Binding 21
3.3 C++ Binding
In addition to the tetlib.h C header file, an object-oriented wrapper for C++ is supplied
for TET clients. It requires the tet.hpp header file, which in turn includes tetlib.h. The cor-
responding tet.cpp module must be linked against the application in addition to the ge-
neric TET C library.
Using the C++ object wrapper replaces the functional approach with API functions
and TET_ prefixes in all TET function names with a more object-oriented approach: a
TET object offers methods, but the method names no longer have the TET_ prefix.
The TET C++ binding will package Unicode text in standard C++ strings in UTF-16 for-
mat. Clients must be prepared to process such strings appropriately.
Exception Handling. Exception handling for the TET COM component is done accord-
ing to COM conventions: when a TET exception occurs, a COM exception will be raised
and furnished with a clear-text description of the error. In addition the memory allocat-
ed by the TET object is released. The COM exception can be caught and handled in the
TET client in whichever way the client environment supports for handling COM errors.
Using the TET COM Edition with .NET. As an alternative to the TET.NET edition (see
Section 3.6, ».NET Binding«, page 25) the COM edition of TET can also be used with .NET.
First, you must create a .NET assembly from the TET COM edition using the tlbimp.exe
utility:
tlbimp tet_com.dll /namespace:tet_com /out:Interop.tet_com.dll
You can use this assembly within your .NET application. If you add a reference to tet_
com.dll from within Visual Studio .NET an assembly will be created automatically. The
following code fragment shows how to use the TET COM edition with C#:
using TET_com;
...
static TET_com.ITET tet;
...
tet = New TET();
...
The TET Java package is contained in the tet.jar file and contains a single class called tet.
In order to supply this package to your application, you must add tet.jar to your
CLASSPATH environment variable, add the option -classpath tet.jar in your calls to the
Java compiler, or perform equivalent steps in your Java IDE. In the JDK you can config-
ure the Java VM to search for native libraries in a given directory by setting the
java.library.path property to the name of the directory, e.g.
java -Djava.library.path=. extractor
Exception handling. The TET language binding for Java will throw native Java excep-
tions of the class TETException. TET client code must use standard Java exception syntax:
TET tet = null;
try {
} catch (TETException e) {
System.err.print("TET exception occurred:\n");
System.err.print("[" + e.get_errnum() + "] " + e.get_apiname() + ": " +
e.get_errmsg() + "\n");
} catch (Exception e) {
System.err.println(e.getMessage());
} finally {
if (tet != null) {
tet.delete(); /* delete the TET object */
}
}
Since TET declares appropriate throws clauses, client code must either catch all possible
exceptions or declare those itself.
Note TET.NET requires the .NET framework 1.1 or above. It does not work with framework 1.0. If you
must work with framework 1.0 we recommend using the TET COM component as a .NET as-
sembly as detailed in Section 3.4, »COM Binding«, page 23.
TET.NET can be deployed in all environments that support the .NET Framework 1.1 or
above. The TET distribution package contains code samples for various .NET languages.
The TET MSI installer will install the TET assembly plus auxiliary data files, docu-
mentation and samples on the machine interactively. The installer will also register TET
so that it can easily be referenced on the .NET tab in the Add Reference dialog box of Visu-
al Studio .NET.
Exception handling. TET.NET supports .NET exceptions, and will throw an exception
with a detailed error message when a runtime problem occurs. The client is responsible
for catching such an exception and properly reacting on it. Otherwise the .NET frame-
work will catch the exception and usually terminate the application.
In order to convey exception-related information TET defines its own exception
class TET_dotnet.TETException with the members get_errnum, get_errmsg, and get_api-
name.
Installing the TET Edition for Perl. The Perl extension mechanism loads shared librar-
ies at runtime through the DynaLoader module. The Perl executable must have been
compiled with support for shared libraries (this is true for the majority of Perl configu-
rations).
For the TET binding to work, the Perl interpreter must access the TET Perl wrapper
and the module file tetlib_pl.pm. In addition to the platform-specific methods described
below you can add a directory to Perl’s @INC module search path using the -I command
line option:
perl -I/path/to/tet extractor.pl
Unix. Perl will search both tetlib_pl.so (on Mac OS X: tetlib_pl.dylib) and tetlib_pl.pm in
the current directory, or the directory printed by the following Perl command:
perl -e 'use Config; print $Config{sitearchexp};'
Perl will also search the subdirectory auto/tetlib_pl. Typical output of the above com-
mand looks like
/usr/lib/perl5/site_perl/5.8/i686-linux
Windows. PDFlib supports the ActiveState port of Perl 5 to Windows, also known as
ActivePerl. Both tetlib_pl.dll and tetlib_pl.pm will be searched in the current directory, or
the directory printed by the following Perl command:
perl -e "use Config; print $Config{sitearchexp};"
Exception Handling in Perl. When a TET exception occurs, a Perl exception is thrown. It
can be caught and acted upon using an eval sequence:
eval {
...some TET instructions...
};
die "Exception caught: $@" if $@;
PHP will search the library in the directory specified in the extension_dir variable in
php.ini on Unix, and additionally in the standard system directories on Windows.
You can test which version of the PHP TET binding you have installed with the fol-
lowing one-line PHP script:
<?phpinfo()?>
This will display a long info page about your current PHP configuration. On this page
check the section titled tet. If this section contains the phrase
PDFlib TET Support enabled
(plus the TET version number) you have successfully installed TET for PHP.
> Alternatively, you can load TET at runtime with one of the following lines at the start
of your script:
dl("libtet_php.dll"); # for Windows
dl("libtet_php.so"); # for Unix
dl("libtet_php.sl"); # for HP-UX
dl("libtet_php.dylib"); # for Mac OS X
File name handling in PHP. Unqualified file names (without any path component) and
relative file names are handled differently in Unix and Windows versions of PHP:
> PHP on Unix systems will find files without any path component in the directory
where the script is located.
Error handling in PHP 4. When a TET exception occurs, a PHP exception is thrown.
Since PHP 4 does not support structured exception handling there is no way to catch ex-
ceptions and act appropriately. Do not disable PHP warnings when using TET, or you
will run into serious trouble.
Exception handling in PHP 5. Since PHP 5 supports structured exception handling, TET
exceptions will be propagated as PHP exceptions. You can use the standard try/catch
technique to deal with TET exceptions:
try {
Note that you can use PHP 5-style exception handling regardless of whether you work
with the old function-based TET interface, or the new object-oriented one.
If the TET source file library is not on top of your library list you have to specify the li-
brary as well:
d/copy tetsrclib/QRPGLESRC,TETLIB
Before you start compiling your ILE-RPG program you have to create a binding directory
that includes the TETLIB service program shipped with TET. The following example as-
sumes that you want to create a binding directory called TETLIB in the library TETLIB:
CRTBNDDIR BNDDIR(TETLIB/TETLIB) TEXT('TETlib Binding Directory')
After creating the binding directory you need to add the TETLIB service program to your
binding directory. The following example assumes that you want to add the service pro-
gram TETLIB in the library TETLIB to the binding directory created earlier.
ADDBNDDIRE BNDDIR(TETLIB/TETLIB) OBJ((TETLIB/TETLIB *SRVPGM))
Now you can compile your program using the CRTBNDRPG command (or option 14 in
PDM):
CRTBNDRPG PGM(TETLIB/EXTRACTOR) SRCFILE(TETLIB/QRPGLESRC) SRCMBR(*PGM) DFTACTGRP(*NO)
BNDDIR(TETLIB/TETLIB)
Exception Handling in RPG. TET clients written in ILE-RPG can use a limited form of
TET’s try/catch mechanism as follows:
c eval rtn=tet_try(tet)
c if TET_open_document(tet:in_filename:0:optlist)=-1
c or tet_catch(tet)=1
*
c callp TET_delete(tet)
c eval error='Couldn''t open input file '+
c %trim(out_filename)
c exsr exit
c endif
There is no one-to-one relationship between characters and glyphs. For example, a liga-
ture is a single glyph which is represented by two or more separate characters. On the
other hand, a specific glyph may be used to represent different characters depending on
the context (some characters look identical, see Figure 4.1).
Characters Glyphs
Fig. 4.1
U+2167 ROMAN NUMERAL EIGHT or Relationship of glyphs
U+0056 V U+0049 I U+0049 I U+0049 I and characters
Text filtering. There are several situations where TET will modify the actual character
values found on the page in order to make the results more useful. Most of these steps
can be controlled via options. The following list gives an overview of all operations
which may modify the text:
> Dehyphenation will remove hyphen characters and combine the parts of a hyphen-
ated word. This can be disabled with the dehyphenate suboption of the contentanalysis
option for TET_open_page( ).
> Redundant text which creates only visual artifacts such as shadow effects or artificial
bold text will be removed. This can be disabled with the shadowdetect suboption of
the contentanalysis option for TET_open_page( ).
> Very small or very large text can be ignored. The limits can be controlled with the
fontsizerange option of TET_open_page( ).
> Unicode post-processing will replace certain Unicode characters with more familiar
ones. For example, Latin ligatures will be replaced with their constituent characters,
and fullwidth ASCII variants in CJK fonts will be replaced with the corresponding
non-fullwidth characters. For details see Table 4.1, page 37.
> Invisible text (text with textrendering=3) will be extracted by default, but this can be
changed with the ignoreinvisibletext option of TET_open_page( ).
> Glyphs which cannot be mapped to Unicode will be replaced with the Unicode char-
acter defined in the unknownchar option of TET_open_document( ). See section »Un-
mappable glyphs«, page 38.
The first coordinate increases to the right, the second coordinate increases upwards. All
coordinates expected or returned by TET are interpreted in this coordinate system, re-
gardless of their representation in the underlying PDF document. See Section 5.1, »Sim-
ple pCOS Examples«, page 53 to see how to determine the size of a PDF page.
Acrobat 5 or above (full version only, not the free Reader) has a helpful facility for
measuring distances on a PDF page. Simply choose Window, Info to display a measure-
ment palette which uses points as units. Note that the coordinates displayed refer to an
origin in the top left corner of the page, and not to an origin in the lower left corner as
used in TET.
Glyph metrics. Using TET_get_char_info( ) you can retrieve font and metrics informa-
tion for the characters which are returned for a particular glyph. The following values
are available for each character in the output (see Figure 4.2 and Table 6.6, page 82):
> The uv value contains the UTF-32 Unicode value of the current character, i.e. the char-
acter for which details are retrieved. This field will always contain UTF-32, even in
language bindings that can deal only with UTF-16 strings in their native Unicode
strings. Accessing the uv field allows applications to deal with characters outside the
BMP without having to interpret surrogate pairs. Since surrogate pairs will be report-
ed as two separate characters, the uv field of the leading surrogate value will contain
the actual Unicode value (larger than U+FFFF). The uv field of the trailing surrogate
value will be treated as an artificial character, and will have an uv value of 0.
> The type field specifies how the character was created. There are two groups: real and
artificial characters. The group of real characters comprises normal characters (i.e.
the complete result of a single glyph) and characters which start a multi-character
sequence that corresponds to a single glyph (e.g. the first character of a ligature). The
group of artificial characters comprises the continuation of a multi-character se-
quence (e.g. the second character of a ligature), the trailing value of a surrogate pair,
and inserted separator characters. For artificial characters the position (x, y) will
specify the endpoint of the most recent real character, the width will be 0, and all
other fields except uv will be those of the most recent real character. The endpoint is
the point (x, y) plus the width added in direction alpha (in horizontal writing mode)
or plus the fontsize in direction -90˚ (in vertical writing mode) .
> The unknown field will usually be false (in C and C++: 0), but has a value of true (in C
and C++: 1) if the original glyph could not be mapped to Unicode and has therefore
been replaced with the character specified in the unknownchar option. Using this
field you can distinguish real document content from replaced characters if you
specified a common character as unknownchar, such as a question mark or space.
th
(x, y) wid
alpha
Fig. 4.2
Glyph metrics for horizontal and vertical writing mode
> The (x, y) fields specify the position of the glyph’s reference point, which is the lower
left corner of the glyph rectangle in horizontal writing mode, and the top center in
vertical writing mode (see Section 4.3, »Support for Chinese, Japanese, and Korean
Text«, page 36 for details on vertical writing mode). For artificial characters, which do
not correspond to any glyph on the page, the point (x, y) specifies the end point of
the most recent real character.
> The width field specifies the width of a glyph according to the corresponding font
metrics and text output parameters, such as character spacing and horizontal scal-
ing. Since these parameters control the position of the next glyph, the distance be-
tween the reference points of two adjacent glyphs may be different from width. The
width may be zero for non-spacing characters. On the other hand, the outline may
actually be wider than the glyph’s width value, e.g. for slanted text.
The width will be 0 for artificial characters.
> The angle alpha provides the direction of inline text progression, specified as the de-
viation from the standard direction. The standard direction is 0˚ for horizontal writ-
ing mode, and -90˚ for vertical writing mode (see below for more details on vertical
writing mode). Therefore, the angle alpha will be 0˚ for standard horizontal text as
well as for standard vertical text.
> The angle beta specifies any skewing which has been applied to the text, e.g. for
slanted (italicized) text. The angle will be measured against the perpendicular of
alpha. It will be 0˚ for standard upright text (for both horizontal and vertical writing
mode). If the absolute value of beta is greater than 90˚ the text will be mirrored at
the baseline.
> The fontid field contains the pCOS ID of the font used for the glyph. It can be used to
retrieve detailed font information, such as the font name, embedding status, writing
mode (horizontal/vertical), etc. Section 5.1, »Simple pCOS Examples«, page 53 shows
sample code for retrieving font details.
> The fontsize field specifies the size of the text in points. It will be normalized, and
therefore always be positive.
End points of glyphs and words. Using the start point coordinates x, y and the width
and alpha values returned by TET_get_char_info( ) you can determine the end point of a
glyph in horizontal writing mode as follows:
xend = x + width * cos(alpha)
yend = y + width * sin(alpha)
In the common case of horizontally oriented text (i.e. alpha=0) this reduces to
xend = x + width
yend = y
For CJK text with vertical writing mode the end point calculation must be performed as
follows:
xend = x
yend = y - fontsize
In order to calculate the end position of a word (e.g. for highlighting) determine the end
position of the last character in the word.
Area of text extraction. By default, TET will extract all text from the visible page area.
Using the clippingarea option of TET_open_page( ) (see Table 6.4, page 77) you can change
this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited all
text regardless of any page boxes can be extracted.
The area of text extraction can be specified in more detail by providing an arbitrary
number of rectangular areas in the includebox and excludebox options of TET_open_
page( ). This is useful for extracting partial page content (e.g. selected columns), or for
excluding irrelevant parts (e.g. margins, headers and footers). The final clipping area is
constructed by determining the union of all rectangles specified in the includebox op-
tion, and subtracting the union of all rectangles specified in the excludebox option. A
character is considered inside the clipping area if its reference point is inside the clip-
ping area. This means that a character could be considered inside the clipping area even
if parts of it extend beyond the clipping area.
The PDF CMaps in turn cover all of the CJK character encodings which are in use today,
such as Shift-JIS, EUC, Big-5, KSC, and many others.
Note In order to extract CJK text you must configure access to the CMap files which are shipped with
TET according to Section 0.1, »Installing the Software«, page 5.
Several groups of CJK characters will be modified (see Table 4.1, page 37, for details):
> Fullwidth ASCII variants and fullwidth symbol variants will be mapped to the corre-
sponding halfwidth characters.
> CJK compatibility forms (prerotated glyphs for vertical text) and small form variants
will be mapped to the corresponding normal variants.
CJK font names which are encoded with locale-specific encodings (e.g. Japanese font
names encoded in Shift-JIS) will also be normalized to Unicode. The wordfinder will
treat all ideographic CJK characters as individual words, while Katakana characters will
not be treated as word boundaries (a sequence of Katakana will be treated as a single
word).
CJK text with vertical writing mode. TET supports both horizontal and vertical writing
modes, and performs all metrics calculations as appropriate for the respective writing
mode. Keep the following in mind when dealing with text in vertical writing mode:
> The glyph reference point in vertical writing mode is at the top center of the glyph
box. The text position will advance downwards as determined by the font size and
character spacing, regardless of the glyph width (see Figure 4.2).
> The angle alpha will be 0˚ for standard vertical text. In other words, fonts with verti-
cal writing mode and alpha=0° will progress downwards, i.e. in direction -90˚.
> Because of the differences noted above client code must take the writing mode into
account by using the pCOS code shown in Section 5.1, »Simple pCOS Examples«, page
53 for determining the writing mode of a font. Note that not all text which appears
vertically actually uses a font with vertical writing mode.
> Prerotated glyphs for Latin characters and punctuation will be mapped to the corre-
sponding unrotated Unicode character (see Table 4.1).
Post-processing for certain Unicode values. In some cases the Unicode values which
are determined as a result of font and encoding processing will be modified by a post-
processing step, e.g. to split ligatures. Table 4.1 lists all Unicode values which are affected
by post-processing.
U+D800 - U+DBFF (high) Leading (high) surrogates and trailing (low) surrogates will be maintained in the UTF-16
U+DC00 - U+DFFF (low) output, and the corresponding UTF-32 value will be available in the uv field (see »Charac-
surrogates ters outside the BMP and surrogate handling«, page 38).
U+E000-U+F8FF PUA characters will be kept or replaced according to the keeppua option (see Section »Un-
(Private Use Area, PUA) mappable glyphs«).
U+F600-U+F8FF Will be mapped to the corresponding characters outside the CUS.
(Adobe CUS)
U+FB00-U+FB17 Latin and Armenian ligatures will be decomposed into their constituent characters.2
U+FF01-U+FF5E, Fullwidth ASCII and symbol variants will be mapped to the corresponding non-fullwidth
U+FFE0-U+FFE6 characters.3
U+FE30-U+FE6F CJK compatibility forms (prerotated glyphs for vertical text) and small form variants
(U+FE30-U+FE6F) will be mapped to the corresponding normal variants
1. Characters inserted via the wordseparator, lineseparator, and zoneseparator options are not subject to this removal.
2. Ligatures in the Arabic and Hebrew presentation forms will not be decomposed.
3. The following characters will be left unchanged: halfwidth CJK punctuation (U+FF61-U+FF64), Katakana variants (U+FF65-U+FF9F),
Hangul variants (U+FFA0-U+FFDC), and symbol variants (U+FFE8-U+FFEE).
Unmappable glyphs. There are several reasons why text in a PDF cannot reliably be
mapped to Unicode. If the document does not contain a ToUnicode CMap with Unicode
mapping information for the codes used on the page, Unicode values can be missing for
various reasons:
> Type 1 fonts may contain unknown glyph names, and TrueType, OpenType, or CID
fonts may be addressed with glyph ids without any Unicode values in the font or
PDF. By default, TET will assign the Unicode Replacement Character U+FFFD to these
characters. You can select a different replacement character (e.g. the space character
U+0020) for unmappable glyphs with the unknownchar option of TET_open_
document( ).
However, if the keeppua option of TET_open_document( ) is true, unknown characters
will be mapped to increasing values in the Private Use Area (PUA), starting at
U+F200. The same glyph name used in different fonts will end up with the same PUA
value, while TrueType or OpenType glyph ids from different fonts will have different
PUA values assigned.
> If the font or PDF provides Unicode values, these may be contained in the Private Use
Area (PUA). Since PUA characters are generally not very useful, TET will replace them
with unknownchar (by default: U+FFFD). However, if the keeppua option of TET_open_
document( ) is true, PUA values will be returned without any modification. This may
be useful if you can deal with PUA values, e.g. for a specific font, or for all fonts from
a specific font vendor.
Since not all glyphs in a document may have proper Unicode values (e.g. custom sym-
bols), TET may have to map some glyphs to unknownchar. Your code should be prepared
for this character. If you don’t care about Unicode mapping problems you can simply ig-
nore it, or use the unknownchar option of TET_open_document( ) to set a different charac-
ter as a replacement for unmappable glyphs (e.g. the space character).
In order to check for unmappable glyphs you can use the unknown field returned by
TET_get_char_info( ).
Analyzing PDF documents with the PDFlib FontReporter plugin1. In order to obtain
the information required to create appropriate Unicode mapping tables you must ana-
lyze the problematic PDF documents.
PDFlib GmbH provides a free companion product to TET which assists in this situa-
tion: PDFlib FontReporter is an Adobe Acrobat plugin for easily collecting font, encod-
ing, and glyph information. The plugin creates detailed font reports containing the ac-
tual glyphs along with the following information:
> The corresponding code: the first hex digit is given in the left-most column, the sec-
ond hex digit is given in the top row. For CID fonts the offset printed in the header
must be added to obtain the code corresponding to the glyph.
> The glyph name if present.
> The Unicode value(s) corresponding to the glyph (if Acrobat can determine them).
These pieces of information play an important role for TET’s glyph mapping controls.
Figure 4.3 shows two pages from a sample font report. Font reports created with the
FontReporter plugin can be used to analyze PDF fonts and create mapping tables for
successfully extracting the text with TET. It is highly recommended to take a look at the
corresponding font report if you want to write Unicode mapping tables or glyph name
heuristics to control text extraction with TET.
1. The PDFlib FontReporter plugin is available for free download at www.pdflib.com/products/fontreporter
Precedence rules. TET will apply the glyph mapping controls in the following order:
> Codelist and ToUnicode CMap resources will be consulted first.
> If the font has an internal ToUnicode CMap it will be considered next.
> For glyph names TET will apply an external or internal glyph name mapping rule if
one is available which matches the font and glyph name.
> Lastly, a user-supplied glyph list will be applied.
Code list resources for all font types. Code lists are similar to glyph lists except that
they specify Unicode values for individual codes instead of glyph names. Although
multiple fonts from the same foundry may use identical code assignments, codes (also
called glyph ids) are generally font-specific. As a consequence, separate code lists will be
required for individual fonts. A code list is a text file where each line describes a Unicode
mapping for a single code according to the following rules:
> Text after a percent sign ’%’ will be ignored; this can be used for comments.
> The first column contains the glyph code in decimal or hexadecimal notation. This
must be a value in the range 0-255 for simple fonts, and in the range 0-65535 for CID
fonts.
> The remainder of the line contains up to 7 Unicode values for the code. The values
can be supplied in decimal notation or (with the prefix x or 0x) in hexadecimal nota-
tion.
By convention, code lists use the file name suffix .cl. Code lists can be configured with
the codelist resource. If no code list resource has been specified explicitly, TET will search
for a file named <mycodelist>.gl (where <mycodelist> is the resource name) in the search-
path hierarchy (see Section 4.7, »Resource Configuration and File Searching«, page 47 for
details). In other words: if the resource name and the file name (without the .cl suffix)
are identical you don’t have to configure the resource since TET will implicitly do the
equivalent of the following call (where name is an arbitrary resource name):
TET_set_option(tet, "codelist {name name.cl}");
The following sample demonstrates the use of code lists. Consider the mismapped logo-
type glyphs in Figure 4.4 where a single glyph of the font actually represents multiple
characters, and all characters together create the company logotype. However, the
glyphs are wrongly mapped to the characters a, b, c, d, and e. In order to fix this you
could create the following code list:
% Unicode mappings for codes in the GlobeLogosOne font
Then supply the codelist with the following option to TET_open_document( ) (assuming
the code list is available in a file called GlobeLogosOne.cl and can be found via the search
path):
glyphmapping {{fontname=GlobeLogosOne codelist=GlobeLogosOne}}
ToUnicode CMap resources for all font types. PDF supports a data structure called
ToUnicode CMap which can be used to provide Unicode values for the glyphs of a font.
If this data structure is present in a PDF file TET will use it. Alternatively, a ToUnicode
CMap can be supplied in an external file. This is useful when a ToUnicode CMap in the
PDF is incomplete, contains wrong entries, or is missing. A ToUnicode CMap will take
precedence over a code list. However, code lists use an easier format the ToUnicode
CMaps so they are the preferred format.
By convention, CMaps don’t use any file name suffix. ToUnicode CMaps can be con-
figured with the cmap resource (see Section 4.7, »Resource Configuration and File
Searching«, page 47). The contents of a cmap resource must adhere to the standard
CMap syntax.1 In order to apply a ToUnicode CMap to all fonts in the Warnock family use
the following option to TET_open_document( ):
1. See partners.adobe.com/public/developer/en/acrobat/5411.ToUnicode.pdf
Glyph list resources for simple fonts. Glyph lists (short for: glyph name lists) can be
used to provide custom Unicode values for non-standard glyph names, or override the
existing values for standard glyph names. A glyph list is a text file where each line de-
scribes a Unicode mapping for a single glyph name according to the following rules:
> Text after a percent sign ’%’ will be ignored; this can be used for comments.
> The first column contains the glyph name. Any glyph name used in a font can be
used (i.e. even the Unicode values of standard glyph names can be overridden). In or-
der to use the percent sign as part of a glyph name the sequence \% must be used
(since the percent sign serves as the comment introducer).
> The remainder of the line contains up to 7 Unicode values for the glyph name. The
values can be supplied in decimal notation or (with the prefix x or 0x) in hexadeci-
mal notation.
> Unprintable characters in glyph names can be inserted by using escape sequences
for text files (see Section 4.7, »Resource Configuration and File Searching«, page 47)
By convention, glyph lists use the file name suffix .gl. Glyph lists can be configured with
the glyphlist resource. If no glyph list resource has been specified explicitly, TET will
search for a file named <myglyphlist>.gl (where <myglyphlist> is the resource name) in the
searchpath hierarchy (see Section 4.7, »Resource Configuration and File Searching«, page
47, for details). In other words: if the resource name and the file name (without the .gl
suffix) are identical you don’t have to configure the resource since TET will implicitly do
the equivalent of the following call (where name is an arbitrary resource name):
TET_set_option(tet, "glyphlist {name name.gl}");
Due to the precedence rules for glyph mapping glyph lists will not be consulted if the
font contains a ToUnicode CMap. The following sample demonstrates the use of glyph
lists:
% Unicode values for glyph names used in TeX documents
precedesequal 0x227C
similarequal 0x2243
negationslash 0x2044
union 0x222A
prime 0x2032
In order to apply a glyph list to all font names starting with CMSY use the following op-
tion for TET_open_document( ):
glyphmapping {{fontname=CMSY* glyphlist=tarski}}
Rules for interpreting numerical glyph names in simple fonts. Sometimes PDF docu-
ments contain glyphs with names which are not taken from some predefined list, but
are generated algorithmically. This can be a »feature« of the application generating the
PDF, or may be caused by a printer driver which converts fonts to another format: some-
times the original glyph names get lost in the process, and are replaced with schematic
names such as G00, G01, G02, etc. TET contains builtin glyph name rules for processing
numerical glyph names created by various common applications and drivers. Since the
same glyph names may be created for different encodings you can provide the
For example, if you determined (e.g. using PDFlib FontReporter) that the glyphs in the
fonts T1, T2, T3, etc. are named c00, c01, c02, ..., cFF where each glyph name corresponds to
the WinAnsi character at the respective hexadecimal position (00, ..., FF) use the follow-
ing option for TET_open_document( ):
glyphmapping {{fontname=T* glyphrule={prefix=c base=hex encoding=winansi} }}
External font files and system fonts. If a PDF does not contain sufficient information
for Unicode mapping and the font is not embedded, you can configure additional font
data which TET will use to derive Unicode mappings. Font data may come from a True-
Type or OpenType font file on disk, which can be configure with the fontoutline resource
category. As an alternative on Mac and Windows systems, TET can access fonts which
are installed on the host operating system. Access to these host fonts can be disabled
with the usehostfonts option in TET_open_document( ).
In order to configure a disk file for the WarnockPro font use the following call:
TET_set_option(tet, "fontoutline {WarnockPro=WarnockPro.otf}");
See Section 4.7, »Resource Configuration and File Searching«, page 47 for more details
on configuring external font files.
These operations will be discussed in more detail below, as well as options which pro-
vide some control over content processing.
By default, all content processing operations will be disabled for granularity=glyph, and
enabled for all other granularity settings. However, more fine-grain control is possible
via separate options (see below).
Word boundary detection. The wordfinder, which is enabled for all granularity modes
except glyph, creates logical words from multiple glyphs which may be scattered all over
the page in no particular order. Word boundaries are identified by two criteria:
> A sophisticated algorithm analyzes the geometric relationship among glyphs to find
character groups which together form a word. The algorithm takes into account a va-
riety of properties and special cases in order to accurately identify words even in
complicated layouts and for arbitrary text ordering on the page.
> Some characters, such as space and punctuation characters (e.g. colon, comma, full
stop, parentheses) will be considered a word boundary, regardless of their width and
position. Note that ideographic CJK characters will be considered word boundaries,
while Katakana characters will not be treated as word boundaries. If the punctuation-
breaks option in TET_open_page( ): is set to false, the wordfinder will no longer treat
punctuation characters as word boundaries:
contentanalysis={punctuationbreaks=false}
Ignoring punctuation characters for word boundary detection can, for example, be use-
ful for maintaining Web URLs where period and slash characters are usually considered
part of a word (see Figure 4.5).
Note Currently there is no dedicated support for right-to-left scripts and bidirectional text. Although
Unicode values and glyph metrics can be retrieved, the wordfinder does not apply any special
handling for right-to-left text.
Dehyphenation. Hyphenated words at the end of a line are usually not desired for ap-
plications which process the extracted text on a logical level. TET will therefore dehy-
phenate, or recombine the parts of a hyphenated word. More precisely, if a word at the
end of a line ends with a hyphen character and the first word on the next line starts
with a lowercase character, the hyphen will be removed and the first part of the word
Fig. 4.5
The default setting punctuationbreaks=true
will separate the parts of URLs (top), while
punctuationbreaks=false will keep the parts
together (bottom).
Note Hyphenated words at the end of a zone will not be identified, and consequently there won’t be
any dehyphenation (i.e. the hyphen will remain part of the text).
Shadow and fake bold text removal. PDF documents sometimes include redundant
text which does not contribute to the semantics of a page, but creates certain visual ef-
fects only. Shadow text effects are usually achieved by placing two or more copies of the
actual text on top of each other, where a small displacement is applied. Applying
opaque coloring to each layer of text provides a visual appearance where the majority
of the text in lower layers is obscured, while the visible portions create a shadow effect.
Similarly, word processing applications sometimes support a feature for creating ar-
tificial bold text. In order to create bold text appearance even if a bold font is not avail-
able, the text is placed repeatedly on the page in the same color. Using a very small dis-
placement the appearance of bold text is simulated.
Shadow simulation, artificial bold text, and similar visual artifacts create severe
problems when reusing the extracted text since redundant text contents which contrib-
ute only to the visual appearance will be processed although the text does not contrib-
ute to the page contents.
If the wordfinder is enabled, TET will identify and remove such redundant visual ar-
tifacts by default. This process can be disabled with the shadowdetect suboption for the
contentanalysis option of TET_open_page( ).
Zones and reading order. Zones can be thought of as text columns, although they may
sometimes cover other areas on the page, such as headers and footers, marginal notes,
or pagination artifacts. Conceptually, a zone is an »island of text«, consisting of text
lines which are placed close to each other, and surrounded by white space which sepa-
rates it from other zones. Technically, zones are a combination of logically connected
rectangular strips holding at most one line of text. A zone may contain multiple para-
graphs.
TET will arrange the zones identified on a page so that their ordering reflects the log-
ical (reading) order of the text. This process may not work perfectly for complex layouts.
Table 4.2 Resource categories (all file names must be specified in UTF-8)
category format explanation
hostfont key=value Name of a host font resource (key is the PDF font name; value is the UTF-8 encoded host
font name) to be used for an unembedded font
fontoutline key=value Font and file name of a TrueType or OpenType font to be used for an unembedded font
searchpath value Relative or absolute path name of directories containing data files
The UPR file format. UPR files are text files with a very simple structure that can easily
be written in a text editor or generated automatically. To start with, let’s take a look at
some syntactical issues:
> Lines can have a maximum of 255 characters.
> A backslash ’\’ escapes newline characters. This may be used to extend lines.
> An isolated period character ’ . ’ serves as a section terminator.
> Comment lines may be introduced with a percent ’%’ character, and terminated by
the end of the line.
> Whitespace is ignored everywhere except in resource names and file names.
> A section listing all resource categories described in the file. Each line describes one
resource category. The list is terminated by a line with a single period character.
> A section for each of the resource categories listed at the beginning of the file. Each
section starts with a line showing the resource category, followed by an arbitrary
number of lines describing available resources. The list is terminated by a line with a
single period character. Each resource data line contains the name of the resource
(equal signs have to be quoted). If the resource requires a file name, this name has to
be added after an equal sign. The searchpath (see below) will be applied when TET
searches for files listed in resource entries.
File searching and the searchpath resource category. In addition to relative or abso-
lute path names you can supply file names without any path specification to TET. The
searchpath resource category can be used to specify a list of path names for directories
containing the required data files. When TET must open a file it will first use the file
name exactly as supplied, and try to open the file. If this attempt fails, TET will try to
open the file in the directories specified in the searchpath resource category one after
another until it succeeds. Multiple searchpath entries can be accumulated, and will be
searched in reverse order (paths set at a later point in time will searched before earlier
ones). In order to disable the search you can use a fully specified path name in the TET
functions.
On Windows TET will initialize the searchpath resource category with a value read
from the following registry key:
HKLM\SOFTWARE\PDFlib\TET\2.2\searchpath
This registry entry may contain a list of path names separated by a semicolon ’;’ char-
acter. The Windows installer will initialize the searchpath registry entry with the follow-
ing directory names (or similar if you installed TET in a custom directory):
C:\Program Files\PDFlib\TET 2.2\resource
C:\Program Files\PDFlib\TET 2.2\resource\cmap
On IBM iSeries the searchpath resource category will be initialized with the following
values:
/tet/2.2/resource
/tet/2.2/resource/cmap
Searching for the UPR resource file. If resource files are to be used you can specify
them via calls to TET_set_option( ) (see below) or in a UPR resource file. TET reads this file
automatically when the first resource is requested. The detailed process is as follows:
The value of this key (which will be created with the value <installdir>/tet.upr by the
TET installer, but can also be created by other means) will be taken as the name of the
resource file to be used. If this file cannot be read an exception will be thrown.
> The client can force TET to read a resource file at runtime by explicitly setting the
resourcefile option:
TET_set_option(tet, "resourcefile=/path/to/tet.upr");
This call can be repeated arbitrarily often; the resource entries will be accumulated.
Configuring resources at runtime. In addition to using a UPR file for the configuration,
it is also possible to directly configure individual resources at runtime via TET_set_
option( ). This function takes a resource category name and pairs of corresponding re-
source names and values as it would appear in the respective section of this category in
a UPR resource file, for example:
TET_set_option(tet, "glyphlist={myglyphnames=/usr/local/glyphnames.gl}");
Multiple resource names can be configured in a single option list for a resource category
option (but the same resource category option cannot be repeated in a single call to TET_
set_option( )). Alternatively, multiple calls can be used to accumulate resource settings.
Escape sequences for text files. Special character sequences can be used to include un-
printable characters in text files. All sequences start with a backslash ’\’ character:
> \x introduces a sequence of two hexadecimal digits (0-9, A-F, a-f), e.g. \x0D
> \nnn denotes a sequence of three octal digits (0-7), e.g. \015. The sequence \000 will be
ignored.
> The sequence \\ denotes a single backslash.
> A backslash at the end of a line will cancel the end-of-line character.
Escape sequences are supported in all text files except UPR configuration files and
CMap files.
Optimizing performance. In certain situations, particularly for search engines, text ex-
traction speed may be crucial, and may play a more important role than optimal out-
put. The default settings of TET have been selected to achieve the best possible output,
but can be adjusted to speed up processing. Some tips for choosing options in TET_open_
page( ) to maximize text extraction throughput:
> contentanalysis={merge=0}
This will disable the expensive strip and zone merging step, and reduces processing
times for typical files to ca. 60% compared to default settings. However, documents
where the contents are scattered across the pages in arbitrary order may result in
some text which is not extracted in logical order.
> contentanalysis={dehyphenate=false}
This will disable the combination of the parts of hyphenated words. If dehyphen-
ation is not required this option can slightly reduce processing times.
> contentanalysis={shadowdetect=false}
This will disable detection of redundant shadow and fake bold text, which can also
reduce processing times.
Multiple suboptions of the contentanalysis option must be combined into a single list,
for example:
contentanalysis={merge=0 shadowdetect=false}
Words vs. line layout vs. reflowable text. Different applications will prefer different
kinds of output (hyphenated words will always be dehyphenated with these settings):
> Individual words (ignore layout): a search engine may not be interested in any lay-
out-related aspects, but only the words comprising the text. In this situation use
granularity=word in TET_open_page( ) to retrieve one word per call to TET_get_text( ).
> Keep line layout: use granularity=page in TET_open_page( ) for extracting the full text
contents of a page in a single call to TET_get_text( ). Text lines will be separated with a
linefeed character to retain the existing line structure.
> Reflowable text: in order to avoid line breaks and facilitate reflowing of the extracted
text use contentanalysis={lineseparator=U+0020} and granularity=page in TET_open_
page( ). The full page contents can be fetched with a single call to TET_get_text( ).
Zones will be separated with a linefeed character, and a space character will be insert-
ed between the lines in a zone.
Which parts of a document? The text contained in a PDF document may be part of
various data structures:
> The actual page contents can be extracted with TET_get_text( ).
> Document info fields, boomark text, form field contents, XMP metadata, and other
hypertext elements can be retrieved with TET_pcos_get_string( ) and TET_pcos_get_
stream( ) (see Section 5.1, »Simple pCOS Examples«, page 53).
> PDF documents may contain file attachments which are themselves PDF documents.
In order to extract the text of PDF file attachments you must first fetch the attach-
ments with TET_pcos_get_stream( ), and then feed the attachment to TET_open_docu-
ment_mem( ) (request sample code from PDFlib GmbH support for details).
> TET can not extract text from raster images or vectorized text.
Unknown characters. TET may be unable to determine the appropriate Unicode map-
ping for one or more characters, and represent it with the Unicode replacement charac-
ter U+FFFD. If your application is not concerned about unmappable characters you can
simply discard all occurrences of this character. Applications which require more fine-
grain results could take the corresponding font into account, and use it to decide on
processing of unmappable characters.
Legal documents. When dealing with legal documents there is usually a zero tolerance
for wrong Unicode mappings since they might alter the content or interpretation of a
document. In many cases the text position is not required, and the text must be extract-
ed word by word. Recommendations:
> Use the granularity=word option in TET_open_page( ).
> Use the password=xxx option in TET_open_document( ) if you must process docu-
ments which require a password for opening, or if content extraction is not allowed
in the permission settings.
> For absolute text fidelity: stop processing as soon as the unknown field in the charac-
ter info structure returned by TET_get_char_info( ) is 1, or if the Unicode replacement
character U+FFFD is part of the string returned by TET_get_text( ). Do not set the
unknownchar option to any common character since you may be unable to distin-
guish it from correctly mapped characters without checking the unknown field. If
Processing documents with PDFlib+PDI. When using PDFlib+PDI to process PDF docu-
ments on a per-page basis you can integrate TET for controlling the splitting or merging
process. For example, you could split a PDF document based on the contents of a page. If
you have control over the creation process you can insert separator pages with suitable
processing instructions in the text. An example for splitting a PDF document with
PDFlib+PDI based on the page contents can be found in the TET distribution.
Legacy PDF documents with missing Unicode values. In some situations PDF docu-
ments created by legacy applications must be processed where the PDF may not contain
enough information for proper Unicode mapping. Using the default settings TET may
be unable to extract some or all of the text contents. Recommendations:
> Start by extracting the text with default settings, and analyze the results. Identify
the fonts which do not provide enough information for proper Unicode mapping.
> Write custom encoding tables and glyph name lists to fix problematic fonts. Use the
PDFlib FontReporter plugin for analyzing the fonts and preparing Unicode mapping
tables.
> Configure the custom mapping tables and extract the text again, using a larger num-
ber of documents. If there are still unmappable glyphs or fonts adjust the mapping
tables as appropriate.
> If you have a large number of documents with unmappable fonts PDFlib GmbH may
be able to assist you in creating the required mapping tables.
Convert PDF documents to another format. If you want to import the page contents of
PDF documents into your application, while retaining as much information as possible
you’ll need precise character metrics. Recommendations:
> Use TET_get_char_info( ) to retrieve precise character metrics and font names. Even if
you use the uv field to retrieve the Unicode values of individual characters, you must
also call TET_get_text( ) since it fills the char_info structure.
> Use granularity=glyph or word in TET_open_page( ), depending on what is better suited
to your application.
Corporate fonts with custom-encoded logos. In many cases corporate fonts contain-
ing custom logos have missing or wrong Unicode mapping information for the logos. If
you have a large number of PDF documents containing such fonts it is recommended to
create a custom mapping table with proper Unicode values.
Start by creating a font report (see »Analyzing PDF documents with the PDFlib Font-
Reporter plugin«, page 39) for a PDF containing the font, and locate mismapped glyphs
in the font report. Depending on the font type you can use any of the available configu-
ration tables to provide the missing Unicode mappings. See »Code list resources for all
font types«, page 40, for a detailed example of a code list for a logotype font.
fonts[...]/name string name of a font; the number of entries can be retrieved with length:fonts
Number of pages. The total number of pages in a document can be queried as follows:
pagecount = p.pcos_get_number(doc, "length:pages");
Document info fields. Document information fields can be retrieved with the follow-
ing code sequence:
objtype = p.pcos_get_string(doc, "type:/Info/Title");
if (objtype.equals("string"))
{
/* Document info key found */
title = p.pcos_get_string(doc, "/Info/Title");
}
Page size. Although the MediaBox, CropBox, and Rotate entries of a page can directly be
obtained via pCOS, they must be evaluated in combination in order to find the actual
size of a page. Determining the page size is much easier with the width and height keys
Listing all fonts in a document. The following sequence creates a list of all fonts in a
document along with their embedding status:
fontcount = p.pcos_get_number(doc, "length:fonts");
Writing mode. Using pCOS and the fontid value provided in the char_info structure
you can easily check whether a font uses vertical writing mode:
if (p.pcos_get_number(doc, "fonts[" + ci->fontid + "]/vertical"))
{
/* font uses vertical writing mode */
}
Encryption status. You can query the pcosmode pseudo object to determine the pCOS
mode for the document:
if (p.pcos_get_number(doc, "pcosmode") == 2)
{
/* full pCOS mode */
}
Text extraction status. By default, content extraction is possible with TET if the docu-
ment can successfully be opened. However, with infomode=true this is not necessarily
true. Depending on the nocopy permission setting, content extraction may or may not
be allowed in restricted pCOS mode (content extraction is always allowed in full pCOS
mode). The following expression can be used to check whether extraction is allowed:
if ((int) p.pcos_get_number(doc, "pcosmode") == 2 ||
((int) p.pcos_get_number(doc, "pcosmode") == 1 &&
(int) p.pcos_get_number(doc, "encrypt/nocopy") == 0))
{
/* text extraction allowed */
}
XMP meta data. A stream containing XMP meta data can be retrieved with the follow-
ing code sequence:
objtype = p.pcos_get_number(doc, "type:/Root/Metadata");
if (objtype.equals("stream"))
{
/* XMP meta data found */
metadata = p.pcos_get_stream(doc, "", "/Root/Metadata");
}
Numbers. Objects of type integer and real can be queried with TET_pcos_get_number( ).
pCOS doesn’t make any distinction between integer and floating point numbers.
Names and strings. Objects of type name and string can be queried with TET_pcos_get_
string( ). Name objects in PDF may contain non-ASCII characters and the # syntax (dec-
oration) to include certain special characters. pCOS deals with PDF names as follows:
> Name objects will be undecorated (i.e. the # syntax will be resolved) before they are
returned.
> Name objects will be returned as Unicode strings in most language bindings. How-
ever, in the C and C++ language bindings they will be returned as UTF-8.
Since the majority of strings in PDF are text strings TET_pcos_get_string( ) will treat them
as such. However, in rare situations strings in PDF are used to carry binary information.
In this case strings should be retrieved with the function TET_pcos_get_stream( ) which
preserves binary strings and does not modify the contents in any way.
Booleans. Objects of type boolean can be queried with TET_pcos_get_number( ) and will
be returned as 1 (true) or 0 (false). TET_pcos_get_string( ) can also be used to query bool-
ean objects; in this case they will be returned as one of the strings true and false.
Note pCOS does not support the following stream filters: CCITTFax, JBIG2, and JPX.
If there is at least one unsupported filter in a stream’s filter chain, the object type will be
reported as fstream (filtered stream). When retrieving the contents of an fstream object,
TET_pcos_get_stream( ) will remove the supported filters at the beginning of a filter
chain, but will keep the remaining unsupported filters and return the stream data with
the remaining unsupported filters still applied. The list of applied filters can be queried
from the stream dictionary, and the filtered stream contents can be retrieved with TET_
pcos_get_stream( ). Note that the names of supported filters will not be removed when
querying the names of the stream’s filters, so the client should ignore the names of sup-
ported filters.
Arrays. Arrays are one-dimensional collections of any number of objects, where each
object may have arbitrary type.
The contents of an array can be enumerated by querying the number N of elements
it contains (using the length prefix in front of the array’s path, see Table 5.2), and then it-
erating over all elements from index 0 to N-1.
pCOS IDs for dictionaries and arrays. Unlike PDF object IDs, pCOS IDs are guaranteed
to provide a unique identifier for an element addressed via a pCOS path (since arrays
and dictionaries can be nested an object can have the same PDF object ID as its parent
array or dictionary). pCOS IDs can be retrieved with the pcosid prefix in front of the dic-
tionary’s or array’s path (see Table 5.2).
The pCOS ID can therefore be used as a shortcut for repeatedly accessing elements
without the need for explicit path addressing. For example, this will improve perfor-
mance when looping over all elements of a large array. Use the objects[] pseudo object to
retrieve the contents of an element identified by a particular ID.
When a path component contains any of the characters /, [, ], or #, these must be ex-
pressed by a number sign # followed by a two-digit hexadecimal number.
Path prefixes. Prefixes can be used to query various attributes of an object (as opposed
to its actual value). Table 5.2 lists all supported prefixes.
The length prefix and content enumeration via indices are only applicable to plain
PDF objects and pseudo objects of type array, but not any other pseudo objects. The
pcosid prefix cannot be applied to pseudo objects. The type prefix is supported for all
pseudo objects.
Universal pseudo objects. Universal pseudo objects are always available, regardless of
encryption and passwords. This assumes that a valid document handle is available,
which may require setting the option requiredmode suitably when opening the docu-
ment. Table 5.3 lists all universal pseudo objects.
encrypt (Dict) Dictionary with keys describing the encryption status of the document:
length (Number) Length of the encryption key in bits
algorithm (Number)
description(String) Encryption algorithm number or description:
-1 Unknown encryption
0 No encryption
1 40-bit RC4 (Acrobat 2-4)
2 128-bit RC4 (Acrobat 5)
3 128-bit RC4 (Acrobat 6)
4 128-bit AES (Acrobat 7)
5 Public key on top of 128-bit RC4 (Acrobat 5) (unsupported)
6 Public key on top of 128-bit AES (Acrobat 7) (unsupported)
7 Adobe Policy Server (Acrobat 7) (unsupported)
master (Boolean) True if the PDF requires a master password to change security settings
(permissions, user or master password),false otherwise
user (Boolean) True if the PDF requires a user password for opening, false otherwise
noaccessible, noannots, noassemble, nocopy, noforms, nohiresprint, nomodify, noprint
(Boolean) True if the respective access protection is set, false otherwise
plainmetadata
(Boolean) True if the PDF contains unencrypted meta data, false otherwise
xinfo (Boolean) True if and only if security settings were ignored when opening the PDF document; the client
must take care of honoring the document author’s intentions.
For TET the value will be true, and text extraction will be allowed, if all of the following conditions
are true: xinfo mode has been enabled (only possible under a special license agreement), the document
has a master password but this has not been supplied, the user password (if any) has been supplied, and
text extraction is not permitted.
pcosinterface (Number) Interface number of the underlying pCOS implementation. This specification describes inter-
face number 3. The following table details which product versions implement various pCOS interface
numbers:
1 TET 2.0, 2.1
2 pCOS 1.0
3 PDFlib+PDI 7, PPS 7, TET 2.2
pdfversion (Number) PDF version number multiplied by 10, e.g. 16 for PDF 1.6
version (String) Full library version string in the format <major>.<minor>.<revision>, possibly suffixed with addi-
tional qualifiers such as beta, rc, etc.
Pseudo objects for PDF objects, pages, and interactive elements. Table 5.4 lists pseudo
objects which can be used for retrieving object or page information, or serve as short-
cuts for various interactive elements.
Table 5.4 Pseudo objects for PDF objects, pages, and interactive elements
object name explanation
articles (Array of dicts) Array containing the article thread dictionaries for the document. The array will have
length 0 if the document does not contain any article threads. In addition to the standard PDF keys pCOS
supports the following pseudo key for dictionaries in the articles array:
beads (Array of dicts) Bead directory with the standard PDF keys, plus the following:
destpage (Number) Number of the target page (first page is 1)
bookmarks (Array of dicts) Array containing the bookmark (outlines) dictionaries for the document. In addition to
the standard PDF keys pCOS supports the following pseudo keys for dictionaries in the bookmarks array:
level (Number) Indentation level in the bookmark hierarchy
destpage (Number) Number of the target page (first page is 1) if the bookmark points to a page in the
same document, -1 otherwise.
fields (Array of dicts) Array containing the form fields dictionaries for the document. In addition to the stan-
dard PDF keys in the field dictionary and the entries in the associated Widget annotation dictionary pCOS
supports the following pseudo keys for dictionaries in the fields array:
level (Number) Level in the field hierarchy (determined by ».« as separator)
fullname (String) Complete name of the form field. The same naming conventions as in Acrobat 7 will
be applied.
names (Dict) A dictionary where each entry provides simple access to a name tree. The following name trees are
supported: AP, AlternatePresentations, Dests, EmbeddedFiles, IDS, JavaScript, Pages, Renditions,
Templates, URLS.
Each name tree can be accessed by using the name as a key to retrieve the corresponding value, e.g.:
names/Dests[0].key retrieves the name of a destination
names/Dests[0].val retrieves the corresponding destination dictionary
In addition to standard PDF dictionary entries the following pseudo keys for dictionaries in the Dests
names tree are supported:
destpage (number) Number of the target page (first page is 1) if the destination points to a page in the
same document, -1 otherwise.
In order to retrieve other name tree entries these must be queried directly via /Root/Names/Dests etc.
since they are not present in the name tree pseudo objects.
objects (Array) Address an element for which a pCOS ID has been retrieved earlier using the pcosid prefix. The ID
must be supplied as array index in decimal form; as a result, the PDF object with the supplied ID will be
addressed. The length prefix cannot be used with this array.
pages (Array of dicts) Each array element addresses a page of the document. Indexing it with the decimal repre-
sentation of the page number minus one addresses that page (the first page has index 0). Using the
length prefix the number of pages in the document can be determined. A page object addressed this way
will incorporate all attributes which are inherited via the /Pages tree. The entries /MediaBox and /
Rotate are guaranteed to be present. In addition to standard PDF dictionary entries the following pseudo
entries are available for each page:
colorspaces, extgstates, fonts, images, patterns, properties, shadings, templates
(Arrays of dicts) High-level page resources according to Table 5.5.
annots (Array of dicts) In addition to the standard PDF keys pCOS supports the following pseudo key
for dictionaries in the annots array:
destpage (Number; only for Subtype=Link and if a Dest entry is present) Number of the tar-
get page (first page is 1)
blocks (Array of dicts) Shorthand for pages[ ]/PieceInfo/PDFlib/Private/Blocks[ ], i.e. the
page’s block dictionary. In addition to the existing PDF keys pCOS supports the following
pseudo key for dictionaries in the blocks array:
rect (Rectangle) Similar to Rect, except that it takes into account any relevant
CropBox/MediaBox and Rotate entries and normalizes coordinate ordering.
height (Number) Height of the page. The MediaBox or the CropBox (if present) will be used to
determine the height. Rotate entries will also be applied.
isempty (Boolean) True if the page is empty, and false if the page is not empty
label (String) The page label of the page (including any prefix which may be present). Labels will be
displayed as in Acrobat. If no label is present (or the PageLabel dictionary is malformed), the
string will contain the decimal page number. Roman numbers will be created in Acrobat’s
style (e.g. VL), not in classical style which is different (e.g. XLV). If /Root/PageLabels doesn’t
exist, the document doesn’t contain any page labels.
width (Number) Width of the page (same rules as for height)
The following entries will be inherited: CropBox, MediaBox, Resources, Rotate.
pdfa (String) PDF/A conformance level of the document (e.g. PDF/A-1a:2005) or none
pdfx (String) PDF/X conformance level of the document (e.g. PDF/X-1a:2001) or none
tagged (Boolean) True if the PDF document is tagged, false otherwise
The following list details the two categories using the images resource type as an exam-
ple; the same scheme applies to all resource types listed in Table 5.5:
> A list of image resources in the document is available in images[ ].
> A list of image resources on each page is available in pages[ ]/images[ ].
Table 5.5 Pseudo objects for resource retrieval; each pseudo object P in this table creates two arrays with high-level
resources P[ ] and pages[ ]/P[ ].
object name explanation
colorspaces (Array of dicts) Array containing dictionaries for all color spaces on the page or in the document. In addi-
tion to the standard PDF keys in color space and ICC profile stream dictionaries the following pseudo keys
are supported:
alternateid
(Integer; only for name=Separation and DeviceN) Index of the underlying alternate color
space in the colorspaces[] pseudo object.
alternateonly
(Boolean) If true, the colorspace is only used as the alternate color space for (one or more)
Separation or DeviceN color spaces, but not directly.
baseid (Integer; only for name=Indexed) Index of the underlying base color space in the
colorspaces[] pseudo object.
colorantname
(Name; only for name=Separation) Name of the colorant
colorantnames
(Array of names; only for name=DeviceN) Names of the colorants
components
(Integer) Number of components of the color space
name (String) Name of the color space
csarray (Array; not for name=DeviceGray/RGB/CMYK) Array describing the underlying native color
space.
High-level color space resources will include all color spaces which are referenced from any type of object,
including the color spaces which do not require any native PDF resources (i.e. DeviceGray, DeviceRGB,
and DeviceCMYK).
extgstates (Array of dicts) Array containing the dictionaries for all extended graphics states (ExtGstates) on the page
or in the document
fonts (Array of dicts) Array containing dictionaries for all fonts on the page or in the document. In addition to
the standard PDF keys in font dictionaries, the following pseudo keys are supported:
name (String) PDF name of the font without any subset prefix. Non-ASCII CJK font names will be
converted to Unicode.
embedded (Boolean) Embedding status of the font
type (String) Font type
vertical (Boolean) true for fonts with vertical writing mode, false otherwise
images (Array of dicts) Array containing dictionaries for all images on the page or in the document. High-level
image resources will include all image XObjects and inline images, while native PDF resources contain
only image XObjects.
In addition to the standard PDF keys the following pseudo keys are supported:
bpc (Integer) The number of bits per component. This entry is usually the same as
BitsPerComponent, but unlike this it is guaranteed to be available.
colorspaceid
(Integer) Index of the image’s color space in the colorspaces[] pseudo object. This can be
used to retrieve detailed color space properties.
filterinfo (Dict) Describes the remaining filter for streams with unsupported filters or when retrieving
stream data with the keepfilter option set to true. If there is no such filter no filterinfo
dictionary will be available. The dictionary contains the following entries:
name (Name) Name of the filter
supported (Boolean) True if the filter is supported
decodeparms
(Dict) The DecodeParms dictionary if one is present for the filter
maskid (Integer) Index of the image’s mask in the images[] pseudo object if the image is masked,
otherwise -1
maskonly (Boolean) If true, the image is only used as a mask for (one or more) other images, but not
directly
patterns (Array of dicts) Array containing dictionaries for all patterns on the page or in the document
properties (Array of dicts) Array containing dictionaries for all properties on the page or in the document
shadings (Array of dicts) Array containing dictionaries for all shadings on the page or in the document. In addition
to the standard PDF keys in shading dictionaries the following pseudo key is supported:
colorspaceid
(Integer) Index of the underlying color space in the colorspaces[] pseudo object.
templates (Array of dicts) Array containing dictionaries for all templates (Form XObjects) on the page or in the doc-
ument
Full pCOS mode (mode 0): Encrypted PDFs can be processed without any restriction
provided the master password has been supplied upon opening the file. All objects will
be returned unencrypted. Unencrypted documents will always be opened in full pCOS
mode.
Restricted pCOS mode (mode 1). If the document has been opened without the appro-
priate master password and does not require a user password (or the user password has
been supplied) pCOS operations are subject to the following restriction: The contents of
objects with type string, stream, or fstream can not be retrieved with the following excep-
tions:
> The objects /Root/Metadata and /Info/* (document info keys) can be retrieved if
nocopy=false or plainmetadata=true.
> The objects bookmarks[...]/Title and annots[...]/Contents (bookmark and annotation
contents) can be retrieved if nocopy=false, i.e. if text extraction is allowed for the
main text on the pages.
Minimum pCOS mode (mode 2). Regardless of the encryption status and the availabili-
ty of passwords, the universal pCOS pseudo objects listed in Table 5.3 are always avail-
able. For example, the encrypt pseudo object can be used to query a document’s encryp-
tion status. Encrypted objects can not be retrieved in minimum pCOS mode.
Table 5.6 lists the resulting pCOS modes for various password combinations. De-
pending on the document’s encryption status and the password supplied when open-
ing the file, PDF object paths may be available in minimum, restricted, or full pCOS
mode. Trying to retrieve a pCOS path which is inappropriate for the respective mode
will raise an exception.
none of the passwords restricted pCOS mode if no user password is set, minimum pCOS mode
otherwise
Names and values, as well as multiple name/value pairs can be separated by arbitrary
whitespace characters (space, tab, carriage return, newline). The value may consist of a
list of multiple values. You can also use an equal sign ’=’ between name and value:
name=value
Simple values. Simple values may use any of the following data types:
> Boolean: true or false; if the value of a boolean option is omitted, the value true is as-
sumed. As a shorthand notation nofoo can be used instead of foo=false to disable op-
tion foo.
> String: these are plain ASCII strings which are generally used for non-localizable key-
words. Strings containing whitespace or ’=’ characters must be bracketed with { and }.
An empty string can be constructed with { }. The characters {, }, and \ must be preced-
ed by an additional \ character if they are supposed to be part of the string.
> Strings and name strings: these can hold Unicode content in various formats; see
Section 3.2, »C Binding«, page 20 for C- and C++-specific details regarding name
strings.
> Unichar: these are single Unicode characters, where several syntax variants are sup-
ported: decimal values (e.g. 173), hexadecimal values prefixed with x, X, 0x, 0X, or U+
(xAD, 0xAD, U+00AD), numerical or character references (see below), but without
the ’&’ and ’;’ decoration (shy, #xAD, #173). Alternatively, literal characters can be
supplied. Unichars must be in the range 0-65535 (0-xFFFF).
> Keyword: one of a predefined list of fixed keywords
> Float and integer: decimal floating point or integer numbers; point and comma can
be used as decimal separators for floating point values. Integer values can start with
x, X, 0x, or 0X to specify hexadecimal values. Some options (this is stated in the re-
spective function description) support percentages by adding a % character directly
after the value.
> Handle: several internal object handles, e.g., document or page handles. Technically
these are integer values.
Depending on the type and interpretation of an option additional restrictions may ap-
ply. For example, integer or float options may be restricted to a certain range of values;
List values. List values consist of multiple values, which may be simple values or list
values in turn. Lists are bracketed with { and }. Example:
TET_set_option( ): searchpath={/usr/lib/tet d:\tet}
Note The backslash \ character requires special handling in many programming languages
Rectangles. A rectangle is a list of four float values specifying the coordinates of the
lower left and upper right corners of a rectangle. Rectangle coordinates will be inter-
preted in the standard or user coordinate system (see »Coordinate system«, page 33). Ex-
ample:
TET_open_page( ): includebox = {{0 0 500 100} {0 500 500 600}}
In addition to the HTML-style references above TET supports the custom character enti-
ty names for control characters (see Table 6.1).
Returns A handle to a TET object to be used in subsequent calls. If this function doesn’t succeed
due to unavailable memory it will return NULL.
Bindings This function is not available in object-oriented language bindings since it is hidden in
the TET constructor.
Details Deleting a TET object automatically closes all of its open documents. The TET object
must no longer be used in any function after it has been deleted.
Bindings In object-oriented language bindings this function is generally not required since it is
hidden in the TET destructor. However, in Java it is available nevertheless to allow ex-
plicit cleanup in addition to automatic garbage collection. In .NET Dispose( ) should be
called at the end of processing to clean up unmanaged resources.
utf8string String to be converted. It must contain a valid UTF-8 sequence (on EBCDIC
platforms it must be encoded in EBCDIC). If a Byte Order Mark (BOM) is present, it will
be removed.
size Pointer to a memory location where the length of the returned string (in bytes,
but excluding the terminating two null bytes) will be stored.
Bindings This function is not available in Unicode-capable language bindings. The memory used
for the converted string will be managed by TET, and must not be freed by the client.
Get the name of the API function which caused an exception or failed.
Returns The name of the function which threw an exception, or the name of the most recently
called function which failed with an error code. An empty string will be returned if
there was no error.
Get the text of the last thrown exception or the reason for a failed function call.
Returns Text containing the description of the last exception thrown, or the reason why the
most recently called function failed with an error code. An empty string will be returned
if there was no error.
Get the number of the last thrown exception or the reason for a failed function call.
Returns The number of an exception, or the error code of the most recently called function
which failed with an error code. This function will return 0 if there was no error.
C TET_TRY(tet)
C TET_CATCH(tet)
C TET_RETHROW(tet)
C TET_EXIT_TRY(tet)
Set up an exception handling block; catch or rethrow an exception; or inform the excep-
tion machinery that a TET_TRY( ) block will be left without entering the corresponding
Details (C language binding only) See Section 3.2, »C Binding«, page 20.
filename (Name string, but Unicode file names are only supported on Windows) Abso-
lute or relative name of the PDF input file to be processed. The file will be searched in all
directories specified in the searchpath resource category. On Windows it is OK to use
UNC paths or mapped network drives. In PHP Unicode filenames must be UTF-8.
len (Only C language binding) Length of filename (in bytes) for UTF-16 strings. If len = 0
a null-terminated string must be provided.
Details Within a single TET object an arbitrary number of documents may be kept open at the
same time. However, a single TET object must not be used in multiple threads simulta-
neously without any locking mechanism for synchronizing the access.
Encryption: if the document is encrypted its user password must be supplied in the
password option if the permission settings allow content extraction. The document’s
master password must be supplied if the permission settings do not allow content ex-
traction.
Supported file systems on iSeries: TET has been tested with PC type file systems only.
Therefore input and output files should reside in PC type files in the IFS (Integrated File
System). The QSYS.lib file system for input files has not been tested and is not supported.
Since QSYS.lib files are mostly used for record-based or database objects, unpredictable
behavior may be the result if you use TET with QSYS.lib objects. TET file I/O is always
stream-based, not record-based.
copy (Boolean; Only for TET_open_document_mem( ), and only useful for the C and C++ bindings) If true, TET
will immediately make an internal copy of the supplied PDF data. Otherwise the client is responsible for
keeping the data available until the corresponding call to TET_close_document( ). Default: false
encodinghint (String1) The name of an encoding which will be used to determine Unicode mappings for glyph names
which cannot be mapped by standard rules, but only by a predefined internal glyph mapping rule. The
keyword none can be used to disable all predefined rules. Default: winansi
glyphmapping (List of option lists) A list of option lists where each option list describes a glyph mapping method for one
or more font/encoding combinations which cannot reliably be mapped with standard methods. The
mappings will be used in least-recently-set order. If the last option list contains the fontname wildcard
»*«, preceding mappings will no longer be used. Each rule consists of an option list according to Table 6.3
(default: predefined internal glyph rules will be applied).
keeppua (Boolean) If true, PUA (Private Use Area) values will be returned as such; otherwise they will be mapped
to the Unicode replacement character (see option unknownchar). Default: false
inmemory (Boolean; Only for TET_open_document( )) If true, TET will load the complete file into memory and pro-
cess it from there. This can result in a tremendous performance gain on some systems (especially MVS) at
the expense of memory usage. If false, individual parts of the document will be read from disk as need-
ed. Default: false
password (String; Maximum string length: 32 characters) The user or master password for encrypted documents. If
the document’s permission settings allow text copying then the user password is sufficient, otherwise the
master password must be supplied.
Note: vendors of a search engines may want to locate a document without making available the actual
text contents to the user. Premium customers may obtain a custom version of TET under a special license
agreement which allows text retrieval even without the master password, assuming no user password
has been set.
See Section 5.6, »Encrypted PDF Documents«, page 65, to find out how to query a document’s encryption
status, and pCOS operations which can be applied even without knowing the user or master password.
repair (Keyword) Specifies how to treat damaged PDF documents. Repairing a document takes more time than
normal parsing, but may allow processing of certain damaged PDFs. Note that some documents may be
damaged beyond repair (default: auto):
force Unconditionally try to repair the document, regardless of whether or not it has problems.
auto Repair the document only if problems are detected while opening the PDF.
none No attempt will be made at repairing the document. If there are problems in the PDF the
function call will fail.
unknown- (Unichar) The character to be used as a replacement for unknown characters which cannot be mapped to
char Unicode (see Section 4.4, »Unicode Mapping«, page 37) . Default: U+FFFD (Replacement Character)
usehostfonts (Boolean) If true, data for fonts which are not embedded, but are required for determining Unicode
mappings will be searched on the Mac or Windows host operating system. Default: true
codelist (String) Name of a codelist resource to be applied to the font. It will have higher priority than an embed-
ded ToUnicode CMap or encoding entry.
fontname (Name string) Prefix or full name of the font to which the rule will be applied (subset prefixes in the font
name must be excluded). Limited wildcards1 are supported. Default: *
force- (List with one or two strings2, If there are two names, the first must be winansi or macroman) Replace the
encoding first encoding with the encoding resource specified by the second name. If only one entry is supplied, the
specified encoding will be used to replace all instances of MacRoman, WinAnsi, and MacExpert encoding.
forcettsymbol- (Keyword or string2) The name of an encoding which will be used to determine Unicode mappings for
encoding embedded pseudo TrueType symbol fonts which are actually text fonts, or one of the following keywords
(default: auto):
auto If the font’s builtin encoding (see below) contains at least one Unicode character in the
symbolic range U+F0000-U+F0FF, the encoding specified in the encodinghint option will be
used to map the pseudo symbol characters to real text characters. Otherwise encodinghint
will not be used, and the characters will be mapped according to the builtin keyword.
builtin Use the font’s builtin encoding, which results from the Unicode mappings of the glyph names
in the font’s post table.
The well-known TrueType fonts Wingdings* and Webdings* will always be treated as symbol fonts.
glyphrule (Option list) Mapping rule for numerical glyph names (in addition to the predefined rules). The option list
must contain the following suboptions:
prefix (String; may be empty) Prefix of the glyph names to which the rule will be applied.
base (Keyword) One of the keywords hex or dec for hexadecimal or decimal representation of
codes within a glyph name.
encoding (String) Name of an encoding resource which will be used for this rule, or the keyword none to
disable the rule.
tounicode- (String) Name of a ToUnicode CMap resource to be applied to the font; it will have higher priority than an
cmap embedded ToUnicode CMap or encoding entry.
1. Limited wildcards: The standalone character »*« denotes all fonts; Using »*« after a prefix (e.g. »MSTT*«) denotes all fonts starting
with the specified prefix.
2. The following predefined encoding names can be used without additional configuration: winansi, macroman, macroman_apple,
macroman_euro, ebcdic, ebcdic_37, iso8859-X, cpXXXX, and U+XXXX. Custom encodings can be defined as resources.
data A reference to the data containing the PDF document. In C and C++ this is a poin-
ter. In Java and C# this is a byte array. In PHP this is a string. In COM this is a variant of
type byte.
size (Only for the C and C++ bindings) The length of the data in bytes.
Open a PDF document from a custom data source for content extraction.
opaque A pointer to some user data that might be associated with the input PDF docu-
ment. This pointer will be passed as the first parameter of the callback functions, and
can be used in any way. TET will not use the opaque pointer in any other way.
readproc A C callback function which copies size bytes to the memory pointed to by
buffer. If the end of the document is reached it may copy less data than requested. The
function must return the number of bytes copied.
seekproc A C callback function which sets the current read position in the document.
offset denotes the position from the beginning of the document (0 meaning the first
byte). If successful, this function must return 0, otherwise -1.
Bindings This function is only available in the C and C++ language bindings.
Release a document handle and all internal resources related to that document.
Details Closing a document automatically closes all of its open pages. All open documents and
pages will be closed automatically when TET_delete( ) is called. It is good programming
practice, however, to close documents explicitly when they are no longer needed.
Closed document handles must no longer be used in any function call.
pagenumber The physical number of the page to be opened. The first page has page
number 1. The total number of pages can be retrieved with TET_pcos_get_number( ) and
the pCOS path length:pages.
Details Within a single document an arbitrary number of pages may be kept open at the same
time. The same page may be opened multiply with different options. However, options
can not be changed while processing a page.
Layer definitions (optional content groups) which may be present on the page are
not taken into account: all text on all layers of the page will be extracted, regardless of
the visibility of layers.
excludebox (List of rectangles) Exclude the combined area of the specified rectangles from content extraction. De-
fault: empty
fontsize- (List of two floats) Two numbers specifying the minimum and maximum font size of text. Text with a size
range outside of this interval will be ignored. The maximum can be specified with the keyword unlimited,
which means that no upper limit will be active. Default: { 0 unlimited }
ignore- (Boolean) If true, text with rendering mode 3 (invisible) will be ignored. Default: false (since invisible
invisibletext text is mainly used for image+text PDFs containing scanned pages and the corresponding OCR text)
includebox (List of rectangles) Restrict content extraction to the combined area of the specified rectangles. Default:
the complete clipping area
granularity (Keyword) The granularity of the text fragments returned by TET_get_text( ); all modes except glyph will
enable the wordfinder. See »Text granularity«, page 44, for more details (default: word).
glyph A fragment contains the result of mapping one glyph, but may contain more than one
character (e.g. for ligatures).
word A fragment contains a word as determined by the wordfinder.
line A fragment contains a line of text, or the closest approximation thereof. Word separators will
be inserted between two consecutive words.
zone A fragment contains a graphical unit of text; depending on the layout this may be a column
or other entity. Word and line separators will be inserted between two consecutive words or
lines, respectively.
page A fragment contains the contents of a single page. Word, line, and zone separators will be
inserted as appropriate.
dehyphenate (Boolean) If true, hard hyphens (U+002D and U+2010) and soft hyphens (U+00AD) at the end of a line
will be removed, and the text fragments surrounding the hyphen will be combined. Default: true
includebox- (Integer) When multiple include boxes have been supplied (see option includebox), this option controls
order how the order of boxes affects the wordfinder (default: 0):
0 Ignore include box ordering when analyzing the page contents.
The result will be the same as if all the text outside the include boxes was deleted. This is
useful for eliminating unwanted text (e.g. headers and footers) while not affecting the
Wordfinder in any way.
1 Take include box ordering into account when creating words and zones, but not for zone
ordering.
A word will never belong to more than one box. The resulting zones will be sorted in logical
order. In case of overlapping boxes the text will belong to the box which is earlier in the list.
This is useful for extracting text from preprinted forms, extracting text from tables, or when
include boxes overlap for complicated layouts.
2 Consider include box ordering for all operations.
The contents of each include box will be treated independently from other boxes, and the
resulting text will be concatenated according to the order of the include boxes. This is useful
for extracting text from printed forms in a particular ordering, or extracting article columns
in a magazine layout in a predefined order. In all cases advance knowledge about the page
layout is required in order to specify the include boxes in appropriate order.
lineseparator (Unichar; Only for granularity=zone and page) Character to be inserted between lines1. Default:
U+000A
shadow- (Boolean) If true, redundant instances of overlapping text fragments which create a shadow or fake bold
detect text will be detected and removed. Default: true
punctuation (Boolean) If true, punctuation characters which are placed close to a letter will be treated as word
breaks boundaries, otherwise they will be included in the adjacent word. Default: true
wordseparator (Unichar; Only for granularity=line, zone, and page) Character to be inserted between words1. Default:
U+0020
zoneseparator (Unichar; Only for granularity=page) Character to be inserted between zones1. Default: U+000A
Details All open pages of the document will be closed automatically when TET_close_document( )
is called. It is good programming practice, however, to close pages explicitly when they
are no longer needed. Closed page handles must no longer be used in any function call.
len (C language binding only) A pointer to a variable which will hold the length of the
returned string in UTF-16 values (not bytes!). To determine the number of bytes this val-
ue must be multiplied by 2 if outputformat=utf16; the string length of the returned null-
terminated string must be used if outputformat=utf8.
Returns A string containing the next text fragment on the page. The length of the fragment is
determined by the granularity option of TET_open_page( ). Even for granularity=glyph the
string may contain more than one character (see Section 4.1, »Characters and Glyphs«,
page 31).
If all text on the page has been retrieved an empty string will be returned (in C: a
NULL pointer and *len=0). In this case TET_get_errnum( ) should be called to find out
whether there is no more text because of an error on the page, or because the end of the
page has been reached.
Bindings C language binding: the result will be provided as null-terminated UTF-8 (default) or
UTF-16 string according to the outputformat option of TET_set_option( ). On iSeries and
zSeries EBCDIC-encoded UTF-8 can also be selected, and is enabled by default. The re-
turned data buffer can be used until the next call to this function.
C++, COM, Java and .NET language bindings: the result will be provided as standard Uni-
code string in UTF-16 format.
RPG language binding: the result will be provided as null-terminated ASCII- or EBCDIC-
encoded UTF-8 string, or as a null-terminated UTF-16 string according to the outputfor-
mat option of TET_set_option( ).
Get detailed information for the next character in the most recent text fragment.
Details This function can be called after TET_get_text( ). It will advance to the next character for
the current text fragment associated with the supplied page handle (or return 0 or NULL
if there are no more characters), and provide detailed information for this character.
There will be N successful calls to this function where N is the number of UTF-16 charac-
ters in the text fragment returned by the most recent call to TET_get_text( ).
For granularities other than glyph this function will advance to the next character of
the string returned by the most recent call to TET_get_text( ). This way it is possible to re-
trieve character metrics when the wordfinder is active and a text fragment may contain
more than one character. In order to retrieve all character details for the current text
fragment this function must be called repeatedly until it returns NULL or 0.
The character details in the structure or properties/fields are valid until the next call
to TET_get_char_info( ) or TET_close_page( ) with the same page handle (whichever occurs
first). Since there is only a single set of character info properties/fields per TET object,
clients must retrieve all character info before they call TET_get_char_info( ) again for the
same or another page or document.
Bindings C and C++ language bindings: If no more characters are available for the most recent
text fragment returned by TET_get_text( ), a NULL pointer will be returned. Otherwise, a
pointer to a TET_char_info structure containing information about a single character
will be returned. The members of the data structure are detailed in Table 6.6.
COM, Java and .NET language bindings: -1 will be returned if no more characters are
available for the most recent text fragment returned by TET_get_text( ), otherwise 1. Indi-
vidual character info can be retrieved from the TET properties/public fields according to
Table 6.6. All properties/fields will contain a value of -1 (the unknown field will contain
false) if they are accessed although the function returned 0.
Perl language binding: 0 will be returned if no more characters are available for the
most recent text fragment returned by TET_get_text( ), otherwise a hash containing the
keys listed in Table 6.6. Individual character info can be retrieved with the keys in this
hash.
PHP language binding: 0 will be returned if no more characters are available for the
most recent text fragment returned by TET_get_text( ), otherwise an object containing
the fields listed in Table 6.6. Individual character info can be retrieved from the mem-
ber fields of this object. All fields will contain a value of -1 (the unknown field will con-
tain false) if they are accessed although the function returned 0. Integer fields in the
character info object are implemented as long in the PHP language binding.
uv (Integer) UTF-32 Unicode value of the current character. It will be 0 if the corresponding UTF-16 value is
the trailing value of a surrogate pair (i.e. if type=11).
type (Integer) Type of the character. The following types describe real characters which correspond to a glyph
on the page. The values of all other properties/fields are determined by the corresponding glyph:
0 Normal character which corresponds to exactly one glyph
1 Start of a sequence (e.g. ligature)
The following types describe artificial characters which do not correspond to a glyph on the page. The x
and y fields will specify the most recent real character’s endpoint, the width field will be 0, and all other
fields except uv will contain the values corresponding to the most recent real character:
10 Continuation of a sequence (e.g. ligature)
11 Trailing value of a surrogate pair; the leading value has type=0, 1, or 10.
12 Inserted word, line, or zone separator
unknown (Boolean, in C and C++: integer) Usually false (0), but will be true (1) if the original glyph could not be
mapped to Unicode and has been replaced with the character specified as unknownchar.
x, y (Double) Position of the glyph’s reference point. The reference point is the lower left corner of the glyph
box for horizontal writing mode, and the top center point for vertical writing mode. For artificial charac-
ters the x, y coordinates will be those of the end point of the most recent real character.
width (Double) Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial
characters the width will be 0.
alpha (Double) Direction of inline text progression in degrees measured counter-clockwise. For horizontal writ-
ing mode this is the direction of the text baseline; for vertical writing mode it is the digression from the
standard -90° direction. The angle will be in the range -180° < alpha ³ +180°. For standard horizontal text
as well as for standard text in vertical writing mode the angle will be 0°.
beta (Double) Text slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The
angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range
-180° < beta ³ 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.
fontid (Integer) Index of the font in the fonts[] pseudo object (see Table 5.5). fontid is never negative.
fontsize (Double) Size of the font (always positive); the relation of this value to the actual height of glyphs is not
fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses
all ascenders (including accented characters) and descenders.
optlist An option list specifying global options according to Table 6.7. If an option is
provided more than once the last instance will override all previous ones. In order to
supply multiple values for a single option (e.g. searchpath) supply all values in a list ar-
gument to this option.
Details Multiple calls to this function can be used to accumulate values for those options
marked in Table 6.7. For unmarked options the new value will override the old one.
codelist1 (List of name strings) A list of string pairs, where each pair contains the name and value of a codelist re-
source (see Section 4.7, »Resource Configuration and File Searching«, page 47).
encoding1 (List of name strings) A list of string pairs, where each pair contains the name and value of an encoding
resource (see Section 4.7, »Resource Configuration and File Searching«, page 47).
fontoutline1 (List of name strings) A list of string pairs, where each pair contains the name and value of a FontOutline
resource (see Section 4.7, »Resource Configuration and File Searching«, page 47).
glyphlist1 (List of name strings) A list of string pairs, where each pair contains the name and value of a glyphlist re-
source (see Section 4.7, »Resource Configuration and File Searching«, page 47).
license (String) Set the license key. It must be set before the first call to TET_open_document*( ).
licensefile (String) Set the name of a file containing the license key(s). The license file can be set only once before the
first call to TET_open_document*( ). Alternatively, the name of the license file can be supplied in an
environment variable called PDFLIBLICENSEFILE or (on Windows) via the registry.
output- (Keyword; Only for the C and RPG language bindings) Specifies the format of the text returned by TET_
format get_text( ) (default on zSeries with USS or MVS: ebcdicutf8; on all other systems: utf8):
utf8 Strings will be returned in null-terminated UTF-8 format (on both ASCII- and EBCDIC-based
systems).
ebcdicutf8 (Only available on EBCDIC-based systems) Strings will be returned in null-terminated EBCDIC-
encoded UTF-8 format. Code page 37 will be used on iSeries, code page 01047 on zSeries.
utf16 Strings will be returned in UTF-16 format in the machine’s native byte ordering (on Intel x86
architectures the native byte order is little-endian, while on Sparc and PowerPC systems it is
big-endian).
resourcefile (Name string) Relative or absolute file name of the UPR resource file. The resource file will be loaded
immediately. Existing resources will be kept; their values will be overridden by new ones if they are set
again. Explicit resource options will be evaluated after entries in the resource file.
The resource file name can also be supplied in the environment variable TETRESOURCEFILE or with a
Windows registry key (see Section 4.7, »Resource Configuration and File Searching«, page 47). Default:
tet.upr (on MVS: upr)
searchpath1 (List of name strings) Relative or absolute path name(s) of a directory containing files to be read. The
search path can be set multiply; the entries will be accumulated and used in least-recently-set order (see
Section 4.7, »Resource Configuration and File Searching«, page 47). An empty string deletes all existing
search path entries. On Windows the search path can also be set via a registry entry. Default: empty
Returns The numerical value of the object identified by the pCOS path. For Boolean values 1 will
be returned if they are true, and 0 otherwise.
Get the value of a pCOS path with type name, string, or boolean.
Returns A string with the value of the object identified by the pCOS path. For Boolean values the
strings true or false will be returned.
Details This function will raise an exception if pCOS does not run in full mode and the type of
the object is string (see Section 5.6, »Encrypted PDF Documents«, page 65). As an excep-
tion, the objects /Info/* (document info keys) can also be retrieved in restricted pCOS
Bindings C and C++ language bindings: The string will be returned in UTF-8 format.
C binding: The returned string can be used until the next call to this function.
C++ const unsigned char *pcos_get_stream(int doc, int *length, string optlist, string path)
C# Java final byte[ ] pcos_get_stream(int doc, String optlist, String path)
Perl PHP string TET_pcos_get_stream(resource tet, long doc, string path)
VB Function pcos_get_stream(doc as Long, optlist As String, path As String)
C const unsigned char *TET_pcos_get_stream(TET *tet, int doc, int *length, const char *optlist,
const char *path, ...)
Get the contents of a pCOS path with type stream, fstream, or string.
length (C and C++ language bindings only) A pointer to a variable which will receive
the length of the returned stream data in bytes.
optlist An option list specifying stream retrieval options according to Table 6.8.
Returns The unencrypted data contained in the stream or string. The returned data will be emp-
ty (in C and C++: NULL) if the stream or string is empty.
If the object has type stream, all filters will be removed from the stream contents (i.e.
the actual raw data will be returned). If the object has type fstream or string the data will
be delivered exactly as found in the PDF file, with the exception of ASCII85 and ASCII-
Hex filters which will be removed.
Details This function will throw an exception if pCOS does not run in full mode (see Section 5.6,
»Encrypted PDF Documents«, page 65). As an exception, the object /Root/Metadata can
also be retrieved in restricted pCOS mode if nocopy=false or plainmetadata=true. An ex-
ception will also be thrown if path does not point to an object of type stream, fstream, or
string.
Despite its name this function can also be used to retrieve objects of type string. Un-
like TET_pcos_get_string( ), which treats the object as a text string, this function will not
modify the returned data in any way. Binary string data is rarely used in PDF, and can-
not be reliably detected automatically. The user is therefore responsible for selecting
the appropriate function for retrieving string objects as binary data or text.
Note This function can be used to retrieve embedded font data from a PDF. Users are reminded of
the fact that fonts are subject to the respective font vendor’s license agreement, and must not
be reused without the explicit permission of the respective intellectual property owners. Please
contact your font vendor to discuss the relevant license agreement.
Document Functions
Function prototype page
int TET_open_document(TET *tet, const char *filename, int len, const char *optlist) 73
int TET_open_document_mem(TET *tet, const void *data, size_t size, const char *optlist) 75
int TET_open_document_callback(TET *tet, void *opaque, size_t filesize, size_t (*readproc)(void *opaque,
void *buffer, size_t size), int (*seekproc)(void *opaque, long offset), const char *optlist) 76
void TET_close_document(TET *tet, int doc) 76
Page Functions
Function prototype page
int TET_open_page(TET *tet, int doc, int pagenumber, const char *optlist) 77
void TET_close_page(TET *tet, int page) 79
Option Handling
Function prototype page
void TET_set_option(TET *tet, const char *optlist) 83
pCOS Functions
Function prototype page
double TET_pcos_get_number(TET *tet, int doc, const char *path, ...) 85
const char *TET_pcos_get_string(TET *tet, int doc, const char *path, ...) 85
const unsigned char *TET_pcos_get_stream(TET *tet, int doc, int *length, const char *optlist, const char *path, ...) 86
A G
API (Application Programming Interface) glyph metrics 33
reference 67 glyph rules 42
area of text extraction 35 glyphlist 42
article threads 13 glyphs 31
attachments 51 granularity 44
B H
Byte Order Mark (BOM) 69 halfwidth variants 36
highlighting 35
C
C binding 20 I
C++ binding 22 inch 33
categories of resources 47 installing TET 5
character references 68
characters 31
CJK (Chinese, Japanese, Korean) 36
J
compatibility forms 36 Java binding 24
configuration 5
codelist 40 L
COM binding 23 license key 6
command-line tool 13 ligatures 31
composite characters 31 list values in option lists 68
content analysis 44
coordinate system 33
CUS (Corporate Use Subarea) 37 M
millimeters 33
D
dehyphenation 45 N
Dispose( ) 69 .NET binding 25
document and page functions 73
DTD (Document Type Definition) 16 O
optimizing performance 50
E option lists 67
EBCDIC-based systems 83
encrypted PDF documents 65
end points of glyphs and words 35
P
evaluation version 5 page boxes 35
exception handling 19 page size 53
in C 20 pCOS 53
API functions 85
data types 55
F encryption 65
fake bold removal 46 path syntax 57
file searching 48 pseudo objects 59
FontReporter plugin 10, 39 PDF Reference Manual 53
fullwidth variants 36 performance optimization 50
Index 91
Perl binding 26 TET_open_document_callback( ) 76
PHP binding 27 TET_open_document_mem( ) 75
points 33 TET_open_page( ) 77
post-processing for Unicode values 37 TET_pcos_get_number( ) 85
prerotated glyphs 36 TET_pcos_get_stream( ) 86
PUA (Private Use Area) 37 TET_pcos_get_string( ) 85
TET_RETHROW( ) 71
R TET_set_option( ) 83
reading order 46 TET_TRY( ) 71
rectangles in option lists 68 TET_utf8_to_utf16( ) 69
replacement character 38 TETRESOURCEFILE environment variable 49
resource configuration 47 text filtering 32
resourcefile parameter 49 ToUnicode CMap 41
RPG binding 29
U
S Unicode mapping 37
searchpath 48 units 33
sequences 31 unmappable glyphs 38
shadow removal 46 UPR file format 47
surrogates 32, 33 UTF-32 38
UTF-8 and UTF-16 69
T
TET command-line tool 13 V
TET plugin 11
vertical writing mode 36
tet.upr 49
TET_CATCH( ) 71
TET_close_document( ) 76 W
TET_close_page( ) 79 word boundary detection 45
TET_delete( ) 69 wordfinder 45
TET_EXIT_TRY( ) 20, 71
TET_get_apiname() 71
TET_get_char_info( ) 80 X
TET_get_errmsg( ) 71 XML output 16
TET_get_errnum( ) 71
TET_get_text( ) 80
TET_new( ) 69 Z
TET_open_document( ) 73 zones 46
92 Index