Document Formats and Image Formats: James C. King

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Document Formats

and Image Formats

James C. King
PDF Architect/Senior Principal Scientist
Advanced Technology Laboratory
Adobe Systems Incorporated

1
Outline

 Some Fundamentals
 PDF Documents
 PDF Pages
 Synthesized Pages versus Scanned Pages
 PDF and JPEG2000
 PDF and ISO Standards

2
Some Fundamentals

3
Image Formats versus Document Formats

picture
“Sampled” Image
(e.g., JPEG2000)

Multi-page Compound Document picture

(e.g., PDF)

4
Image Resolution and Size

lower resolution
display

subsample supersample

higher resolution
display (x2)

higher resolution
sampled image (x2)

5
Image Sampling (JPEG2000)

 Sub and Super Sampling Tools needed


 Size and resolution are different things

display or page
(arbitrary image size)

subs
amp
supe le
rsam
ple

JPEG2000 Image
(multiple resolutions)

6
PDF Documents

7
PDF: Multi-page Compound Documents

 A Comprehensive Format for Representing Documents and Forms


 Page contents  Metadata
 Images  Annotations
 Graphics  Links
 Fonts  Digital signatures
 Colorspaces  <and more>

 Not an image format like TIFF or JPEG


 High fidelity, high precision text layout and graphics features
 Platform and device independent definition
 Selective compression to reduce file size (e.g., image formats)
 Color Management (ICC support)

PDF 1.0 in 1993 … PDF 1.7 in 2006. Many enhancements!


8
Composite Documents

9
PDF Pages

Page Content Objects

10
Text, Graphics and Image

Typographic Text Sampled Images

Typographic Text

Vector Graphics

11
Coordinate Transforms

 x-scale, rotate/skew, rotate/skew, y-scale, x-pos, y-pos

2 0.8 0.7 2 10 210 cm


3 0.9 0.8 1 180 200 Tm

x t
T e
2.5 0 0 -1 235 170 cm

12
Clipping and Masking

Picture

picture Mask

Typographic
Typogra
grap Text

Clip to path (star) Mask off sky

13
Text as Text

The JPEG2000 image compression


technique has been cited by experts
as a new archiving format for digital
images. It is both a preservation and
delivery format, and has been seen
as a possible alternative to the TIFF
format which most institutions use
as a long-term archiving standard.
Produced by both imaging experts
and the Joint Photographic Experts
Group, it is now a recognised ISO
standard. The standard JPEG file
format which is so widely in use is
not yet an ISO standard.

Text as text
(using outline fonts) Text as image

14
Various Resolutions for Image Text

The JPEG2000 image compression


technique has been cited by experts
as a new archiving format for digital
images. (outlines)

9.5 in x 5.3 in

15
Resolution Independence

16
Resolution Independence

17
Synthesized Pages
versus
Scanned Pages

18
Document Sources

 Born digital
 More compact
 Editable
 Device independent/resolution independent
 Zoom-able

 Scanned from paper


 Bulky
 Need to pick a sampling resolution
 Text and image need different treatment
 Can do OCR or DR (document recognition)

 Born digital is a luxury

19
OCR’ed Text as Underlayer

OCR’d Text
• underlaid
• made invisible
• may have mistakes
• used for search
Scanned Text as Image

A PDF Page

20
Image Text and Image Picture Require Different Treatment

The JPEG2000 image


compression technique has been Needs 1-bit per pixel black and white at 600 dpi
cited by experts as a new
archiving format for digital
images.

Needs 24-bit per pixel color at 150 dpi

The standard JPEG file format


which is so widely in use is not
yet an ISO standard.

 MRC (Mixed Raster Content)


 Both JPEG2000 and PDF support this

21
PDF and JPEG2000

22
PDF Support for JPEG2000

 JPEG2000 images can be included on PDF pages


 JPX Baseline is supported
 Enumerated color spaces 19 (CIEJab) not supported
 Enumerated color space 12 (CMYK) is supported
 All four progressions supported: resolution, color depth, band, location
 Inappropriate progression will just cost time
 One global soft mask within the JPEG2000 supported
 JPEG2000 document features are not supported
 PDFs document features are more general and more flexible
 An image to display in a rectangle is obtained from the JPEG2000 stream

23
Software Support

 Key to use of any image format or document format are the tools available
 Tools for creation
 support advanced features
 Tools for presentation
 Tools for incorporating with other formats
 Ubiquity of viewing tools
 OCR and DR capabilities

24
Tools for Scan to PDF

 Tools that separate image text and image pictures (MRC)


 Adobe Professional Create from Scanner
 Adobe PDF Scan Library 3.0 (OEM product)
 CVision Technologies (www.cvisiontech.com) option for JPEG2000
 Canon desktop printers and multi-functions devices (some)
 Iris (www.irislink.com) MRC called IHQ
 LuraTech (www.luratech.colm) MRC using JPEG2000
 Nuance (www.thedevilincarnate.com)
 JRAPublish (Jim Rile)
 Spigraph (www.spigraph.fr)
 VeryPDF (www.veryPDF.com)
 Of course the ubiquitous Adobe Reader presents them all

25
PDF and
ISO Standards

26
Establishing the ISO PDF Umbrella

PDF 1.7 (ISO 32000 in 2008

PDF/A PDF/X PDF/E PDF/UA


archive graphic arts engineering accessibility

ISO 19005-1 ISO 15930-1 AIIM Committee AIIM Committee


(PDF 1.4) (PDF 1.4 & 1.6) --> ISO --> ISO

27
PDF/A

A PDF subset for archiving


ISO 19005-1

28
Long-term Preservation Needs for Electronic Documents

 Characteristics identified as objectives for PDF/A were


 Device Independent - Can be reliably and consistently rendered without regard
to the hardware or software platform
 Self-contained - Contains all resources necessary for rendering
 Self-documenting - Contains its own description
 Unfettered - Absence of technical file protection mechanisms
 Available - Authoritative specification publicly available
 Adoption - Widespread use may be the best deterrent against preservation risk

29
PDF/A -- A PDF Subset of PDF 1.4
(Standard: ISO 19005-1)

 Some useful PDF features work against, and are incompatible with,
preserving information over the long-term
 PDF/A
 PDF Subset: restricted from using some PDF features, for example
 Anything that would alter the visual appearance over time (forms)
 No external references or embedded files
 Encryption
 PDF Subset: required to use some PDF features, for example
 Accessibility features for recoverable text (tagged PDF)
 Embed all fonts
 Specific metadata requirements
 Device independent color

30
Uses for PDF/A

 Archival storage of electronic documents


 Documents of record
 Government records
 Corporate records
 Distributing read only material
 Documents with assured accessibility (read to the blind)

31
bc
32

You might also like