Internal Document Structure of a PDF Format File: A Glance

November 28, 2022

2 minutes read

The structure of Portable Document Format (PDF) document. | Download Scientific Diagram

Portable Document Format (PDF) is the only file format that can be read by any computer or mobile device. Its primary purpose is to showcase content, such as text, images, format, and layout.

Typically, this file type is chosen when a user needs to store data that cannot be easily changed but still needs to be shared or printed. The fact that one can access a PDF via a web browser or any conventional reader makes it truly universal.

A PDF file cannot be edited as a text file could. It is a structured 8-bit binary document defined by an 8-bit character-based token series, sorted in lines distanced by white space. These tokens denote the various objects and their types and establish the beginnings and endings of the PDF’s four logical divisions.

Header

In a PDF file, the header information begins at byte 0. The minimum length is 8 bytes and concludes with an end-of-line marker. These 8 bytes indicate that the file is a PDF (%PDF-) and may signify which standard version the file complies with (e.g., 1.4).

A second line will appear, also beginning with the PDF comment character, %, if your PDF, like most these days, contains actual binary data. At least four characters with ASCII values higher than 127 will follow the % on the second line.

Body

One can find all the document’s nine object types in the file’s body. These objects form the basis upon which PDF operates. The nine object types are as follows: null, Boolean, integer, real, name, string, array, dictionary, and stream.

Cross-Reference Table (xref)

This section is the most important feature of PDF. This table lists the binary offset for each indirect object from the beginning of the file. Because of this, a PDF reader can quickly seek and read any object.

With the help of this random-access model, a PDF can be loaded and processed quickly without first having to store its entirety in RAM. No matter how giant a leap in page numbers is, moving from one page to the next is quick.

xref

0 9

0000000000 65535 f

0000000015 00000 n

0000000034 00000 n

0000000393 00000 n

0000000432 00000 n

0000000542 00000 n

0000000601 00000 n

0000000631 00000 n

0000000698 00000 n

Multiple cross-reference sections exist in the table’s original form (PDF 1.0 to 1.4). Each contains a sequence of entries (one line per object) that detail the object’s generation, file offset, and usage status. The most common table type (shown above) consists of a single section listing all objects.

Trailer

When processing a document, it is helpful to read its trailer first – essentially a keyed dictionary with corresponding values describing various aspects of the document.

Here is what a simple trailer looks like:

trailer

/Size 23

/Root 5 0 R /ID[<E3FEB541622C4F35B45539A690880C71><E3FEB541622C4F35B45539A690880C71>]

/Info 6 0 R

Size and Root are the two most crucial and required keys. While the Size key indicates the expected number of entries in the xref table, the Root key allows access to PDF’s catalogue dictionary. You should begin your search for the PDF’s objects from here. Other standard keys include Encrypt, ID and Info.

Incremental Update

Incremental updating is one of PDF’s most helpful features, made possible by adding a trailer and a cross-reference table at the end of the document.

You can quickly save modifications without requiring a thorough reading and processing of each object because they are simply appended to the end of the PDF.

After the initial cross-reference section, subsequent sections will only list the new, modified, or deleted objects in the new table and refer back to the previous section using the Prev key.

PDF file format: Internal Document Structure Explained

November 28, 2022

2 minutes read