Before we get into the different kinds of tags it’s important to understand the nature of reading order in PDF.
In HTML, text happens by way of the natural sequence of characters and spaces enclosed within the tags that provide the structure of the content. HTML could not, quite literally, be simpler in this regard (if not in others).
PDF is different because the PDF format’s purpose is to ensure a document will offer precisely the same appearance no matter what software is used to view it. Achieving that reliability meant that as technical matter PDF files are organized around the needs of display and printing software rather than accessibility needs.
In fact, that’s why accessible PDF depends on tags, a feature that wasn’t added until 2000, when PDF was already well-established.
In PDF the basic building block for words, lines and paragraphs is called a text run. Each text run might be a single character, two or three characters, one or more words, or portions of words. The way text runs split up the words and sentences is determined by the software that created the PDF.
Accessible PDF requires the text runs to be grouped in correct reading order within tags, and for the tags to be organized in correct reading order within the document.
It’s the most critical part of ensuring accessible PDF.
Not all software is capable of processing tagged PDF. To maximize the benefits for the largest possible group of users, PDF files should be consistent in certain specific ways.
Physical View (also “Page View”)
The original PDF page, displayed conventionally on a monitor or via a printer.
Logical View (also “Tags View”)
The document contents as read by screen readers and other assistive technology that uses tagged PDF.
Content Order View (also “Content View”)
The document contents as drawn by the authoring application. Software that doesn’t use tagged PDF uses the content order to reflow PDF text for display on smaller screens, search engine indexes, and other implementations.
Many PDF documents “look OK” when displayed or printed but are not accessible because the tags do not accurately and precisely reflect the logical order of the content. Checking and correcting (remediating) PDF tags and adjusting content order accordingly is key to making a PDF accessible to users with the widest possible variety of software.
Text runs are blocks of letters ranging from a single letter to a few, a couple of words, or just a single space. You will soon notice that a single word may be broken across two or more text runs. How words and letters are divided have no impact on the ability of a screen reader to pronounce a word correctly.
Why are text runs this way??
You may wonder why text runs are divided in what often appears an arbitrary manner. The answer, broadly speaking, lies in the PDF specification itself, ISO 32000, but we’re not suggesting a deep dive there unless you are a developer or are otherwise technically-minded. The ultra-concise, semi-technical answer is this: since PDF is fundamentally about a consistent printed and onscreen display, text runs occur in such a way as to facilitate concepts such as kerning to decide how letters are divided among runs.
You could also say that without tags, a PDF has no concept of any meaning in its content beyond what and where to paint each letter and object.
What’s required for text runs and tags
To give text runs meaning to assistive technology they are grouped into tags. Tags identify the sequence of text runs and their meaning in terms of serving as part of a paragraph, heading or list. Accordingly, text runs must be in correct reading order within each tag, and the tags must also be in correct reading order with respect to ear other.
Words may be split between text runs, but the text runs making up a single word cannot exist in different tags. To assistive technology, this result would make it seem as if the single word was actually two separate fragments in different paragraphs.
Learn more about managing reading order.
Hide any artifacts in the page by clicking Hide Artifacts; this reduces clutter in the logical window.
To repair the structure of a PDF document, you need to match the order of the elements in the logical view to the physical view and ensure they are properly tagged. While the exact steps to be followed depend on the content of the document, the following list describes a typical remediation process for a page within the document.
If you select the text you want to work within the physical view, the corresponding elements in the logical view will be selected. You can then drag and drop these elements in the logical view to the appropriate location.
Select elements in the logical view, move using drag and drop so that text in the logical view flows in the same order as the physical view.
To select content within an area in the physical view, position the mouse pointer at the top left corner of the area, click the left mouse button, drag to the bottom left corner and release.
If the document includes tables, use the above drag and drop methods combined with the various ‘Cells’ and ‘Rows’ functions available under the Table menu to ensure the logical structure of each matches the physical view and are well structured. i.e. the rows and columns must contain the correct cells as displayed in the physical view.
The Page > Fix Common Problems menu item corrects problems with text running together, hyphens and repeated characters.
If you encounter elements incorrectly tagged as a table, click Table > Linearize Table in the menu to transform the table tags into <P> tags.
Next, fix the tags and text runs to ensure the correct logical order and to confirm the tags accurately represent the document’s semantics.
The Page > Abbreviations and Acronyms menu item ensures replacement text for abbreviations and acronyms in the page are specified and applied.
As you apply changes, CommonLook ensures the appropriate types of containers are created and automatically synchronizes the content and tags views behind the scene. Any untagged elements from the underlying tags view are removed with the following exceptions.
- If a user places a text run outside any tag, it is wrapped with a P container and kept untagged until the next time CommonLook PDF loads the page.
- If a user places an image outside any tag, it is wrapped with an Artifact container until the next time CommonLook PDF loads the page.
- If a user places an annotation outside any tag, it is kept untagged until the next time CommonLook PDF loads the page.
If these exceptions occur, the item is automatically retagged the next time CommonLook PDF loads the page.
In addition to facilitating the viewing and remediation of PDF structure, the Logical Structure Editor provides other functions that greatly simplify a number of tasks that make PDF documents more readable. These tasks are described under Semantics (Tag Types).