This article compares several popular electronic document file formats. The purpose is limited to identifying some inherent features and capabilities that each brings to addressing the needs of assistive technology users. Much is left unsaid! I know that, but this is an article, not a book, ok? Feel free to leave a comment.
HTML is easy. PDF… not so much
It’s generally considered that “HTML is accessible”. What is usually meant is that HTML is inherently accessible, and in a strictly limited sense, that’s pretty much the case.
Simply remove any CSS and inline styling and you’ve got plain text and markup. At least within the page content itself, text and graphics are generally in more-or-less the right order. If the website’s manager has done a good job, the site as a whole is readily “consumable” by the technologies that check, display, feed or otherwise deliver for “end-users” with various needs, both human or machine.
Any piece of software worthy of the name “assistive technology” can read vanilla HTML.
It’s been a longtime goal for both Microsoft and Assistive Technology developers to find ways for OOXML (the format used by current versions of MS Word) to play well with AT, and generally, it does. Correspondingly, OOXML is generally considered accessible, even if current implementations (both Microsoft’s Word and the free Open Office) suffer a few limitations.
The Saga of Accessible PDF
Historically, PDF has not enjoyed such friendly relations with the AT user or developer community. Technically, PDF is vastly more complex than HTML, and entirely distinct from OOXML. The specification for PDF, no longer owned by Adobe Systems and now known as ISO 32000 has included accessibility features since 2000 but not much guidance for developers.
Consequently, the vast majority of PDF files are still untagged. Even most tagged PDF is still poorly tagged.
PDF/UA, the new International Standard for Accessible PDF to be published in Q3, 2012 will provide explicit rules allowing developers to produce software to ensure PDF files are accessible.
Why just these features?
The idea in this article is to inform on the appropriateness of each format by comparing their relative strengths for assistive technology purposes in a variety of relevant use-cases.
|Document Features in the
Assistive Technology Context
|Essential role||Website||Authoring||Final Form (untagged)||Final form
Readily restyle text size, color, font, etc
|Current software can readily restyle text||A||A||N/A||Soon?!?|
A reliable portable format that basically static
|May be readily exported to clean HTML||A||A-||E||A+*|
|Can play well with search engines||A||A||E||A|
|Can make any source content accessible||N/A||N/A||E||A-|
|Portable (self-contained, works offline)||E||B-||A||A|
|Suitable for large documents||C-||C||A||A|
|Encourages distinguishing style from structure||C||C||N/A||A|
|Supports row headers in tables||A||D-**||N/A||A|
|Documents can comply with WCAG 2.0||A||N/A||N/A||A-***|
|Many vendors producing software||A||C||A-||Soon?!?|
|Windows AT support (JAWS, NVDA, ZoomText, etc)||A-||A-||E||A-****|
|Mac AT support (VoiceOver)||A||E||E||E?|
Footnotes to the table
- * – PDF/UA conforming PDF files may be readily exported to highly structured, consistent HTML, ideally suited for restyling.
- ** – While Office Open XML (OOXML) supports row headers in tables, MS Word and Open Office at this time do not.
- *** – PDF/UA is the normative technical accessibility standard for the PDF file format. PDF files may contain media, actions, scripts and content requiring consideration beyond the scope of PDF/UA in order to conform with WCAG 2.0.
- **** – Unlike WCAG 2.0, PDF/UA includes specifications for conforming AT and PDF readers. While current AT software such as NVDA and ZoomText does support tagged PDF, none yet support PDF/UA.
What’s being graded
While static implementations are not uncommon, HTML is fundamentally a format of the web. OOXML is almost exclusively used as a document authoring format in applications such as Microsoft Word and Open Office.
Untagged PDF, by contrast, is typical for “final form” documents intended for distribution to end users. While it’s certainly possible for tagged PDF documents to be accessible (that’s our business, after all), the new PDF/UA standard will provide assurance of high-quality, dependable PDF tags.
It’s an innate feature of HTML that cascading style-sheets (CSS) make it possible to “re-skin” HTML content easily and quickly. Word’s “Styles” accomplish a similar task.
Many users benefit from the ability to reflow text so they can change font size and typeface to make the document more readable. While HTML and Word are relatively easy to restyle, untagged PDF can’t be restyled at all. Tagged PDF can be restyled (see for example Adobe’s “Reflow” feature), but the implementation at this time is flawed and incomplete.
PDF, by contrast, is designed to function as “electronic paper”, to be faithful to the author’s intent, and portable above all. Obviously, this isn’t always desirable when actually reading the document. Within the limitations of document security settings, tagged PDF was invented precisely to allow users with different needs to be able to extract text and structure for reliable restyling.
Some AT users prefer to extract their current document to HTML where they can apply their own styling.
While HTML and Open XML files are easy for search engines to index properly, PDF offers a challenge. Tagged PDF allows search engines to process PDF files as if they were HTML.
HTML and OOXML facilitate authoring rather than the integration of content from heterogeneous sources. PDF, on the other hand, readily combines pages from desktop publishing, CAD, scanned pages and mainframe output all in the same document and provides a means of making them all accessible.
HTML typically operates from a web server; OOXML files have many local dependencies. PDF files are self-contained.
HTML has no concept of annotations. In Word, OOOXML has comprehensive review functionality, but relatively limited annotation capabilities. PDF has a broad and deep annotations model.
HTML pages are rarely longer than a few thousand words; longer documents require collections of pages. OOXML is capable of very large (multi-thousand pages) documents. PDF, however, can handle almost any conceivable document, both in file-size and page-count terms.
HTML and OOXML applications such as MS Word and Open Office allow the user to think of structure (headings) and style (appearance of text) in distinct terms, vital for maintaining navigability when used by assistive technology. Untagged PDF has no concept of structure at all. PDF/UA enforces the distinction between structure and style by requiring logical heading levels in conforming documents.
Both HTML and tagged PDF support headers in table rows. While OOXML does include this support, both MS Word and Open Office, unaccountably, do not.
WCAG 2.0 is oriented towards web content, and thus naturally applies to HTML. WCAG 2.0 is less applicable to Word and PDF, as these are not web formats. PDF files may comply with WCAG 2.0 by using PDF/UA as well as conforming to WCAG 2.0 provisions not covered by PDF/UA.
There are many web browsers and a couple of high-quality options for creating OOXML apart from Microsoft’s Word. While a wide range of PDF viewers is available, none other than Adobe Reader yet fully supports tagged PDF, let alone PDF/UA.
Getting this right is a baseline for accessibility for very practical reasons. Not all AT works with all features of Word, PDF (or CSS for that matter). To qualify the format has to be reliably capable of handling semantics (tags, in HTML and PDF vernacular).
HTML and Word obviously play generally well with AT – they are among the most important targets for accessibility technology after the operating systems themselves. The point of this row is simply to highlight that untagged PDF is completely unreliable when read with AT.
VoiceOver does a great job with HTML and Word files, but Apple’s Preview, however, is an inferior PDF viewer that does not support PDF tags. Sadly, the Mac version of Adobe Reader does not (yet) work with VoiceOver, so Mac users cannot presently benefit from tagged PDF!