Auto-Tagging PDFs – What Works, What Doesn’t

 In Articles, PDF Accessibility

An office SettingThis is good – more and more people, organizations, companies, and agencies are paying attention to the accessibility of their electronic content, including not only their websites, but also the PDFs that they make available to people. Naturally, as more and more people pay attention to these things, more and more questions will arise.  What’s not so good – there has also been an increase in incorrect or incomplete information about PDF accessibility made available to people.  We live in a society of “instant gratification” and looking for the easy way to get things done.  We want it all, we want it now, and we don’t want to have to work too hard to get it.  Unfortunately, that’s not always the way life works.  It’s been said many times before, “Anything worth doing, and doing well, is probably going to take hard work.”  This is as true today as it has been in the past and, in fact, PDF accessibility is one of those things.  To achieve 100% accessibility takes some manual labor.  There are tools to make the job easier, but it still takes work. In an attempt to provide the “Easy Button,” some companies now are touting their automated tools to tag PDFs and make them accessible.  However, PDF accessibility, conforming to standards, and reducing your legal risk can’t be done 100% through automation.   

It Can’t Be All Bad, Right?

To be fair, tools that automatically tag PDFs will do just that.  They’ll tag PDFs.  What this means is that they’ll put the content in Tags so that assistive technologies like screen readers and refreshable braille displays can read them to people. However, that’s about where they stop.  If you’ve got Adobe Acrobat Professional, then you have tagging software already, for free, at your disposal.  Other companies will try to sell you this service, but why pay more when you already have a really good tool to do the job?

So, sure, they’ll add tags. If the document was constructed well, the tags might be in the correct reading order so that content is read logically.  Then again, maybe not.  This is one of those “manual labor” things that I mentioned earlier – having to verify, and perhaps correct, the order of the tags.  In addition, If “Styles” (in Word, or “Headings,” in other authoring software) are used correctly when creating a document, then maybe they will carry over to the PDF tags and end up being correct.  Again, this is one of those things that needs to be manually verified.

But Sometimes Automation Falls Very Short of the Mark

Let’s say, for the sake of the conversation, that the tags ended up in the correct order so that content on the page is read in a logical manner.  Let’s also say, just for fun, that the headings were tagged correctly.  That’s great!  However, unless your document is incredibly simple, automated tools are very likely to miss some things.  After using automated tagging, here are some of the issues that send clients to us because, clearly, things need to be fixed and they want either our software, training, and/or services to help them out:

The Color Checkpoint

Automation won’t catch or fix when color, format, location on the page, font, or other visual cues are the only way information is conveyed.  Now, granted, some of these issues should never end up in the PDF anyway.  They fail accessibility in a number of ways.  For example, think about the Drug Formularies that you see during “Open Enrollment.”  When insurance companies list the brand name drugs in all caps and the generic drugs in lowercase bold italics, then that’s technically a failure of the color checkpoint.  If they provide a key to tell people with sight what the all caps or lowercase bold text mean, then that’s good but, at that point, any assistive technology won’t sufficiently convey that information to people using screen readers.  That’s a fail.

Complex Reading Order

If the document has pull quotes, sidebars, text in multiple columns, or other similar “creative” layout features that make it more interesting to look at, this can wreak havoc on the reading order when using automation to tag a PDF.  As mentioned before, reading order needs to be manually verified and, when the pages have a more complex layout, the reading order will most likely also need to be fixed.  That’s a potential/ probable fail.

Images and Links

Images and hyperlinks in a document can present some unique issues and this is especially true when automation is used for tagging.  Sometimes automated tools will put images in Figure tags (which might be appropriate for the image if it conveys important information), but they really fail when it comes to figuring out what the Alternative text should be.  Alternative text is what assistive technologies use to tell people what’s important about an image, or chart, graph, etc.  In addition, hyperlinks need to have accurate alternative text, too, and again, automation will often provide something for the Alt text, but is it correct?  Alternative text for images and links needs to be manually verified.

Furthermore, we’ve seen many instances where automation will put purely decorative images in Figure tags when, in fact, these pictures don’t really convey any important information at all.  In situations like these, it’s best to not tag those images so that assistive technology doesn’t read them – decorative images can be skipped over and no one loses any important information.  Double fail.

Data Tables

Many authors love data tables.  They’re great for offering up a lot of related information in a little bit of space.  Some people try to cram as much information as they can into one table and, when these tables get complex, automated tagging really struggles.  In PDF, tagging tables is a pretty complex task that goes beyond the scope of this article (and some in the PDF tagging community are going to get that “scope” pun…) but, suffice it to say, the more complex the table, the worse automation does.

Of course, there’s also a “flip side” to the whole table situation.  That’s when authors use tables in their source documents to help with the layout, but the content doesn’t really belong in a table.  Auto-taggers, however, will “see” the encoding for the table and try to tag that content that way.  The problem with this is that people using assistive technologies will be told that there’s a data table on the page where, in reality, there isn’t one.  Double fail. Again.

Tables of Contents

We’ve also seen automation really struggle with tagging Tables of Contents (TOC) – especially when there are nested (or sub) sections within a part of the TOC.  When the TOC is linked to the pages in the document, that can be a real nightmare.  In fact, we’ve seen some auto-tagging results where the tool didn’t even try to tag the TOC (or, maybe it tried but it failed miserably)!  The Table of Contents in your document is most likely going to need some serious manual labor help.  That’s a fail.

Conclusion

Automated tagging solutions can be helpful to get the process started, but, in the end, none of them are perfect, some are downright lousy, and you’re most likely going to have to at least manually verify some stuff and probably have to fix a lot, too.

Buyer Beware

Beware of what companies claim their auto-tagger will do, and how accurate and comprehensive their auto-tagger is.  Ask the questions – “Will your solution get me all the way there?”  “Will your tool make my PDF conform with WCAG 2.0AA or PDF/UA?”  “How much does it cost?” (Remember, if you’ve got Adobe Acrobat Pro, you’ve already got your auto-tagging tool!)

Finally, here’s one final thought…  Anyone who tells you that their tool will get you 60, 70, 80, or even 90% of the way to accessible is missing one very important thing.  If someone in a wheelchair can get 90% of the way up the ramp and into the building, they’re 10% short of the mark and that building’s not accessible.  If your PDF is 90% accessible, some people are still going to miss out on important information, your communication breaks down, and you’re taking on unnecessary legal risk.