To guarantee long-term archiving of PDF documents (Portable Document Format), ISO standards called PDF/A have been established in different levels. Especially when publishing scientific articles or theses, PDF/A is often required by publishers, cf. qucosa. An important premise is the absence of references to external resources, i. e., documents must be self-contained. Additionally, the use of JavaScript and encryption is not allowed. PDF/A-1 does not support transparency, PDF/A-2 allows it. Find detailed description of PDF/A levels in the corresponding Wikipedia article.
When checking documents with confidential contents, the PDF files cannot be sent to an online validation service.
Offline validation is provided by e. g. 3-Heights™ PDF Validator. You
may generate a 30 days evaluation key and add it to the license manager. The library may be used with different
interfaces. To access it with Java, VALA.jar has to be on the classpath, the system variable
java.library.path
must point to the directory containing PDFValidatorAPI.dll and
the PDF file to be validated should be handed over as an argument, as long as you adhere to the recommended
architecture and the apps provided in the directory Samples. After executing the validation
process, the results are shown, indicating problem types and the document page they occurred on. Moreover, the
results are grouped in categories. A conveniently formatted output may be achieved by:
System.out.printf("p%d: %s (%dx)\n", err.getPageNo(), err.getMessage(), err.getCount());
According to qucosa, the current level worth striving for is PDF/A-2a followed by PDF/A-2b. Since the support of PDF/A-2a in the workflow described below is labeled experimental, we focus on PDF/A-2b. The first step is to check, whether included files – especially vector graphics – include their used fonts and color profiles where applicable. This is achieved best by providing those files themselves as PDF documents. Strategies for some common validation problems are:
/Interpolate true
by /Interpolate false
and check each particular case for
visual side effects. Since this only kills the symptoms, each affected graphics file should be replaced by
an uncritical one, see also Disable
anti-aliasing in ps2pdf.
LinBiolinum
with type TrueType
and encoding Ansi
, a further subset is embedded with type
TrueType (CID)
and encoding Identity-H
, when exporting a PDF with Inkscape. For each embedded subset with Identity-H,
one of the mentioned warnings appears. The problem can be avoided by not using ligatures, by not using fonts
supporting ligatures or by converting texts to paths. For background information see also A very short and simplified introduction
to fonts in PDF and PDF/A
validation with embedded CID font subset caused by ligatures in vector image.
PDF/A compliance of the output of the actual TeX document can be achieved by employing the package pdfx. Specify the
required level as a package argument: \usepackage[a-2b]{pdfx}
. Now you see the blue indicator bar
for PDF/A compliance in Adobe Acrobat Reader. Unfortunately, this does not imply, that all requirements are
fulfilled already.
To provide mandatory metadata, create a file with the same name as your TeX document, but ending with .xmpdata
such as thesis.xmpdata
. The elements \Author
, \Title
,
\Keywords
, \Subject
and \CopyrightURL
should be sufficient for the
beginning, cf. Add metadata in pdf as type
pdf/a. In parallel, you should clean up the metadata specified within hypersetup
. The
package hyperref is a dependency of pdfx and does not have to be included
explicitly.
A consequential error to fonts, which are not embedded, cf. above, is: The CharSet of the font font name must contain the name character name. Fix this by fixing the font embedding. Two other problems, which are fixed in pdfx version 1.5.6, but were reported in earlier versions, are: