Detecting empty pages

From time to time, it can be necessary to detect and process empty pages. For example, one might want to delete blank all pages or add a “This Page Intentionally Left Blank” stamp to each page. Nevertheless, detecting empty PDF pages can pose special challenges due to the nature of PDF files. Although PDF is often associated with formatted, paginated content, the internal structure does not contain the concept of a “page”, per se. Instead, a PDF is a container for objects that are tied together in various possible ways. When the PDF is accurately rendered, these elements are assembled into a whole that can be represented as the familiar paginated content most of us associate with PDF.

In order to detect empty pages, there are two possible approaches:

Analyze every object in a PDF to identify to which page it belongs.
Render each page as an image and then analyze those images (in your programming language of choice) to determine whether all pixels are white.

The first approach is potentially very resource-intensive, especially for complex PDF documents. This is likely to seriously hamper performance. Nevertheless, it can be used to detect both visible and invisible objects. By contrast, the second approach is far leaner, but will only search for visible objects. If a page includes content hidden in layers or page-level JavaScript, this will not be rendered. As such, the best solution depends on the types of documents being processed.

Updated on March 22, 2017

Tagged: tips and tricks