DocRaptor HTML TO PDF API

How To Evaluate HTML to PDF Converters

You’re the lucky developer who got assigned the PDF report project? You probably quickly discovered it's more complicated than building normal web pages. So which HTML to PDF converter should you use? How do you compare them?

Some elements of creating PDFs are simple, while others elements are more complex, tricky, slow, and even impossible in some HTML to PDF conversion libraries. But no worries, this article will show which document elements you should test first.

You have many options, from open-source HTML to PDF libraries to online HTML to PDF APIs to commercial libraries overseen by the inventor of CSS. But they all vary in their ability to create advanced PDFs and in their conversion speed.

This article will guide your HTML-to-PDF library comparison process, saving you days of trial and error. Changing converters later can be painful, as libraries often use different CSS and HTML to accomplish the same result. It’s better to make the right choice the first time!

Differing Paradigms

The core problem is that the original HTML and CSS standards and web browsers focus on building a long, continuously scrolling web page—not making documents or books with many smaller pages. You’ll see this core difference between browser engines and PDF requirements pop up again and again throughout this article.

There are now some excellent proposed and draft standards for creating paged media with CSS (such as PDFs or presentations). Still, there’s not a lot of momentum within the browser development teams for implementing these standards.

Additionally, many PDF creation libraries are focused on creating printed materials and books where conversion speed is less important than web-based reports.

This has led to a wide variance in libraries’ ability to create advanced PDFs. Some open-source browser-based libraries have excellent JavaScript and CSS support but horrible PDF functionality. Commercial PDF converters have powerful PDF features but sometimes lag in JavaScript support. Non-browser open-source libraries are stuck in the middle and are often the least performant choice.

Known Problem Areas

There are elements of PDF generation that are simple and hence pose no performance issues, but others are complex and problematic from a performance standpoint. Let's take a closer look at these tricky PDF generation elements. This way, you can be better prepared to address possible performance problems.

Page Break Handling

Most libraries allow for CSS-defined page breaks, but their ability to define page breaks automatically differs. Document elements such as tables, images, floating elements, and footnotes complicate where the page should break and how much whitespace should be left on the page. Some libraries will struggle with documents containing multiple pages and anything other than plain text.

Differing Page Styles

You’re probably familiar with books and reports with title page styling that’s different from the content page; or with left pages that have different margins than right pages. Some PDFs use landscape layouts for tables and portrait layouts for text.

The CSS Paged Media specifications let you define different styles, layouts, and sizes through named pages. Unfortunately, browser engines have neglected to implement these specifications, so you cannot use them in many open-source HTML-to-PDF converters.

External Assets

Loading images, charts, JavaScript, CSS, etc. from external sources can dramatically increase your document generation times. Most HTML to PDF libraries can handle lots of external assets, but the time it takes can put additional strain on your infrastructure.

Accessibility

Accessible PDFs, or "tagged" PDFs, contain hidden tags that describe the document to screen readers and other assistive devices. They're similar to HTML's alt attributes. Only some PDF generators support tagging your content, and they all offer differing levels of support for automatic tagging.

Tables

A small, simple table is supported by almost every converter. Problems can occur when your table spans multiple pages or columns, has captions, is floated, contains images or charts, uses row or column spans, or is incredibly long. Many engines—particularly those that are open-source and non-browser-based—are unable to handle these complex calculations or consume too much memory or time.

DocRaptor has advanced table support, including defining rows and column spans through CSS and selecting when and where table captions appear.

Columns

Though well-supported by browsers for years now, some libraries will struggle with CSS columns. They can be especially problematic when the columns contain figures and tables or cross multiple pages.

Floats/Page Floats

Floating content left or right is commonplace in web development, but some libraries struggle with correctly selecting the page break location when a float is involved.

Often, you may want to float an object to the top or bottom of a page or even to the inside or outside of a page spread. This is only possible in the most advanced rendering engines such as DocRaptor, Prince, or PDFreactor.

Images

Almost all libraries offer basic support for raster images and most support vector (SVG) graphics. Some libraries will convert vector images to raster images, which destroys their ability to smoothly scale up or down. Other libraries may offer more advanced image quality options, letting you fine-tune the output size of your document.

Footnotes

Despite being exceedingly common in publishing, footnotes are one of the largest areas of differentiation between HTML to PDF libraries. Most conversion engines struggle to support even simple page-level footnotes, as this concept is completely unfamiliar to web browsers and standard CSS.

CSS-based footnote support was defined in the CSS Generated Content for Paged Media Module specification draft, but support is limited to more advanced HTML-to-PDF converters. These specifications allow you to use CSS to define how and where footnotes should be displayed.

JavaScript

All browser-based engines and most commercial engines provide basic JavaScript support. Most non-browser-based open-source libraries do not.

Commercial engines provide the most PDF-specific JavaScript functionality. As an example, the DocRaptor and Prince engines allow you to access the page count through JavaScript and even directly access individual PDF box elements (which are different from the HTML elements). You can run JavaScript functions after document rendering is complete, and trigger a re-render if necessary.

This advanced level of document-level JavaScript support lets you easily create dynamic tables of content with page numbers and other complex content manipulations.

Generated Content: Counters and Cross-References

Most libraries support basic page and chapter number counters. Some offer limited styling and placement options, while others allow you to insert the page counters wherever you want through the CSS content property.

Some libraries also support more advanced versions of generated content, such as cross-references. These lets you create links such as "See Page 38 for More Details", where the "38" is automatically generated.

Color Management / Printing

When your document is intended to be printed, it's critical to match the colors to your specifications in your printer's desired format. Some libraries support only RGB documents, but most offer at least CMYK support. More advanced generators allow you to define a specific ICC profile for your document.

Some advanced libraries also allow you to define printer's marks through CSS, which is helpful for documents intended to be trimmed by the printer.

Watermarks

Many watermarks are intended to be on every page, and this can be difficult to accomplish with browser-based engines. A workaround can normally be found, but watermarks will be easier in any commercial HTML-to-PDF generator or most non-browser-based open-source libraries.

Complexity Often Requires a Powerful Converter

We hope this guide has illuminated the areas of HTML to PDF conversion complexity. If none of these elements apply to your document, you can easily use any available library or API. But if your document contains these elements, we recommend that you be more selective and focus your initial implementation testing on these elements.

You may find that you need a more powerful HTML-to-PDF generator. That's the situation we found ourselves in over a decade ago and why we built the DocRaptor HTML-to-PDF API. It's helped thousands of developers and could help you too!

Ready to get started? Try DocRaptor for free with unlimited test documents.