This is a specialized, highly-reproducible fork of pdf2htmlEX designed specifically for Nix and NixOS users.
Upstream pdf2htmlEX relies heavily on internal, undocumented Poppler C++ headers (specifically CharCodeToUnicode.h and OutputDev) to achieve pixel-perfect HTML rendering. In late 2024 (Poppler 24.10.0+), the Poppler team permanently deleted the CharCodeToUnicode class, completely breaking pdf2htmlEX on modern Linux distributions.
This Nix Flake solves this by creating a hermetic time-capsule. It utilizes nixpkgs-legacy (24.11) to securely lock the C++ toolchain to Poppler 24.02.0, compiling both Poppler and FontForge statically from source to expose the necessary internal ABI symbols without polluting your host system.
If you have Nix with Flakes enabled, you can run this tool from anywhere on your system without installing any dependencies:
nix run github:GeniusTechnoMystic/pdf2htmlEX-nixos#pdf2htmlEX -- /path/to/your/document.pdfTo use it in your system configuration, add it to your flake inputs.
To build the executable locally from source:
git clone https://github.com/GeniusTechnoMystic/pdf2htmlEX-nixos.git
cd pdf2htmlEX-nixos
nix build .#pdf2htmlEX
./result/bin/pdf2htmlEX --version
This Nix Flake is strictly engineered and tested for:
- x86_64-linux: Fully optimized and cached.
- aarch64-linux: Supported via the
nixpkgs-legacy(24.11) channel.
Note: Due to the static pinning of Poppler 24.02.0 and specific C++ dependencies, Darwin (macOS) is currently not supported by this flake.
For instructions on syncing with upstream or maintaining the Nix build, see MAINTAINERS.md.
This is my branch of pdf2htmlEX which aims to allow an open collaboration to help keep the project active. A number of changes and improvements have been incorporated from other forks:
- Lots of bugs fixes, mostly of edge cases
- Integration of latest Cairo code
- Out of source building
- Rewritten handling of obscured/partially obscured text - now much more accurate
- Some support for transparent text
- Improvement of DPI settings - clamping of DPI to ensure output graphic isn't too big
--correct-text-visibility tracks the visibility of 4 sample points for each character (currently the 4 corners of the character's bounding box, inset slightly) to determine visibility.
It now has two modes. 1 = Fully occluded text handled (i.e. doesn't get put into the HTML layer). 2 = Partially occluded text handled.
The default is now "1", so fully occluded text should no longer show through. If "2" is selected then if the character is partially occluded it will be drawn in the background layer. In this case, the rendered DPI of the page will be automatically increased to --covered-text-dpi (default: 300) to reduce the impact of rasterized text.
For maximum accuracy I strongly recommend using the output options: --font-size-multiplier 1 --zoom 25. This will circumvent rounding errors inside web browsers. You will then have to scale down the resulting HTML page using an appropriate "scale" transform.
If you are concerned about file size of the resulting HTML, then I recommend patching fontforge to prevent it writing the current time into the dumped fonts, and then post-process the pdf2htmlEX data to remove duplicate files - there will usually be many duplicate background images and fonts.
一图胜千言
A beautiful demo is worth a thousand words
- Bible de Genève, 1564 (fonts and typography): HTML / PDF
- Cheat Sheet (math formulas): HTML / PDF
- Scientific Paper (text and figures): HTML / PDF
- Full Circle Magazine (read while downloading): HTML / PDF
- Git Manual (CJK support): HTML / PDF
pdf2htmlEX renders PDF files in HTML, utilizing modern Web technologies. Academic papers with lots of formulas and figures? Magazines with complicated layouts? No problem!
pdf2htmlEX is also an online publishing tool which is flexible for many different use cases.
Learn more about who and why should use pdf2htmlEX.
- Native HTML text with precise font and location.
- Flexible output: all-in-one HTML or on demand page loading (needs JavaScript).
- Moderate file size, sometimes even smaller than PDF.
- Supporting links, outlines (bookmarks), printing, SVG background, Type 3 fonts and more...
pdf2htmlEX, as a whole package, is licensed under GPLv3+.
Some resource files are released with relaxed licenses, read LICENSE for more details.
pdf2htmlEX is made possible thanks to the following projects:
pdf2htmlEX is inspired by the following projects:
- pdftohtml from poppler
- MuPDF
- PDF.js
- Crocodoc
- Google Doc
- Hongliang Tian
- Wanmin Liu

