From Document to Book

How to Easily Format Your Text Documents for Printing

Creating PDFs for Printing

Here are some instructions on how to get from text and a markup language to a PDF ready for printing or publishing using a basic editor program. I wanted to create my documents in a standard programming editor for a number of reasons. My programming editor starts faster and responds faster to commands than a typical word processor. It's available cross-platform. It works on low resource computers (with little memory). Documents are stored in a human readable format. ASCII text alone wouldn't be enough to store a complicated book or document. However, using a markup language, it should be able to do the job. There are several choices out there, Latex and Docbook and DITA, to name just some of them. HTML is not designed specifically as a language for formatting documents for print (although they're adding some print extensions to css). However, there is precedence for a book being done in HTML using Prince, so it's not unheard of. I wanted to reuse my knowledge of HTML and, therefore, not have to learn a whole new XML or markup language subset to get results. XHTML also has a lot fewer tags to learn than alternatives like Docbook and DITA. I decided to go with XHTML as my document format of choice. Another advantage was that there were Open Source tools that could convert from HTML to PDF. Some of the other formats didn't always have good, easy to use, Open Source tool chains in place to make the transformation. The last piece of the puzzle was to convert a document to PDF format. PDF is accepted by most book publishers and is a good format for print documents. I could have used other alternatives like Postscript. However, PDF seems to be more accepted. It's also easier to get a document into the PDF format than many other specialty or proprietary formats for printing.

I looked at several programs to convert documents from HTML to Postscript or PDF. Postscript can easily be converted to PDF via an Open Source program such as Ghostscript. Some options include using the print function of a browser and printing to a file in Postscript format. One drawback of that is that most browsers don't give you a What You See Is What You Get output. It looks one way on the page and another when printed. The cascading style sheet specifications for outputting to printer are not well implemented by most browsers at this time either. Most browsers don't even implement many of the CSS 3 standards. One program that makes use of the technique of printing from a browser to Postscript is wkhtmltopdf. However, when I first tried it, it had many of the same limitations as if you were simply printing from a specific browser. Later verions are showing some nice improvements which makes it a good option. I tested out Prince which is what one book on HTML used in its own creation. The drawback is that it is not Open Source. Also, there's a watermark in the free trial version. It wasn't completely up-to-date on some of the CSS 3 standards, but it could certainly handle a lot of the CSS specifications. I tried several programs written in a variety of interpreted languages to translate from HTML to PDF. While a compiled option would be preferrable due to various issues such as execution time, there are several choices available that are written for various interpreted languages including implementations in php and Perl. Usually these options are further behind in implementing the latest HTML and CSS standards. Of the ones I tried, my preference was for a specific Perl based program because I'm more familiar with the Perl language, already had Perl installed and use it for other applications and the program was easier to work than some of the other alternatives. It doesn't have full CSS coverage, but it's a start and it gets the job done.

So at this point, I have two program options for converting from XHTML to PDF, wkhtmltopdf and html2ps. There are some drawbacks to html2ps. It is harder to set up. It expects styles to be set up in html2psrc not in css files. It's interpreted and thus can be slower on low resource machines. The main drawback to the latest version of wkhtmltopdf that I tested was that the output doesn't come out as crisp and sharp as html2ps. I'd recommend either as useful once you get them set up. However, if any better Open Source options come along, I'm always ready to switch to a better tool.

What You Need

If you know HTML or have a good HTML editor (possibly an Open Source option such as Kompozer), the formatting part is relatively easier. However, setting up the software to convert the formatted ASCII test to the final result in PDF can be a challenge. The following information gives a couple of methods for doing so. I'll continue my search for better methods and would be interested in hearing from anyone who's found some.

In order to create pdf files from XHTML or HTML files, you'll need the right software to help you. If you decide to use wkhtmltopdf, you can get by with just that one additional program, although I like to have Ghostscript installed as well. If you decide to use html2ps, you'll need it and Perl to run the program. Html2ps uses other software to handle the graphics portions, so you'll need ImageMagick or GraphicsMagick or possibly netpbm if you want images. In the end, it can become difficult to set up and install html2ps, but it does give a sharp, clean output. Either program, html2ps or wkhtmltopdf, can output to Postscript format. For converting between Postscript and PDF, I use Ghostscript. Wkhtmltopdf can also render directly to PDF format, saving a step. Feel free to check the different outputs and see which you think looks better.

Installation

wkhtmltopdf

If you're on Windows, wkhtmltopdf comes with its own installer. Just run it and you're ready to go. You may have to add the location the program installs to in your environment path if it doesn't do it for you. You can also specify the full path as well as program name when running the program. If you prefer portable apps, copy the files that were installed to your location of choice and work with them that way. If you're on another system, check if the program is already available in a compiled format. If not, you'll need to compile and build it on your own. Instructions are at the wkhtmltopdf web site.

I noticed wkhtmltopdf on Windows appears to use the background color of your Windows settings. I usually set mine to grey. However, I don't want my documents to have grey backgrounds when I print them. If you don't like the background color of your output, you may need to go into a css file and add a style. You can also add the background color directly to the body tag of the HTML file similar to the following:

<body style="background-color: #FFFFFF">

If you'd like to use the same styles for several documents, you can make use of the cascading functionality of style sheets. Create a generic style sheet with the modifications you need such as background color changes for the body or text-decoration and color settings for anchor tags. Add code within the head tags of each HTML document you want to format to include the generic style sheet after any other style sheets already listed in the head section of the document. Here's the code you'd need in order to add a generic style sheet called style.css.

<link rel="stylesheet" type="text/css" href="style.css" />

To force a page break, you can add the following in your HTML page where you want the break to occur:

<div style="page-break-after: always;"><span style="display: none;">&nbsp;</span></div>

If you're using the table of contents automated generation feature, you may want to generate your own xsl template for the table of contents. You can then edit the xsl file to add styles to handle issues such as background color changes or further customize the table of contents. The following command will dump the default xsl information for generation of the table of contents to a file called toc.xsl.

wkhtmltopdf --dump-default-toc-xsl > toc.xsl

Skip html2ps Installation and Run the Programs

html2ps

Perl Installation

You'll need to install and set up Perl. On Windows, I use ActiveState Perl. If you're going to use ImageMagick or GraphicsMagick, you'll need to install the PerlMagick library so it will work through Perl. You'll want to install your graphics programs before trying to install the Perl libraries.

There are three ways to get the Perl ImageMagick libraries. You can download a version of ImageMagick or GraphicsMagick that gives you the option of installing the PerlMagick library during the setup process. On Windows, you'll want to choose a version of ImageMagick or GraphicsMagick that includes a setup program and is a dynamic (dll) rather than static version. It will most likely have the installer built in. Watch for what version of Perl it says it requires. If versions of Perl, ImageMagick (or GraphicsMagick) and PerlMagick do not match up well enough, html2ps will be unable to render graphics properly.

If installing with the graphics program doesn't work for you, you can try installing the Perl library yourself after ImageMagick or GraphicsMagick is set up. Use ppm to download the addon libraries (Perl modules) you need. If you're behind a proxy (possibly at work) and trying to get your addons to install, you can set the proxy in a batch file or via the command line before attempting to connect with ppm. For Windows, the command is:

set HTTP_proxy=http://zz.zzz.zz.zz:zzzz

Switch the z values to the appropriate numbers for your situation.

Usually you can have ppm query for a library you want instead of building everything from scratch. However, I was unable to locate the ImageMagick library by using search in ppm. If you know an archive with the Perl module you need, you can connect to it directly and install. For instance, I used the following command to download the PerlMagick library from an alternative site:

ppm install http://www.bribes.org/perl/ppm/Image-Magick.ppd 

I already had ImageMagick installed and in my path. The command was all I needed to retrieve and install the Perl library for ImageMagick.

If you still can't get the addons you want, you can download them directly. You should be able to find a tarball with source at CPAN or an alternative site. You'll need a make program and a compiler to install properly. Most POSIX machines will hopefully come with this preinstalled. On FreeBSD, you may need to add the GNU make and build tools. If you do so, use gmake instead of make. To install by building the source from CPAN, type the following at the command line in the directory where you've unarchived the Perl library source you're working with:

perl Makefile.PL
make
make install

This particular technique works well with many Perl libraries and may work well for you, but I had trouble using it with the Perl ImageMagick library on my system. It was expecting ImageMagick to be compiled from source as well so the appropriate files such as headers are available and I didn't have those files installed. ImageMagick supplies the source for PerlMagick as part of its own source code too. You can follow the instructions that come with the ImageMagick source to build PerlMagick. On a POSIX compatible system, you'll need a flag set when you run configure to tell it you want PerlMagick, not just ImageMagick. However, it can be tricky to build on Windows systems. ImageMagick uses Visual C++ on Windows as its default compiler instead of an Open Source compiler choice. Personally, I prefer to stick with Open Source compilers to build my C/C++ code. However, if you really need to, you can always try Microsoft Visual Studio Express. I'd love to see a port of ImageMagick to a compiler such as MinGW (and build environment such as msys) on Windows.

html2ps Customization

I tried running the setup script for html2ps and it didn't work properly. If that's the case, you can manually create the setup script and let the html2ps program know where to find other programs it needs on your system. Whether you use the setup script or set html2ps manually, make sure the other programs you're using such as ImageMagick and Ghostscript are available in your path.

Search in the file html2ps for @html2ps and make sure any packages you're using are set to 1. If you only want to use netpbm, set PerlMagick and ImageMagick to 0. In my case, I modified that portion of the file to look like this:

@html2ps {
  package {
    PerlMagick: 1;
    ImageMagick: 1;
    pbmplus: 0;
    netpbm: 1;
    djpeg: 0;
    Ghostscript: 1;
    TeX: 0;
    dvips: 0;
    libwww-perl: 0;
    geturl: "";
    check: "";
    path: "";
  }

If you don't have a html2psrc file, create one. You can modify the settings to do many of the things a cascading style sheet for printing would automatically do. First, make sure it knows which programs you want to use with it. If you only want to use netpbm, set PerlMagick and ImageMagick to 0. My html2psrc file includes this:

@html2ps {
  package {
    geturl: "/bin/true";
    PerlMagick: 1;
    ImageMagick: 1;
    netpbm: 1;
    Ghostscript: 1;
  }
  

Next, set items like paper type you want to use and whether you want duplex or not. You can also set whether you want links to show as underlined or not. These are my current file settings in html2psrc:

paper {
    type: letter;
  }
 
option {
  underline: 1;
  duplex: 1;
  }

You can format the header and footer in a number of ways. You can add page numbering or display the document name. There's even an option to create table of contents automatically. Here's how I currently have my page headers and footers set in html2psrc:

header {
  font-weight: bold 
  font-size: 24pt;
  center: $T;
  }
footer {
  center: $N;
  }  
}

Finally, you can set the style of various html tags. Not every tag is supported, but there are several you can customize using standard cascading style sheet syntax. Here is how I have the body of the document set up in my html2psrc:

BODY {
       font-family: Helvetica;
       font-size: 12pt;
       text-align: left;
       background: white;
     }  

In some cases, html2ps couldn't find my html2psrc file. If necessary, you can edit the following line in html2ps and specify exactly where the file is (giving full path if needed):

$globrc='html2psrc';

If you'd like to use the netpbm program instead of ImageMagick or GraphicsMagick, you can install it. You don't have to bother with installing the PerlMagick libraries (which is useful if you can't get the different versions of these programs and libraries working together nicely). Make sure netpbm is set to 1 in the appropriate places mentioned above and that ImageMagick and PerlMagick are set to 0. Otherwise, the Perl script appears to try to run ImageMagick in preference to netpbm. Also, make sure it's in your path. On Windows, netpbm can be downloaded from the gnuwin32 project at Sourceforge. The Windows port includes a bash shell script anytopnm that will only work if you have bash installed on your system. You can get bash with the msys (from mingw) or with the djgpp compiler suite. There's also a stand-alone version based on Cygwin available. Since I didn't have bash set up on the Windows machine I was running this on (it wasn't my home system), I coded a quick work-around for avoiding the anytopnm script. You can edit html2ps and use this work-around instead. In html2ps, where you see the following:

    } elsif($package{'pbmplus'} || $package{'netpbm'}) {
      if($pic=~/^GIF/) {
        &run("$giftopm $scr");
      } else {
        &run("anytopnm $scr");
      }

change it to:

    } elsif($package{'pbmplus'} || $package{'netpbm'}) {
      if($pic=~/^GIF/) {
        &run("$giftopm $scr");
      } else {
      if($URL=~/JPG$/i) {
        &run("jpegtopnm $scr");
      } else {
      if($URL=~/PNG$/i) {
        &run("pngtopnm $scr");
      } else {
        &run("anytopnm $scr");
      }
      }
      }

Keep in mind, this will only work for netpbm, not for pbmplus.

Notes

You cannot make full use of cascading style sheets or Javascript, because html2ps won't recognize most of the commands. You can, however, customize some of the styles in the html2psrc file using a syntax similar to cascading style sheets. You can also add page numbering and special formats for the header and footer using that file. You should even be able to add a table of contents.

Ready to Run

To create a document with html2ps, simply run the program on that document. For instance, with my glossary.htm file, I can type the following at the command line:

perl html2ps glossary.htm

You need to be in the directory where the html file is and you need the directories for the html2ps program and all the directories for all the supporting programs it uses in your path.

To create a document with wkhtmltopdf, make sure it's in your path or give the full path when calling the program. With my same glossary.htm file, I can type the following at the command line to create a Postscript file:

wkhtmltopdf glossary.html glossary.ps

To go straight to PDF, type:

wkhtmltopdf glossary.html glossary.pdf

There are several other options you can use with later versions of wkhtmltopdf. To see some of them, type:

wkhtmltopdf -H

Some of the options I tried were:
-g --disable-smart-shrinking --no-background --header-center [title] --header-font-size 24 --footer-center [page]
The added flags cause the output to be printed in greyscale, disable shrinking of the output, turn off printing of the background and add a centered title in 24 point font size to the header and a footer with the current page number centered.

To automatically generate a table of contents along with a document, try a command similar to the following:

wkhtmltopdf --enable-toc-back-links --footer-center [page] toc --xsl-style-sheet toc.xsl book.html book.pdf

The table of contents is automatically generated from the header tags in the document when you specify toc. To use a custom xsl file for table of contents generation as mentioned above, add the --xsl-style-sheet option and specify the file you've generated and customized. These commands should go after as many other settings flags as possible and before the input and output file names. The --enable-toc-back-links automatically generates links from the header tags back to the table of contents. This flag needs to be near the beginning of the settings flags list.

If you want to view a Postscript file, you can use a Postscript viewer or use Ghostscript directly (with later versions of Ghostscript). Some useful Postscript viewers on POSIX compatible machines are gv and mgv. To use Ghostscript directly on Windows, the command line is:

GSWin32c -dSAFER -dBATCH %1

If you're not running from a batch file, be sure to replace the %1 with the actual file name you want to view. You should be able to do the equivalent on POSIX machines by invoking the right ghostscript program for your machine.

To convert the Postscript file to PDF on your machine, all you need is Ghostscript. On Windows, type the following at the command line:

ps2pdf.bat %1 %2

The %1 and %2 should be replaced by the input Postscript filename and the output PDF filename you want created. If you're on a POSIX machine, there's a shell file to do the work. Leave off the .bat extension in the command.

Once the file's in PDF, you can view it with a PDF viewer or using Ghostscript and the same commands you used for viewing Postscript. Good PDF viewers include mupdf and Sumatra on Windows and gv and epdfview on POSIX machines. Foxit is a freeware Windows alternative if you want to view PDFs embedded in a web page as well as standalone. Mupdf and Sumatra also have plug-ins for embedding that work with certain versions of Firefox. Evince is an interesting cross-platform option.

Tricks

If your conversion tool can handle CSS print options, you can set up one version of CSS for the web and another for printing. Wkhtmltopdf offers some support for that. Html2ps lets you just specify the CSS for printing in html2psrc and ignores the CSS file settings. Another way to customize a HTML page (so that you have one version for the web and one specifically for your PDF file and printing) is to make use of templates. You can use one template for the look and feel of your web page and another template to simplify the page and hide unnecessary features for printing. I've also written a short program, dwtmerge. We use Dreamweaver at work, but I'd rather have Open Source options. So, it's a helpful alternative when you're stuck with having to use Dreamweaver compatible templates but don't want to work with Dreamweaver.

More Resources

To an article on creating books for portable devices that only support graphics formats.

 

To the main page.

 

Validate XHTML



The information on these pages is copyrighted by the author with all rights reserved. Reproduction of anything without the author's permission is in violation of copyright laws.
All original material is copyrighted:
(c) Copyright 2010 by Laura Michaels
All Rights Reserved
Last Update: 20111103