Python script to use HTMLDOC with UTF-8 files

You know, HTMLDOC is a good tool to complement txt2tags features, specially to break an HTML file into multiple pages.

But the current version of HTMLDOC (1.8.x) has no Unicode support.

When you try to use it to convert or split an UTF-8 file, all the special characters (not ASCII) will be incorrect in the resulting HTML.

The Unicode support will be released on the 1.9 version, which is still in beta stage.

If you can’t wait for the stable 1.9 release or are stuck into an old version and just want a quick solution to your messed files, try my Python script:
fix-htmldoc-utf8.py

It restores the original UTF-8 characters that HTMLDOC has messed.

You can use it as a filter (reads STDIN, results to STDOUT):

cat myfile.html | fix-htmldoc-utf8 > myfile-ok.html

You can inform the file and send the results to STDOUT:

fix-htmldoc-utf8 myfile.html > myfile-ok.html

Or you can use the -w option fix the file in place:

fix-htmldoc-utf8 -w myfile.html

Enjoy!

Advertisements

8 responses to “Python script to use HTMLDOC with UTF-8 files

  1. HI, i just needed this. Do you think i could use it on a remote (http) web page? i’m using htmldoc on a win system, actually.

    I was looking for a “charset converting” proxy, to set up in htmldoc but i could try this other trick…

    what you think?

    Ciao
    Marco

  2. Hi Marco,

    You can install Python in your Windows machine and use it locally before uploading your pages, or maybe you can do it on the fly on the server, if you have admin access to it. I guess Apache+mod_python could do it.

    Bye!

  3. Hi, Aurélio,

    I have tried to use your code for Simplified Chinese. The characters which you have have set in the ‘mapping part’ of your code do not seem to cover the chinese alphabet (CJK Unified Ideographs Range: 4E00–9FCF).

    So, the code simply ignores these chars. Also, chars are being sent to the htmldoc pdf generator as hex (&#x4E2D, for example). Does it mean that I must complete your table with smth like

    中 &#x4E2D

    to make it work?

    Thanks you in advance

  4. Nikolai,

    Yes, you’re right in your assumptions. Just add the extra characters to the code and they’ll be converted. You may also need to change the CHARSET= line to something different from iso-8859-1.

    Bye!

  5. Hi

    I have just started using the HTML DOC tool with txt2tags and the UTF coding is a problem for me too. This post sounds like a solution but I have no idea how to install/run your Python script on my Mandriva.

    Could you please explain step-by-step how to do it? Thanks a lot.

  6. iconv: another quick solution ;o)

    iconv -f UTF-8 -t ISO-8859-1 web.html > web-iso-8859-1.html

    And then execute htmldoc 1.8.x

    htmldoc -f web.pdf –webpage web-iso-8859-1.html

  7. This simple PHP script works for any characters (tested on Cyrillic):

    Enjoy!

  8. $stdin = fopen(‘php://stdin’, ‘r’);
    $entities = stream_get_contents($stdin);
    print html_entity_decode($entities);

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s