Tuesday, November 18, 2008

Compressed HTML help (CHM) in Freepascal

What are CHM files?

CHM files are html archives that are compressed using the LZX compression algorithm.

The Idea

In early 2000 something, there was a discussion about Lazarus and making the help system better. Many different ideas were thrown around but the one that caught my eye was CHM, to me it made sense.

So I began to search for a way to make Lazarus help available as CHM files. The Freepascal documentation: the Run Time Library(RTL), Free Component Library(FCL) and the Lazarus Documentation Lazarus Component Library(LCL) were and are, all available in html format either to download as a zip file or from a website for browsing. This was good because CHM's are just html files in a special archive so in theory it's easy to make CHM's from the existing HTML files.

A bunch of History

I used the MS HTML Help Workshop to make some CHM's from the existing html files. This worked just fine but already I was seeing a problem: CHM readers only exist for Windows.

I found a library, libchm, which was capable of reading and decompressing CHM files. I started working with that and in not too long I was able to extract any file from within a CHM file and display it. This was wonderful but still it was not a perfect solution since Lazarus, if it provided a CHM help system, would rely on an uncommon library.

So I searched using google for an explanation of the CHM file format and found a very detailed specification. http://www.speakeasy.org/~russotto/chm/
The real work began. I converted the LZX decompression code in chmlib from C to Pascal. At first it did not work, but with the help of Micha Nelissen I corrected the mistakes from the conversion and that was finished.

Next came the part which required actual understanding of the innermost workings of a CHM archive. This is when I realized that reading CHM files was not just one file to understand but many. I'll explain: We already know that a CHM file is an archive that contains compressed HTML documents, but how do we know which HTLM documents are included and where they are, or how big they are? CHM's have a document index which uses a tree structure to store information about the directory and files within. After some time I was able to read this structure and see a list of all files and directories in CHM's. Subsequently I realized that a CHM has files stored within it that are added when it is made, which contain information about the Table of Contents, Index, which file is the “Home page” of the CHM and many other disinteresting but important things. Understanding a CHM involves understanding these internal files of a chm.

The specification I referred to above had all the information I needed about reading a CHM's contents but not how to understand these other internal files. Google to the rescue. I found another website http://www.nongnu.org/chmspec/latest/ that had information about the internal files! Not much later I was able to open, extract, and show correctly chm files. All without any external library dependencies.

I was very pleased but the CHMs made by the MS HTML Workshop did not have a very helpful Table of Contents or Index. Additionally there was some discomfort on the part of Lazarus developers that the documentation could only be compiled on Windows PC's since many of the Lazarus developers only used Linux.

This is why several months later, I began work on a CHM compiler! During a two week period I converted more C code to Pascal so a LZX compressor existed for pascal, and wrote from scratch several units to write CHM files.

Now it is possible to write and read CHM's files. The code is included with Freepascal in the packages/chm folder. Also fpdoc the Freepascal documentation tool incorporates the ability to write CHM files.

A program was made (LHelp) and a package made for Lazarus to integrate CHM help. It did not gain popularity however. Why? I'm not sure but it could be because people didn't know it was there. It required setting up in the Lazarus which may have been too compilicated for some. But regardless two years go by.

Recent Progress

I had recently been distracted away from Lazarus development by life and not paying much attention to it's progress. Then an email on the mailinglist caught my eye. A Freepascal dev, Marco van de Voort, was working on integration CHM help into the FP ide. I immediately started reading the logs to find out what interest had been shown in CHM's.

So in the last few weeks I implemented something I had wanted to but as yet, not been motivated to do: I implemented creating and reading the search index CHM's can contain.

What does this mean? This means that now it is possible to author searchable CHM files in any os/platform that FreePascal supports and also read them, all with Pascal code and with no external library requirements.

In the not too distant future, Lazarus may by default use these CHM files to display help when you need it.

6 comments:

Marc Lebrun said...

I like very much the idea of the CHM file(s) for Lazarus' help !
To me, one of the key points that made me learn Delphi so fast back in the mid 90's was that its help system was highly available in just a single click. If Lazarus could have such a nice help system it would be very very profitable to everyone.
Many thanks !!!

PeeDee said...

Where is the code for your conversion from chm to html? I was browsing the respository, but coudn't find it.

PeeDee said...

I'm working on a ruby version of a chm reader, and am stuck at the LZX/html conversion. Hoping your code is easier to read than the C.

Unknown said...

Look here: http://svn.freepascal.org/svn/fpc/trunk/packages/chm/src

Anonymous said...

there is program for Linux: KChmViewer, you may write better one for Linux, I think. using your code.

Jack said...

Brilliant idea of the CHM file(s) for Lazarus' help ! I must say that you thought very well. If it work fine , it would be very much beneficial to everyone. Hope for good !! Thanks a ton !!

comment system