The Doc format is the de facto standard for large text documents on the Palm Computing Platform. It enjoys wide support in both software and content, but documentation is sparse. This document is an attempt to describe the Doc format for the edification of programmers who are interested in writing Doc-compatible software, and to encourage programmers not to break the format in incompatible ways.
This document is totally unofficial, and derived from examination of existing Doc files and applications.
A Doc-format e-text is an ordinary PalmPilot database, represented on the desktop by a file in the standard .prc/.pdb format. (Describing that format is currently beyond the scope of this document.) The database is divided into three sections, which appear in order:
- A header record
- A series of text records
- A series of bookmark records
Note that all values are stored MSB first, as is usual on the PalmPilot.
The Header Record
The first record in a Doc database is a header. Existing Doc creation programs create a 16-byte header, with contents as described below; many Doc readers extend this record once the database is installed, to hold additional reader-specific information.
|Doc Header Format|
|version||2 bytes||0x0002 if data is compressed, 0x0001 if uncompressed|
|spare||2 bytes||purpose unknown (set to 0 on creation)|
|length||4 bytes||total length of text before compression|
|records||2 bytes||number of text records|
|record_size||2 bytes||maximum size of each record (usually 4096; see below)|
|position||4 bytes||currently viewed position in the document|
|sizes||2*records bytes||record size array|
The position field is not used by all readers; some store this information elsewhere.
AportisDoc (Reader and Mobile Edition) set spare to 0x0003, and overwrite the first two bytes of length with zeros (even if the document is more than 64k bytes in length!) upon first opening the document.
The sizes array is a list of two-byte unsigned integers giving the uncompressed size of each text record, in order. It is created by some readers (AportisDoc, TealDoc, Doc, and possibly others) when the document is first opened.
Following the header record is a series of text records, each one of which represents a text block no greater than record_size bytes in length. Most conversion software creates blocks of 4096 bytes (except for the last one); the format provides for other block sizes and for records of varying lengths, but it is likely that some Doc-handling software cannot deal with anything but fixed 4096-byte records.
In a version 1 database, each block of text is simply stored in a single record. In a version 2 database, each block of text is individually compressed, making the actual record size somewhat smaller -- note that the block size refers to the uncompressed size of a text block.
Note: The original designer of the Doc compression format, Pat Beirne, has reposted one of his original messages describing the algorithm. If you are curious about why it works the way it does, check it out.
Each text block (in a version 2 database) is individually compressed using a simple one-pass algorithm. As I am far from an expert in compression algorithm design, I shall simply describe what the data looks like and refer anyone interested in more details to the code (which is readily available in a variety of places, such as in the source to txt2pdbdoc or the source to Pyrite.
The output of the compression algorithm is a stream of bytes, described here with the action taken by the decompressor when they are encountered:
|Compression Byte Codes|
|0x01-0x09||Copy the following N bytes verbatim|
|0x0a-0x7f||Pass through as-is|
|0x80-0xbf||Copy a sequence from a previous part of the block|
|0xc0-0xff||Insert a space followed by N xor 0x80|
When a copy-sequence byte code is encountered, it is used as the high byte of a two byte quantity, along with the next byte in the data (resulting in a value from 0x8000-0xbfff). This value is then ANDed with 0x3fff, resulting in a value from 0x0000 to 0x3fff. It is further subdivided into an offset (the upper 11 bits, which are shifted down appropriately) and a length (the lower 3 bits). The actual data in the output is located by subtracting the offset from the current position in the decompressed data; the number of bytes copied is equal to the length plus 3.
Following the text records is an optional series of bookmark records. Each bookmark occupies a single record, and they are usually presented by the reader in the same order they appear in the database. The format of a bookmark record is rather simple:
|name||16 bytes||bookmark name (up to 15 characters, null terminated)|
|position||4||bookmark position, from beginning of text|
Note that the bookmark name field is always 16 bytes wide, even if the name is shorter, and that the position is in actual text bytes before compression.
Because most Doc creation programs do not add bookmark records to their output, most Doc readers support an alternative method for authors to specify bookmark locations in a document. The reader scans the document the first time it is opened, looking for a specified string at the start of lines. Each time it is found, the reader adds a bookmark using the text on the rest of the line. By convention, the text to scan for is placed on the last line of the document, surrounded by angle brackets (< and >).
The current TealDoc extensions are implemented by the use of HTML-like tags embedded in the text of the document. Although TealDoc tags look like HTML, TealDoc's parser is not as robust as that of a desktop web browser; the following limitations have been observed in practice:
- Tags, attributes, and keyword values must be in all upper case
- Each tag must appear alone on a single line; attempting to embed a tag in the middle of a line of text will cause unpredictable results.
- Text attribute values should be surrounded by double quotes; keyword and numeric values should not be quoted.
Besides TealDoc, other Doc readers also extend the standard e-text database format. Some of these extensions will be more fully documented later; for the time being, this section contains a few notes in the hopes that future developers will be able to avoid compatibility problems. Please note that the notes in this section should not be considered authoritative or complete; if you are developing Doc software, you should investigate this stuff for yourself.
QED, the Doc editor from Visionary 2000, adds an appinfo block, simultaneously marking the document with its version number (in the database header).
RichReader, the rich text document reader by Michael Arena, supports formatting control codes (font changes, indentation, etc.) embedded in the document text. When viewed on another reader, RichReader documents may appear to contain "garbage" characters, since many of the formatting codes use non-printable or extended ASCII characters.
Mobile LinkDoc, a reader from Mobile Generation Software, stores links between documents by adding extended bookmark records to the document being linked from.
Extensions Which Do Not Affect the Doc Format
A number of readers (nearly all of them, in fact) store additional information in databases separate from the documents themselves, leaving the documents unaltered. For example, category information is normally stored externally. These product-specific databases will not, at the present time, be documented here, because they do not affect the document format itself.