您当前的位置:首页>>技术中心>>数据恢复文章>>正文
 
Doc 文件格式
作者: 来源: 日期:2006-3-26 0:19:29  点击次数:

并未完全整理请等待。

The Doc format is the de facto standard for large text documents on the Palm Computing Platform. It enjoys wide support in both software and content, but documentation is sparse. This document is an attempt to describe the Doc format for the edification of programmers who are interested in writing Doc-compatible software, and to encourage programmers not to break the format in incompatible ways.

This document is totally unofficial, and derived from examination of existing Doc files and applications.

Overview

A Doc-format e-text is an ordinary PalmPilot database, represented on the desktop by a file in the standard .prc/.pdb format. (Describing that format is currently beyond the scope of this document.) The database is divided into three sections, which appear in order:

  • A header record
  • A series of text records
  • A series of bookmark records

Note that all values are stored MSB first, as is usual on the PalmPilot.

The Header Record

The first record in a Doc database is a header. Existing Doc creation programs create a 16-byte header, with contents as described below; many Doc readers extend this record once the database is installed, to hold additional reader-specific information.

Doc Header Format
version2 bytes0x0002 if data is compressed, 0x0001 if uncompressed
spare2 bytespurpose unknown (set to 0 on creation)
length4 bytestotal length of text before compression
records2 bytesnumber of text records
record_size2 bytesmaximum size of each record (usually 4096; see below)
position4 bytescurrently viewed position in the document
sizes2*records bytesrecord size array

The position field is not used by all readers; some store this information elsewhere.

AportisDoc (Reader and Mobile Edition) set spare to 0x0003, and overwrite the first two bytes of length with zeros (even if the document is more than 64k bytes in length!) upon first opening the document.

The sizes array is a list of two-byte unsigned integers giving the uncompressed size of each text record, in order. It is created by some readers (AportisDoc, TealDoc, Doc, and possibly others) when the document is first opened.

Text Records

Following the header record is a series of text records, each one of which represents a text block no greater than record_size bytes in length. Most conversion software creates blocks of 4096 bytes (except for the last one); the format provides for other block sizes and for records of varying lengths, but it is likely that some Doc-handling software cannot deal with anything but fixed 4096-byte records.

In a version 1 database, each block of text is simply stored in a single record. In a version 2 database, each block of text is individually compressed, making the actual record size somewhat smaller -- note that the block size refers to the uncompressed size of a text block.

Compression Algorithm

Note: The original designer of the Doc compression format, Pat Beirne, has reposted one of his original messages describing the algorithm. If you are curious about why it works the way it does, check it out.

Each text block (in a version 2 database) is individually compressed using a simple one-pass algorithm. As I am far from an expert in compression algorithm design, I shall simply describe what the data looks like and refer anyone interested in more details to the code (which is readily available in a variety of places, such as in the source to txt2pdbdoc or the source to Pyrite.

The output of the compression algorithm is a stream of bytes, described here with the action taken by the decompressor when they are encountered:

 
Compression Byte Codes
0x01-0x09Copy the following N bytes verbatim
0x0a-0x7fPass through as-is
0x80-0xbfCopy a sequence from a previous part of the block
0xc0-0xffInsert a space followed by N xor 0x80

When a copy-sequence byte code is encountered, it is used as the high byte of a two byte quantity, along with the next byte in the data (resulting in a value from 0x8000-0xbfff). This value is then ANDed with 0x3fff, resulting in a value from 0x0000 to 0x3fff. It is further subdivided into an offset (the upper 11 bits, which are shifted down appropriately) and a length (the lower 3 bits). The actual data in the output is located by subtracting the offset from the current position in the decompressed data; the number of bytes copied is equal to the length plus 3.

Bookmark Records

Following the text records is an optional series of bookmark records. Each bookmark occupies a single record, and they are usually presented by the reader in the same order they appear in the database. The format of a bookmark record is rather simple:

name16 bytesbookmark name (up to 15 characters, null terminated)
position4bookmark position, from beginning of text

Note that the bookmark name field is always 16 bytes wide, even if the name is shorter, and that the position is in actual text bytes before compression.

Common Conventions

Bookmark Autoscan

Because most Doc creation programs do not add bookmark records to their output, most Doc readers support an alternative method for authors to specify bookmark locations in a document. The reader scans the document the first time it is opened, looking for a specified string at the start of lines. Each time it is found, the reader adds a bookmark using the text on the rest of the line. By convention, the text to scan for is placed on the last line of the document, surrounded by angle brackets (< and >).

TealDoc-Specific Extensions

The current TealDoc extensions are implemented by the use of HTML-like tags embedded in the text of the document. Although TealDoc tags look like HTML, TealDoc's parser is not as robust as that of a desktop web browser; the following limitations have been observed in practice:

  • Tags, attributes, and keyword values must be in all upper case
  • Each tag must appear alone on a single line; attempting to embed a tag in the middle of a line of text will cause unpredictable results.
  • Text attribute values should be surrounded by double quotes; keyword and numeric values should not be quoted.

Other Extensions

Besides TealDoc, other Doc readers also extend the standard e-text database format. Some of these extensions will be more fully documented later; for the time being, this section contains a few notes in the hopes that future developers will be able to avoid compatibility problems. Please note that the notes in this section should not be considered authoritative or complete; if you are developing Doc software, you should investigate this stuff for yourself.

QED Extensions

QED, the Doc editor from Visionary 2000, adds an appinfo block, simultaneously marking the document with its version number (in the database header).

RichReader Extensions

RichReader, the rich text document reader by Michael Arena, supports formatting control codes (font changes, indentation, etc.) embedded in the document text. When viewed on another reader, RichReader documents may appear to contain "garbage" characters, since many of the formatting codes use non-printable or extended ASCII characters.

LinkDoc Extensions

Mobile LinkDoc, a reader from Mobile Generation Software, stores links between documents by adding extended bookmark records to the document being linked from.

Extensions Which Do Not Affect the Doc Format

A number of readers (nearly all of them, in fact) store additional information in databases separate from the documents themselves, leaving the documents unaltered. For example, category information is normally stored externally. These product-specific databases will not, at the present time, be documented here, because they do not affect the document format itself.


上一篇:MDB 文件格式
下一篇:从MDF文件恢复Sql Server2000数据库

  北京总部: 4006-505-646
  上 海 部: 021-58358765
  深 圳 部: 0755-83692929
  浙 江 部: 13666673722
  其它地区: 4006-505-646

经典案例
藁城市东街百货-EFS文件解密成
中央电视台新闻评论部-苹果分
promise乔鼎硬盘阵列数据恢复成
麒麟童文化-苹果分区无法打开,
NAS 8100服务器数据恢复成功 
Liteon-重建一组RAID时,不小
濮阳市地方税务局-CHKDSK后数据
北京市海淀区华夏心理培训学校
台湾HD公司-FreeBSD Nas无法启
NCR公司-硬盘数据恢复成功 
解决方案
硬盘出现异响应急处理
raid磁盘阵列OFFLINE后的应急方
磁盘未被格式化,是否格式化数据
误GHOST、误一键恢复灾难应急方
误删除、误格式化数据灾难应急
LINUX FSCK数据出错灾难应急方
北亚数据恢复 - 联系我们 - 关于北亚 - 友情链接 - 网站地图 - RSS聚合 
版权所有 北京北亚数据恢复中心
24小时免费咨询电话:4006-505-646 或 800-810-580
公司地址:北京市海淀区永丰基地丰慧中路7号新材料创业大厦B座205室
緉<"