XmHTML Parser Description

Overview

This document describes XmHTML's HTML parser in detail and provides background information on the how and why of document verification and repair. It is targetted towards programmers that want to make full use of the parser and document callback resources as well as programmers that want to use the generated parser tree for different purposes.

XmHTML's HTML parser is fairly powerfull in that it is capable of repairing even the most terrible HTML documents as well as converting a non HTML 3.2 conforming document to a HTML 3.2 conforming one. The only reason for the existance of these document verification and repair capabilities is that XmHTML only works with fully balanced HTML documents. A balanced HTML document is a document in which each terminated HTML element has its opening and closing members at the same level.

Parser Tree

When a document is loaded into XmHTML, the parser translates this document to a doubly linked list of objects (referred to as the Parser Tree). Each object contains either a HTML element (and its attributes) or plain text.
typedef struct _XmHTMLObject{
	htmlEnum id;		/* ID for this element */
	String element;		/* element text */
	String attributes;	/* attributes for this element, if any */
	Boolean is_end;		/* true when this is a closing element */
	Boolean terminated;	/* true when element has a closing counterpart */
	int line;		/* line number for this element */
	struct _XmHTMLObject *next;
	struct _XmHTMLObject *prev;
}XmHTMLObject;
The id field of this structure describes the type of element. The table at the end of this document lists all elements that XmHTML knows of.

When id is HT_ZTEXT, the element field contains plain text as read from the document (character escape sequences not expanded). The attributes, is_end and terminated elements are meaningless.

In all other cases, the element field contains the element name and the attributes field contains possible attributes for this element. When an element is terminated (that is, has a closing counterpart), the terminated field will be True, and the is_end field indicates whether the current element is an opening or a closing one. Only unterminated or opening elements can have attributes.

The element and attributes fields are contained in the same memory buffer, where the latter is separated from the former by a NULL character. When freeing an object, freeing the element field will also free the attribute field.

The line field contains the line number in the source document where the element is located.

The objects field in the XmHTMLDocumentCallbackStruct contains the starting point of the parser tree.

Programmers that want to use the generated parser tree for different purposes might be interested in some of the XmHTML private functions for extracting attribute values and character escape sequence expansion.

Document Verification

Document Repair

XmNparserCallback

typedef struct
{
	int reason;		/* the reason the callback was called */
	XEvent *event;		/* always NULL for XmNparserCallback */
	int no;			/* total error count uptil now */
	int line_no;		/* input line number where error was detected */
	int start_pos;		/* absolute index where error starts */
	int end_pos;		/* absolute index where error ends */
	parserError error;	/* type of error */
	int action;		/* suggested correction action */
	String err_msg;		/* error message */
}XmHTMLParserCallbackStruct, *XmHTMLParserCallbackStructPtr;
This table lists all possible values for the action field, together with a short description of what the parser response will be.

Action Description
XmHTML_REMOVE offending element will be removed
XmHTML_INSERT insert missing element
XmHTML_SWITCH switch offending and expected element
XmHTML_KEEP keep offending element
XmHTML_IGNORE ignore, proceed as if nothing happened
XmHTML_TERMINATE terminate parser

Shown below are all possible values for the error field (default action is displayed in bold), allowed actions and the value of the err_msg field. When the action field is set to an action that is not allowed for an error, XmHTML will use the default action.

error: HTML_UNKNOWN_ELEMENT
actions: XmHTML_REMOVE, XmHTML_TERMINATE
err_msg: %s: unknown HTML identifier

error: HTML_UNKNOWN_ESCAPE
actions: XmHTML_REMOVE, XmHTML_TERMINATE
err_msg: %s: unknown character escape sequence

error: HTML_BAD
actions: XmHTML_REMOVE, XmHTML_IGNORE, XmHTML_TERMINATE
err_msg: Terrible HTML! element %s completely out of balance.

error: HTML_OPEN_BLOCK
actions: XmHTML_INSERT, XmHTML_REMOVE, XmHTML_KEEP
err_msg: A new block level element (%s) was encountered while %s is still open.

error: HTML_CLOSE_BLOCK
actions: XmHTML_REMOVE, XmHTML_INSERT, XmHTML_KEEP, XmHTML_TERMINATE
err_msg: A closing block level element (%s) was encountered while it " was never opened.

error: HTML_OPEN_ELEMENT
actions: XmHTML_REMOVE, XmHTML_SWITCH, XmHTML_TERMINATE
err_msg: Unbalanced terminator: got %s while %s is required.

error: HTML_VIOLATION
actions: XmHTML_REMOVE, XmHTML_KEEP, XmHTML_TERMINATE
err_msg: %s may not occur inside %s

error: HTML_INTERNAL
actions: XmHTML_TERMINATE, XmHTML_IGNORE
err_msg: Internal parser error

XmNdocumentCallback

typedef struct
{
	int reason;		/* the reason the callback was called */
	XEvent *event;		/* always NULL for XmNdocumentCallback */
	Boolean html32;		/* True when document was HTML 3.2 conforming */
	Boolean verified;	/* True when document has been verified */
	Boolean balanced;	/* True when parser tree is balanced */
	int pass_level;		/* current parser level count. Starts at 0 */
	Boolean redo;		/* See below */
	XmHTMLObject *objects;	/* parser tree starting point */
}XmHTMLDocumentCallbackStruct;

Private Functions

XmHTML uses a number of functions to extract values from the attributes field of the XmHTMLObject structures. This section gives a brief overview of these functions, along with the prototypes. The functions themselves are defined in the header file XmHTMLfuncs.h.
extern Boolean _XmHTMLTagCheck(char *attributes, char *tag);
Returns True when tag is present in the given attributes.

extern Boolean _XmHTMLTagCheckValue(char *attributes, char *tag, char *check);
Returns True when tag has the specified value check and False if not.

extern char *_XmHTMLTagGetValue(char *attributes, char *tag);
Returns the value of tag if found in the given attributes, NULL otherwise. The return value must be freed by the caller.

extern int _XmHTMLTagGetNumber(char *attributes, char *tag, int def);
Returns the numerical value of tag if found in the given attributes. def specifies the return value if tag is not found.

The following function searches and expands any character escape sequences in the given string:

extern void _XmHTMLExpandEscapes(char *string);
This function recognizes all escape sequences from the ISO 8895-1 character set, as well as all &# character escapes below 160. Escape sequences are not required to have a terminating semi-colon.

XmHTML Element Identifiers

This table lists the internal identifiers, the name of the corresponding HTML element and whether an element is terminated or not. It includes the complete set of HTML 3.2 elements, as well as a small number of extensions.
XmHTML Element Identifiers
id Element Terminated id Element Terminated
HT_DOCTYPE !doctype False HT_A a True
HT_ADDRESS address True HT_APPLET applet True
HT_AREA area False HT_B b True
HT_BASE base False HT_BASEFONT basefont False
HT_BIG big True HT_BLOCKQUOTE blockquote True
HT_BODY body True HT_BR br False
HT_CAPTION caption True HT_CENTER center True
HT_CITE cite True HT_CODE, code True
HT_DD dd True HT_DFN dfn True
HT_DIR dir True HT_DIV div True
HT_DL dl True HT_DT dt True
HT_EM em True HT_FONT font True
HT_FORM form True HT_FRAME frame True
HT_FRAMESET frameset True HT_H1 h2 True
HT_H2 h2 True HT_H3 h3 True
HT_H4 h4 True HT_H5 h5 True
HT_H6 h6 True HT_HEAD head True
HT_HR hr False HT_HTML html True
HT_I i True HT_IMG img False
HT_INPUT input False HT_ISINDEX isindex False
HT_KBD kbd True HT_LI li True
HT_LINK link False HT_MAP map True
HT_MENU, menu True HT_META meta False
HT_NOFRAMES noframes True HT_OL ol True
HT_OPTION option True HT_P p True
HT_PARAM param False HT_PRE pre True
HT_SAMP samp True HT_SCRIPT script True
HT_SELECT select True HT_SMALL small True
HT_STRIKE strike True HT_STRONG strong True
HT_STYLE style True HT_SUB sub True
HT_SUP sup True HT_TAB tab False
HT_TABLE table True HT_TD td True
HT_TEXTAREA textarea True HT_TH th True
HT_TITLE, title True HT_TR tr True
HT_TT tt True HT_U u True
HT_UL ul True HT_VAR var True
HT_ZTEXT plain text False      




©Copyright 1996-1997 by Ripley Software Development
Last update: September 19, 1997 by Koen