plan9port

fork of plan9port with libvec, libstr and libsdb
Log | Files | Refs | README | LICENSE

html.3 (29268B)


      1 .TH HTML 3
      2 .SH NAME
      3 parsehtml,
      4 printitems,
      5 validitems,
      6 freeitems,
      7 freedocinfo,
      8 dimenkind,
      9 dimenspec,
     10 targetid,
     11 targetname,
     12 fromStr,
     13 toStr
     14 \- HTML parser
     15 .SH SYNOPSIS
     16 .nf
     17 .PP
     18 .ft L
     19 #include <u.h>
     20 #include <libc.h>
     21 #include <html.h>
     22 .ft P
     23 .PP
     24 .ta \w'\fLToken* 'u
     25 .B
     26 Item*	parsehtml(uchar* data, int datalen, Rune* src, int mtype,
     27 .B
     28 	int chset, Docinfo** pdi)
     29 .PP
     30 .B
     31 void	printitems(Item* items, char* msg)
     32 .PP
     33 .B
     34 int	validitems(Item* items)
     35 .PP
     36 .B
     37 void	freeitems(Item* items)
     38 .PP
     39 .B
     40 void	freedocinfo(Docinfo* d)
     41 .PP
     42 .B
     43 int	dimenkind(Dimen d)
     44 .PP
     45 .B
     46 int	dimenspec(Dimen d)
     47 .PP
     48 .B
     49 int	targetid(Rune* s)
     50 .PP
     51 .B
     52 Rune*	targetname(int targid)
     53 .PP
     54 .B
     55 uchar*	fromStr(Rune* buf, int n, int chset)
     56 .PP
     57 .B
     58 Rune*	toStr(uchar* buf, int n, int chset)
     59 .SH DESCRIPTION
     60 .PP
     61 This library implements a parser for HTML 4.0 documents.
     62 The parsed HTML is converted into an intermediate representation that
     63 describes how the formatted HTML should be laid out.
     64 .PP
     65 .I Parsehtml
     66 parses an entire HTML document contained in the buffer
     67 .I data
     68 and having length
     69 .IR datalen .
     70 The URL of the document should be passed in as
     71 .IR src .
     72 .I Mtype
     73 is the media type of the document, which should be either
     74 .B TextHtml
     75 or
     76 .BR TextPlain .
     77 The character set of the document is described in
     78 .IR chset ,
     79 which can be one of
     80 .BR US_Ascii ,
     81 .BR ISO_8859_1 ,
     82 .B UTF_8
     83 or
     84 .BR Unicode .
     85 The return value is a linked list of
     86 .B Item
     87 structures, described in detail below.
     88 As a side effect, 
     89 .BI * pdi
     90 is set to point to a newly created
     91 .B Docinfo
     92 structure, containing information pertaining to the entire document.
     93 .PP
     94 The library expects two allocation routines to be provided by the
     95 caller,
     96 .B emalloc
     97 and
     98 .BR erealloc .
     99 These routines are analogous to the standard malloc and realloc routines,
    100 except that they should not return if the memory allocation fails.
    101 In addition,
    102 .B emalloc
    103 is required to zero the memory.
    104 .PP
    105 For debugging purposes,
    106 .I printitems
    107 may be called to display the contents of an item list; individual items may
    108 be printed using the
    109 .B %I
    110 print verb, installed on the first call to
    111 .IR parsehtml .
    112 .I validitems
    113 traverses the item list, checking that all of the pointers are valid.
    114 It returns
    115 .B 1
    116 is everything is ok, and
    117 .B 0
    118 if an error was found.
    119 Normally, one would not call these routines directly.
    120 Instead, one sets the global variable
    121 .I dbgbuild
    122 and the library calls them automatically.
    123 One can also set
    124 .IR warn ,
    125 to cause the library to print a warning whenever it finds a problem with the
    126 input document, and
    127 .IR dbglex ,
    128 to print debugging information in the lexer.
    129 .PP
    130 When an item list is finished with, it should be freed with
    131 .IR freeitems .
    132 Then,
    133 .I freedocinfo
    134 should be called on the pointer returned in
    135 .BI * pdi\f1.
    136 .PP
    137 .I Dimenkind
    138 and
    139 .I dimenspec
    140 are provided to interpret the
    141 .B Dimen
    142 type, as described in the section
    143 .IR "Dimension Specifications" .
    144 .PP
    145 Frame target names are mapped to integer ids via a global, permanent mapping.
    146 To find the value for a given name, call
    147 .IR targetid ,
    148 which allocates a new id if the name hasn't been seen before.
    149 The name of a given, known id may be retrieved using
    150 .IR targetname .
    151 The library predefines
    152 .BR FTtop ,
    153 .BR FTself ,
    154 .B FTparent
    155 and
    156 .BR FTblank .
    157 .PP
    158 The library handles all text as Unicode strings (type
    159 .BR Rune* ).
    160 Character set conversion is provided by
    161 .I fromStr
    162 and
    163 .IR toStr .
    164 .I FromStr
    165 takes
    166 .I n
    167 Unicode characters from
    168 .I buf
    169 and converts them to the character set described by
    170 .IR chset .
    171 .I ToStr
    172 takes
    173 .I n
    174 bytes from
    175 .IR buf ,
    176 interpretted as belonging to character set
    177 .IR chset ,
    178 and converts them to a Unicode string.
    179 Both routines null-terminate the result, and use
    180 .B emalloc
    181 to allocate space for it.
    182 .SS Items
    183 The return value of
    184 .I parsehtml
    185 is a linked list of variant structures,
    186 with the generic portion described by the following definition:
    187 .PP
    188 .EX
    189 .ta 6n +\w'Genattr* 'u
    190 typedef struct Item Item;
    191 struct Item
    192 {
    193 	Item*	next;
    194 	int	width;
    195 	int	height;
    196 	int	ascent;
    197 	int	anchorid;
    198 	int	state;
    199 	Genattr*	genattr;
    200 	int	tag;
    201 };
    202 .EE
    203 .PP
    204 The field
    205 .B next
    206 points to the successor in the linked list of items, while
    207 .BR width ,
    208 .BR height ,
    209 and
    210 .B ascent
    211 are intended for use by the caller as part of the layout process.
    212 .BR Anchorid ,
    213 if non-zero, gives the integer id assigned by the parser to the anchor that
    214 this item is in (see section
    215 .IR Anchors ).
    216 .B State
    217 is a collection of flags and values described as follows:
    218 .PP
    219 .EX
    220 .ta 6n +\w'IFindentshift = 'u
    221 enum
    222 {
    223 	IFbrk =	0x80000000,
    224 	IFbrksp =	0x40000000,
    225 	IFnobrk =	0x20000000,
    226 	IFcleft =	0x10000000,
    227 	IFcright =	0x08000000,
    228 	IFwrap =	0x04000000,
    229 	IFhang =	0x02000000,
    230 	IFrjust =	0x01000000,
    231 	IFcjust =	0x00800000,
    232 	IFsmap =	0x00400000,
    233 	IFindentshift =	8,
    234 	IFindentmask =	(255<<IFindentshift),
    235 	IFhangmask =	255
    236 };
    237 .EE
    238 .PP
    239 .B IFbrk
    240 is set if a break is to be forced before placing this item.
    241 .B IFbrksp
    242 is set if a 1 line space should be added to the break (in which case
    243 .B IFbrk
    244 is also set).
    245 .B IFnobrk
    246 is set if a break is not permitted before the item.
    247 .B IFcleft
    248 is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
    249 before this item is placed, and
    250 .B IFcright
    251 is set for right floats.
    252 In both cases, IFbrk is also set.
    253 .B IFwrap
    254 is set if the line containing this item is allowed to wrap.
    255 .B IFhang
    256 is set if this item hangs into the left indent.
    257 .B IFrjust
    258 is set if the line containing this item should be right justified,
    259 and
    260 .B IFcjust
    261 is set for center justified lines.
    262 .B IFsmap
    263 is used to indicate that an image is a server-side map.
    264 The low 8 bits, represented by
    265 .BR IFhangmask ,
    266 indicate the current hang into left indent, in tenths of a tabstop.
    267 The next 8 bits, represented by
    268 .B IFindentmask
    269 and
    270 .BR IFindentshift ,
    271 indicate the current indent in tab stops.
    272 .PP
    273 The field
    274 .B genattr
    275 is an optional pointer to an auxiliary structure, described in the section
    276 .IR "Generic Attributes" .
    277 .PP
    278 Finally,
    279 .B tag
    280 describes which variant type this item has.
    281 It can have one of the values
    282 .BR Itexttag ,
    283 .BR Iruletag ,
    284 .BR Iimagetag ,
    285 .BR Iformfieldtag ,
    286 .BR Itabletag ,
    287 .B Ifloattag
    288 or
    289 .BR Ispacertag .
    290 For each of these values, there is an additional structure defined, which
    291 includes Item as an unnamed initial substructure, and then defines additional
    292 fields.
    293 .PP
    294 Items of type
    295 .B Itexttag
    296 represent a piece of text, using the following structure:
    297 .PP
    298 .EX
    299 .ta 6n +\w'Rune* 'u
    300 struct Itext
    301 {
    302 	Item;
    303 	Rune*	s;
    304 	int	fnt;
    305 	int	fg;
    306 	uchar	voff;
    307 	uchar	ul;
    308 };
    309 .EE
    310 .PP
    311 Here
    312 .B s
    313 is a null-terminated Unicode string of the actual characters making up this text item,
    314 .B fnt
    315 is the font number (described in the section
    316 .IR "Font Numbers" ),
    317 and
    318 .B fg
    319 is the RGB encoded color for the text.
    320 .B Voff
    321 measures the vertical offset from the baseline; subtract
    322 .B Voffbias
    323 to get the actual value (negative values represent a displacement down the page).
    324 The field
    325 .B ul
    326 is the underline style:
    327 .B ULnone
    328 if no underline,
    329 .B ULunder
    330 for conventional underline, and
    331 .B ULmid
    332 for strike-through.
    333 .PP
    334 Items of type
    335 .B Iruletag
    336 represent a horizontal rule, as follows:
    337 .PP
    338 .EX
    339 .ta 6n +\w'Dimen 'u
    340 struct Irule
    341 {
    342 	Item;
    343 	uchar	align;
    344 	uchar	noshade;
    345 	int	size;
    346 	Dimen	wspec;
    347 };
    348 .EE
    349 .PP
    350 Here
    351 .B align
    352 is the alignment specification (described in the corresponding section),
    353 .B noshade
    354 is set if the rule should not be shaded,
    355 .B size
    356 is the height of the rule (as set by the size attribute),
    357 and
    358 .B wspec
    359 is the desired width (see section
    360 .IR "Dimension Specifications" ).
    361 .PP
    362 Items of type
    363 .B Iimagetag
    364 describe embedded images, for which the following structure is defined:
    365 .PP
    366 .EX
    367 .ta 6n +\w'Iimage* 'u
    368 struct Iimage
    369 {
    370 	Item;
    371 	Rune*	imsrc;
    372 	int	imwidth;
    373 	int	imheight;
    374 	Rune*	altrep;
    375 	Map*	map;
    376 	int	ctlid;
    377 	uchar	align;
    378 	uchar	hspace;
    379 	uchar	vspace;
    380 	uchar	border;
    381 	Iimage*	nextimage;
    382 };
    383 .EE
    384 .PP
    385 Here
    386 .B imsrc
    387 is the URL of the image source,
    388 .B imwidth
    389 and
    390 .BR imheight ,
    391 if non-zero, contain the specified width and height for the image,
    392 and
    393 .B altrep
    394 is the text to use as an alternative to the image, if the image is not displayed.
    395 .BR Map ,
    396 if set, points to a structure describing an associated client-side image map.
    397 .B Ctlid
    398 is reserved for use by the application, for handling animated images.
    399 .B Align
    400 encodes the alignment specification of the image.
    401 .B Hspace
    402 contains the number of pixels to pad the image with on either side, and
    403 .B Vspace
    404 the padding above and below.
    405 .B Border
    406 is the width of the border to draw around the image.
    407 .B Nextimage
    408 points to the next image in the document (the head of this list is
    409 .BR Docinfo.images ).
    410 .PP
    411 For items of type
    412 .BR Iformfieldtag ,
    413 the following structure is defined:
    414 .PP
    415 .EX
    416 .ta 6n +\w'Formfield* 'u
    417 struct Iformfield
    418 {
    419 	Item;
    420 	Formfield*	formfield;
    421 };
    422 .EE
    423 .PP
    424 This adds a single field,
    425 .BR formfield ,
    426 which points to a structure describing a field in a form, described in section
    427 .IR Forms .
    428 .PP
    429 For items of type
    430 .BR Itabletag ,
    431 the following structure is defined:
    432 .PP
    433 .EX
    434 .ta 6n +\w'Table* 'u
    435 struct Itable
    436 {
    437 	Item;
    438 	Table*	table;
    439 };
    440 .EE
    441 .PP
    442 .B Table
    443 points to a structure describing the table, described in the section
    444 .IR Tables .
    445 .PP
    446 For items of type
    447 .BR Ifloattag ,
    448 the following structure is defined:
    449 .PP
    450 .EX
    451 .ta 6n +\w'Ifloat* 'u
    452 struct Ifloat
    453 {
    454 	Item;
    455 	Item*	item;
    456 	int	x;
    457 	int	y;
    458 	uchar	side;
    459 	uchar	infloats;
    460 	Ifloat*	nextfloat;
    461 };
    462 .EE
    463 .PP
    464 The
    465 .B item
    466 points to a single item (either a table or an image) that floats (the text of the
    467 document flows around it), and
    468 .B side
    469 indicates the margin that this float sticks to; it is either
    470 .B ALleft
    471 or
    472 .BR ALright .
    473 .B X
    474 and
    475 .B y
    476 are reserved for use by the caller; these are typically used for the coordinates
    477 of the top of the float.
    478 .B Infloats
    479 is used by the caller to keep track of whether it has placed the float.
    480 .B Nextfloat
    481 is used by the caller to link together all of the floats that it has placed.
    482 .PP
    483 For items of type
    484 .BR Ispacertag ,
    485 the following structure is defined:
    486 .PP
    487 .EX
    488 .ta 6n +\w'Item; 'u
    489 struct Ispacer
    490 {
    491 	Item;
    492 	int	spkind;
    493 };
    494 .EE
    495 .PP
    496 .B Spkind
    497 encodes the kind of spacer, and may be one of
    498 .B ISPnull
    499 (zero height and width),
    500 .B ISPvline
    501 (takes on height and ascent of the current font),
    502 .B ISPhspace
    503 (has the width of a space in the current font) and
    504 .B ISPgeneral
    505 (for all other purposes, such as between markers and lists).
    506 .SS Generic Attributes
    507 .PP
    508 The genattr field of an item, if non-nil, points to a structure that holds
    509 the values of attributes not specific to any particular
    510 item type, as they occur on a wide variety of underlying HTML tags.
    511 The structure is as follows:
    512 .PP
    513 .EX
    514 .ta 6n +\w'SEvent* 'u
    515 typedef struct Genattr Genattr;
    516 struct Genattr
    517 {
    518 	Rune*	id;
    519 	Rune*	class;
    520 	Rune*	style;
    521 	Rune*	title;
    522 	SEvent*	events;
    523 };
    524 .EE
    525 .PP
    526 Fields
    527 .BR id ,
    528 .BR class ,
    529 .B style
    530 and
    531 .BR title ,
    532 when non-nil, contain values of correspondingly named attributes of the HTML tag
    533 associated with this item.
    534 .B Events
    535 is a linked list of events (with corresponding scripted actions) associated with the item:
    536 .PP
    537 .EX
    538 .ta 6n +\w'SEvent* 'u
    539 typedef struct SEvent SEvent;
    540 struct SEvent
    541 {
    542 	SEvent*	next;
    543 	int	type;
    544 	Rune*	script;
    545 };
    546 .EE
    547 .PP
    548 Here,
    549 .B next
    550 points to the next event in the list,
    551 .B type
    552 is one of
    553 .BR SEonblur ,
    554 .BR SEonchange ,
    555 .BR SEonclick ,
    556 .BR SEondblclick ,
    557 .BR SEonfocus ,
    558 .BR SEonkeypress ,
    559 .BR SEonkeyup ,
    560 .BR SEonload ,
    561 .BR SEonmousedown ,
    562 .BR SEonmousemove ,
    563 .BR SEonmouseout ,
    564 .BR SEonmouseover ,
    565 .BR SEonmouseup ,
    566 .BR SEonreset ,
    567 .BR SEonselect ,
    568 .B SEonsubmit
    569 or
    570 .BR SEonunload ,
    571 and
    572 .B script
    573 is the text of the associated script.
    574 .SS Dimension Specifications
    575 .PP
    576 Some structures include a dimension specification, used where
    577 a number can be followed by a
    578 .B %
    579 or a
    580 .B *
    581 to indicate
    582 percentage of total or relative weight.
    583 This is encoded using the following structure:
    584 .PP
    585 .EX
    586 .ta 6n +\w'int 'u
    587 typedef struct Dimen Dimen;
    588 struct Dimen
    589 {
    590 	int	kindspec;
    591 };
    592 .EE
    593 .PP
    594 Separate kind and spec values are extracted using
    595 .I dimenkind
    596 and
    597 .IR dimenspec .
    598 .I Dimenkind
    599 returns one of
    600 .BR Dnone ,
    601 .BR Dpixels ,
    602 .B Dpercent
    603 or
    604 .BR Drelative .
    605 .B Dnone
    606 means that no dimension was specified.
    607 In all other cases,
    608 .I dimenspec
    609 should be called to find the absolute number of pixels, the percentage of total,
    610 or the relative weight.
    611 .SS Background Specifications
    612 .PP
    613 It is possible to set the background of the entire document, and also
    614 for some parts of the document (such as tables).
    615 This is encoded as follows:
    616 .PP
    617 .EX
    618 .ta 6n +\w'Rune* 'u
    619 typedef struct Background Background;
    620 struct Background
    621 {
    622 	Rune*	image;
    623 	int	color;
    624 };
    625 .EE
    626 .PP
    627 .BR Image ,
    628 if non-nil, is the URL of an image to use as the background.
    629 If this is nil,
    630 .B color
    631 is used instead, as the RGB value for a solid fill color.
    632 .SS Alignment Specifications
    633 .PP
    634 Certain items have alignment specifiers taken from the following
    635 enumerated type:
    636 .PP
    637 .EX
    638 .ta 6n
    639 enum
    640 {
    641 	ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
    642 	ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
    643 };
    644 .EE
    645 .PP
    646 These values correspond to the various alignment types named in the HTML 4.0
    647 standard.
    648 If an item has an alignment of
    649 .B ALleft
    650 or
    651 .BR ALright ,
    652 the library automatically encapsulates it inside a float item.
    653 .PP
    654 Tables, and the various rows, columns and cells within them, have a more
    655 complex alignment specification, composed of separate vertical and
    656 horizontal alignments:
    657 .PP
    658 .EX
    659 .ta 6n +\w'uchar 'u
    660 typedef struct Align Align;
    661 struct Align
    662 {
    663 	uchar	halign;
    664 	uchar	valign;
    665 };
    666 .EE
    667 .PP
    668 .B Halign
    669 can be one of
    670 .BR ALnone ,
    671 .BR ALleft ,
    672 .BR ALcenter ,
    673 .BR ALright ,
    674 .B ALjustify
    675 or
    676 .BR ALchar .
    677 .B Valign
    678 can be one of
    679 .BR ALnone ,
    680 .BR ALmiddle ,
    681 .BR ALbottom ,
    682 .BR ALtop
    683 or
    684 .BR ALbaseline .
    685 .SS Font Numbers
    686 .PP
    687 Text items have an associated font number (the
    688 .B fnt
    689 field), which is encoded as
    690 .BR style*NumSize+size .
    691 Here,
    692 .B style
    693 is one of
    694 .BR FntR ,
    695 .BR FntI ,
    696 .B FntB
    697 or
    698 .BR FntT ,
    699 for roman, italic, bold and typewriter font styles, respectively, and size is
    700 .BR Tiny ,
    701 .BR Small ,
    702 .BR Normal ,
    703 .B Large
    704 or
    705 .BR Verylarge .
    706 The total number of possible font numbers is
    707 .BR NumFnt ,
    708 and the default font number is
    709 .B DefFnt
    710 (which is roman style, normal size).
    711 .SS Document Info
    712 .PP
    713 Global information about an HTML page is stored in the following structure:
    714 .PP
    715 .EX
    716 .ta 6n +\w'DestAnchor* 'u
    717 typedef struct Docinfo Docinfo;
    718 struct Docinfo
    719 {
    720 	// stuff from HTTP headers, doc head, and body tag
    721 	Rune*	src;
    722 	Rune*	base;
    723 	Rune*	doctitle;
    724 	Background	background;
    725 	Iimage*	backgrounditem;
    726 	int	text;
    727 	int	link;
    728 	int	vlink;
    729 	int	alink;
    730 	int	target;
    731 	int	chset;
    732 	int	mediatype;
    733 	int	scripttype;
    734 	int	hasscripts;
    735 	Rune*	refresh;
    736 	Kidinfo*	kidinfo;
    737 	int	frameid;
    738 
    739 	// info needed to respond to user actions
    740 	Anchor*	anchors;
    741 	DestAnchor*	dests;
    742 	Form*	forms;
    743 	Table*	tables;
    744 	Map*	maps;
    745 	Iimage*	images;
    746 };
    747 .EE
    748 .PP
    749 .B Src
    750 gives the URL of the original source of the document,
    751 and
    752 .B base
    753 is the base URL.
    754 .B Doctitle
    755 is the document's title, as set by a
    756 .B <title>
    757 element.
    758 .B Background
    759 is as described in the section
    760 .IR "Background Specifications" ,
    761 and
    762 .B backgrounditem
    763 is set to be an image item for the document's background image (if given as a URL),
    764 or else nil.
    765 .B Text
    766 gives the default foregound text color of the document,
    767 .B link
    768 the unvisited hyperlink color,
    769 .B vlink
    770 the visited hyperlink color, and
    771 .B alink
    772 the color for highlighting hyperlinks (all in 24-bit RGB format).
    773 .B Target
    774 is the default target frame id.
    775 .B Chset
    776 and
    777 .B mediatype
    778 are as for the
    779 .I chset
    780 and
    781 .I mtype
    782 parameters to
    783 .IR parsehtml .
    784 .B Scripttype
    785 is the type of any scripts contained in the document, and is always
    786 .BR TextJavascript .
    787 .B Hasscripts
    788 is set if the document contains any scripts.
    789 Scripting is currently unsupported.
    790 .B Refresh
    791 is the contents of a
    792 .B "<meta http-equiv=Refresh ...>"
    793 tag, if any.
    794 .B Kidinfo
    795 is set if this document is a frameset (see section
    796 .IR Frames ).
    797 .B Frameid
    798 is this document's frame id.
    799 .PP
    800 .B Anchors
    801 is a list of hyperlinks contained in the document,
    802 and
    803 .B dests
    804 is a list of hyperlink destinations within the page (see the following section for details).
    805 .BR Forms ,
    806 .B tables
    807 and
    808 .B maps
    809 are lists of the various forms, tables and client-side maps contained
    810 in the document, as described in subsequent sections.
    811 .B Images
    812 is a list of all the image items in the document.
    813 .SS Anchors
    814 .PP
    815 The library builds two lists for all of the
    816 .B <a>
    817 elements (anchors) in a document.
    818 Each anchor is assigned a unique anchor id within the document.
    819 For anchors which are hyperlinks (the
    820 .B href
    821 attribute was supplied), the following structure is defined:
    822 .PP
    823 .EX
    824 .ta 6n +\w'Anchor* 'u
    825 typedef struct Anchor Anchor;
    826 struct Anchor
    827 {
    828 	Anchor*	next;
    829 	int	index;
    830 	Rune*	name;
    831 	Rune*	href;
    832 	int	target;
    833 };
    834 .EE
    835 .PP
    836 .B Next
    837 points to the next anchor in the list (the head of this list is
    838 .BR Docinfo.anchors ).
    839 .B Index
    840 is the anchor id; each item within this hyperlink is tagged with this value
    841 in its
    842 .B anchorid
    843 field.
    844 .B Name
    845 and
    846 .B href
    847 are the values of the correspondingly named attributes of the anchor
    848 (in particular, href is the URL to go to).
    849 .B Target
    850 is the value of the target attribute (if provided) converted to a frame id.
    851 .PP
    852 Destinations within the document (anchors with the name attribute set)
    853 are held in the
    854 .B Docinfo.dests
    855 list, using the following structure:
    856 .PP
    857 .EX
    858 .ta 6n +\w'DestAnchor* 'u
    859 typedef struct DestAnchor DestAnchor;
    860 struct DestAnchor
    861 {
    862 	DestAnchor*	next;
    863 	int	index;
    864 	Rune*	name;
    865 	Item*	item;
    866 };
    867 .EE
    868 .PP
    869 .B Next
    870 is the next element of the list,
    871 .B index
    872 is the anchor id,
    873 .B name
    874 is the value of the name attribute, and
    875 .B item
    876 is points to the item within the parsed document that should be considered
    877 to be the destination.
    878 .SS Forms
    879 .PP
    880 Any forms within a document are kept in a list, headed by
    881 .BR Docinfo.forms .
    882 The elements of this list are as follows:
    883 .PP
    884 .EX
    885 .ta 6n +\w'Formfield* 'u
    886 typedef struct Form Form;
    887 struct Form
    888 {
    889 	Form*	next;
    890 	int	formid;
    891 	Rune*	name;
    892 	Rune*	action;
    893 	int	target;
    894 	int	method;
    895 	int	nfields;
    896 	Formfield*	fields;
    897 };
    898 .EE
    899 .PP
    900 .B Next
    901 points to the next form in the list.
    902 .B Formid
    903 is a serial number for the form within the document.
    904 .B Name
    905 is the value of the form's name or id attribute.
    906 .B Action
    907 is the value of any action attribute.
    908 .B Target
    909 is the value of the target attribute (if any) converted to a frame target id.
    910 .B Method
    911 is one of
    912 .B HGet
    913 or
    914 .BR HPost .
    915 .B Nfields
    916 is the number of fields in the form, and
    917 .B fields
    918 is a linked list of the actual fields.
    919 .PP
    920 The individual fields in a form are described by the following structure:
    921 .PP
    922 .EX
    923 .ta 6n +\w'Formfield* 'u
    924 typedef struct Formfield Formfield;
    925 struct Formfield
    926 {
    927 	Formfield*	next;
    928 	int	ftype;
    929 	int	fieldid;
    930 	Form*	form;
    931 	Rune*	name;
    932 	Rune*	value;
    933 	int	size;
    934 	int	maxlength;
    935 	int	rows;
    936 	int	cols;
    937 	uchar	flags;
    938 	Option*	options;
    939 	Item*	image;
    940 	int	ctlid;
    941 	SEvent*	events;
    942 };
    943 .EE
    944 .PP
    945 Here,
    946 .B next
    947 points to the next field in the list.
    948 .B Ftype
    949 is the type of the field, which can be one of
    950 .BR Ftext ,
    951 .BR Fpassword ,
    952 .BR Fcheckbox ,
    953 .BR Fradio ,
    954 .BR Fsubmit ,
    955 .BR Fhidden ,
    956 .BR Fimage ,
    957 .BR Freset ,
    958 .BR Ffile ,
    959 .BR Fbutton ,
    960 .B Fselect
    961 or
    962 .BR Ftextarea .
    963 .B Fieldid
    964 is a serial number for the field within the form.
    965 .B Form
    966 points back to the form containing this field.
    967 .BR Name ,
    968 .BR value ,
    969 .BR size ,
    970 .BR maxlength ,
    971 .B rows
    972 and
    973 .B cols
    974 each contain the values of corresponding attributes of the field, if present.
    975 .B Flags
    976 contains per-field flags, of which
    977 .B FFchecked
    978 and
    979 .B FFmultiple
    980 are defined.
    981 .B Image
    982 is only used for fields of type
    983 .BR Fimage ;
    984 it points to an image item containing the image to be displayed.
    985 .B Ctlid
    986 is reserved for use by the caller, typically to store a unique id
    987 of an associated control used to implement the field.
    988 .B Events
    989 is the same as the corresponding field of the generic attributes
    990 associated with the item containing this field.
    991 .B Options
    992 is only used by fields of type
    993 .BR Fselect ;
    994 it consists of a list of possible options that may be selected for that
    995 field, using the following structure:
    996 .PP
    997 .EX
    998 .ta 6n +\w'Option* 'u
    999 typedef struct Option Option;
   1000 struct Option
   1001 {
   1002 	Option*	next;
   1003 	int	selected;
   1004 	Rune*	value;
   1005 	Rune*	display;
   1006 };
   1007 .EE
   1008 .PP
   1009 .B Next
   1010 points to the next element of the list.
   1011 .B Selected
   1012 is set if this option is to be displayed initially.
   1013 .B Value
   1014 is the value to send when the form is submitted if this option is selected.
   1015 .B Display
   1016 is the string to display on the screen for this option.
   1017 .SS Tables
   1018 .PP
   1019 The library builds a list of all the tables in the document,
   1020 headed by
   1021 .BR Docinfo.tables .
   1022 Each element of this list has the following format:
   1023 .PP
   1024 .EX
   1025 .ta 6n +\w'Tablecell*** 'u
   1026 typedef struct Table Table;
   1027 struct Table
   1028 {
   1029 	Table*	next;
   1030 	int	tableid;
   1031 	Tablerow*	rows;
   1032 	int	nrow;
   1033 	Tablecol*	cols;
   1034 	int	ncol;
   1035 	Tablecell*	cells;
   1036 	int	ncell;
   1037 	Tablecell***	grid;
   1038 	Align	align;
   1039 	Dimen	width;
   1040 	int	border;
   1041 	int	cellspacing;
   1042 	int	cellpadding;
   1043 	Background	background;
   1044 	Item*	caption;
   1045 	uchar	caption_place;
   1046 	Lay*	caption_lay;
   1047 	int	totw;
   1048 	int	toth;
   1049 	int	caph;
   1050 	int	availw;
   1051 	Token*	tabletok;
   1052 	uchar	flags;
   1053 };
   1054 .EE
   1055 .PP
   1056 .B Next
   1057 points to the next element in the list of tables.
   1058 .B Tableid
   1059 is a serial number for the table within the document.
   1060 .B Rows
   1061 is an array of row specifications (described below) and
   1062 .B nrow
   1063 is the number of elements in this array.
   1064 Similarly,
   1065 .B cols
   1066 is an array of column specifications, and
   1067 .B ncol
   1068 the size of this array.
   1069 .B Cells
   1070 is a list of all cells within the table (structure described below)
   1071 and
   1072 .B ncell
   1073 is the number of elements in this list.
   1074 Note that a cell may span multiple rows and/or columns, thus
   1075 .B ncell
   1076 may be smaller than
   1077 .BR nrow*ncol .
   1078 .B Grid
   1079 is a two-dimensional array of cells within the table; the cell
   1080 at row
   1081 .B i
   1082 and column
   1083 .B j
   1084 is
   1085 .BR Table.grid[i][j] .
   1086 A cell that spans multiple rows and/or columns will
   1087 be referenced by
   1088 .B grid
   1089 multiple times, however it will only occur once in
   1090 .BR cells .
   1091 .B Align
   1092 gives the alignment specification for the entire table,
   1093 and
   1094 .B width
   1095 gives the requested width as a dimension specification.
   1096 .BR Border ,
   1097 .B cellspacing
   1098 and
   1099 .B cellpadding
   1100 give the values of the corresponding attributes for the table,
   1101 and
   1102 .B background
   1103 gives the requested background for the table.
   1104 .B Caption
   1105 is a linked list of items to be displayed as the caption of the
   1106 table, either above or below depending on whether
   1107 .B caption_place
   1108 is
   1109 .B ALtop
   1110 or
   1111 .BR ALbottom .
   1112 Most of the remaining fields are reserved for use by the caller,
   1113 except
   1114 .BR tabletok ,
   1115 which is reserved for internal use.
   1116 The type
   1117 .B Lay
   1118 is not defined by the library; the caller can provide its
   1119 own definition.
   1120 .PP
   1121 The
   1122 .B Tablecol
   1123 structure is defined for use by the caller.
   1124 The library ensures that the correct number of these
   1125 is allocated, but leaves them blank.
   1126 The fields are as follows:
   1127 .PP
   1128 .EX
   1129 .ta 6n +\w'Point 'u
   1130 typedef struct Tablecol Tablecol;
   1131 struct Tablecol
   1132 {
   1133 	int	width;
   1134 	Align	align;
   1135 	Point		pos;
   1136 };
   1137 .EE
   1138 .PP
   1139 The rows in the table are specified as follows:
   1140 .PP
   1141 .EX
   1142 .ta 6n +\w'Background 'u
   1143 typedef struct Tablerow Tablerow;
   1144 struct Tablerow
   1145 {
   1146 	Tablerow*	next;
   1147 	Tablecell*	cells;
   1148 	int	height;
   1149 	int	ascent;
   1150 	Align	align;
   1151 	Background	background;
   1152 	Point	pos;
   1153 	uchar	flags;
   1154 };
   1155 .EE
   1156 .PP
   1157 .B Next
   1158 is only used during parsing; it should be ignored by the caller.
   1159 .B Cells
   1160 provides a list of all the cells in a row, linked through their
   1161 .B nextinrow
   1162 fields (see below).
   1163 .BR Height ,
   1164 .B ascent
   1165 and
   1166 .B pos
   1167 are reserved for use by the caller.
   1168 .B Align
   1169 is the alignment specification for the row, and
   1170 .B background
   1171 is the background to use, if specified.
   1172 .B Flags
   1173 is used by the parser; ignore this field.
   1174 .PP
   1175 The individual cells of the table are described as follows:
   1176 .PP
   1177 .EX
   1178 .ta 6n +\w'Background 'u
   1179 typedef struct Tablecell Tablecell;
   1180 struct Tablecell
   1181 {
   1182 	Tablecell*	next;
   1183 	Tablecell*	nextinrow;
   1184 	int	cellid;
   1185 	Item*	content;
   1186 	Lay*	lay;
   1187 	int	rowspan;
   1188 	int	colspan;
   1189 	Align	align;
   1190 	uchar	flags;
   1191 	Dimen	wspec;
   1192 	int	hspec;
   1193 	Background	background;
   1194 	int	minw;
   1195 	int	maxw;
   1196 	int	ascent;
   1197 	int	row;
   1198 	int	col;
   1199 	Point	pos;
   1200 };
   1201 .EE
   1202 .PP
   1203 .B Next
   1204 is used to link together the list of all cells within a table
   1205 .RB ( Table.cells ),
   1206 whereas
   1207 .B nextinrow
   1208 is used to link together all the cells within a single row
   1209 .RB ( Tablerow.cells ).
   1210 .B Cellid
   1211 provides a serial number for the cell within the table.
   1212 .B Content
   1213 is a linked list of the items to be laid out within the cell.
   1214 .B Lay
   1215 is reserved for the user to describe how these items have
   1216 been laid out.
   1217 .B Rowspan
   1218 and
   1219 .B colspan
   1220 are the number of rows and columns spanned by this cell,
   1221 respectively.
   1222 .B Align
   1223 is the alignment specification for the cell.
   1224 .B Flags
   1225 is some combination of
   1226 .BR TFparsing ,
   1227 .B TFnowrap
   1228 and
   1229 .B TFisth
   1230 or'd together.
   1231 Here
   1232 .B TFparsing
   1233 is used internally by the parser, and should be ignored.
   1234 .B TFnowrap
   1235 means that the contents of the cell should not be
   1236 wrapped if they don't fit the available width,
   1237 rather, the table should be expanded if need be
   1238 (this is set when the nowrap attribute is supplied).
   1239 .B TFisth
   1240 means that the cell was created by the
   1241 .B <th>
   1242 element (rather than the
   1243 .B <td>
   1244 element),
   1245 indicating that it is a header cell rather than a data cell.
   1246 .B Wspec
   1247 provides a suggested width as a dimension specification,
   1248 and
   1249 .B hspec
   1250 provides a suggested height in pixels.
   1251 .B Background
   1252 gives a background specification for the individual cell.
   1253 .BR Minw ,
   1254 .BR maxw ,
   1255 .B ascent
   1256 and
   1257 .B pos
   1258 are reserved for use by the caller during layout.
   1259 .B Row
   1260 and
   1261 .B col
   1262 give the indices of the row and column of the top left-hand
   1263 corner of the cell within the table grid.
   1264 .SS Client-side Maps
   1265 .PP
   1266 The library builds a list of client-side maps, headed by
   1267 .BR Docinfo.maps ,
   1268 and having the following structure:
   1269 .PP
   1270 .EX
   1271 .ta 6n +\w'Rune* 'u
   1272 typedef struct Map Map;
   1273 struct Map
   1274 {
   1275 	Map*	next;
   1276 	Rune*	name;
   1277 	Area*	areas;
   1278 };
   1279 .EE
   1280 .PP
   1281 .B Next
   1282 points to the next element in the list,
   1283 .B name
   1284 is the name of the map (use to bind it to an image), and
   1285 .B areas
   1286 is a list of the areas within the image that comprise the map,
   1287 using the following structure:
   1288 .PP
   1289 .EX
   1290 .ta 6n +\w'Dimen* 'u
   1291 typedef struct Area Area;
   1292 struct Area
   1293 {
   1294 	Area*	next;
   1295 	int	shape;
   1296 	Rune*	href;
   1297 	int	target;
   1298 	Dimen*	coords;
   1299 	int	ncoords;
   1300 };
   1301 .EE
   1302 .PP
   1303 .B Next
   1304 points to the next element in the map's list of areas.
   1305 .B Shape
   1306 describes the shape of the area, and is one of
   1307 .BR SHrect ,
   1308 .B SHcircle
   1309 or
   1310 .BR  SHpoly .
   1311 .B Href
   1312 is the URL associated with this area in its role as
   1313 a hypertext link, and
   1314 .B target
   1315 is the target frame it should be loaded in.
   1316 .B Coords
   1317 is an array of coordinates for the shape, and
   1318 .B ncoords
   1319 is the size of this array (number of elements).
   1320 .SS Frames
   1321 .PP
   1322 If the
   1323 .B Docinfo.kidinfo
   1324 field is set, the document is a frameset.
   1325 In this case, it is typical for
   1326 .I parsehtml
   1327 to return nil, as a document which is a frameset should have no actual
   1328 items that need to be laid out (such will appear only in subsidiary documents).
   1329 It is possible that items will be returned by a malformed document; the caller
   1330 should check for this and free any such items.
   1331 .PP
   1332 The
   1333 .B Kidinfo
   1334 structure itself reflects the fact that framesets can be nested within a document.
   1335 If is defined as follows:
   1336 .PP
   1337 .EX
   1338 .ta 6n +\w'Kidinfo* 'u
   1339 typedef struct Kidinfo Kidinfo;
   1340 struct Kidinfo
   1341 {
   1342 	Kidinfo*	next;
   1343 	int	isframeset;
   1344 
   1345 	// fields for "frame"
   1346 	Rune*	src;
   1347 	Rune*	name;
   1348 	int	marginw;
   1349 	int	marginh;
   1350 	int	framebd;
   1351 	int	flags;
   1352 
   1353 	// fields for "frameset"
   1354 	Dimen*	rows;
   1355 	int	nrows;
   1356 	Dimen*	cols;
   1357 	int	ncols;
   1358 	Kidinfo*	kidinfos;
   1359 	Kidinfo*	nextframeset;
   1360 };
   1361 .EE
   1362 .PP
   1363 .B Next
   1364 is only used if this structure is part of a containing frameset; it points to the next
   1365 element in the list of children of that frameset.
   1366 .B Isframeset
   1367 is set when this structure represents a frameset; if clear, it is an individual frame.
   1368 .PP
   1369 Some fields are used only for framesets.
   1370 .B Rows
   1371 is an array of dimension specifications for rows in the frameset, and
   1372 .B nrows
   1373 is the length of this array.
   1374 .B Cols
   1375 is the corresponding array for columns, of length
   1376 .BR ncols .
   1377 .B Kidinfos
   1378 points to a list of components contained within this frameset, each
   1379 of which may be a frameset or a frame.
   1380 .B Nextframeset
   1381 is only used during parsing, and should be ignored.
   1382 .PP
   1383 The remaining fields are used if the structure describes a frame, not a frameset.
   1384 .B Src
   1385 provides the URL for the document that should be initially loaded into this frame.
   1386 Note that this may be a relative URL, in which case it should be interpretted
   1387 using the containing document's URL as the base.
   1388 .B Name
   1389 gives the name of the frame, typically supplied via a name attribute in the HTML.
   1390 If no name was given, the library allocates one.
   1391 .BR Marginw ,
   1392 .B marginh
   1393 and
   1394 .B framebd
   1395 are the values of the marginwidth, marginheight and frameborder attributes, respectively.
   1396 .B Flags
   1397 can contain some combination of the following:
   1398 .B FRnoresize
   1399 (the frame had the noresize attribute set, and the user should not be allowed to resize it),
   1400 .B FRnoscroll
   1401 (the frame should not have any scroll bars),
   1402 .B FRhscroll
   1403 (the frame should have a horizontal scroll bar),
   1404 .B FRvscroll
   1405 (the frame should have a vertical scroll bar),
   1406 .B FRhscrollauto
   1407 (the frame should be automatically given a horizontal scroll bar if its contents
   1408 would not otherwise fit), and
   1409 .B FRvscrollauto
   1410 (the frame gets a vertical scrollbar only if required).
   1411 .SH SOURCE
   1412 .B \*9/src/libhtml
   1413 .SH SEE ALSO
   1414 .MR fmt (1)
   1415 .PP
   1416 W3C World Wide Web Consortium,
   1417 ``HTML 4.01 Specification''.
   1418 .SH BUGS
   1419 The entire HTML document must be loaded into memory before
   1420 any of it can be parsed.