Unlike MAML topics which must be well-formed XML, HTML files are not necessarily as strict in their conformance to such rules. End tags are not always present, the "/" in self-closing tags is not always present, entities are not always encoded correctly if at all, etc. As such, the HTML to MAML converter extracts key parts of the HTML files and uses a series of regular expressions paired with match evaluators to alter the HTML elements by replacing them with their MAML equivalents or removing them altogether.
While this will not guarantee a valid MAML topic, it will guarantee that no information is lost during the conversion. Building the converted topics reveals any missing end tags, unencoded entities, etc. which are easily fixed by editing the topics.
The conversion rules are stored in an XML configuration file. This file must reside in the same location as the converter's executable.
Conversion of a topic follows this general procedure using the rule definitions found in the configuration file:
Extract the key parts of the topic (metadata such as the title, attributes, and index keywords, and the body text).
Replace named entities with their numeric equivalent.
Replace markup wrapper HTML elements with a placeholder in the text so as not to alter their content in later processing.
Remove all HTML elements with no MAML equivalent.
Replace all HTML elements that do have MAML equivalents.
The following HTML elements are processed based on the context in which they are used:
a - Links to topics and external URLs
code - Inline code and code blocks
h1-h6 - Section headings
img - Image links
see - See Also references
These elements should not appear in any of the other rules.
Once done, the markup wrapper placeholders are replaced with their actual markup enclosed in a MAML markup element.
The converted topic is saved to the destination folder along with some supporting files.