The BodyExtract Element

This element defines the regular expression used to extract the body of the HTML document.

Rule Processing

By default it extracts the <body> (.html files) or <bodyText> (.topic files) element content. The "body" part of the regular expression must be a named group called Body.

One example where you might want to modify this is if your document bodies contain several sections contained within div elements. You can alter the expression to extract the specific div that contains just the body text thus excluding the other unwanted parts of the document.

Note

Since it resides in an XML file, any special characters in the expression such as <, >, &, ", and ' must be encode as shown in the example below. The regular expression is matched case-insensitively.

Example div Extract

<!-- Note: Lines wrapped for display purposes -->
<BodyExtract expression="&lt;\s*div\s*class=&quot;Main&quot;[^&gt;]*?&gt;
(?&lt;Body&gt;.*?)&lt;\s*/\s*div?\s*&gt;" />

The BodyExtract Element

Rule Processing

See Also

Other Resources