Tolerant HTML Parsing

Beautiful Soup is an absolutely terrific Python library for parsing HTML and XML. Its strength is its ability to offer a clean document tree even for bad markup (including such gory details as converting everything to unicode intelligently).

Unfortunately, Beautiful Soup’s tolerance for lousy markup is limited. It’s got lots of clever heuristics for repairing broken nesting (foo) and guessing implicitly-closed tags (<li>First<li>Second), but it uses Python’s standard HTML parser class to tokenize the markup. HTMLParser isn’t designed to accommodate syntactically malformed markup.

The case I’ve encountered in the wild involves “syntactically nested” tags—constructions of the form <a href="foo">bar</a>. The nested tag is almost always a line-break; presumably this is the result of a particularly lousy tool attempting to do its own text-wrapping.

Although such markup is clearly atrocious, web browsers are fairly consistent in the way they render such fragments: anything up to the first > is part of the tag, and the rest is just text up to the < that opens the next tag. Both Safari and Firefox render my example as href="foo">bar, presumably wrapped in an a tag with no href attribute. Even TextMate’s HTML highlighter interprets the syntax in this way.

HTMLParser chokes on this syntax, however, making it impossible to use Beautiful Soup to process pages with such errors. HTMLParser.py uses the following regular expression to find the end of a tag it is parsing:

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:\s+                             # whitespace before attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

If whatever matches this expression is not followed by > or />, the check_for_whole_start_tag routine raises an exception.

A general replacement for HTMLParser which is designed from the ground up to handle questionable syntax (presumably just turning anything it can’t parse into text runs) would be a useful utility, but for now I just wanted to fix the particular cases I’ve encountered in practice. Modifying check_for_whole_start_tag such that it no longer raises exceptions is one option, but that doesn’t match the behavior of the web browsers: the existing version of locatestarttagend stops matching at the beginning of any syntactically nested tag, while the web browsers stop at the end of the nested tag. As a quick hack, I modified HTMLParser to allow attribute names beginning with <. This involves changing both locatestarttagend and the expression for identifying attributes:

attrfind = re.compile(
    r'\s*([<a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')

locatestarttagend = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:\s+                             # whitespace before attribute name
    (?:[<a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)

Not the most robust solution (it’s easy to find examples that still break), but so far it’s been able to handle everything I’ve found on the web.

If you’re too lazy to add the two characters yourself, feel free to download my patched version of the HTML parser.