Unicode and HTML entities

2012-09-14

tagged programming, unicode, text-handling, python

In which we struggle with a cacophony of characters for the web.

Buried in the Python standard library, unicodedata contains most the information needed to interrogate and translate unicode characters. Unfortunately, it's underdocumented. More accurately, the docs are a terse list of what it does, but not why you might want to use it or how you use it. Unfortunately it's also a compiled library, which eliminates looking at the source. The following are some notes I compiled concerning unicodedata and beating information out of unicode characters. Throughout these examples, I'll use the characters 'a', 'é' (u'xE9' in Python) and pi (?, u'u03D6' in python).

The name for any of these characters can be obtained by passing them to name. This accepts a single character unicode (not string):

>>> chars = [u'a', u'xE9', u'u03D6']
>>> for item in chars:
   print unicodedata.name (item)
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
GREEK PI SYMBOL

This name can be used in a reverse lookup to get the character with lookup:

>>> unicodedata.lookup ('LATIN SMALL LETTER A') u'a'

The hexadecimal letter-numbers (A to F) can be either case. Interestingly, while the hex 'x' notation can be converted to the unicode 'u' notation, the reverse doesn't hold. This may be do with the compilation of Python, although it is stated elsewhere that the 'x' notation is limited to 2 characters, 'u' needs 4, and 'U' requires 8:

>>> print unicodedata.name (u'u00e9')
LATIN SMALL LETTER E WITH ACUTE
>>> print unicodedata.name (u'x03d6')
...
<type 'exceptions.typeerror'="">: need a single Unicode character as parameter

decimal, digit and numeric are potentially confusing call - if the character passed is a numerical character, they returns the integer (float for numeric) for that character:

>>> print unicodedata.decimal (u'1') 1
>>> print unicodedata.decimal (u'a')
...
<type 'exceptions.ValueError'>: not a decimal
>>> print unicodedata.digit (u'1')
1
>>> print unicodedata.digit (u'a')
...
<type 'exceptions.ValueError'>: not a digit
>>> print unicodedata.numeric (u'1')
1.0
>>> print unicodedata.digit (u'a')
...
<type 'exceptions.valueerror'="">: not a numeric character é

The remainder of functions within unicodedata are justifiably obscure, concerned with more typographical issuess. category is passingly interesting, returning a string of classification information for any character:

>>> unicodedata.category (u'A')
'Lu' # letter, uppercase
>>> unicodedata.category (u'a')
'Ll' # letter, lowercase

You can obtain the unicode for a character from it's ordinal value - decimal or hex, but not a hexadecimal string- via unichr:

>>> unichr (233), unichr (0xe9) (u'xe9', u'xe9')

Encoding characters for HTML is a small problem for unicode. First, the angled braces < and > have to be excluded from text within HTML, so they aren't mistaken for markup. Therefore they are translated into HTML character entities which can be safely placed in HTML and rendered. This can be done with the cgi.escape function:

>>> cgi.escape ('<>') '&lt;&gt;

This works with strings or unicode. cgi.escape also converts ampersands, as they are used in constructing the character entities:

>>> cgi.escape ('&') '&amp;'

Next, there is the problem of rendering extended characters in HTML. While character encoding information can be indicated by webpages, a safer fallback is to also translate extended characters into character entities. This can be done for all commonly used characters (and most uncommon ones). There isn't a direct translation faility in Python, but ElementTree does this with some clever pattern recognition:

_escape = re.compile (eval(r'u"[&<>"u0080-uffff]+"'))

to catch the extended characters and then a simple substitution:

text = "&#%d;" % ord (char)

to form numerical character entities:

>>> "&#%d;" % ord(u'xe9') 'é'

The catch here is this produces numerical entities, not the named entities that cgi.escape makes. Each character has a named and numerical entity. While these render equivalently, it would be preferable to produce the more readable named entities. To illustrate:

u'?' = unichr (960) = u'u03c0' = 'π' = 'π'

and:

u'???' becomes '∑&radic;Π' or '∑√Π'

Unfortunately, there's no mapping for these named entities in Python. (Irritatingly, Docutils has wide support but not in a form that looks useful for code.)