Selecting Fonts and Shapers
The basic algorithm for selecting fonts and shapers is currently
For each character in the text, form a PangoFontDescription
with a list of families in the family field. This is done by
combining the attributes for that segment with global
settings. For instance, if the global font is "Times Italic",
and two properties have been applied to the text segment,
one specifying the family as "Utopia,Serif", and the other
specifying the weight as "Bold", then the final font
description would be:
For each combination of family and style, look it up in
all the applicable font maps to find a matching font.
For each of the resulting list of fonts, retrieve the
coverage map for the current language tag, and then select
the font which has the best PangoCoverageLevel for the
character. Ties are broken by ordering in the family list.
The shaper is found by calling the find_shaper() method
of the font with the character point and language tag.
The exact exact implementation of this is up to the font.
Most likely this is a straight call of the engine lookup
functions that the Pango core provides.
The main problem with the above approach is that we don't get any
contextual coherency in font or shaper selection. If there is a
string of Japanese consisting of mixed Kanji characters and Kana,
then you might get Chinese variants of the Kanji mixed with the
Kana, if the string is not tagged.
If people language tag, or only use the language corresponding
to their locale, then all is fine. The problem come when
somebody whose locale corresponds is, say, English, looks
at an untagged Japanese document. My proposed solution
to this problem is to do an implicit language tagging pass.
(This is not an urgent problem, and can be implemented later.)
The information required for this a map listing for each Unicode
character, what language tags are valid. This information can
be semi-automatically extracted from locale definition files.
We start with document tagged according to explicit language
tags and the user's locale.
In the first pass, we remove all tags where the tag is not
valid for the corresponding code points. (Actually, you want
to be more aggressive and remove all tag from all characters
in a "word" where one of the characters in the word is mistagged,
though, unfortunately, we don't have word break information
until we have the final language tags!)
In the second pass, all segments where only one language tag
are valid are tagged with that tag.
In the final pass, tags are extended to adjoining untagged
areas where the tag is valid.
Explicit tagging should be sufficient where this approach fails.
In general, getting automatic selection of language tag wrong is not
too dangerous, as long as you always favor tagging to the user's
global preference (local).