Selecting Fonts and Shapers

The basic algorithm for selecting fonts and shapers is currently character-by-character:

  1. For each character in the text, form a PangoFontDescription with a list of families in the family field. This is done by combining the attributes for that segment with global settings. For instance, if the global font is "Times Italic", and two properties have been applied to the text segment, one specifying the family as "Utopia,Serif", and the other specifying the weight as "Bold", then the final font description would be:

    FamilyStyleVariantWeightStretch
    utopia,serifitalicnormalboldnormal

  2. For each combination of family and style, look it up in all the applicable font maps to find a matching font.

  3. For each of the resulting list of fonts, retrieve the coverage map for the current language tag, and then select the font which has the best PangoCoverageLevel for the character. Ties are broken by ordering in the family list.

  4. The shaper is found by calling the find_shaper() method of the font with the character point and language tag.

    The exact exact implementation of this is up to the font. Most likely this is a straight call of the engine lookup functions that the Pango core provides.

Language auto-tagging

The main problem with the above approach is that we don't get any contextual coherency in font or shaper selection. If there is a string of Japanese consisting of mixed Kanji characters and Kana, then you might get Chinese variants of the Kanji mixed with the Kana, if the string is not tagged.

If people language tag, or only use the language corresponding to their locale, then all is fine. The problem come when somebody whose locale corresponds is, say, English, looks at an untagged Japanese document. My proposed solution to this problem is to do an implicit language tagging pass. (This is not an urgent problem, and can be implemented later.)

The information required for this a map listing for each Unicode character, what language tags are valid. This information can be semi-automatically extracted from locale definition files.

We start with document tagged according to explicit language tags and the user's locale.

  1. In the first pass, we remove all tags where the tag is not valid for the corresponding code points. (Actually, you want to be more aggressive and remove all tag from all characters in a "word" where one of the characters in the word is mistagged, though, unfortunately, we don't have word break information until we have the final language tags!)

  2. In the second pass, all segments where only one language tag are valid are tagged with that tag.

  3. In the final pass, tags are extended to adjoining untagged areas where the tag is valid.

Explicit tagging should be sufficient where this approach fails. In general, getting automatic selection of language tag wrong is not too dangerous, as long as you always favor tagging to the user's global preference (local).



Last modified 17-Feb-2000
Owen Taylor <otaylor@redhat.com>