Quantcast
Channel: maps for developers - Medium
Viewing all articles
Browse latest Browse all 2230

Improving Arabic and Hebrew text in map labels

$
0
0

Mapbox GL JS now supports Arabic and Hebrew text with an optional plugin, and the next releases of our iOS and Android SDKs will include support out of the box. This also extends to languages like Persian (Farsi) and Urdu that use the Arabic script. In this post, I’d like to share why these scripts presented a technical challenge for us, and how we developed a solution.

Mashhad Before/After

OpenGL, Unicode Codepoints, and Glyphs

Incorporating support for Arabic and Hebrew text was a non-trivial project because our maps render everything with direct instructions to 3D graphics hardware using OpenGL, which doesn’t provide any support at all for rendering text. Instead of relying on the operating system or the browser to do the hard work for us, we’ve had to implement our own text rendering code.

Let’s go through how we’d render the label “Hey” on a map.

The input we get is encoded as a Unicode string that looks like:

U+0048 ("H")
U+0065 ("e")
U+0079 ("y")

Each of the numeric codes above is called a “codepoint,” and that’s the basic input we work with. A codepoint represents the concept of a character, while we call the actual visual representation of a character a “glyph”. To render “Hey,” we need to convert codepoints to glyphs, and then we need to tell the graphics card exactly where to place each of those glyphs (so the letters line up next to each other). Here’s roughly how we do it:

  • We ask the Mapbox font server to give us the glyphs for the codepoints “H,” “e,” and “y”
  • The server looks up the glyphs for the codepoints in the appropriate font
  • The server renders the glyph as something called a “signed distance field” (SDF), a graphical format that is easy to feed to OpenGL. Konstantin has written an in-depth description of our use of SDFs.
  • The server sends back the set of SDFs along with their widths
  • We tell the graphics hardware to place the first glyph where we want our label to start, then we move to the right by the width of that glyph, place the next glyph, and repeat until we’re done.

Placing "hey"

Now let’s render “سلام” (Salaam), Arabic for “peace”.

The Unicode representation of the string looks like:

U+0633 ("Seen")
U+0644 ("Lam")
U+0627 ("Aleph")
U+0645 ("Meem")

Here’s what happens when we render that:

Naive arabic placing

Fellow Mapbox employee Sam Kronick pointed me to the hall of shame we belong in: “nope, not Arabic!”

Bidirectional Text Layout (BiDi)

Arabic and Hebrew are both written from right-to-left (RTL) instead of the left-to-right (LTR). Under the hood, label strings are stored in “logical order,” which means the characters in the string are stored in the same order they would be read (you could also call it “first-to-last” order). Since our layout algorithm starts at the left and moves right, we’re implicitly converting “first-to-last” to “left-to-right,” and the text ends up backwards.

We can’t just reverse the order of the characters because we have to handle the case where LTR text is mixed with RTL text. Mixed LTR/RTL text is called “bidirectional text”. Bidirectional text can show up when you have a multilingual label, but it also shows up in the common case of an Arabic label that includes numbers, because Arabic numerals are actually written LTR (!). Once you start doing layout across multiple lines, “reversing just the RTL text” becomes surprisingly difficult to do. If the top line is RTL text, should you start laying out the next line from the right, even if the characters on the next line are mostly LTR text? What if the LTR characters are within parentheses and are followed by more RTL text?

The Unicode Consortium tells us exactly how to correctly handle bidirectional text layout, and luckily for us, the team working on International Components for Unicode (ICU) have already implemented this algorithm. Once we hook Mapbox up to ICU, it’s straight forward to hand ICU a “logical” input and a set of desired line breaks, and get back a set of lines with the characters placed in what’s called “visual order” (and here “visual” really means “left-to-right”).

Here is what the algorithm does with our simple test string:

processBidirectionalText("سلام",
    [no line breaks]) ->

U+0645 ("Meem")
U+0627 ("Aleph")
U+0644 ("Lam")
U+0633 ("Seen")

Placing arabic with bidirectional layout

The characters are now in the correct right-to-left order (even though we printed them starting from the left). If we had rendered the Hebrew cognate “שָׁלוֹם” (Shalom), we’d be done by now, but in Arabic there’s still more work to do to make the characters legible.

Arabic Shaping

In printed Arabic, each character can have an “isolated,” “initial,” “medial,” and “final” form. As an example, here are the four forms for the Arabic letter “meem” (U+0645).

Meem formsCopyright © 2015-2017 W3C® https://w3c.github.io/alreq/

The form you choose depends on the surrounding characters. If you select the right forms and place them next to each other, the words will appear gracefully connected, as if written in cursive.

Arabic fonts store all four of these glyphs for the single codepoint for “meem” (U+0645). Choosing the right glyph to display for the codepoint “meem” based on the surrounding codepoints is the core of the problem of complex text layout.

Normally, we wouldn’t be able to do complex text layout without using a library like Harfbuzz and having access to the “shaping tables” for the font, but in Arabic a fortuitous historical accident gives us an easy way out. When the Unicode encoding was standardized, one of its design goals was to provide an equivalent Unicode codepoint for every codepoint that existed in one of the then-current national encodings. Early Arabic encodings avoided the complex text shaping problem by assigning a codepoint for every single glyph (at the cost of making word processing a lot more complicated since editing one character also required editing surrounding characters). To support these original Arabic encodings, Unicode introduced what it calls the “presentation forms” of Arabic letters, where each codepoint represents exactly one form/glyph.

These “presentation form” codepoints aren’t normally used in writing Arabic, but if we know the rules of Arabic, we can take any “normal” string of Arabic text and replace all of the codepoints with the appropriate “presentation form” codepoint. By doing so, we remove all ambiguity about which glyph goes with which codepoint. Again, we are lucky that ICU will do this transformation for us automatically.

Here we combine ICU’s Arabic shaping with the bidirectional transformation:

processBidirectionalText(
    applyArabicShaping("سلام"),
    [no line breaks]) ->

U+FEE1 ("Meem Isolated")
U+FEFC ("Lam with Alef Combined")
U+FEB3 ("Seen Initial")

Placing Arabic with BiDi and Shaping

Hooray!

Using ICU in Mapbox GL JS

ICU has twin C/C++ and Java implementations, which are widely used by both browsers (Chrome, Safari, etc.) and operating systems (Linux, Android, etc.). As a result, ICU has a well tested and stable interface, and it was pretty straightforward to integrate into mapbox-gl-native. But there’s no equivalent to it for JavaScript. While there are a few JS projects out there that take a stab at implementing this functionality (including a recent PR for the iD editor), none that I know of implement the full Unicode Bidirectional Algorithm. Because rendering text in WebGL is still a relatively uncommon case, there’s just not (yet) the same demand for this kind of library.

While at some point it might make sense for us to directly port ICU (or some portion of ICU, we don’t use the whole thing) to JavaScript, for now we’re using Emscripten to automatically convert ICU from C/C++ to JavaScript. This works well, but the resulting minified JavaScript weighs in at ~400KB. Although that’s not terrible compared to the ~300KB we needed to add ICU to mapbox-gl-native, it’s monster-size for a JS dependency, and including it would nearly double the size of the mapbox-gl.js bundle.

Because of this added size, we’re doing our initial release of this functionality as an optional right-to-left text plugin.

Next up

The GL JS version of this functionality just shipped in v0.32.1, and the functionality will ship in the next versions of our mobile SDKs. If you’re interested in details, contact us or follow these developments on GitHub. After launch we’ll keep working on expanding the typographical capabilities of the map. Some of the features at the top of our list are:

  • Shaping support for Brahmic/Indic scripts.
  • Shaping support for ligatures and kerning in Latin fonts
  • Improved language-aware line-breaking algorithms
  • Shaping support for more complex “calligraphic” fonts

Viewing all articles
Browse latest Browse all 2230

Trending Articles