Skip to content

Search should properly normalize strings before case folding and comparison

Following on from !244 (merged), as promised.

Feature request and progress

The current filename-based search implementation folds case for both the search term "needles" and the filename "haystack" strings. This is helpful, but it doesn't yet meet the needs of a globally aware file manager. Thunar should probably:

  • Easy: normalize the strings to NFD (at least) before case folding
  • Harder: optionally try to strip diacritics (like Firefox does with the right checkbox, might need some UI)
  • Harder still maybe: real locale aware comparisons like g_str_collate() does it, so that characters like Turkish dotless I get casefolded correctly

Stage 1: simple Unicode normalization

Click to expand

This is the process of converting strings with composing characters to the fully-composed version, or vice versa. This is done with g_utf8_normalize(). For example:

  • "à" can be represented in decomposed form <U+0061 (LATIN SMALL LETTER A) U+0300 (COMBINING GRAVE ACCENT)> or composed form <U+00E0 (LATIN SMALL LETTER A WITH GRAVE)>

This is needed for both needles and haystack because we're using the Unicode-unaware g_strrstr() for the substring matches. It should be performed before casefolding.

Stage 2: locale-naïve diacritic stripping via decomposition

Click to expand

It would be super if users whose filenames contain a lot of diacritic characters could search for them by typing the unadorned form of the character. See the reference below by Turkish user "loki" on StackExchange for a quick user story in an unrelated program.

The way GLib anticipated we'd all be doing this is by providing g_str_match_string(), g_str_tokenize_and_fold(), and ultimately (and alarmingly!) g_str_to_ascii() which is supposedly locale aware. Except that it isn't, and it converts almost everything that isn't already ASCII or a known asciification into a ?. The concept of searching for, and amongst, sets of generated alternative forms is worthwhile, however, and I'll be returning to it

Firefox is an interesting case here. It has UI for diacritic matching and also whole word matching (I'll be returning to that too!)

Pic of the search bar that appears in firefox when you press Ctrl+F

One caveat: if we do this by decomposing by normalizing to Normalization Form KD and stripping out the characters for which g_unichar_combining_class() returns non-zero, as in the Eevee reference below, this could change the meanings of certain texts. But for the examples given:

This cannot be round tripped always, but that doesn't matter to us because it's supposed to be a mangling of the text. I remain to be convinced that this transform on both needles and haystacks would render the search unworkable by returning a huge number of hilariously bad matches that a user literate in the language in question couldn't come up with a workaround for. Particularly if we have a Firefox-like diacritics button in the search area…

Here's a sample implementation, in Python

import locale
locale.setlocale(locale.LC_ALL, "")

import gi
from gi.repository import GLib

def normalize(s, diacritics=True, casefold=True):
    if diacritics:
        s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKD)
        s = "".join(c for c in s if not GLib.unichar_combining_class(c))
    else:
        s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKC)
    if casefold:
       s = GLib.utf8_casefold(s, -1)
    return s

s = '한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶'
print(s)
s = normalize(s)
print(s)
print(GLib.str_to_ascii(s, None))

Produces

한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶
한글 イーフイ ャ ijsselmeer ٣33➌ ø istanbul hayırlı queensryche spınal tap soupcon not this
?????? ???? ? ijsselmeer ?33? oe istanbul hay?rl? queensryche sp?nal tap soupcon not this

Convince me we shouldn't be doing this (well, not g_str_to_ascii() because that's terrible)! I need more examples in multiple languages. Is there ever a mangling that's going to return mismatched search results in a way that makes no sense to the user?

Another caveat here: this algorithm is completely unaware of what language is being written (because Unicode cannot encode it), and also of the current locale (which is probably what's most meaningful to the user). This is why it breaks even under LANG=tr_TR.UTF-8.

(and g_str_to_ascii(), which is supposed to be locale aware, can't handle downsampling dotless I to dotted ASCII i, let alone the useless mess it makes of Japanese or Korean characters. Let us not use this or its related functions g_str_match_string() or g_str_tokenize_and_fold())

Stage 3 galaxy brain: locale aware comparisons

Click to expand

This is less worked through because it may be a bad idea.

Remember that Firefox can match by word? And that g_str_tokenize_and_fold() actually splits on whitespace? Could we do something like that, and then make use of the locale-specific cmp()able hash that something like g_utf8_collate_key()? However, this doesn't seem to cope with dotless I in a Turkish locale, and it still needs pre-casefolding (and g_utf8_casefold() can't do that)

(Other big flaw with matching by word: wordsplitting Japanese and Chinese is hard because they don't use whitespace to divide words most of the time. This would have to have some UI.)

Another resource is the big database of Unicode confusables in Raymond Chen's reference below. However, I'm not sure how you'd start with that, and it's not present in unicode-data on my system.

A third half-baked idea would be to get the translators to do upstream's work for them, and have a localizable casefold override string like C_("searching:casefold-override", "") (for C/en_US) which the Turkish translators, say, could flesh out as "ı:i İ:i". And then apply the resultant mapping as a pre-pass before g_utf8_casefold() to work around its shortcomings.

Some light background reading

Edited by Andrew Chadwick