Search should properly normalize strings before case folding and comparison

Following on from !244 (merged), as promised.

Feature request and progress

The current filename-based search implementation folds case for both the search term "needles" and the filename "haystack" strings. This is helpful, but it doesn't yet meet the needs of a globally aware file manager. Thunar should probably:

Easy: normalize the strings to NFD (at least) before case folding
Harder: optionally try to strip diacritics (like Firefox does with the right checkbox, might need some UI)
Harder still maybe: real locale aware comparisons like g_str_collate() does it, so that characters like Turkish dotless I get casefolded correctly

Stage 1: simple Unicode normalization

Click to expand

This is the process of converting strings with composing characters to the fully-composed version, or vice versa. This is done with g_utf8_normalize(). For example:

"à" can be represented in decomposed form <U+0061 (LATIN SMALL LETTER A) U+0300 (COMBINING GRAVE ACCENT)> or composed form <U+00E0 (LATIN SMALL LETTER A WITH GRAVE)>

This is needed for both needles and haystack because we're using the Unicode-unaware g_strrstr() for the substring matches. It should be performed before casefolding.

Stage 2: locale-naïve diacritic stripping via decomposition

Click to expand

It would be super if users whose filenames contain a lot of diacritic characters could search for them by typing the unadorned form of the character. See the reference below by Turkish user "loki" on StackExchange for a quick user story in an unrelated program.

The way GLib anticipated we'd all be doing this is by providing g_str_match_string(), g_str_tokenize_and_fold(), and ultimately (and alarmingly!) g_str_to_ascii() which is supposedly locale aware. Except that it isn't, and it converts almost everything that isn't already ASCII or a known asciification into a ?. The concept of searching for, and amongst, sets of generated alternative forms is worthwhile, however, and I'll be returning to it

Firefox is an interesting case here. It has UI for diacritic matching and also whole word matching (I'll be returning to that too!)

One caveat: if we do this by decomposing by normalizing to Normalization Form KD and stripping out the characters for which g_unichar_combining_class() returns non-zero, as in the Eevee reference below, this could change the meanings of certain texts. But for the examples given:

Korean 한글 (Hangul) becomes 한글 (Hangul, but written in jamo salad)
- Wiktionary (for example) is pretty relaxed about the https://en.wiktionary.org/wiki/한글
- It's basically the same string but decomposed into its constituent jamo.
- Doesn't really matter if it's intelligible because we're not displaying this.
- There's an algorithm for recomposing the salad back to precomposed Hangul syllables
Japanese イーブイ (Ībui, or Eevee) becomes イーフイ (Īfui)
- Not round-trippable. Information has been lost, but this may not matter (I'm uncertain about this)
- The misspelling is known to search engines if that's any yardstick https://duckduckgo.com/?q=%2B"イーフイ"&t=ffab&iax=images&ia=images
- I don't know how well that corresponds to Japanese keyboards

This cannot be round tripped always, but that doesn't matter to us because it's supposed to be a mangling of the text. I remain to be convinced that this transform on both needles and haystacks would render the search unworkable by returning a huge number of hilariously bad matches that a user literate in the language in question couldn't come up with a workaround for. Particularly if we have a Firefox-like diacritics button in the search area…

Here's a sample implementation, in Python

import locale
locale.setlocale(locale.LC_ALL, "")

import gi
from gi.repository import GLib

def normalize(s, diacritics=True, casefold=True):
    if diacritics:
        s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKD)
        s = "".join(c for c in s if not GLib.unichar_combining_class(c))
    else:
        s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKC)
    if casefold:
       s = GLib.utf8_casefold(s, -1)
    return s

s = '한글 イーブイ ｬ Ĳsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶'
print(s)
s = normalize(s)
print(s)
print(GLib.str_to_ascii(s, None))

Produces

한글 イーブイ ｬ Ĳsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶
한글 イーフイ ャ ijsselmeer ٣33➌ ø istanbul hayırlı queensryche spınal tap soupcon not this
?????? ???? ? ijsselmeer ?33? oe istanbul hay?rl? queensryche sp?nal tap soupcon not this

Convince me we shouldn't be doing this (well, not g_str_to_ascii() because that's terrible)! I need more examples in multiple languages. Is there ever a mangling that's going to return mismatched search results in a way that makes no sense to the user?

Another caveat here: this algorithm is completely unaware of what language is being written (because Unicode cannot encode it), and also of the current locale (which is probably what's most meaningful to the user). This is why it breaks even under LANG=tr_TR.UTF-8.

(and g_str_to_ascii(), which is supposed to be locale aware, can't handle downsampling dotless I to dotted ASCII i, let alone the useless mess it makes of Japanese or Korean characters. Let us not use this or its related functions g_str_match_string() or g_str_tokenize_and_fold())

Stage 3 galaxy brain: locale aware comparisons

Click to expand

This is less worked through because it may be a bad idea.

Remember that Firefox can match by word? And that g_str_tokenize_and_fold() actually splits on whitespace? Could we do something like that, and then make use of the locale-specific cmp()able hash that something like g_utf8_collate_key()? However, this doesn't seem to cope with dotless I in a Turkish locale, and it still needs pre-casefolding (and g_utf8_casefold() can't do that)

(Other big flaw with matching by word: wordsplitting Japanese and Chinese is hard because they don't use whitespace to divide words most of the time. This would have to have some UI.)

Another resource is the big database of Unicode confusables in Raymond Chen's reference below. However, I'm not sure how you'd start with that, and it's not present in unicode-data on my system.

A third half-baked idea would be to get the translators to do upstream's work for them, and have a localizable casefold override string like C_("searching:casefold-override", "") (for C/en_US) which the Turkish translators, say, could flesh out as "ı:i İ:i". And then apply the resultant mapping as a pre-pass before g_utf8_casefold() to work around its shortcomings.

Some light background reading

Edited May 25, 2022 by Andrew Chadwick