Search should properly normalize strings before case folding and comparison
Following on from !244 (merged), as promised.
Feature request and progress
The current filename-based search implementation folds case for both the search term "needles" and the filename "haystack" strings. This is helpful, but it doesn't yet meet the needs of a globally aware file manager. Thunar should probably:
-
Easy: normalize the strings to NFD (at least) before case folding -
Harder: optionally try to strip diacritics (like Firefox does with the right checkbox, might need some UI) -
Harder still maybe: real locale aware comparisons like g_str_collate()
does it, so that characters like Turkish dotless I get casefolded correctly
Stage 1: simple Unicode normalization
Click to expand
This is the process of converting strings with composing characters to the fully-composed version, or vice versa. This is done with g_utf8_normalize()
. For example:
- "à" can be represented in decomposed form <U+0061 (LATIN SMALL LETTER A) U+0300 (COMBINING GRAVE ACCENT)> or composed form <U+00E0 (LATIN SMALL LETTER A WITH GRAVE)>
This is needed for both needles and haystack because we're using the Unicode-unaware g_strrstr()
for the substring matches. It should be performed before casefolding.
Stage 2: locale-naïve diacritic stripping via decomposition
Click to expand
It would be super if users whose filenames contain a lot of diacritic characters could search for them by typing the unadorned form of the character. See the reference below by Turkish user "loki" on StackExchange for a quick user story in an unrelated program.
The way GLib anticipated we'd all be doing this is by providing g_str_match_string()
, g_str_tokenize_and_fold()
, and ultimately (and alarmingly!) g_str_to_ascii()
which is supposedly locale aware. Except that it isn't, and it converts almost everything that isn't already ASCII or a known asciification into a ?
. The concept of searching for, and amongst, sets of generated alternative forms is worthwhile, however, and I'll be returning to it
Firefox is an interesting case here. It has UI for diacritic matching and also whole word matching (I'll be returning to that too!)
One caveat: if we do this by decomposing by normalizing to Normalization Form KD and stripping out the characters for which g_unichar_combining_class()
returns non-zero, as in the Eevee reference below, this could change the meanings of certain texts. But for the examples given:
- Korean 한글 (Hangul) becomes 한글 (Hangul, but written in jamo salad)
- Wiktionary (for example) is pretty relaxed about the https://en.wiktionary.org/wiki/한글
- It's basically the same string but decomposed into its constituent jamo.
- Doesn't really matter if it's intelligible because we're not displaying this.
- There's an algorithm for recomposing the salad back to precomposed Hangul syllables
- Japanese イーブイ (Ībui, or Eevee) becomes イーフイ (Īfui)
- Not round-trippable. Information has been lost, but this may not matter (I'm uncertain about this)
- The misspelling is known to search engines if that's any yardstick https://duckduckgo.com/?q=%2B"イーフイ"&t=ffab&iax=images&ia=images
- I don't know how well that corresponds to Japanese keyboards
This cannot be round tripped always, but that doesn't matter to us because it's supposed to be a mangling of the text. I remain to be convinced that this transform on both needles and haystacks would render the search unworkable by returning a huge number of hilariously bad matches that a user literate in the language in question couldn't come up with a workaround for. Particularly if we have a Firefox-like diacritics button in the search area…
Here's a sample implementation, in Python
import locale
locale.setlocale(locale.LC_ALL, "")
import gi
from gi.repository import GLib
def normalize(s, diacritics=True, casefold=True):
if diacritics:
s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKD)
s = "".join(c for c in s if not GLib.unichar_combining_class(c))
else:
s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKC)
if casefold:
s = GLib.utf8_casefold(s, -1)
return s
s = '한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶'
print(s)
s = normalize(s)
print(s)
print(GLib.str_to_ascii(s, None))
Produces
한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶
한글 イーフイ ャ ijsselmeer ٣33➌ ø istanbul hayırlı queensryche spınal tap soupcon not this
?????? ???? ? ijsselmeer ?33? oe istanbul hay?rl? queensryche sp?nal tap soupcon not this
Convince me we shouldn't be doing this (well, not g_str_to_ascii()
because that's terrible)! I need more examples in multiple languages. Is there ever a mangling that's going to return mismatched search results in a way that makes no sense to the user?
Another caveat here: this algorithm is completely unaware of what language is being written (because Unicode cannot encode it), and also of the current locale (which is probably what's most meaningful to the user). This is why it breaks even under LANG=tr_TR.UTF-8
.
(and g_str_to_ascii()
, which is supposed to be locale aware, can't handle downsampling dotless I to dotted ASCII i, let alone the useless mess it makes of Japanese or Korean characters. Let us not use this or its related functions g_str_match_string()
or g_str_tokenize_and_fold()
)
Stage 3 galaxy brain: locale aware comparisons
Click to expand
This is less worked through because it may be a bad idea.
Remember that Firefox can match by word? And that g_str_tokenize_and_fold()
actually splits on whitespace? Could we do something like that, and then make use of the locale-specific cmp()able hash that something like g_utf8_collate_key()
? However, this doesn't seem to cope with dotless I in a Turkish locale, and it still needs pre-casefolding (and g_utf8_casefold()
can't do that)
(Other big flaw with matching by word: wordsplitting Japanese and Chinese is hard because they don't use whitespace to divide words most of the time. This would have to have some UI.)
Another resource is the big database of Unicode confusables in Raymond Chen's reference below. However, I'm not sure how you'd start with that, and it's not present in unicode-data on my system.
A third half-baked idea would be to get the translators to do upstream's work for them, and have a localizable casefold override string like C_("searching:casefold-override", "")
(for C/en_US) which the Turkish translators, say, could flesh out as "ı:i İ:i"
. And then apply the resultant mapping as a pre-pass before g_utf8_casefold()
to work around its shortcomings.