Search should properly normalize strings before case folding and comparison
Following on from !244, as promised. ## Feature request and progress The current filename-based search implementation folds case for both the search term "needles" and the filename "haystack" strings. This is helpful, but it doesn't yet meet the needs of a globally aware file manager. Thunar should probably: - [X] Easy: normalize the strings to NFD (at least) before case folding - [X] Harder: optionally try to strip diacritics (like Firefox does with the right checkbox, might need some UI) - [ ] Harder still maybe: real locale aware comparisons like `g_str_collate()` does it, so that characters like Turkish dotless I get casefolded correctly ### Stage 1: simple Unicode normalization <details><summary>Click to expand</summary> This is the process of converting strings with composing characters to the fully-composed version, or vice versa. This is done with `g_utf8_normalize()`. For example: * "à" can be represented in decomposed form <U+0061 (LATIN SMALL LETTER A) U+0300 (COMBINING GRAVE ACCENT)> or composed form <U+00E0 (LATIN SMALL LETTER A WITH GRAVE)> This is needed for both needles and haystack because we're using the Unicode-unaware `g_strrstr()` for the substring matches. It should be performed before casefolding. </details> ### Stage 2: locale-naïve diacritic stripping via decomposition <details><summary>Click to expand</summary> It would be super if users whose filenames contain a lot of diacritic characters could search for them by typing the unadorned form of the character. See the reference below by Turkish user "loki" on StackExchange for a quick user story in an unrelated program. The way GLib anticipated we'd all be doing this is by providing `g_str_match_string()`, `g_str_tokenize_and_fold()`, and ultimately (and alarmingly!) `g_str_to_ascii()` which is supposedly locale aware. Except that it isn't, and it converts almost everything that isn't already ASCII or a known asciification into a `?`. The concept of searching for, and amongst, sets of generated alternative forms is worthwhile, however, and I'll be returning to it Firefox is an interesting case here. It has UI for diacritic matching and also whole word matching (I'll be returning to that too!) ![Pic of the search bar that appears in firefox when you press Ctrl+F](/uploads/62a4b7853030d45bdc4150e373fa5715/image.png) One caveat: if we do this by decomposing by normalizing to [Normalization Form KD](https://www.unicode.org/reports/tr15/#Norm_Forms) and stripping out the characters for which `g_unichar_combining_class()` returns non-zero, as in the Eevee reference below, this could change the meanings of certain texts. But for the examples given: * Korean 한글 ([Hangul](https://en.wikipedia.org/wiki/Hangul)) becomes 한글 (Hangul, but written in jamo salad) - Wiktionary (for example) is pretty relaxed about the <https://en.wiktionary.org/wiki/한글> - It's basically the same string but decomposed into its constituent jamo. - Doesn't really matter if it's intelligible because we're not displaying this. - There's an [algorithm for recomposing the salad back to precomposed Hangul syllables](https://stackoverflow.com/a/42478456) * Japanese イーブイ (Ībui, or [Eevee](https://en.wikipedia.org/wiki/Eevee)) becomes イーフイ (Īfui) - Not round-trippable. Information _has_ been lost, but this may not matter (I'm uncertain about this) - The misspelling is known to search engines if that's any yardstick <https://duckduckgo.com/?q=%2B"イーフイ"&t=ffab&iax=images&ia=images> - I don't know how well that corresponds to Japanese keyboards This cannot be round tripped always, but that doesn't matter to us because it's _supposed_ to be a mangling of the text. I remain to be convinced that this transform on both needles and haystacks would render the search unworkable by returning a huge number of hilariously bad matches that a user literate in the language in question couldn't come up with a workaround for. Particularly if we have a Firefox-like diacritics button in the search area… Here's a sample implementation, in Python ```python import locale locale.setlocale(locale.LC_ALL, "") import gi from gi.repository import GLib def normalize(s, diacritics=True, casefold=True): if diacritics: s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKD) s = "".join(c for c in s if not GLib.unichar_combining_class(c)) else: s = GLib.utf8_normalize(s, -1, GLib.NormalizeMode.NFKC) if casefold: s = GLib.utf8_casefold(s, -1) return s s = '한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶' print(s) s = normalize(s) print(s) print(GLib.str_to_ascii(s, None)) ``` Produces ``` 한글 イーブイ ャ IJsselmeer ٣³3➌ Ø İstanbul hayırlı Queensrÿche Spın̈al Tap soupçon n̶o̶t̶ ̶t̶h̶i̶s̶ 한글 イーフイ ャ ijsselmeer ٣33➌ ø istanbul hayırlı queensryche spınal tap soupcon not this ?????? ???? ? ijsselmeer ?33? oe istanbul hay?rl? queensryche sp?nal tap soupcon not this ``` Convince me we shouldn't be doing this (well, not `g_str_to_ascii()` because that's terrible)! I need more examples in multiple languages. Is there ever a mangling that's going to return mismatched search results in a way that makes no sense to the user? Another caveat here: this algorithm is completely unaware of what language is being written (because Unicode cannot encode it), and also of the current locale (which is probably what's most meaningful to the user). This is why it breaks even under `LANG=tr_TR.UTF-8`. (and `g_str_to_ascii()`, which is supposed to be locale aware, can't handle downsampling dotless I to dotted ASCII i, let alone the useless mess it makes of Japanese or Korean characters. Let us not use this or its related functions `g_str_match_string()` or `g_str_tokenize_and_fold()`) </details> ### Stage 3 galaxy brain: locale aware comparisons <details><summary>Click to expand</summary> This is less worked through because it may be a bad idea. Remember that Firefox can match by word? And that `g_str_tokenize_and_fold()` actually splits on whitespace? Could we do something like that, and then make use of the locale-specific cmp()able hash that something like `g_utf8_collate_key()`? However, this doesn't seem to cope with dotless I in a Turkish locale, and it still needs pre-casefolding (and `g_utf8_casefold()` can't do that) (Other big flaw with matching by word: wordsplitting Japanese and Chinese is *hard* because they don't use whitespace to divide words most of the time. This would have to have some UI.) Another resource is the big database of Unicode confusables in Raymond Chen's reference below. However, I'm not sure how you'd start with that, and it's not present in unicode-data on my system. A third half-baked idea would be to get the translators to do upstream's work for them, and have a localizable casefold override string like `C_("searching:casefold-override", "")` (for C/en_US) which the Turkish translators, say, could flesh out as `"ı:i İ:i"`. And then apply the resultant mapping as a pre-pass before `g_utf8_casefold()` to work around its shortcomings. </details> ## Some light background reading * [Dark corners of Unicode by Eeevee](https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/) * [The [gnarly] world of stripping diacritics by Raymond Chen](https://devblogs.microsoft.com/oldnewthing/20141124-00/?p=43553) * [stackoverflow:42974173 - questioner loki explains some Turkish language search expectations](https://stackoverflow.com/questions/42974173)
issue