Save what you can in new encoding instead of aborting
Problem:
Mousepad refuses to save one file in another encoding format if even one invalid byte exists in a (possibly massive) text file. There is no option to Save anyway
and to just let that one out of a thousand characters get lost.
This problem affects me and others who make subtitles for Smart TVs. eEen if only one character of text in a large text transcript of movie dialog (e.g., as in subtitles) would get corrupted, we cannot choose to save the file in the new format and just compare the old and new files to see if anything substantial was lost.
Mousepad just says blah, blah ... invalid byte ... blah, blah
and aborts the saving process. It will not allow me to save in another encoding at all.
Suggested Solution:
Mousepad saves the file with as many valid bytes as possible and gracefully discards the one or two characters outside the range or otherwise corrupted after telling the user of the problem and asking the user how to proceed (to abort or to save anyway).
Actual Behavior (the problem in action):
- Fire up Mousepad.
- Open a text file in
UTF-8
encoding that has many characters outside of basicASCII
but that are all available also in another text encoding format (like물
). In my case, the text file has a bunch of Korean characters but will have one invalid byte or a whitespace character that is not inUHC
encoding or the text file includes a character from another script (likeЦ
) hidden in the massive text file somewhere. - Try to save that text file with
Save As ...
in another encoding that is appropriate, likeUHC
. - Mousepad throws up a warning message pop-up that says
there is one or more invalid bytes in the file and that it is impossible to save in the new format
and then Mousepad aborts the saving process.
Expected Behavior (the solution in action):
- Fire up Mousepad.
- Open a text file in
UTF-8
that is full of characters outside of basic ASCII but are all available in another text encoding format. In my case, this is the Korean script. Characters like물
do not exist in ASCII, but they do exist in encoding formats likeUHC
. But there is one invalid byte or invalid whitespace or some character that is in Unicode but not inUHC
. - Try to save that text file with
Save As ...
in another encoding that is appropriate, likeUHC
. - Mousepad should throw up a warning message pop-up that says
there is one or more invalid bytes in the file
or highlight the problem character or line and then have Mousepad ask the user if he or she wants to abort the saving process or save the file in the new format regardless of whether a character or more is lost in the re-encoding process. The user can always choose to abort and to save under a new filename in order to test out just how many characters get lost in the re-encoding process with a simple before-and-after look at the text file's character count.
Explanation:
Because I use a USB stick to watch multimedia content on a TV sometimes, I noticed that my TV only has support for ISO-8859-1 Western European
and for EUC-KR
/UHC
. The TV includes two fonts with glyphs for all of the printable characters in one or the other set. UTF-8
encoding will not work at all on most Smart TVs beyond the basic ASCII
characters.
(ASCII
printable characters, the space character, and <CR><LF>
are kept at the same codepoints in ISO-8859-X, GBK/GB 18030, Big5/HKSCS, EUC-KR/UHC, Shift-JIS, and UTF-8.)
I save my subtitle files as two copies. One is in UTF-8 encoding, because, of course I use UTF-8 encoding. I then also do Save As ...
for a new text file appended with [UHC]
in order to distinguish the new file by its name. (This is just my preference.)
Next, I select UHC
encoding instead of UTF-8 (Default)
. I then hit Save
.
One of my subtitle files, which is all Korean and ASCII characters, had one corrupted byte or illegal whitespace character somewhere, and then Mousepad refused to re-save the file under the new format. Mousepad just said blah, blah ... invalid byte ... blah, blah
in a warning message pop-up and aborted the saving process. I do not want to comb through 5,000 lines to find the one line with one bad character, or re-type every single thing because some illegal space/whitespace or corrupted byte appeared. (FFMpeg's text extractpr does this.) I have no idea where to look or what to fix.
Please allow the user to do as KWrite and Kate do, which is to have a warning pop-up message that says This document has one or more bytes that are invalid bytes in the new chosen encoding. Do you want to save anyway in the new encoding format even if some characters are rejected?
For what it is worth, the latest versions of Gnome's text editors (gedit and Gnome Text Editor) have the same annoying rule of refusing to save files that are not 100% perfect text files.