86 lines
4.0 KiB
Plaintext
86 lines
4.0 KiB
Plaintext
== Unix and non-ASCII file names, a summary of issues
|
|
|
|
Unix/Linux file and directory names are binary byte C strings. Only the
|
|
null byte and the slash character (/) are forbidden inside a name,
|
|
nowhere does the kernel interpret the strings as meaningful or
|
|
printable.
|
|
|
|
In the old times, all utilities that would display to the user were
|
|
ASCII-based, and people would use pure printable ASCII file names (even
|
|
using space characters inside names was a cause for trouble). Non
|
|
alphanumeric characters were exclusively used for playing tricks on
|
|
colleagues. And all was well.
|
|
|
|
Then the devil came under the guise of accented 8 bit characters. The
|
|
system has no problem with them, file names are still binary C strings, but
|
|
the utilities have to display them or take them as input, and, because
|
|
there is no encoding specification stored with the file names, they can
|
|
only do this according to the character encoding taken from the user's
|
|
current locale.
|
|
|
|
For example fr_FR.UTF-8, and fr_FR.ISO8859-1 could be used simultaneously
|
|
on the same system (by different users), but they are completely
|
|
uncompatible: ISO-8859-1 strings are illegal when viewed in an UTF-8 locale
|
|
(will display as interrogation points or some other conventional error
|
|
marker). UTF-8 strings will display as gibberish in an ISO-8859-1 locale.
|
|
|
|
This means that the file names created by an UTF-8 user are displayed as
|
|
garbage to the ISO-8859 one...
|
|
|
|
If you ever change your locale, your old files are still there and named
|
|
the same (in the binary sense), but the names display badly and you have
|
|
great trouble inputing them. If you add distributed (NFS) file system
|
|
issues, things become totally unmanageable. Also think about archives sent
|
|
from another system with a different encoding.
|
|
|
|
For what concerns Recoll:
|
|
|
|
- The file names inside recoll.conf are not transcoded, they are taken as
|
|
binary strings (mostly, only +\n+ and +space+ are a bit special), and
|
|
passed as is to the system. So if you edit 'recoll.conf' with a text
|
|
editor, inside the same locale that is or has been used for file names,
|
|
you'll be fine.
|
|
- There was a bug in the GUI configuration tool, up to 1.12, it should
|
|
transcode between the internal Qt format and locale-dependant strings,
|
|
but it doesn't or does it badly.
|
|
- There is also an exception for the +unac_except_trans+ variable, this
|
|
*has* to be UTF-8, so if the rest of the file uses another encoding,
|
|
you'll need to edit two separate files and concatenate them.
|
|
|
|
As of version 1.13, Recoll uses local8Bit()/fromLocal8Bit() to convert
|
|
recoll.conf file names from/to QStrings (it uses UTF-8 for all string
|
|
values which are not file names).
|
|
|
|
The Qt file dialog is broken (at least was, I have not checked this on
|
|
recent versions). It should consider file paths as almost-binary data, not
|
|
QStrings, but doesn't. In consequence, things are even more broken than
|
|
necessary as seen from there:
|
|
|
|
With LANG="C", no non-ASCII paths can't be used at all:
|
|
|
|
- Strings read from recoll.conf are stripped of 8bit characters before display.
|
|
- Directory entries with 8bit characters are not displayed at all in the
|
|
selection dialog.
|
|
|
|
With LANG="fr_FR.UTF-8", only UTF-8 paths can be used:
|
|
|
|
- Strings read from recoll.conf are damaged when converted to QString
|
|
(except those that were actually UTF-8)
|
|
- Only the UTF-8 directory entries are displayed in the selection dialog.
|
|
|
|
|
|
With LANG="fr_FR.iso8859-1", everything works ok.
|
|
|
|
- Strings read from recoll.conf are displayed with weird characters if
|
|
they use another encoding such as UTF-8, but are correctly maintained
|
|
and can be read back from the dialogs and rewritten without damage.
|
|
- Directory entries with 8 bit characters are displayed weirdly (normal),
|
|
but can be manipulated without trouble (this includes utf-8 names of
|
|
course).
|
|
|
|
In conclusion, only the iso-8859 locales can be used for handling mixed
|
|
encoding situations. This is a possible workaround for people who need it.
|
|
|
|
More data about path encoding issues:
|
|
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
|