Bug while enumerating unicode file names


#1

Hi, we’re using juce 1.45 and have stepped into the following problem:

We’re using File::findChildFiles() to create a list of files which we want to process. The problem is that if the code encounters a file with for instance a ‘ä’ glyph, it’s encoded in HFS+ as fully decomposed (NFD), i.e. it becomes LATIN SMALL LETTER a + COMBINING DIAERESIS. The code then proceeds with creating fully composed (NFC) UTF-8. The following stat on line 738 of juce_mac_Files.cpp then fails.

Is JUCE standardized on handling only NFC UTF-8 or is there any other reason for converting to NFC? I’m thinking of removing the conversion but I want to know what I’m doing :slight_smile:


#2

Well I’ve used NFC all the way through because that’s what the filesystem seemed to require. Which stat did you say is failing? The one at line 745? That’s very odd…


#3

Yeah, it’s the one at line 745.

No HFS+ does indeed give you fully decomposed UTF-8.

I was thinking maybe to add code for fully decomposing the fullPath attribute of class File when storing paths in it (using a platform specific function)? I read that the default file name encoding for Win32 is NFC, so maybe an API name like juce_toFilenameEncoding or something?


#4

Come to think of it, I actually need the other way around as well as we’re storing some paths into a database, which does some further not-very-smart UTF-8 conversion. I think it can only accept NFC UTF-8.


#5

This is very confusing. HFS+ might give you decomposed unicode, but I had to add a whole bunch of code to fully compose the names before the other filesystem calls would work properly with it…


#6

I agree, according to http://www.venge.net/mtn-wiki/FileSystemIssues, NFD is returned when filenames are read. No conversion required when creating new files (the darwin VFS layer performs that). The stat(2) call seems to not accept NFC on my system (at least the NFC that the conversion routine (line 738) created).


#7

ah, maybe it’s because the parent path isn’t composed. Try changing it to:

[code] if (fnmatch (wildCardUTF8, de->d_name, 0) == 0)
{
result = PlatformUtilities::convertToPrecomposedUnicode (String::fromUTF8 ((const uint8*) de->d_name));

            const String path (PlatformUtilities::convertToPrecomposedUnicode (parentDir) + result);

[/code]


#8

Sorry, it was to no avail… :frowning:

The problem is that it’s a folder name, " Annan Musik åäöÅÄÖ" (english: other music åäöÅÄÖ). Previous parts of the parent folders are /Users/markus/music so there is no meaning in precomposing that path.

I debugged some and stat(2) is indeed getting precomposed UTF-8 so it seems that precomposed will not work using stat. Don’t know about open(2) or creat(2) though…


#9

BTW, the code works if I omit the conversion to NFC :-). What other parts break???


#10

Ok, I’ll need to take a look at this. Busy right now, but will investigate soon.


#11