How to conversion a non UTF-8 text file/String to UTF-8 format?


#1

When a text file is non UTF-8 format and includes non-ascii charatcers, File::loadFileAsString() will return a heap of messy code. Is there an easy juce way to conversion it to UTF-8?

Thanks!


#2

Depends what the format is.

For typical UTF16, 32 and codepage 1252, String::createStringFromData should be able to detect it, but if you’re using some crazy locale codepage then you’d need to do the conversion yourself.


#3

Thanks for your quick reply,Jules. But I can’t understand what the ‘crazy locale codepage’ you said was and how to make it calm down.

On Windows, using notepad, VS and all other editors, input some text (include non ascii chars) and save it as a default text file without nothing have done, then read it by juce::File::loadFileAsString(), the result is messy code…

I will try createStringFromData(), hope this method would work it out.


#4

No luck. just get a heap of messy code again…

MemoryBlock mb;
aTextFile.loadFileAsData (mb);

DBGX (String::createStringFromData (mb.getData(), (int) mb.getSize()));

So return to the above, how to do the conversion myself?


#5

Well, maybe start by working out what the format actually is. If it’s some kind of esoteric far-eastern code-page then detecting it is beyond the scope of anything we have in juce.


#6

The format is text file and they all showed: ‘system default’, Jules. But I think it should be codepage 936 (for Chinese).

detecting it is beyond the scope of anything we have in juce.

I have no any idea how to do this, and what should I do after this… Someone could give me a clue?

Can’t juce do it yet? I think File::loadFileAsString() should more compatibility and intelligent…


#7

TBH decent editors nowadays should default to unicode, the old codepage crap is a throwback to the1980s.

Detecting and support all codepages isn’t something it’s worth us doing - there are already big libraries out there to do it, but it’s also becoming increasingly unnecessary.


#8

???

I have a few of editors on my Windows. They all the newest version, include VS 2015/2017, Word 2016, WPS 2016, EditPlus 4.2… Jules. Perhaps they all don’t decent enough than…? :)) When save the file as a text file, it will be ‘system default’ format by default. I think this is nothing about decent and crap.

Hope I’ll find a big lib to do it. This sentence is the most helpful info in this topic :slight_smile: Succinct and efficient…:slight_smile:

Thanks anyway…


#9

Sorry, but JUCE isn’t attempting to be a unicode library! For us to add support for hundreds of foreign-language ANSI code-pages would involve us embedding megabytes of character-mapping tables in our source code, which would be ridiculous!

It’s not our fault if someone else’s software makes a bad choice and uses a locale-specific encoding rather than a universally recognised one, but if you understand the problem you’ll see that it’s really unreasonable to expect us to deal with that. There are lots of encoding libraries out there where people have tackled this exact problem, so you’d need to look at those.


#10

???

You’re too emotional, Jules… I know you have too much stress for a long time through and JUCE 5 will release soon.

Can or can’t, will or won’t. To a software engineer, just simple is enough.

Let me say a little more. Recently, I’m developing a open source software, it uses and severe dependence JUCE. The first Alpha version just released less than 50 days. Out of my expectation, lot of people like it very much, it has more than 10 thousand users at the present.

This app has a function: import external text files (read its content and put into a new file which created by the app). Many people tell me, after import, it’s totally messy code… So I think JUCE may could (should) solve it or something important I haven’t found yet. After all, JUCE is so big a lib… but, just like always, you are so superfluous words, may be you like show some superiority and overweening whenever there is a chance. However, I think it’s very harmful for your great work and your future career…

Back to the problem, I can save a text file as UTF-8 format, but how can I force users to do that every time??

Sorry if I’m wrong.


#11

If you don’t know the encoding of your file, then there’s no way to reliably decode it. See also Bush hid the facts.

On Windows text editors usually save text files in the local “ANSI” code page. The easiest way to handle that is to use MultiByteToWideChar to convert it to UTF_16. Use CP_ACP as the code page.

@jules, is it possible on Windows to have String::createStringFromData try to load the data using the local code page instead of latin-1? Because that will be the encoding for practically all text files you’ll encounter.


#12

May be possible, but would probably require some platform-specific code to do the conversion on each OS. Not a 5-minute job, unfortunately.


#13

Hm I know. But it may be worth figuring out. I think on many Linux systems everything is UTF-8. Not sure about OS X. But on Windows, if your program has to read text files but it can’t read ANSI encoded files, it basically won’t work for most of your users.


#14

Roeland, thank you very much!!
It’s the second time (maybe third or more) you help me on this , thank you!

My problem is solved now (not perfect though). Here is the code, hope this will help or inspire someone who is confusing of the same problem:

const String convertAnsiString (const File& ansiTextFile)
{
    MemoryBlock mb;
    ansiTextFile.loadFileAsData (mb);

#if JUCE_WINDOWS
    const char* chars = (char*)mb.getData();
    int charSize = MultiByteToWideChar (CP_ACP, 0, chars, -1, NULL, 0);

    wchar_t* chars_t = new wchar_t[charSize];
    MultiByteToWideChar (CP_ACP, 0, chars, -1, chars_t, charSize);

    const String resultStr = String (CharPointer_UTF16 (chars_t));
    delete[] chars_t;

    return resultStr;

#else
    return mbStr.toString();

#endif
}


#15

What I would recommend is:

  1. Check if there is an UTF-16 byte order mark. If so, use UTF-16.
  2. Try UTF-8.
  3. Try the current ANSI code page. If you want you can detect failure by passing in the flag MB_ERR_INVALID_CHARS and check the return value of MultiByteToWideChar.

(I don’t know what the equivalent of (3) on Mac OS or Linux is though, or what the most common encodings on those platforms are, but I think they mostly switched to UTF-8.)

If the above fails or produces garbage, you have a problem. There’s no way to reliably guess the encoding. You can allow the user to try different encodings, in case they know which one it is.


#16

Thanks, Roeland.

At the present, my implementation is very ugly. the menu item ‘Import text file(s)’ has 2 sub-items:

  • import UTF-8
  • import ANSI

but ,it works :slight_smile: