When a text file is non UTF-8 format and includes non-ascii charatcers, File::loadFileAsString() will return a heap of messy code. Is there an easy juce way to conversion it to UTF-8?
Thanks!
When a text file is non UTF-8 format and includes non-ascii charatcers, File::loadFileAsString() will return a heap of messy code. Is there an easy juce way to conversion it to UTF-8?
Thanks!
Depends what the format is.
For typical UTF16, 32 and codepage 1252, String::createStringFromData should be able to detect it, but if youāre using some crazy locale codepage then youād need to do the conversion yourself.
Thanks for your quick reply,Jules. But I canāt understand what the ācrazy locale codepageā you said was and how to make it calm down.
On Windows, using notepad, VS and all other editors, input some text (include non ascii chars) and save it as a default text file without nothing have done, then read it by juce::File::loadFileAsString(), the result is messy codeā¦
I will try createStringFromData(), hope this method would work it out.
No luck. just get a heap of messy code againā¦
MemoryBlock mb;
aTextFile.loadFileAsData (mb);
DBGX (String::createStringFromData (mb.getData(), (int) mb.getSize()));
So return to the above, how to do the conversion myself?
Well, maybe start by working out what the format actually is. If itās some kind of esoteric far-eastern code-page then detecting it is beyond the scope of anything we have in juce.
The format is text file and they all showed: āsystem defaultā, Jules. But I think it should be codepage 936 (for Chinese).
detecting it is beyond the scope of anything we have in juce.
I have no any idea how to do this, and what should I do after thisā¦ Someone could give me a clue?
Canāt juce do it yet? I think File::loadFileAsString() should more compatibility and intelligentā¦
TBH decent editors nowadays should default to unicode, the old codepage crap is a throwback to the1980s.
Detecting and support all codepages isnāt something itās worth us doing - there are already big libraries out there to do it, but itās also becoming increasingly unnecessary.
???
I have a few of editors on my Windows. They all the newest version, include VS 2015/2017, Word 2016, WPS 2016, EditPlus 4.2ā¦ Jules. Perhaps they all donāt decent enough thanā¦? :)) When save the file as a text file, it will be āsystem defaultā format by default. I think this is nothing about decent and crap.
Hope Iāll find a big lib to do it. This sentence is the most helpful info in this topic Succinct and efficientā¦
Thanks anywayā¦
Sorry, but JUCE isnāt attempting to be a unicode library! For us to add support for hundreds of foreign-language ANSI code-pages would involve us embedding megabytes of character-mapping tables in our source code, which would be ridiculous!
Itās not our fault if someone elseās software makes a bad choice and uses a locale-specific encoding rather than a universally recognised one, but if you understand the problem youāll see that itās really unreasonable to expect us to deal with that. There are lots of encoding libraries out there where people have tackled this exact problem, so youād need to look at those.
ļ¼ļ¼ļ¼
Youāre too emotional, Julesā¦ I know you have too much stress for a long time through and JUCE 5 will release soon.
Can or canāt, will or wonāt. To a software engineer, just simple is enough.
Let me say a little more. Recently, Iām developing a open source software, it uses and severe dependence JUCE. The first Alpha version just released less than 50 days. Out of my expectation, lot of people like it very much, it has more than 10 thousand users at the present.
This app has a function: import external text files (read its content and put into a new file which created by the app). Many people tell me, after import, itās totally messy codeā¦ So I think JUCE may could (should) solve it or something important I havenāt found yet. After all, JUCE is so big a libā¦ but, just like always, you are so superfluous words, may be you like show some superiority and overweening whenever there is a chance. However, I think itās very harmful for your great work and your future careerā¦
Back to the problem, I can save a text file as UTF-8 format, but how can I force users to do that every time??
Sorry if Iām wrong.
If you donāt know the encoding of your file, then thereās no way to reliably decode it. See also Bush hid the facts.
On Windows text editors usually save text files in the local āANSIā code page. The easiest way to handle that is to use MultiByteToWideChar
to convert it to UTF_16. Use CP_ACP
as the code page.
@jules, is it possible on Windows to have String::createStringFromData
try to load the data using the local code page instead of latin-1? Because that will be the encoding for practically all text files youāll encounter.
May be possible, but would probably require some platform-specific code to do the conversion on each OS. Not a 5-minute job, unfortunately.
Hm I know. But it may be worth figuring out. I think on many Linux systems everything is UTF-8. Not sure about OS X. But on Windows, if your program has to read text files but it canāt read ANSI encoded files, it basically wonāt work for most of your users.
Roeland, thank you very much!!
Itās the second time (maybe third or more) you help me on this , thank you!
My problem is solved now (not perfect though). Here is the code, hope this will help or inspire someone who is confusing of the same problem:
const String convertAnsiString (const File& ansiTextFile)
{
MemoryBlock mb;
ansiTextFile.loadFileAsData (mb);
#if JUCE_WINDOWS
const char* chars = (char*)mb.getData();
int charSize = MultiByteToWideChar (CP_ACP, 0, chars, -1, NULL, 0);
wchar_t* chars_t = new wchar_t[charSize];
MultiByteToWideChar (CP_ACP, 0, chars, -1, chars_t, charSize);
const String resultStr = String (CharPointer_UTF16 (chars_t));
delete[] chars_t;
return resultStr;
#else
return mbStr.toString();
#endif
}
What I would recommend is:
MB_ERR_INVALID_CHARS
and check the return value of MultiByteToWideChar
.(I donāt know what the equivalent of (3) on Mac OS or Linux is though, or what the most common encodings on those platforms are, but I think they mostly switched to UTF-8.)
If the above fails or produces garbage, you have a problem. Thereās no way to reliably guess the encoding. You can allow the user to try different encodings, in case they know which one it is.
Thanks, Roeland.
At the present, my implementation is very ugly. the menu item āImport text file(s)ā has 2 sub-items:
but ,it works