CharPointer::sizeInBytes feature request


#1

Can you add a “const bool withZeroTerminator = true” parameter ?
It’s just a small annoyance to see code like this (it just feels wrong, as usual C-string strlen doesn’t count the last zero)):

  strncat(buffer, (const char*)cp, cp.sizeInBytes() - 1); 
  // Equivalent C code would have read
  strncat(buffer, cp, strlen(cp));

// It would be nicer to have
  strncat(buffer, (const char*)cp, cp.sizeInBytes(false)); 

That won’t break existing code.


#2

Well, I think you’re probably mis-using the method slightly. It’s not designed to be a strlen replacement.

If you’re definitely working with 8-bit strings, and are mixing them with C library string functions, then you probably shouldn’t use this at all - just get the raw const char* and call strlen or whatever you need directly on it. And since the CharPointer_UTF8 class provides an implicit cast, you can just write:

…with no need to worry about stuff like this.

The sizeInBytes method is designed for situations where you may be using a templated encoding class, and in those situations the size of the trailing zero is not always 1 byte, so it’s important for the method to include it.


#3

Sure, it’s just that I’m fixing the case where String.length() was used instead of the correct sizeInBytes().
The example above is just purely random, to show the “mental twist” I have to do while reading the code.
BTW, using strlen() on a String class seems wrong to me. One of the major interest of a string class is to keep track of the storage length.
Having to recompute it at each operation stacks up and ends up being a performance loss, and I don’t even speak of the ugly strlen() everywhere.

Yes, and in that case, having a parameter to specify if we want the last trailing zero counted or not would be better.

The main code that’s dealing with string buffer is crypto code (SHA/AES), and I’m not storing the trailing zero for obvious reasons. I guess I’m safe with sizeWithBytes() -1 as long as I’m using UTF8, but again it’s mentally disturbing.


#4

These are definitely not string classes! The CharPointer_xyz classes are nothing more than wrappers around a pointer. They don’t own or manage the pointer in any way, they just help you to parse its content. And they definitely don’t keep track of its length!

It feels disturbing because it’s not the right thing to be doing! The class has no benefits over strlen in this context, so you should just use strlen.


#5

Ok, then I’m back on point 0.
I guess in the String machinery, there is somewhere where you store the actual size of the string buffer. How can I access it ?
I used to use String::length() but it’s wrong (unless I move to the 7 bit world like US).
I tried to use String::toUTF8().sizeInBytes(), but if it does like strlen, that is, counting the buffer size on each call, it’s bad.

So how do I do ?


#6

No, even the String class doesn’t currently store a length (other than knowing the total buffer space it has allocated).

I’ve considered adding it, of course, but it’s actually a far more complex and hazy problem than you’d think at first glance. Although there are some situations where it’d help, there are downsides in terms of the extra storage overhead and the work involved in maintaining its value correctly, which mean that it wouldn’t necessarily produce better overall performance. It may do, but I certainly wouldn’t bet on it, and there are certainly some situations where it’d be slower. There’s also the problem of whether you’d want to store the length in characters, or the length in bytes, or both…

So anyway, relax and just call strlen! Why the fuss, anyway, is it causing you some kind of performance issue, or are you just being paranoid?


#7

I’m surely paranoid.
I’m doing a lot of string operation (actually parsing/generating HTML).

I haven’t measured the String class speed, but I do have measured the HTML parser speed. All the O(n) operation for computing the string length for each operation stacks up to the HTML parser being 4x slower when using a c-string function than using my read-only string class. As you said, it’s specific to my code, and my use case.
For a rough idea, and a rough benchmark, one could expect 21x faster string concatenation with a string class storing length (see http://bstring.sourceforge.net/features.html ).
I don’t see such improvement on my side, but clearly I do have 4x faster code with string length tracking.

Anyway, it was just an annoyance to my eyes and I don’t want to enter a philosophical discussion about string classes, as no class fits them all.
I just conclude that I should use strlen instead of sizeInBytes() - 1.


#8

Sounds to me like the HTML parser should be re-factored into a class template with the option of the type of string to use for storage passed as a template parameter (with a default value of juce::String).


#9

Well, this approach only work if all your string class have the same interface (or a least a common interface).
The class I’m using is called “VerySimpleReadOnlyString” for the very basic reason that it only support the absolute strict minimum of the very minimal features to allow parsing, but the few features it supports, are optimized like crazy (like a strlen like function that’s 2x faster than the C library like version because it works with the machine size word for searching zeros instead of byte by byte, Turbo Boyer Moore with pre-computed jump table for searching and switch table based recognition of HTML tag and attribute name).
All in all, it’s very efficient.

I’ve benchmarked it with a C library based implementation (you know, with “strlen”, “strstr” and “if(strcmp(”), and it’s between 4x and 10x faster for very large HTML documents (500kB).

But this has no real link with the orignal idea which was “I don’t like those ugly (sizeInBytes() - 1) in my code”. I’m currently working of the other way around, that is writing HTML document from a DOM tree very very fast.
I’ve done the work with Juce String in few hours (thanks Jules for this class), but now that I know more about the String internals I could try with a VerySimpleWriteOnlyString class to see if it’s faster.


#10

You can always make a simple wrapper class to expose the needed interface on an existing string implementation.