尽管我很喜欢C和c++,但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚,所以C语言不知道它们是有道理的。然而,在C语言出现之前,有些语言就已经很简单了。为什么会选择空终止字符串,而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实),他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。 使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下,以空结尾的字符串可能更有效,但这种情况发生的频率要低得多。

从下面的答案,这些是一些情况下,空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串,它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null,长度前缀只是从前缀中减去。)


当前回答

假设C以Pascal的方式实现字符串,通过前缀长度:7字符长字符串与3字符字符串的数据类型相同吗?如果答案是肯定的,那么当我将前者分配给后者时,编译器应该生成什么样的代码?字符串应该被截断,还是自动调整大小?如果调整大小,该操作是否应该被锁保护以使其线程安全?不管你喜不喜欢,C语言的方法回避了所有这些问题。

其他回答

在很多方面,C语言是原始的。我很喜欢。

它比汇编语言高了一步,用一种更容易编写和维护的语言提供了几乎相同的性能。

空结束符很简单,不需要语言的特殊支持。

现在回想起来,似乎并不是那么方便。但我在80年代使用汇编语言,当时它似乎非常方便。我只是认为软件在不断地发展,平台和工具也在不断地变得越来越复杂。

我认为,这是有历史原因的,我在维基百科上找到了这个:

At the time C (and the languages that it was derived from) were developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (though also used by early versions of BASIC), used a leading byte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) (constant) time). But one byte limits the length to 255. This length limitation was far more restrictive than the problems with the C string, so the C string in general won out.

还有一点没有提到:当C语言被设计出来的时候,有很多机器的“char”不是8位的(即使是今天的DSP平台也不是8位的)。如果一个人决定字符串是长度前缀,应该使用多少'char'的长度前缀?使用two会人为地限制具有8位字符和32位寻址空间的机器的字符串长度,而在具有16位字符和16位寻址空间的机器上浪费空间。

If one wanted to allow arbitrary-length strings to be stored efficiently, and if 'char' were always 8-bits, one could--for some expense in speed and code size--define a scheme were a string prefixed by an even number N would be N/2 bytes long, a string prefixed by an odd value N and an even value M (reading backward) could be ((N-1) + M*char_max)/2, etc. and require that any buffer which claims to offer a certain amount of space to hold a string must allow enough bytes preceding that space to handle the maximum length. The fact that 'char' isn't always 8 bits, however, would complicate such a scheme, since the number of 'char' required to hold a string's length would vary depending upon the CPU architecture.

懒惰、寄存器节俭和可移植性考虑到任何语言的汇编核心,尤其是C语言,它比汇编高出一步(因此继承了大量汇编遗留代码)。 你会同意null字符在那些ASCII的日子里是无用的,它(可能和EOF控件字符一样好)。

让我们看看伪代码

function readString(string) // 1 parameter: 1 register or 1 stact entries
    pointer=addressOf(string) 
    while(string[pointer]!=CONTROL_CHAR) do
        read(string[pointer])
        increment pointer

共使用1个寄存器

案例2

 function readString(length,string) // 2 parameters: 2 register used or 2 stack entries
     pointer=addressOf(string) 
     while(length>0) do 
         read(string[pointer])
         increment pointer
         decrement length

共使用2个寄存器

这在当时似乎是短视的,但考虑到代码和寄存器的节俭(这在当时是PREMIUM,那时你知道,他们使用穿孔卡)。因此,更快(当处理器速度可以以kHz计),这个“黑客”是相当不错的,可轻松移植到无寄存器处理器。

为了便于讨论,我将实现2个常见的字符串操作

stringLength(string)
     pointer=addressOf(string)
     while(string[pointer]!=CONTROL_CHAR) do
         increment pointer
     return pointer-addressOf(string)

复杂度O(n),在大多数情况下PASCAL字符串是O(1),因为字符串的长度是预先挂起的字符串结构(这也意味着该操作必须在更早的阶段进行)。

concatString(string1,string2)
     length1=stringLength(string1)
     length2=stringLength(string2)
     string3=allocate(string1+string2)
     pointer1=addressOf(string1)
     pointer3=addressOf(string3)
     while(string1[pointer1]!=CONTROL_CHAR) do
         string3[pointer3]=string1[pointer1]
         increment pointer3
         increment pointer1
     pointer2=addressOf(string2)
     while(string2[pointer2]!=CONTROL_CHAR) do
         string3[pointer3]=string2[pointer2]
         increment pointer3
         increment pointer1
     return string3

复杂度O(n)和预先设置字符串长度不会改变操作的复杂性,而我承认它会减少3倍的时间。

另一方面,如果你使用PASCAL字符串将不得不重新设计您的API来考虑在长度和bit-endianness注册,帕斯卡字符串的众所周知的限制255字符(0 xff)因为中存储的长度是1个字节(8位),而且你想要更长的字符串(16位- >任何)你必须考虑在一层的架构代码,这意味着在大多数情况下不相容的字符串API如果你想要更长的字符串。

例子:

One file was written with your prepended string api on an 8 bit computer and then would have to be read on say a 32 bit computer, what would the lazy program do considers that your 4bytes are the length of the string then allocate that lot of memory then attempt to read that many bytes. Another case would be PPC 32 byte string read(little endian) onto a x86 (big endian), of course if you don't know that one is written by the other there would be trouble. 1 byte length (0x00000001) would become 16777216 (0x0100000) that is 16 MB for reading a 1 byte string. Of course you would say that people should agree on one standard but even 16bit unicode got little and big endianness.

当然,C也有它的问题,但它不会受到这里提出的问题的影响。

不知怎的,我把这个问题理解为C中没有编译器支持以长度为前缀的字符串。下面的例子显示,至少你可以开始你自己的C字符串库,其中字符串长度在编译时计算,使用这样的构造:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

然而,这不会带来任何问题,因为你需要小心什么时候特别释放字符串指针,什么时候它是静态分配的(字面字符数组)。

编辑:作为对这个问题更直接的回答,我的观点是,这是C既可以支持可用的字符串长度(作为编译时间常数)的方式,如果你需要它,但如果你只想使用指针和零终止,仍然没有内存开销。

当然,使用以零结尾的字符串似乎是推荐的做法,因为标准库一般不接受字符串长度作为参数,而且提取长度的代码不像char * s = "abc"那样简单,正如我的示例所示。