尽管我很喜欢C和c++,但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚,所以C语言不知道它们是有道理的。然而,在C语言出现之前,有些语言就已经很简单了。为什么会选择空终止字符串,而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实),他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。 使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下,以空结尾的字符串可能更有效,但这种情况发生的频率要低得多。

从下面的答案,这些是一些情况下,空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串,它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null,长度前缀只是从前缀中减去。)


当前回答

我觉得更好的问题是你为什么觉得C欠你什么?C语言的设计是为了满足你的需要,仅此而已。你需要摆脱那种认为语言必须为你提供一切的心态。或者只是继续使用你的高级语言,这将给你奢侈的字符串,日历,容器;而在Java中,你会得到一种千变万化的东西。多个类型字符串,多个类型的unordered_map(s)。

这对你来说太糟糕了,这不是C的目的。C并不是被设计成一种从大头针到锚的臃肿语言。相反,您必须依赖第三方库或您自己的库。没有什么比创建一个包含字符串及其大小的简单结构体更容易的了。

struct String
{
 const char *s;
 size_t len;
};

你知道问题出在哪里。它不标准。另一种语言可能决定将len组织在字符串之前。另一种语言可能决定使用指针来代替结束。另一个人可能决定使用六个指针来提高String的效率。然而,null结尾的字符串是字符串的最标准格式;你可以用它来与任何语言进行交互。甚至Java JNI也使用以空结尾的字符串。

Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

其他回答

我觉得更好的问题是你为什么觉得C欠你什么?C语言的设计是为了满足你的需要,仅此而已。你需要摆脱那种认为语言必须为你提供一切的心态。或者只是继续使用你的高级语言,这将给你奢侈的字符串,日历,容器;而在Java中,你会得到一种千变万化的东西。多个类型字符串,多个类型的unordered_map(s)。

这对你来说太糟糕了,这不是C的目的。C并不是被设计成一种从大头针到锚的臃肿语言。相反,您必须依赖第三方库或您自己的库。没有什么比创建一个包含字符串及其大小的简单结构体更容易的了。

struct String
{
 const char *s;
 size_t len;
};

你知道问题出在哪里。它不标准。另一种语言可能决定将len组织在字符串之前。另一种语言可能决定使用指针来代替结束。另一个人可能决定使用六个指针来提高String的效率。然而,null结尾的字符串是字符串的最标准格式;你可以用它来与任何语言进行交互。甚至Java JNI也使用以空结尾的字符串。

Lastly, it is a common saying; the right data structure for the task. If you find that need to know the size of a string more than anything else; well use a string structure that allows you to do that optimally. But don't make claims that that operation is used more than anything else for everybody. Like, why is knowing the size of a string more important than reading its contents. I find that reading the contents of a string is what I mostly do, so I use null terminated strings instead of std::string; which saves me 5 pointers on a GCC compiler. If I can even save 2 pointers that is good.

来自马的口

None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

Dennis M Ritchie, C语言的开发

GCC接受以下代码:

Char s[4] = "abcd";

如果我们把is当作字符数组,而不是字符串数组,这是可以的。也就是说,我们可以使用s[0], s[1], s[2]和s[3],甚至使用memcpy(dest, s, 4)访问它。但是当我们尝试使用put (s)时,我们会得到混乱的字符,或者更糟糕的是使用strcpy(dest, s)。

C语言中没有字符串。C语言中的“string”只是一个指向char的指针。所以也许你问错问题了。

“省略字符串类型的基本原理是什么”可能更相关。对此,我要指出C不是面向对象的语言,只有基本的值类型。字符串是一个更高级别的概念,必须以某种方式组合其他类型的值来实现。C处于较低的抽象级别。

鉴于下面的狂风暴雨

我只是想指出,我并不是想说这是一个愚蠢或糟糕的问题,或者C语言表示字符串的方式是最好的选择。我试图澄清的是,如果考虑到C语言没有区分字符串作为数据类型与字节数组的机制这一事实,那么这个问题就会更简洁。考虑到今天计算机的处理和存储能力,这是最好的选择吗?可能不会。但事后诸葛总是20/20之类的。

不知怎的,我把这个问题理解为C中没有编译器支持以长度为前缀的字符串。下面的例子显示,至少你可以开始你自己的C字符串库,其中字符串长度在编译时计算,使用这样的构造:

#define PREFIX_STR(s) ((prefix_str_t){ sizeof(s)-1, (s) })

typedef struct { int n; char * p; } prefix_str_t;

int main() {
    prefix_str_t string1, string2;

    string1 = PREFIX_STR("Hello!");
    string2 = PREFIX_STR("Allows \0 chars (even if printf directly doesn't)");

    printf("%d %s\n", string1.n, string1.p); /* prints: "6 Hello!" */
    printf("%d %s\n", string2.n, string2.p); /* prints: "48 Allows " */

    return 0;
}

然而,这不会带来任何问题,因为你需要小心什么时候特别释放字符串指针,什么时候它是静态分配的(字面字符数组)。

编辑:作为对这个问题更直接的回答,我的观点是,这是C既可以支持可用的字符串长度(作为编译时间常数)的方式,如果你需要它,但如果你只想使用指针和零终止,仍然没有内存开销。

当然,使用以零结尾的字符串似乎是推荐的做法,因为标准库一般不接受字符串长度作为参数,而且提取长度的代码不像char * s = "abc"那样简单,正如我的示例所示。