空终止字符串的基本原理是什么?

尽管我很喜欢C和c++，但我还是忍不住对空结尾字符串的选择抓耳挠脑:

Length prefixed (i.e. Pascal) strings existed before C Length prefixed strings make several algorithms faster by allowing constant time length lookup. Length prefixed strings make it more difficult to cause buffer overrun errors. Even on a 32 bit machine, if you allow the string to be the size of available memory, a length prefixed string is only three bytes wider than a null terminated string. On 16 bit machines this is a single byte. On 64 bit machines, 4GB is a reasonable string length limit, but even if you want to expand it to the size of the machine word, 64 bit machines usually have ample memory making the extra seven bytes sort of a null argument. I know the original C standard was written for insanely poor machines (in terms of memory), but the efficiency argument doesn't sell me here. Pretty much every other language (i.e. Perl, Pascal, Python, Java, C#, etc) use length prefixed strings. These languages usually beat C in string manipulation benchmarks because they are more efficient with strings. C++ rectified this a bit with the std::basic_string template, but plain character arrays expecting null terminated strings are still pervasive. This is also imperfect because it requires heap allocation. Null terminated strings have to reserve a character (namely, null), which cannot exist in the string, while length prefixed strings can contain embedded nulls.

其中一些东西比C语言出现得更晚，所以C语言不知道它们是有道理的。然而，在C语言出现之前，有些语言就已经很简单了。为什么会选择空终止字符串，而不是明显更好的长度前缀?

编辑:因为有些人问了关于我上面提到的效率点的事实(他们不喜欢我已经提供的事实)，他们源于以下几点:

使用空结尾字符串的Concat需要O(n + m)时间复杂度。长度前缀通常只需要O(m)。使用空结尾字符串的长度需要O(n)时间复杂度。长度前缀为O(1)。 Length和concat是迄今为止最常见的字符串操作。在一些情况下，以空结尾的字符串可能更有效，但这种情况发生的频率要低得多。

从下面的答案，这些是一些情况下，空终止字符串更有效:

When you need to cut off the start of a string and need to pass it to some method. You can't really do this in constant time with length prefixing even if you are allowed to destroy the original string, because the length prefix probably needs to follow alignment rules. In some cases where you're just looping through the string character by character you might be able to save a CPU register. Note that this works only in the case that you haven't dynamically allocated the string (Because then you'd have to free it, necessitating using that CPU register you saved to hold the pointer you originally got from malloc and friends).

上面这些词都没有length和concat常见。

下面的答案中还有一个断言:

你需要把绳子的一端剪掉

但这个是不正确的——对于以null结尾的字符串和有长度前缀的字符串，它的时间是相同的。(以Null结尾的字符串只是在你想要的新结尾的地方插入一个Null，长度前缀只是从前缀中减去。)

当前回答

与长度前缀相比，null终止的一个优点是字符串比较的简单性，这一点我没有看到任何人提到过。考虑比较标准，它返回小于、等于或大于的有符号结果。对于长度前缀，算法必须遵循以下几行:

Compare the two lengths; record the smaller, and note if they are equal (this last step might be deferred to step 3). Scan the two character sequences, subtracting characters at matching indices (or use a dual pointer scan). Stop either when the difference is nonzero, returning the difference, or when the number of characters scanned is equal to the smaller length. When the smaller length is reached, one string is a prefix of the other. Return negative or positive value according to which is shorter, or zero if of equal length.

将其与null终止算法进行对比:

扫描两个字符序列，在匹配的索引处减去字符[注意，移动指针处理得更好]。当差值非零时停止，返回差值。注意:如果一个字符串是另一个字符串的PROPER前缀，减法中的一个字符将为NUL，即零，比较将自然地停止在那里。如果差值为零，-only then-检查是否有字符为NUL。如果是，则返回0，否则继续到下一个字符。

以null结尾的情况更简单，并且非常容易用双指针扫描高效地实现。带长度前缀的大小写至少做同样多的工作，几乎总是更多。如果你的算法必须做大量的字符串比较[e。编译器!]，以null结尾的情况胜出。现在，这可能不那么重要了，但在过去，是的。

2021-09-14 14:57:16

其他回答

不一定是基本原理，而是长度编码的对应物

Certain forms of dynamic length encoding are superior to static length encoding as far as memory is concerned, it all depends on usage. Just look at UTF-8 for proof. It's essentially an extensible character array for encoding a single character. This uses a single bit for each extended byte. NUL termination uses 8 bits. Length-prefix I think can be reasonably termed infinite length as well by using 64 bits. How often you hit the case of your extra bits is the deciding factor. Only 1 extremely large string? Who cares if you're using 8 or 64 bits? Many small strings (Ie Strings of English words)? Then your prefix costs are a large percentage. Length-prefixed strings allowing time savings is not a real thing. Whether your supplied data is required to have length provided, you're counting at compile time, or you're truly being provided dynamic data that you must encode as a string. These sizes are computed at some point in the algorithm. A separate variable to store the size of a null terminated string can be provided. Which makes the comparison on time-savings moot. One just has an extra NUL at the end... but if the length encode doesn't include that NUL then there's literally no difference between the two. There's no algorithmic change required at all. Just a pre-pass you have to manually design yourself instead of having a compiler/runtime do it for you. C is mostly about doing things manually. Length-prefix being optional is a selling point. I don't always need that extra info for an algorithm so being required to do it for a every string makes my precompute+compute time never able to drop below O(n). (Ie hardware random number generator 1-128. I can pull from an "infinite string". Let's say it only generates characters so fast. So our string length changes all the time. But my usage of the data probably doesn't care how many random bytes I have. It just wants the next available unused byte as soon as it can get it after a request. I could be waiting on the device. But I could also have a buffer of characters pre-read. A length comparison is a needless waste of computation. A null check is more efficient.) Length-prefix is a good guard against buffer overflow? So is sane usage of library functions and implementation. What if I pass in malformed data? My buffer is 2 bytes long but I tell the function it's 7! Ex: If gets() was intended to be used on known data it could've had an internal buffer check that tested compiled buffers and malloc() calls and still follow spec. If it was meant to be used as a pipe for unknown STDIN to arrive at unknown buffer then clearly one can't know abut the buffer size which means a length arg is pointless, you need something else here like a canary check. For that matter, you can't length-prefix some streams and inputs, you just can't. Which means the length check has to be built into the algorithm and not a magic part of the typing system. TL;DR NUL-terminated never had to be unsafe, it just ended up that way via misuse. counter-counter point: NUL-termination is annoying on binary. You either need to do length-prefix here or transform NUL bytes in some way: escape-codes, range remapping, etc... which of course means more-memory-usage/reduced-information/more-operations-per-byte. Length-prefix mostly wins the war here. The only upside to a transform is that no additional functions have to be written to cover the length-prefix strings. Which means on your more optimized sub-O(n) routines you can have them automatically act as their O(n) equivalents without adding more code. Downside is, of course, time/memory/compression waste when used on NUL heavy strings. Depending on how much of your library you end up duplicating to operate on binary data, it may make sense to work solely with length-prefix strings. That said one could also do the same with length-prefix strings... -1 length could mean NUL-terminated and you could use NUL-terminated strings inside length-terminated. Concat: "O(n+m) vs O(m)" I'm assuming your referring to m as the total length of the string after concatenating because they both have to have that number of operations minimum (you can't just tack-on to string 1, what if you have to realloc?). And I'm assuming n is a mythical amount of operations you no longer have to do because of a pre-compute. If so, then the answer is simple: pre-compute. If you're insisting you'll always have enough memory to not need to realloc and that's the basis of the big-O notation then the answer is even more simple: do binary search on allocated memory for end of string 1, clearly there's a large swatch of infinite zeros after string 1 for us to not worry about realloc. There, easily got n to log(n) and I barely tried. Which if you recall log(n) is essentially only ever as large as 64 on a real computer, which is essentially like saying O(64+m), which is essentially O(m). (And yes that logic has been used in run-time analysis of real data structures in-use today. It's not bullshit off the top of my head.) Concat()/Len() again: Memoize results. Easy. Turns all computes into pre-computes if possible/necessary. This is an algorithmic decision. It's not an enforced constraint of the language. String suffix passing is easier/possible with NUL termination. Depending on how length-prefix is implemented it can be destructive on original string and can sometimes not even be possible. Requiring a copy and pass O(n) instead of O(1). Argument-passing/de-referencing is less for NUL-terminated versus length-prefix. Obviously because you're passing less information. If you don't need length, then this saves a lot of footprint and allows optimizations. You can cheat. It's really just a pointer. Who says you have to read it as a string? What if you want to read it as a single character or a float? What if you want to do the opposite and read a float as a string? If you're careful you can do this with NUL-termination. You can't do this with length-prefix, it's a data type distinctly different from a pointer typically. You'd most likely have to build a string byte-by-byte and get the length. Of course if you wanted something like an entire float (probably has a NUL inside it) you'd have to read byte-by-byte anyway, but the details are left to you to decide.

TL;DR您使用二进制数据吗?如果不是，那么null终止允许更多的算法自由。如果是，那么代码数量vs速度/内存/压缩是你的主要关注点。两种方法的混合或记忆可能是最好的。

2018-08-28 03:13:55

在很多方面，C语言是原始的。我很喜欢。

它比汇编语言高了一步，用一种更容易编写和维护的语言提供了几乎相同的性能。

空结束符很简单，不需要语言的特殊支持。

现在回想起来，似乎并不是那么方便。但我在80年代使用汇编语言，当时它似乎非常方便。我只是认为软件在不断地发展，平台和工具也在不断地变得越来越复杂。

2010-12-11 23:02:16

我不相信“C没有字符串”的答案。没错，C语言不支持内置的高级类型，但你仍然可以用C语言表示数据结构，这就是字符串。在C语言中，字符串只是一个指针，但这并不意味着前N个字节不能作为长度具有特殊意义。

Windows/COM开发人员将非常熟悉BSTR类型，它就像这样——一个有长度前缀的C字符串，其中实际的字符数据不是从字节0开始的。

因此，使用空终止符的决定似乎只是人们喜欢的，而不是语言的必要。

2020-02-11 19:04:26

Obviously for performance and safety, you'll want to keep the length of a string while you're working with it rather than repeatedly performing strlen or the equivalent on it. However, storing the length in a fixed location just before the string contents is an incredibly bad design. As Jörgen pointed out in the comments on Sanjit's answer, it precludes treating the tail of a string as a string, which for example makes a lot of common operations like path_to_filename or filename_to_extension impossible without allocating new memory (and incurring the possibility of failure and error handling). And then of course there's the issue that nobody can agree how many bytes the string length field should occupy (plenty of bad "Pascal string" languages used 16-bit fields or even 24-bit fields which preclude processing of long strings).

C语言让程序员选择是否/在哪里/如何存储长度的设计更加灵活和强大。当然，程序员必须聪明。C语言惩罚愚蠢的程序崩溃，慢慢停止，或者让你的敌人扎根。

2010-12-11 22:10:58

即使在32位机器上，如果允许字符串的大小与可用内存相同，带前缀的长度字符串也只比以空结尾的字符串宽3个字节。

首先，对于短字符串来说，额外的3个字节可能是相当大的开销。具体来说，零长度字符串现在占用的内存是原来的4倍。我们中的一些人正在使用64位机器，因此我们要么需要8个字节来存储零长度的字符串，要么字符串格式无法处理平台支持的最长字符串。

可能还需要处理对齐问题。假设我有一个包含7个字符串的内存块，比如“solo\0second\0\0four\0five\0\0seventh”。第二个字符串从偏移量5开始。硬件可能要求32位整数以4的倍数的地址对齐，因此您必须添加填充，从而进一步增加开销。相比之下，C表示非常节省内存。(内存效率很好;例如，它有助于缓存性能。)

2012-07-23 12:45:26

空终止字符串的基本原理是什么?

推荐文章

最新文章

标签