在Python中如何将字符串截断为75个字符?

在JavaScript中是这样做的:

var data="saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
var info = (data.length > 75) ? data.substring[0,75] + '..' : data;

info = (data[:75] + '..') if len(data) > 75 else data

更简明的说:

data = data[:75]

如果小于75个字符,则不会有任何更改。


更简短的是:

info = data[:75] + (data[75:] and '..')

       >>> info = lambda data: len(data)>10 and data[:10]+'...' or data
       >>> info('sdfsdfsdfsdfsdfsdfsdfsdfsdfsdfsdf')
           'sdfsdfsdfs...'
       >>> info('sdfsdf')
           'sdfsdf'
       >>> 

正则表达式:

re.sub(r'^(.{75}).*$', '\g<1>...', data)

长字符串被截断:

>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'111111111122222222223333333333444444444455555555556666666666777777777788888...'

较短的字符串永远不会被截断:

>>> data="11111111112222222222333333"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'11111111112222222222333333'

通过这种方式,你还可以“切割”字符串的中间部分,这在某些情况下会更好:

re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)

>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
'11111...88888'

你不能像动态分配C字符串那样“截断”Python字符串。Python中的字符串是不可变的。您可以像其他答案中描述的那样对字符串进行切片,生成一个只包含由切片偏移量和步长定义的字符的新字符串。 在某些(不实际的)情况下,这可能有点烦人,比如当你选择Python作为面试语言时,面试官要求你从一个字符串中删除重复的字符。哎。


对于Django解决方案(问题中没有提到):

from django.utils.text import Truncator
value = Truncator(value).chars(75)

看看Truncator的源代码来理解这个问题: https://github.com/django/django/blob/master/django/utils/text.py#L66

关于Django的截断: Django HTML截断


这是另一种解决方案。在测试结束时,你会得到一些关于测试的反馈。

data = {True: data[:75] + '..', False: data}[len(data) > 75]

不需要正则表达式,但您确实希望在接受的答案中使用字符串格式而不是字符串连接。

这可能是将字符串数据截断为75个字符的最规范的python方法。

>>> data = "saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
>>> info = "{}..".format(data[:75]) if len(data) > 75 else data
>>> info
'111111111122222222223333333333444444444455555555556666666666777777777788888...'

这个方法不使用任何if:

数据[:75]+布尔(数据[75:])*'.'


如果您使用的是Python 3.4+,则可以使用textwrap。从标准库中缩短:

Collapse and truncate the given text to fit in the given width. First the whitespace in text is collapsed (all whitespace is replaced by single spaces). If the result fits in the width, it is returned. Otherwise, enough words are dropped from the end so that the remaining words plus the placeholder fit within width: >>> textwrap.shorten("Hello world!", width=12) 'Hello world!' >>> textwrap.shorten("Hello world!", width=11) 'Hello [...]' >>> textwrap.shorten("Hello world", width=10, placeholder="...") 'Hello...'


刚刚收到的消息:

n = 8
s = '123'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '12345678'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789'     
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]
s = '123456789012345'
print  s[:n-3] + (s[n-3:], '...')[len(s) > n]

123
12345678
12345...
12345...

这是一个函数,我把它作为一个新的String类的一部分…它允许添加后缀(如果字符串是修剪后的大小,并且添加它足够长-尽管你不需要强制绝对大小)

我在改变一些东西的过程中,所以有一些无用的逻辑成本(如果_truncate…例如),不再需要它,并且在顶部有一个return…

但是,它仍然是一个截断数据的好函数……

##
## Truncate characters of a string after _len'nth char, if necessary... If _len is less than 0, don't truncate anything... Note: If you attach a suffix, and you enable absolute max length then the suffix length is subtracted from max length... Note: If the suffix length is longer than the output then no suffix is used...
##
## Usage: Where _text = 'Testing', _width = 4
##      _data = String.Truncate( _text, _width )                        == Test
##      _data = String.Truncate( _text, _width, '..', True )            == Te..
##
## Equivalent Alternates: Where _text = 'Testing', _width = 4
##      _data = String.SubStr( _text, 0, _width )                       == Test
##      _data = _text[  : _width ]                                      == Test
##      _data = ( _text )[  : _width ]                                  == Test
##
def Truncate( _text, _max_len = -1, _suffix = False, _absolute_max_len = True ):
    ## Length of the string we are considering for truncation
    _len            = len( _text )

    ## Whether or not we have to truncate
    _truncate       = ( False, True )[ _len > _max_len ]

    ## Note: If we don't need to truncate, there's no point in proceeding...
    if ( not _truncate ):
        return _text

    ## The suffix in string form
    _suffix_str     = ( '',  str( _suffix ) )[ _truncate and _suffix != False ]

    ## The suffix length
    _len_suffix     = len( _suffix_str )

    ## Whether or not we add the suffix
    _add_suffix     = ( False, True )[ _truncate and _suffix != False and _max_len > _len_suffix ]

    ## Suffix Offset
    _suffix_offset = _max_len - _len_suffix
    _suffix_offset  = ( _max_len, _suffix_offset )[ _add_suffix and _absolute_max_len != False and _suffix_offset > 0 ]

    ## The truncate point.... If not necessary, then length of string.. If necessary then the max length with or without subtracting the suffix length... Note: It may be easier ( less logic cost ) to simply add the suffix to the calculated point, then truncate - if point is negative then the suffix will be destroyed anyway.
    ## If we don't need to truncate, then the length is the length of the string.. If we do need to truncate, then the length depends on whether we add the suffix and offset the length of the suffix or not...
    _len_truncate   = ( _len, _max_len )[ _truncate ]
    _len_truncate   = ( _len_truncate, _max_len )[ _len_truncate <= _max_len ]

    ## If we add the suffix, add it... Suffix won't be added if the suffix is the same length as the text being output...
    if ( _add_suffix ):
        _text = _text[ 0 : _suffix_offset ] + _suffix_str + _text[ _suffix_offset: ]

    ## Return the text after truncating...
    return _text[ : _len_truncate ]

limit = 75
info = data[:limit] + '..' * (len(data) > limit)

info = data[:75] + ('..' if len(data) > 75 else '')

info = data[:min(len(data), 75)

简单而简短的helper函数:

def truncate_string(value, max_length=255, suffix='...'):
    string_value = str(value)
    string_truncated = string_value[:min(len(string_value), (max_length - len(suffix)))]
    suffix = (suffix if len(string_value) > max_length else '')
    return string_truncated+suffix

使用例子:

# Example 1 (default):

long_string = ""
for number in range(1, 1000): 
    long_string += str(number) + ','    

result = truncate_string(long_string)
print(result)


# Example 2 (custom length):

short_string = 'Hello world'
result = truncate_string(short_string, 8)
print(result) # > Hello... 


# Example 3 (not truncated):

short_string = 'Hello world'
result = truncate_string(short_string)
print(result) # > Hello world


来的很晚,我想添加我的解决方案,以修剪文本在字符级别,也处理空白适当。

def trim_string(s: str, limit: int, ellipsis='…') -> str:
    s = s.strip()
    if len(s) > limit:
        return s[:limit-1].strip() + ellipsis
    return s

简单,但它将确保limit=6的hello world不会导致一个丑陋的hello…,而是hello…。

它还删除开头和结尾的空格,但不删除里面的空格。如果你也想删除里面的空格,签出这篇stackoverflow文章


这里我使用的是文本包。缩短并处理更多的边缘情况。也包括最后一个单词的一部分,以防这个单词超过最大宽度的50%。

import textwrap


def shorten(text: str, width=30, placeholder="..."):
    """Collapse and truncate the given text to fit in the given width.

    The text first has its whitespace collapsed. If it then fits in the *width*, it is returned as is.
    Otherwise, as many words as possible are joined and then the placeholder is appended.
    """
    if not text or not isinstance(text, str):
        return str(text)
    t = text.strip()
    if len(t) <= width:
        return t

    # textwrap.shorten also throws ValueError if placeholder too large for max width
    shorten_words = textwrap.shorten(t, width=width, placeholder=placeholder)

    # textwrap.shorten doesn't split words, so if the text contains a long word without spaces, the result may be too short without this word.
    # Here we use a different way to include the start of this word in case shorten_words is less than 50% of `width`
    if len(shorten_words) - len(placeholder) < (width - len(placeholder)) * 0.5:
        return t[:width - len(placeholder)].strip() + placeholder
    return shorten_words

测试:

>>> shorten("123 456", width=7, placeholder="...")
'123 456'
>>> shorten("1 23 45 678 9", width=12, placeholder="...")
'1 23 45...'
>>> shorten("1 23 45 678 9", width=10, placeholder="...")
'1 23 45...'
>>> shorten("01 23456789", width=10, placeholder="...")
'01 2345...'
>>> shorten("012 3 45678901234567", width=17, placeholder="...")
'012 3 45678901...'
>>> shorten("1 23 45 678 9", width=9, placeholder="...")
'1 23...'
>>> shorten("1 23456", width=5, placeholder="...")
'1...'
>>> shorten("123 456", width=5, placeholder="...")
'12...'
>>> shorten("123 456", width=6, placeholder="...")
'123...'
>>> shorten("12 3456789", width=9, placeholder="...")
'12 345...'
>>> shorten("   12 3456789    ", width=9, placeholder="...")
'12 345...'
>>> shorten('123 45', width=4, placeholder="...")
'1...'
>>> shorten('123 45', width=3, placeholder="...")
'...'
>>> shorten("123456", width=3, placeholder="...")
'...'
>>> shorten([1], width=9, placeholder="...")
'[1]'
>>> shorten(None, width=5, placeholder="...")
'None'
>>> shorten("", width=9, placeholder="...")
''

假设string是我们希望截断的字符串,而nchars是输出字符串中所需的字符数。

stryng = "sadddddddddddddddddddddddddddddddddddddddddddddddddd"
nchars = 10

我们可以像下面这样截断字符串:

def truncate(stryng:str, nchars:int):
    return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]

某些测试用例的结果如下所示:

s = "sadddddddddddddddddddddddddddddd!"
s = "sa" + 30*"d" + "!"

truncate(s, 2)                ==  sa
truncate(s, 4)                ==  sadd
truncate(s, 10)               ==  sadd [...]
truncate(s, len(s)//2)        ==  sadddddddd [...]

我的解决方案为上面的测试用例产生了合理的结果。

但一些病理病例如下:

一些病理病例!

truncate(s, len(s) - 3)()       ==  sadddddddddddddddddddddd [...]
truncate(s, len(s) - 2)()       ==  saddddddddddddddddddddddd [...]
truncate(s, len(s) - 1)()       ==  sadddddddddddddddddddddddd [...]
truncate(s, len(s) + 0)()       ==  saddddddddddddddddddddddddd [...]
truncate(s, len(s) + 1)()       ==  sadddddddddddddddddddddddddd [...
truncate(s, len(s) + 2)()       ==  saddddddddddddddddddddddddddd [..
truncate(s, len(s) + 3)()       ==  sadddddddddddddddddddddddddddd [.
truncate(s, len(s) + 4)()       ==  saddddddddddddddddddddddddddddd [
truncate(s, len(s) + 5)()       ==  sadddddddddddddddddddddddddddddd 
truncate(s, len(s) + 6)()       ==  sadddddddddddddddddddddddddddddd!
truncate(s, len(s) + 7)()       ==  sadddddddddddddddddddddddddddddd!
truncate(s, 9999)()             ==  sadddddddddddddddddddddddddddddd!

值得注意的是,

当字符串包含换行字符(\n)时,可能会出现问题。 当nchars > len(s)时,我们应该打印字符串s,而不是试图打印“[…]”

下面是更多的代码:

import io

class truncate:
    """
        Example of Code Which Uses truncate:
        ```
            s = "\r<class\n 'builtin_function_or_method'>"
            s = truncate(s, 10)()
            print(s)
                    ```
                Examples of Inputs and Outputs:
                        truncate(s, 2)()   ==  \r
                        truncate(s, 4)()   ==  \r<c
                        truncate(s, 10)()  ==  \r<c [...]
                        truncate(s, 20)()  ==  \r<class\n 'bu [...]
                        truncate(s, 999)() ==  \r<class\n 'builtin_function_or_method'>
                    ```
                Other Notes:
                    Returns a modified copy of string input
                    Does not modify the original string
            """
    def __init__(self, x_stryng: str, x_nchars: int) -> str:
        """
        This initializer mostly exists to sanitize function inputs
        """
        try:
            stryng = repr("".join(str(ch) for ch in x_stryng))[1:-1]
            nchars = int(str(x_nchars))
        except BaseException as exc:
            invalid_stryng =  str(x_stryng)
            invalid_stryng_truncated = repr(type(self)(invalid_stryng, 20)())

            invalid_x_nchars = str(x_nchars)
            invalid_x_nchars_truncated = repr(type(self)(invalid_x_nchars, 20)())

            strm = io.StringIO()
            print("Invalid Function Inputs", file=strm)
            print(type(self).__name__, "(",
                  invalid_stryng_truncated,
                  ", ",
                  invalid_x_nchars_truncated, ")", sep="", file=strm)
            msg = strm.getvalue()

            raise ValueError(msg) from None

        self._stryng = stryng
        self._nchars = nchars

    def __call__(self) -> str:
        stryng = self._stryng
        nchars = self._nchars
        return (stryng[:nchars - 6] + " [...]")[:min(len(stryng), nchars)]

下面是一个简单的函数,它将从任意一侧截断给定字符串:

def truncate(string, length=75, beginning=True, insert='..'):
    '''Shorten the given string to the given length.
    An ellipsis will be added to the section trimmed.

    :Parameters:
        length (int) = The maximum allowed length before trunicating.
        beginning (bool) = Trim starting chars, else; ending.
        insert (str) = Chars to add at the trimmed area. (default: ellipsis)

    :Return:
        (str)

    ex. call: truncate('12345678', 4)
        returns: '..5678'
    '''
    if len(string)>length:
        if beginning: #trim starting chars.
            string = insert+string[-length:]
        else: #trim ending chars.
            string = string[:length]+insert
    return string

如果你想做一些更复杂的字符串截断,你可以采用sklearn方法作为实现:

sklearn.base.BaseEstimator.__repr__ (参见原始完整代码:https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/base.py#L262)

它增加了一些好处,比如避免在单词中间截断。

def truncate_string(data, N_CHAR_MAX=70):
    # N_CHAR_MAX is the (approximate) maximum number of non-blank
    # characters to render. We pass it as an optional parameter to ease
    # the tests.

    lim = N_CHAR_MAX // 2  # apprx number of chars to keep on both ends
    regex = r"^(\s*\S){%d}" % lim
    # The regex '^(\s*\S){%d}' % n
    # matches from the start of the string until the nth non-blank
    # character:
    # - ^ matches the start of string
    # - (pattern){n} matches n repetitions of pattern
    # - \s*\S matches a non-blank char following zero or more blanks
    left_lim = re.match(regex, data).end()
    right_lim = re.match(regex, data[::-1]).end()
    if "\n" in data[left_lim:-right_lim]:
        # The left side and right side aren't on the same line.
        # To avoid weird cuts, e.g.:
        # categoric...ore',
        # we need to start the right side with an appropriate newline
        # character so that it renders properly as:
        # categoric...
        # handle_unknown='ignore',
        # so we add [^\n]*\n which matches until the next \n
        regex += r"[^\n]*\n"
        right_lim = re.match(regex, data[::-1]).end()
    ellipsis = "..."
    if left_lim + len(ellipsis) < len(data) - right_lim:
        # Only add ellipsis if it results in a shorter repr
        data = data[:left_lim] + "..." + data[-right_lim:]
    return data