在Python中如何将字符串截断为75个字符?

在JavaScript中是这样做的:

var data="saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
var info = (data.length > 75) ? data.substring[0,75] + '..' : data;

当前回答

如果你想做一些更复杂的字符串截断,你可以采用sklearn方法作为实现:

sklearn.base.BaseEstimator.__repr__ (参见原始完整代码:https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/base.py#L262)

它增加了一些好处,比如避免在单词中间截断。

def truncate_string(data, N_CHAR_MAX=70):
    # N_CHAR_MAX is the (approximate) maximum number of non-blank
    # characters to render. We pass it as an optional parameter to ease
    # the tests.

    lim = N_CHAR_MAX // 2  # apprx number of chars to keep on both ends
    regex = r"^(\s*\S){%d}" % lim
    # The regex '^(\s*\S){%d}' % n
    # matches from the start of the string until the nth non-blank
    # character:
    # - ^ matches the start of string
    # - (pattern){n} matches n repetitions of pattern
    # - \s*\S matches a non-blank char following zero or more blanks
    left_lim = re.match(regex, data).end()
    right_lim = re.match(regex, data[::-1]).end()
    if "\n" in data[left_lim:-right_lim]:
        # The left side and right side aren't on the same line.
        # To avoid weird cuts, e.g.:
        # categoric...ore',
        # we need to start the right side with an appropriate newline
        # character so that it renders properly as:
        # categoric...
        # handle_unknown='ignore',
        # so we add [^\n]*\n which matches until the next \n
        regex += r"[^\n]*\n"
        right_lim = re.match(regex, data[::-1]).end()
    ellipsis = "..."
    if left_lim + len(ellipsis) < len(data) - right_lim:
        # Only add ellipsis if it results in a shorter repr
        data = data[:left_lim] + "..." + data[-right_lim:]
    return data

其他回答

正则表达式:

re.sub(r'^(.{75}).*$', '\g<1>...', data)

长字符串被截断:

>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'111111111122222222223333333333444444444455555555556666666666777777777788888...'

较短的字符串永远不会被截断:

>>> data="11111111112222222222333333"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'11111111112222222222333333'

通过这种方式,你还可以“切割”字符串的中间部分,这在某些情况下会更好:

re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)

>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
'11111...88888'
       >>> info = lambda data: len(data)>10 and data[:10]+'...' or data
       >>> info('sdfsdfsdfsdfsdfsdfsdfsdfsdfsdfsdf')
           'sdfsdfsdfs...'
       >>> info('sdfsdf')
           'sdfsdf'
       >>> 

这里我使用的是文本包。缩短并处理更多的边缘情况。也包括最后一个单词的一部分,以防这个单词超过最大宽度的50%。

import textwrap


def shorten(text: str, width=30, placeholder="..."):
    """Collapse and truncate the given text to fit in the given width.

    The text first has its whitespace collapsed. If it then fits in the *width*, it is returned as is.
    Otherwise, as many words as possible are joined and then the placeholder is appended.
    """
    if not text or not isinstance(text, str):
        return str(text)
    t = text.strip()
    if len(t) <= width:
        return t

    # textwrap.shorten also throws ValueError if placeholder too large for max width
    shorten_words = textwrap.shorten(t, width=width, placeholder=placeholder)

    # textwrap.shorten doesn't split words, so if the text contains a long word without spaces, the result may be too short without this word.
    # Here we use a different way to include the start of this word in case shorten_words is less than 50% of `width`
    if len(shorten_words) - len(placeholder) < (width - len(placeholder)) * 0.5:
        return t[:width - len(placeholder)].strip() + placeholder
    return shorten_words

测试:

>>> shorten("123 456", width=7, placeholder="...")
'123 456'
>>> shorten("1 23 45 678 9", width=12, placeholder="...")
'1 23 45...'
>>> shorten("1 23 45 678 9", width=10, placeholder="...")
'1 23 45...'
>>> shorten("01 23456789", width=10, placeholder="...")
'01 2345...'
>>> shorten("012 3 45678901234567", width=17, placeholder="...")
'012 3 45678901...'
>>> shorten("1 23 45 678 9", width=9, placeholder="...")
'1 23...'
>>> shorten("1 23456", width=5, placeholder="...")
'1...'
>>> shorten("123 456", width=5, placeholder="...")
'12...'
>>> shorten("123 456", width=6, placeholder="...")
'123...'
>>> shorten("12 3456789", width=9, placeholder="...")
'12 345...'
>>> shorten("   12 3456789    ", width=9, placeholder="...")
'12 345...'
>>> shorten('123 45', width=4, placeholder="...")
'1...'
>>> shorten('123 45', width=3, placeholder="...")
'...'
>>> shorten("123456", width=3, placeholder="...")
'...'
>>> shorten([1], width=9, placeholder="...")
'[1]'
>>> shorten(None, width=5, placeholder="...")
'None'
>>> shorten("", width=9, placeholder="...")
''

你不能像动态分配C字符串那样“截断”Python字符串。Python中的字符串是不可变的。您可以像其他答案中描述的那样对字符串进行切片,生成一个只包含由切片偏移量和步长定义的字符的新字符串。 在某些(不实际的)情况下,这可能有点烦人,比如当你选择Python作为面试语言时,面试官要求你从一个字符串中删除重复的字符。哎。

这个方法不使用任何if:

数据[:75]+布尔(数据[75:])*'.'