在Python中如何将字符串截断为75个字符?
在JavaScript中是这样做的:
var data="saddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsaddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddsadddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd"
var info = (data.length > 75) ? data.substring[0,75] + '..' : data;
正则表达式:
re.sub(r'^(.{75}).*$', '\g<1>...', data)
长字符串被截断:
>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'111111111122222222223333333333444444444455555555556666666666777777777788888...'
较短的字符串永远不会被截断:
>>> data="11111111112222222222333333"
>>> re.sub(r'^(.{75}).*$', '\g<1>...', data)
'11111111112222222222333333'
通过这种方式,你还可以“切割”字符串的中间部分,这在某些情况下会更好:
re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
>>> data="11111111112222222222333333333344444444445555555555666666666677777777778888888888"
>>> re.sub(r'^(.{5}).*(.{5})$', '\g<1>...\g<2>', data)
'11111...88888'
如果您使用的是Python 3.4+,则可以使用textwrap。从标准库中缩短:
Collapse and truncate the given text to fit in the given width.
First the whitespace in text is collapsed (all whitespace is replaced
by single spaces). If the result fits in the width, it is returned.
Otherwise, enough words are dropped from the end so that the remaining
words plus the placeholder fit within width:
>>> textwrap.shorten("Hello world!", width=12)
'Hello world!'
>>> textwrap.shorten("Hello world!", width=11)
'Hello [...]'
>>> textwrap.shorten("Hello world", width=10, placeholder="...")
'Hello...'
如果你想做一些更复杂的字符串截断,你可以采用sklearn方法作为实现:
sklearn.base.BaseEstimator.__repr__
(参见原始完整代码:https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/base.py#L262)
它增加了一些好处,比如避免在单词中间截断。
def truncate_string(data, N_CHAR_MAX=70):
# N_CHAR_MAX is the (approximate) maximum number of non-blank
# characters to render. We pass it as an optional parameter to ease
# the tests.
lim = N_CHAR_MAX // 2 # apprx number of chars to keep on both ends
regex = r"^(\s*\S){%d}" % lim
# The regex '^(\s*\S){%d}' % n
# matches from the start of the string until the nth non-blank
# character:
# - ^ matches the start of string
# - (pattern){n} matches n repetitions of pattern
# - \s*\S matches a non-blank char following zero or more blanks
left_lim = re.match(regex, data).end()
right_lim = re.match(regex, data[::-1]).end()
if "\n" in data[left_lim:-right_lim]:
# The left side and right side aren't on the same line.
# To avoid weird cuts, e.g.:
# categoric...ore',
# we need to start the right side with an appropriate newline
# character so that it renders properly as:
# categoric...
# handle_unknown='ignore',
# so we add [^\n]*\n which matches until the next \n
regex += r"[^\n]*\n"
right_lim = re.match(regex, data[::-1]).end()
ellipsis = "..."
if left_lim + len(ellipsis) < len(data) - right_lim:
# Only add ellipsis if it results in a shorter repr
data = data[:left_lim] + "..." + data[-right_lim:]
return data