从字符串中删除标点符号的最佳方法

似乎应该有一种比以下更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有？

当前回答

从效率的角度来看，你不会击败

s.translate(None, string.punctuation)

对于更高版本的Python，请使用以下代码：

s.translate(str.maketrans('', '', string.punctuation))

它使用查找表在C语言中执行原始字符串操作——除了编写自己的C代码之外，没有什么能比这更好的了。

如果速度不令人担忧，另一个选择是：

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

这比用每个字符替换s.replace更快，但不会像正则表达式或字符串转换等非纯python方法那样执行得好，正如您从下面的计时中看到的那样。对于这种类型的问题，在尽可能低的水平上解决是有回报的。

计时代码：

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

结果如下：

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

2008-11-05 18:36:11

其他回答

您也可以这样做：

import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

2021-04-27 11:48:29

为什么你们都不用这个？

 ''.join(filter(str.isalnum, s))

太慢了？

2019-07-29 08:18:31

考虑unicode。代码已在python3中检查。

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

2020-06-04 05:08:05

试试那个：）

regex.sub(r'\p{P}','', s)

2020-09-02 07:51:45

这可能不是最好的解决方案，但我就是这样做的。

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

2011-07-05 04:30:07

从字符串中删除标点符号的最佳方法

推荐文章

最新文章

标签