从字符串中删除标点符号的最佳方法

似乎应该有一种比以下更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有？

当前回答

这个问题没有太多细节，所以我采取的方法是用最简单的解释来解决这个问题：去掉标点符号。

请注意，所提供的解决方案不考虑收缩词（例如，you are）或连字符词（例如肛门保留）。。。这是关于它们是否应该被视为标点符号的争论。。。也不能解释非英语字符集或类似的东西。。。因为问题中没有提到这些细节。有人认为空格是标点符号，这在技术上是正确的。。。但对我来说，这在当前问题的背景下毫无意义。

# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))

# using list comprehension
''.join('' if c in string.punctuation else c for c in s)

2021-08-13 14:38:15

其他回答

作为更新，我重写了Python 3中的@Brian示例，并对其进行了更改，以将正则表达式编译步骤移到函数内部。我在这里的想法是对使功能工作所需的每一步进行计时。也许您使用的是分布式计算，无法在工作人员之间共享regex对象，需要在每个工作人员处执行re.compile步骤。此外，我还很好奇地对Python 3的maketrans的两种不同实现进行计时

table = str.maketrans({key: None for key in string.punctuation})

table = str.maketrans('', '', string.punctuation)

另外，我添加了另一种使用集合的方法，在这里我利用交集函数来减少迭代次数。

这是完整的代码：

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果：

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

2018-05-07 13:42:17

您也可以这样做：

import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

2021-04-27 11:48:29

我在寻找一个非常简单的解决方案。这是我得到的：

import re 

s = "string. With. Punctuation?" 
s = re.sub(r'[\W\s]', ' ', s)

print(s)
'string  With  Punctuation '

2021-03-26 14:09:10

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

2016-08-24 05:43:58

在处理Unicode字符串时，我建议使用PyPi正则表达式模块，因为它同时支持Unicode属性类（如\p｛X｝/\p｛X｝）和POSIX字符类（如[：name：]）。

只需在终端中键入pipinstallregex（或pip3installregex）并按回车键即可安装软件包。

如果您需要删除任何类型的标点符号（即除字母、数字和空格之外的任何其他符号），您可以使用

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

在线观看Python演示：

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

在这里，我向字符类添加了空白模式

2021-12-01 14:37:52

从字符串中删除标点符号的最佳方法

推荐文章

最新文章

标签