我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
我有一个这样的字符串:
this is "a test"
我试图在Python中写一些东西,通过空格分割它,同时忽略引号中的空格。我想要的结果是:
['this', 'is', 'a test']
PS,我知道你会问“如果引号中有引号会发生什么,在我的应用程序中,这永远不会发生。
当前回答
不同答案的速度测试:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
其他回答
我看到这里的正则表达式方法看起来很复杂和/或错误。这让我很惊讶,因为正则表达式语法可以很容易地描述“空格或引号包围的东西”,而且大多数正则表达式引擎(包括Python的)都可以在正则表达式上进行拆分。所以如果你要使用正则表达式,为什么不直接说出你的意思呢?:
test = 'this is "a test"' # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]
解释:
[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators
不过,Shlex可能提供更多的特性。
要解决某些Python 2版本中的unicode问题,我建议:
from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]
我建议:
测试字符串:
s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''
也要捕捉""和":
import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)
结果:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]
忽略空的""和":
import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)
结果:
['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
不同答案的速度测试:
import re
import shlex
import csv
line = 'this is "a test"'
%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop
%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop
%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop
%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
你需要从内置的shlex模块中分离。
>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']
这应该是你想要的效果。
如果你想保留引号,那么你可以传递posix=False kwarg。
>>> shlex.split('this is "a test"', posix=False)
['this', 'is', '"a test"']