值得使用Python的re.compile吗?

在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

re.match('hello', 'hello world')

当前回答

用下面的例子:

h = re.compile('hello')
h.match('hello world')

上面例子中的匹配方法和下面的不一样:

re.match('hello', 'hello world')

Re.compile()返回一个正则表达式对象，这意味着h是一个正则表达式对象。

regex对象有自己的匹配方法，带有可选的pos和endpos参数:

的。匹配(字符串[线程][线程]])

pos

可选的第二个参数pos给出了字符串中的一个索引搜寻就要开始了;缺省值为0。这并不完全是相当于对字符串进行切片;'^'模式字符匹配于字符串的真正开始和在a之后的位置换行符，但不一定在搜索到的索引处开始。

尾部

可选参数endpos限制了字符串的长度搜索;这就好像字符串有endpos个字符那么长只搜索从pos到endpos - 1的字符匹配。如果endpos小于pos，则找不到匹配;否则, 如果rx是编译后的正则表达式对象，则rx。搜索(字符串,0, 50)等于rx。搜索(字符串(:50),0)。

regex对象的search、findall和finditer方法也支持这些参数。

Re.match (pattern, string, flags=0)不支持，如你所见，它的search、findall和finditer也没有。

match对象具有补充这些参数的属性:

match.pos

的search()或match()方法传递的pos的值一个正则表达式对象。这是正则表达式所在字符串的索引引擎开始寻找匹配。

match.endpos

传递给search()或match()方法的endpos值正则表达式对象的。对象超出的字符串的索引 RE引擎不会去。

一个regex对象有两个唯一的，可能有用的属性:

regex.groups

模式中捕获组的数量。

regex.groupindex

将(?P)定义的任何符号组名映射到的字典组数字。如果没有使用符号组，则字典为空在模式中。

最后，match对象有这个属性:

match.re

其match()或search()方法的正则表达式对象生成此匹配实例。

2013-03-10 23:03:59

其他回答

我的理解是，这两个例子实际上是等价的。唯一的区别是，在第一种情况下，您可以在其他地方重用已编译的正则表达式，而不会导致再次编译它。

这里有一个参考:http://diveintopython3.ep.io/refactoring.html

使用字符串'M'调用已编译模式对象的搜索函数，其效果与同时使用正则表达式和字符串'M'调用re.search相同。只是要快得多。(事实上，re.search函数只是编译正则表达式，并为您调用结果模式对象的搜索方法。)

2009-01-16 21:38:12

我有很多运行编译过的regex 1000的经验与实时编译相比，并没有注意到任何可感知的差异

对已接受答案的投票导致假设@Triptych所说的对所有情况都是正确的。这并不一定是真的。一个很大的区别是当你必须决定是接受一个正则表达式字符串还是一个编译过的正则表达式对象作为函数的参数时:

>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: x.match(y)       # accepts compiled regex as parameter
... h=re.compile('hello')
... """, stmt="f(h, 'hello world')")
0.32881879806518555
>>> timeit.timeit(setup="""
... import re
... f=lambda x, y: re.compile(x).match(y)   # compiles when called
... """, stmt="f('hello', 'hello world')")
0.809190034866333

编译正则表达式总是更好的，以防需要重用它们。

请注意，上面timeit中的示例模拟在导入时一次创建已编译的regex对象，而不是在需要匹配时“动态”创建。

2017-01-04 13:20:09

我有很多运行一个编译过的正则表达式和实时编译的经验，并没有注意到任何可感知的差异。显然，这只是传闻，当然也不是反对编译的有力论据，但我发现两者之间的差异可以忽略不计。

编辑: 在快速浏览了实际的Python 2.5库代码后，我发现无论何时使用正则表达式(包括调用re.match())， Python都会在内部编译和缓存正则表达式，因此实际上只在正则表达式被编译时进行更改，并且不应该节省太多时间——只节省检查缓存所需的时间(对内部dict类型的键查找)。

来自re.py模块(评论是我的):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

我仍然经常预编译正则表达式，但只是为了将它们绑定到一个漂亮的、可重用的名称，而不是为了任何预期的性能提升。

2009-01-16 21:42:57

我自己刚试过。对于从字符串中解析数字并对其求和的简单情况，使用编译后的正则表达式对象的速度大约是使用re方法的两倍。

正如其他人指出的那样，re方法(包括re.compile)在以前编译的表达式缓存中查找正则表达式字符串。因此，在正常情况下，使用re方法的额外成本只是缓存查找的成本。

然而，检查代码，缓存被限制为100个表达式。这就引出了一个问题，缓存溢出有多痛苦?该代码包含正则表达式编译器的内部接口re.sre_compile.compile。如果我们调用它，就绕过了缓存。结果表明，对于一个基本的正则表达式，例如r'\w+\s+([0-9_]+)\s+\w*'，它要慢两个数量级。

下面是我的测试:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5
</pre>
And here is the output on my machine:
<pre>
$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

'reallyCompiled'方法使用内部接口，绕过缓存。注意，在每个循环迭代中编译的代码只迭代了10,000次，而不是一百万次。

2010-04-14 04:40:24

使用re.compile()还有一个额外的好处，即使用re.VERBOSE向正则表达式模式添加注释

pattern = '''
hello[ ]world    # Some info on my pattern logic. [ ] to recognize space
'''

re.search(pattern, 'hello world', re.VERBOSE)

虽然这不会影响代码的运行速度，但我喜欢这样做，因为这是我注释习惯的一部分。当我想要修改代码时，我完全不喜欢花时间去记住代码背后的逻辑。

2015-03-20 03:39:09

值得使用Python的re.compile吗?

推荐文章

最新文章

标签