值得使用Python的re.compile吗?

在Python中对正则表达式使用compile有什么好处吗?

h = re.compile('hello')
h.match('hello world')

re.match('hello', 'hello world')

当前回答

作为一个替代答案，正如我看到之前没有提到的，我将继续引用Python 3文档:

您是应该使用这些模块级函数，还是应该获取模式并自己调用它的方法?如果在循环中访问正则表达式，预编译它将节省一些函数调用。在循环之外，由于内部缓存，没有太大区别。

2019-06-02 01:05:00

其他回答

我同意诚实的亚伯，所给例子中的匹配(…)是不同的。他们不是一对一的比较，因此，结果是不同的。为了简化我的回答，我用A, B, C, D来表示这些函数。哦，是的，我们在re.py中处理的是4个函数而不是3个。

运行这段代码:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)

与运行此代码相同:

re.match('hello', 'hello world')          # (C)

因为，当查看源代码re.py时，(A + B)意味着:

h = re._compile('hello')                  # (D)
h.match('hello world')

(C)实际上是:

re._compile('hello').match('hello world')

因此，(C)与(B)并不相同，实际上(C)在调用(D)之后调用(B)， (D)也被(A)调用，换句话说，(C) = (A) + (B)，因此，在循环中比较(A + B)与在循环中比较(C)的结果相同。

George的regexTest.py为我们证明了这一点。

noncompiled took 4.555 seconds.           # (C) in a loop
compiledInLoop took 4.620 seconds.        # (A + B) in a loop
compiled took 2.323 seconds.              # (A) once + (B) in a loop

大家的兴趣是，如何得到2.323秒的结果。为了确保compile(…)只被调用一次，我们需要将编译后的regex对象存储在内存中。如果使用类，则可以存储对象，并在每次调用函数时重用该对象。

class Foo:
    regex = re.compile('hello')
    def my_function(text)
        return regex.match(text)

如果我们不使用类(这是我今天的要求)，那么我没有评论。我还在学习如何在Python中使用全局变量，我知道全局变量不是什么好东西。

还有一点，我认为使用(A) + (B)的方法有优势。以下是我观察到的一些事实(如果我错了，请指正):

Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached). If the _cache gets flushed in between, then the regex object is released from memory and Python needs to compile again. (someone suggests that Python won't recompile.) If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keeps a reference on it and the regex object will not be released from memory. Those, Python need not to compile again. The 2 seconds difference in George's test compiled loop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex. George's reallycompile test show what happens if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

以下是(A + B)比(C)更好的情况:

如果可以在类中缓存regex对象的引用。如果需要重复调用(B)(在循环内或多次)，则必须在循环外缓存对regex对象的引用。

如果(C)足够好:

不能缓存引用。我们只是偶尔用一次。总的来说，我们没有太多的正则表达式(假设编译后的正则表达式永远不会被刷新)

简单回顾一下，以下是abc:

h = re.compile('hello')                   # (A)
h.match('hello world')                    # (B)
re.match('hello', 'hello world')          # (C)

感谢阅读。

2014-07-29 16:55:09

我真的很尊重上面所有的答案。在我看来是的!当然，使用re.compile而不是一次又一次地编译正则表达式是值得的。

使用re.compile可以使代码更加动态，因为您可以调用已经编译好的正则表达式，而不是一次又一次地编译。这对你有好处:

处理器的努力时间复杂度。使正则表达式通用。(可以在findall, search, match中使用) 并使您的程序看起来很酷。

例子:

  example_string = "The room number of her room is 26A7B."
  find_alpha_numeric_string = re.compile(r"\b\w+\b")

在Findall中使用

 find_alpha_numeric_string.findall(example_string)

在搜索中使用

  find_alpha_numeric_string.search(example_string)

类似地，您可以将它用于:Match和Substitute

2017-03-14 12:31:47

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

因此，如果您将经常使用同一个正则表达式，可能值得执行re.compile(特别是对于更复杂的正则表达式)。

反对过早优化的标准论点适用，但如果您怀疑regexp可能成为性能瓶颈，我不认为使用re.compile会真正失去多少清晰度/直接性。

更新:

在Python 3.6(我怀疑上述计时是使用Python 2.x完成的)和2018硬件(MacBook Pro)下，我现在得到以下计时:

% python -m timeit -s "import re" "re.match('hello', 'hello world')"
1000000 loops, best of 3: 0.661 usec per loop

% python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 0.285 usec per loop

% python -m timeit -s "import re" "h=re.compile('hello'); h.match('hello world')"
1000000 loops, best of 3: 0.65 usec per loop

% python --version
Python 3.6.5 :: Anaconda, Inc.

我还添加了一个案例(注意最后两次运行之间的引号差异)，表明re.match(x，…)从字面上[大致]等价于re.compile(x).match(…)，即似乎没有发生编译表示的幕后缓存。

2009-01-16 21:42:37

尽管这两种方法在速度方面是可以比较的，但是您应该知道，如果您正在处理数百万次迭代，那么仍然存在一些可以忽略不计的时间差。

以下速度测试:

import re
import time

SIZE = 100_000_000

start = time.time()
foo = re.compile('foo')
[foo.search('bar') for _ in range(SIZE)]
print('compiled:  ', time.time() - start)

start = time.time()
[re.search('foo', 'bar') for _ in range(SIZE)]
print('uncompiled:', time.time() - start)

给出了以下结果:

compiled:   14.647532224655151
uncompiled: 61.483458042144775

编译后的方法在我的PC上(使用Python 3.7.0)始终快大约4倍。

如文档中所述:

如果在循环中访问正则表达式，预编译它将节省一些函数调用。在循环之外，由于内部缓存，没有太大区别。

2021-07-16 09:30:40

下面是一个使用re.compile的示例，在请求时速度超过50倍。

这一点与我在上面的评论中所说的是一样的，即当您的使用从编译缓存中获益不多时，使用re.compile可能是一个显著的优势。这种情况至少发生在一个特定的情况下(我在实践中遇到过)，即当以下所有情况都成立时:

您有很多regex模式(不仅仅是re._MAXCACHE，它目前的默认值是512)，以及你经常使用这些正则表达式，而且相同模式的连续使用之间被多个re._MAXCACHE其他正则表达式分隔，因此每个正则表达式在连续使用之间从缓存中刷新。

import re
import time

def setup(N=1000):
    # Patterns 'a.*a', 'a.*b', ..., 'z.*z'
    patterns = [chr(i) + '.*' + chr(j)
                    for i in range(ord('a'), ord('z') + 1)
                    for j in range(ord('a'), ord('z') + 1)]
    # If this assertion below fails, just add more (distinct) patterns.
    # assert(re._MAXCACHE < len(patterns))
    # N strings. Increase N for larger effect.
    strings = ['abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'] * N
    return (patterns, strings)

def without_compile():
    print('Without re.compile:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for s in strings:
        for pat in patterns:
            count += bool(re.search(pat, s))
    return count

def without_compile_cache_friendly():
    print('Without re.compile, cache-friendly order:')
    patterns, strings = setup()
    print('searching')
    count = 0
    for pat in patterns:
        for s in strings:
            count += bool(re.search(pat, s))
    return count

def with_compile():
    print('With re.compile:')
    patterns, strings = setup()
    print('compiling')
    compiled = [re.compile(pattern) for pattern in patterns]
    print('searching')
    count = 0
    for s in strings:
        for regex in compiled:
            count += bool(regex.search(s))
    return count

start = time.time()
print(with_compile())
d1 = time.time() - start
print(f'-- That took {d1:.2f} seconds.\n')

start = time.time()
print(without_compile_cache_friendly())
d2 = time.time() - start
print(f'-- That took {d2:.2f} seconds.\n')

start = time.time()
print(without_compile())
d3 = time.time() - start
print(f'-- That took {d3:.2f} seconds.\n')

print(f'Ratio: {d3/d1:.2f}')

我在笔记本电脑上获得的示例输出(Python 3.7.7):

With re.compile:
compiling
searching
676000
-- That took 0.33 seconds.

Without re.compile, cache-friendly order:
searching
676000
-- That took 0.67 seconds.

Without re.compile:
searching
676000
-- That took 23.54 seconds.

Ratio: 70.89

I didn't bother with timeit as the difference is so stark, but I get qualitatively similar numbers each time. Note that even without re.compile, using the same regex multiple times and moving on to the next one wasn't so bad (only about 2 times as slow as with re.compile), but in the other order (looping through many regexes), it is significantly worse, as expected. Also, increasing the cache size works too: simply setting re._MAXCACHE = len(patterns) in setup() above (of course I don't recommend doing such things in production as names with underscores are conventionally “private”) drops the ~23 seconds back down to ~0.7 seconds, which also matches our understanding.

2020-05-04 23:30:30

值得使用Python的re.compile吗?

推荐文章

最新文章

标签