匹配平衡括号的正则表达式

我需要一个正则表达式来选择两个外括号之间的所有文本。

例子: START_TEXT(这里的文本(可能的文本)文本(可能的文本(更多的文本))END_TXT ^ ^

结果: (此处文本(可能的文本)文本(可能的文本(更多的文本)))

当前回答

"""
Here is a simple python program showing how to use regular
expressions to write a paren-matching recursive parser.

This parser recognises items enclosed by parens, brackets,
braces and <> symbols, but is adaptable to any set of
open/close patterns.  This is where the re package greatly
assists in parsing. 
"""

import re


# The pattern below recognises a sequence consisting of:
#    1. Any characters not in the set of open/close strings.
#    2. One of the open/close strings.
#    3. The remainder of the string.
# 
# There is no reason the opening pattern can't be the
# same as the closing pattern, so quoted strings can
# be included.  However quotes are not ignored inside
# quotes.  More logic is needed for that....


pat = re.compile("""
    ( .*? )
    ( \( | \) | \[ | \] | \{ | \} | \< | \> |
                           \' | \" | BEGIN | END | $ )
    ( .* )
    """, re.X)

# The keys to the dictionary below are the opening strings,
# and the values are the corresponding closing strings.
# For example "(" is an opening string and ")" is its
# closing string.

matching = { "(" : ")",
             "[" : "]",
             "{" : "}",
             "<" : ">",
             '"' : '"',
             "'" : "'",
             "BEGIN" : "END" }

# The procedure below matches string s and returns a
# recursive list matching the nesting of the open/close
# patterns in s.

def matchnested(s, term=""):
    lst = []
    while True:
        m = pat.match(s)

        if m.group(1) != "":
            lst.append(m.group(1))

        if m.group(2) == term:
            return lst, m.group(3)

        if m.group(2) in matching:
            item, s = matchnested(m.group(3), matching[m.group(2)])
            lst.append(m.group(2))
            lst.append(item)
            lst.append(matching[m.group(2)])
        else:
            raise ValueError("After <<%s %s>> expected %s not %s" %
                             (lst, s, term, m.group(2)))

# Unit test.

if __name__ == "__main__":
    for s in ("simple string",
              """ "double quote" """,
              """ 'single quote' """,
              "one'two'three'four'five'six'seven",
              "one(two(three(four)five)six)seven",
              "one(two(three)four)five(six(seven)eight)nine",
              "one(two)three[four]five{six}seven<eight>nine",
              "one(two[three{four<five>six}seven]eight)nine",
              "oneBEGINtwo(threeBEGINfourENDfive)sixENDseven",
              "ERROR testing ((( mismatched ))] parens"):
        print "\ninput", s
        try:
            lst, s = matchnested(s)
            print "output", lst
        except ValueError as e:
            print str(e)
    print "done"

2016-09-01 05:40:18

其他回答

正则表达式是一个错误的工具，因为你正在处理嵌套结构，即递归。

但是有一个简单的算法可以做到这一点，我在之前的问题的回答中详细描述了它。其要点是编写代码扫描字符串，并对尚未与闭括号匹配的开括号保持计数器。当计数器返回0时，您就知道已经到达了最后的右括号。

2009-02-13 15:55:10

实际上，使用. net正则表达式是可以做到这一点的，但它并不是微不足道的，所以请仔细阅读。

你可以在这里读到一篇不错的文章。您可能还需要阅读。net正则表达式。你可以从这里开始阅读。

使用尖括号<>是因为它们不需要转义。

正则表达式是这样的:

<
[^<>]*
(
    (
        (?<Open><)
        [^<>]*
    )+
    (
        (?<Close-Open>>)
        [^<>]*
    )+
)*
(?(Open)(?!))
>

2011-09-23 18:22:38

因为js regex不支持递归匹配，我不能使平衡括号匹配工作。

这是一个简单的javascript循环版本，将“method(arg)”字符串转换为数组

push(number) map(test(a(a()))) bass(wow, abc)
$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)

const parser = str => {
  let ops = []
  let method, arg
  let isMethod = true
  let open = []

  for (const char of str) {
    // skip whitespace
    if (char === ' ') continue

    // append method or arg string
    if (char !== '(' && char !== ')') {
      if (isMethod) {
        (method ? (method += char) : (method = char))
      } else {
        (arg ? (arg += char) : (arg = char))
      }
    }

    if (char === '(') {
      // nested parenthesis should be a part of arg
      if (!isMethod) arg += char
      isMethod = false
      open.push(char)
    } else if (char === ')') {
      open.pop()
      // check end of arg
      if (open.length < 1) {
        isMethod = true
        ops.push({ method, arg })
        method = arg = undefined
      } else {
        arg += char
      }
    }
  }

  return ops
}

// const test = parser(`$$(groups) filter({ type: 'ORGANIZATION', isDisabled: { $ne: true } }) pickBy(_id, type) map(test()) as(groups)`)
const test = parser(`push(number) map(test(a(a()))) bass(wow, abc)`)

console.log(test)

结果就像

[ { method: 'push', arg: 'number' },
  { method: 'map', arg: 'test(a(a()))' },
  { method: 'bass', arg: 'wow,abc' } ]

[ { method: '$$', arg: 'groups' },
  { method: 'filter',
    arg: '{type:\'ORGANIZATION\',isDisabled:{$ne:true}}' },
  { method: 'pickBy', arg: '_id,type' },
  { method: 'map', arg: 'test()' },
  { method: 'as', arg: 'groups' } ]

2019-10-20 11:29:54

除了bobble bubble的答案之外，还有其他类型的正则表达式支持递归结构。

Lua

使用%b() (%b{} / %b[]作为大括号/方括号):

对于字符串中的s。gmatch(“提取(a (b) c)和f (g)) ((d)”,“% b()”)做打印(s)结束(见演示)

Raku(前Perl6):

不重叠的多个平衡括号匹配:

my regex paren_any { '(' ~ ')' [ <-[()]>+ || <&paren_any> ]* }
say "Extract (a(b)c) and ((d)f(g))" ~~ m:g/<&paren_any>/;
# => (｢(a(b)c)｣ ｢((d)f(g))｣)

重叠多个平衡括号匹配:

say "Extract (a(b)c) and ((d)f(g))" ~~ m:ov:g/<&paren_any>/;
# => (｢(a(b)c)｣ ｢(b)｣ ｢((d)f(g))｣ ｢(d)｣ ｢(g)｣)

看到演示。

Python的非正则表达式解决方案

参见poke对如何在平衡括号之间获取表达式的回答。

Java可定制的非正则表达式解决方案

下面是一个可定制的解决方案，允许在Java中使用单个字符文字分隔符:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                 Character markEnd, Boolean includeMarkers) 

{
        List<String> subTreeList = new ArrayList<String>();
        int level = 0;
        int lastOpenDelimiter = -1;
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (c == markStart) {
                level++;
                if (level == 1) {
                    lastOpenDelimiter = (includeMarkers ? i : i + 1);
                }
            }
            else if (c == markEnd) {
                if (level == 1) {
                    subTreeList.add(s.substring(lastOpenDelimiter, (includeMarkers ? i + 1 : i)));
                }
                if (level > 0) level--;
            }
        }
        return subTreeList;
    }
}

示例用法:

String s = "some text(text here(possible text)text(possible text(more text)))end text";
List<String> balanced = getBalancedSubstrings(s, '(', ')', true);
System.out.println("Balanced substrings:\n" + balanced);
// => [(text here(possible text)text(possible text(more text)))]

2016-05-13 10:40:20

我想添加这个答案，以便快速参考。请随时更新。

.NET Regex使用平衡组:

\((?>\((?<c>)|[^()]+|\)(?<-c>))*(?(c)(?!))\)

其中c用作深度计数器。

在Regexstorm.com上进行演示

堆栈溢出:使用正则表达式来平衡匹配括号 Wes令人困惑的博客:平衡结构与。net正则表达式的匹配 Greg Reinacker的Weblog:正则表达式中的嵌套结构

使用递归模式的PCRE:

\((?:[^)(]+|(?R))*+\)

演示在regex101;或无交替的:

\((?:[^)(]*(?R)?)*+\)

演示在regex101;或为表演而展开:

\([^)(]*+(?:(?R)[^)(]*)*+\)

演示在regex101;模式被粘贴在(?R)处，它表示(?0)。

Perl, PHP, notepad++， R: Perl =TRUE, Python: PyPI正则表达式模块与(?V1)的Perl行为。 (新版本的PyPI regex包已经默认为this→DEFAULT_VERSION = VERSION1)

Ruby使用子表达式调用:

与Ruby 2.0 \g<0>可以用来调用完整的模式。

\((?>[^)(]+|\g<0>)*\)

在Rubular演示;Ruby 1.9只支持捕获组递归:

(\((?>[^)(]+|\g<1>)*\))

Rubular的演示(从Ruby 1.9.3开始进行原子分组)

API JavaScript

XRegExp.matchRecursive(str, '\\(', '\\)', 'g');

Java: @jaytea使用前向引用的有趣想法。

不递归最多3层嵌套: (JS, Java和其他类型的正则表达式)

为了防止不平衡时失控，只在最内层[)(]上使用*。

\((?:[^)(]|\((?:[^)(]|\((?:[^)(]|\([^)(]*\))*\))*\))*\)

演示在regex101;或展开以获得更好的性能(首选)。

\([^)(]*(?:\([^)(]*(?:\([^)(]*(?:\([^)(]*\)[^)(]*)*\)[^)(]*)*\)[^)(]*)*\)

演示在regex101;需要根据需要添加更深层次的嵌套。

参考-这个正则表达式是什么意思?

递归正则表达式 Regular- expressions .info -正则表达式递归精通正则表达式- Jeffrey E.F. Friedl 1 2 3 4

2016-02-08 13:37:00

匹配平衡括号的正则表达式

推荐文章

最新文章

标签