如何在Python中获得一个字符串与另一个字符串相似的概率?

我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。

e.g.

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

当前回答

解决方案#1:内置Python

使用difflib中的SequenceMatcher

优点: 本机python库,不需要额外的包。 缺点:太有限了,有很多其他的字符串相似度的好算法。

例子

:
>>> from difflib import SequenceMatcher
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75

解决方案#2:水母库

这是一个非常好的图书馆,覆盖面广,问题少。 它支持: - Levenshtein距离 -达默罗-利文斯坦距离 ——Jaro Distance - Jaro-Winkler距离 -匹配评级方法比较 -汉明距离

优点: 易于使用,支持的算法的范围,测试。 缺点:不是本地库。

例子:

>>> import jellyfish
>>> jellyfish.levenshtein_distance(u'jellyfish', u'smellyfish')
2
>>> jellyfish.jaro_distance(u'jellyfish', u'smellyfish')
0.89629629629629637
>>> jellyfish.damerau_levenshtein_distance(u'jellyfish', u'jellyfihs')
1

其他回答

还添加了Spacy NLP库;

@profile
def main():
    str1= "Mar 31 09:08:41  The world is beautiful"
    str2= "Mar 31 19:08:42  Beautiful is the world"
    print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

if __name__ == '__main__':

    #python3 -m spacy download en_core_web_sm
    #nlp = spacy.load("en_core_web_sm")
    nlp = spacy.load("en_core_web_md")
    main()

使用Robert Kern的line_profiler运行

kernprof -l -v ./python/loganalysis/testspacy.py

NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562

然而,时间的启示

Function: main at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def main():
    34         1          1.0      1.0      0.0      str1= "Mar 31 09:08:41  The world is beautiful"
    35         1          0.0      0.0      0.0      str2= "Mar 31 19:08:42  Beautiful is the world"
    36         1      43248.0  43248.0     99.1      print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    37         1        375.0    375.0      0.9      print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    38         1         30.0     30.0      0.1      print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

包装距离包括Levenshtein距离:

import distance
distance.levenshtein("lenvestein", "levenshtein")
# 3

内置的SequenceMatcher在大输入时非常慢,下面是如何用diff-match-patch完成的:

from diff_match_patch import diff_match_patch

def compute_similarity_and_diff(text1, text2):
    dmp = diff_match_patch()
    dmp.Diff_Timeout = 0.0
    diff = dmp.diff_main(text1, text2, False)

    # similarity
    common_text = sum([len(txt) for op, txt in diff if op == 0])
    text_length = max(len(text1), len(text2))
    sim = common_text / text_length

    return sim, diff

我想你们可能在寻找一种描述字符串之间距离的算法。这里有一些你可以参考的:

汉明距离 Levenshtein距离 Damerau-Levenshtein距离 Jaro-Winkler距离

Python3.6 + = 没有导入图书馆 在大多数情况下工作良好

在堆栈溢出,当你试图添加一个标签或发布一个问题,它会带来所有相关的东西。这是如此方便,正是我正在寻找的算法。因此,我编写了一个查询集相似度过滤器。

def compare(qs, ip):
    al = 2
    v = 0
    for ii, letter in enumerate(ip):
        if letter == qs[ii]:
            v += al
        else:
            ac = 0
            for jj in range(al):
                if ii - jj < 0 or ii + jj > len(qs) - 1: 
                    break
                elif letter == qs[ii - jj] or letter == qs[ii + jj]:
                    ac += jj
                    break
            v += ac
    return v


def getSimilarQuerySet(queryset, inp, length):
    return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
        


if __name__ == "__main__":
    print(compare('apple', 'mongo'))
    # 0
    print(compare('apple', 'apple'))
    # 10
    print(compare('apple', 'appel'))
    # 7
    print(compare('dude', 'ud'))
    # 1
    print(compare('dude', 'du'))
    # 4
    print(compare('dude', 'dud'))
    # 6

    print(compare('apple', 'mongo'))
    # 2
    print(compare('apple', 'appel'))
    # 8

    print(getSimilarQuerySet(
        [
            "java",
            "jquery",
            "javascript",
            "jude",
            "aja",
        ], 
        "ja",
        2,
    ))
    # ['javascript', 'java']

解释

compare takes two string and returns a positive integer. you can edit the al allowed variable in compare, it indicates how large the range we need to search through. It works like this: two strings are iterated, if same character is find at same index, then accumulator will be added to a largest value. Then, we search in the index range of allowed, if matched, add to the accumulator based on how far the letter is. (the further, the smaller) length indicate how many items you want as result, that is most similar to input string.