如何在Python中获得一个字符串与另一个字符串相似的概率?

我想要得到一个十进制值,比如0.9(意思是90%)等等。最好是标准的Python和库。

e.g.

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

当前回答

还添加了Spacy NLP库;

@profile
def main():
    str1= "Mar 31 09:08:41  The world is beautiful"
    str2= "Mar 31 19:08:42  Beautiful is the world"
    print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

if __name__ == '__main__':

    #python3 -m spacy download en_core_web_sm
    #nlp = spacy.load("en_core_web_sm")
    nlp = spacy.load("en_core_web_md")
    main()

使用Robert Kern的line_profiler运行

kernprof -l -v ./python/loganalysis/testspacy.py

NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562

然而,时间的启示

Function: main at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def main():
    34         1          1.0      1.0      0.0      str1= "Mar 31 09:08:41  The world is beautiful"
    35         1          0.0      0.0      0.0      str2= "Mar 31 19:08:42  Beautiful is the world"
    36         1      43248.0  43248.0     99.1      print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    37         1        375.0    375.0      0.9      print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    38         1         30.0     30.0      0.1      print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

其他回答

Textdistance:

TextDistance - python库,用于通过多种算法比较两个或多个序列之间的距离。它有Textdistance

30 +算法 纯python实现 简单的使用 两个以上的序列比较 有些算法在一个类中有多个实现。 可选的numpy使用最高速度。

例二:

import textdistance
textdistance.hamming('test', 'text')

输出:

1

Example2:

import textdistance

textdistance.hamming.normalized_similarity('test', 'text')

输出:

0.75

谢谢,干杯!

我想你们可能在寻找一种描述字符串之间距离的算法。这里有一些你可以参考的:

汉明距离 Levenshtein距离 Damerau-Levenshtein距离 Jaro-Winkler距离

这是内置的。

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

使用它:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

还添加了Spacy NLP库;

@profile
def main():
    str1= "Mar 31 09:08:41  The world is beautiful"
    str2= "Mar 31 19:08:42  Beautiful is the world"
    print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

if __name__ == '__main__':

    #python3 -m spacy download en_core_web_sm
    #nlp = spacy.load("en_core_web_sm")
    nlp = spacy.load("en_core_web_md")
    main()

使用Robert Kern的line_profiler运行

kernprof -l -v ./python/loganalysis/testspacy.py

NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562

然而,时间的启示

Function: main at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def main():
    34         1          1.0      1.0      0.0      str1= "Mar 31 09:08:41  The world is beautiful"
    35         1          0.0      0.0      0.0      str2= "Mar 31 19:08:42  Beautiful is the world"
    36         1      43248.0  43248.0     99.1      print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    37         1        375.0    375.0      0.9      print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    38         1         30.0     30.0      0.1      print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

TheFuzz是一个用python实现Levenshtein距离的包,在某些情况下,当你希望两个不同的字符串被认为是相同的时,它带有一些帮助函数来提供帮助。例如:

>>> fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    91
>>> fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear")
    100