我正在寻找一个NLP项目,在任何编程语言(尽管Python将是我的偏好)。

我想取两个文档并确定它们有多相似。


当前回答

这里有一个小应用程序让你开始…

import difflib as dl

a = file('file').read()
b = file('file1').read()

sim = dl.get_close_matches

s = 0
wa = a.split()
wb = b.split()

for i in wa:
    if sim(i, wb):
        s += 1

n = float(s) / float(len(wa))
print '%d%% similarity' % int(n * 100)

其他回答

这是一个老问题了,但我发现斯派西可以很容易地解决这个问题。读取文档后,可以使用简单的api相似性来查找文档向量之间的余弦相似性。

首先安装包并下载模型:

pip install spacy
python -m spacy download en_core_web_sm

然后用like so:

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716

如果您对测量两段文本的语义相似性更感兴趣,我建议您看看这个gitlab项目。你可以把它作为服务器运行,也有一个预先构建的模型,你可以很容易地使用它来测量两段文本的相似性;尽管它主要用于测量两个句子的相似度,但你仍然可以在你的情况下使用它。它是用java编写的,但您可以将其作为RESTful服务运行。

另一个选择是DKPro Similarity,这是一个库,有各种算法来测量文本的相似性。然而,它也是用java编写的。

代码示例:

// this similarity measure is defined in the dkpro.similarity.algorithms.lexical-asl package
// you need to add that to your .pom to make that example work
// there are some examples that should work out of the box in dkpro.similarity.example-gpl 
TextSimilarityMeasure measure = new WordNGramJaccardMeasure(3);    // Use word trigrams

String[] tokens1 = "This is a short example text .".split(" ");   
String[] tokens2 = "A short example text could look like that .".split(" ");

double score = measure.getSimilarity(tokens1, tokens2);

System.out.println("Similarity: " + score);

I am combining the solutions from answers of @FredFoo and @Renaud. My solution is able to apply @Renaud's preprocessing on the text corpus of @FredFoo and then display pairwise similarities where the similarity is greater than 0. I ran this code on Windows by installing python and pip first. pip is installed as part of python but you may have to explicitly do it by re-running the installation package, choosing modify and then choosing pip. I use the command line to execute my python code saved in a file "similarity.py". I had to execute the following commands:

>set PYTHONPATH=%PYTHONPATH%;C:\_location_of_python_lib_
>python -m pip install sklearn
>python -m pip install nltk
>py similarity.py

similar .py的代码如下:

from sklearn.feature_extraction.text import TfidfVectorizer
import nltk, string
import numpy as np
nltk.download('punkt') # if necessary...

stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]  

vect = TfidfVectorizer(tokenizer=normalize, stop_words='english')
tfidf = vect.fit_transform(corpus)   
                                                                                                                                                                                                                    
pairwise_similarity = tfidf * tfidf.T

#view the pairwise similarities 
print(pairwise_similarity)

#check how a string is normalized
print(normalize("The scikit-learn docs are Orange and Blue"))

你可能想尝试一下cos文档相似度的在线服务http://www.scurtu.it/documentSimilarity.html

import urllib,urllib2
import json
API_URL="http://www.scurtu.it/apis/documentSimilarity"
inputDict={}
inputDict['doc1']='Document with some text'
inputDict['doc2']='Other document with some text'
params = urllib.urlencode(inputDict)    
f = urllib2.urlopen(API_URL, params)
response= f.read()
responseObject=json.loads(response)  
print responseObject

Generally a cosine similarity between two documents is used as a similarity measure of documents. In Java, you can use Lucene (if your collection is pretty large) or LingPipe to do this. The basic concept would be to count the terms in every document and calculate the dot product of the term vectors. The libraries do provide several improvements over this general approach, e.g. using inverse document frequencies and calculating tf-idf vectors. If you are looking to do something copmlex, LingPipe also provides methods to calculate LSA similarity between documents which gives better results than cosine similarity. For Python, you can use NLTK.