维基百科上的余弦相似度文章
你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?
维基百科上的余弦相似度文章
你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?
这是我在c#中的实现。
using System;
namespace CosineSimilarity
{
class Program
{
static void Main()
{
int[] vecA = {1, 2, 3, 4, 5};
int[] vecB = {6, 7, 7, 9, 10};
var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);
Console.WriteLine(cosSimilarity);
Console.Read();
}
private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
{
var dotProduct = DotProduct(vecA, vecB);
var magnitudeOfA = Magnitude(vecA);
var magnitudeOfB = Magnitude(vecB);
return dotProduct/(magnitudeOfA*magnitudeOfB);
}
private static double DotProduct(int[] vecA, int[] vecB)
{
// I'm not validating inputs here for simplicity.
double dotProduct = 0;
for (var i = 0; i < vecA.Length; i++)
{
dotProduct += (vecA[i] * vecB[i]);
}
return dotProduct;
}
// Magnitude of the vector is the square root of the dot product of the vector with itself.
private static double Magnitude(int[] vector)
{
return Math.Sqrt(DotProduct(vector, vector));
}
}
}
我猜你更感兴趣的是深入了解余弦相似度工作的“为什么”(为什么它提供了一个很好的相似性指示),而不是“如何”计算它(用于计算的具体操作)。如果您对后者感兴趣,请参阅Daniel在这篇文章中指出的参考文献,以及相关的SO问题。
为了解释如何,甚至是为什么,首先,简化问题并只在二维空间中工作是有用的。一旦你在2D中得到了这个,在三维中考虑它就更容易了,当然在更大的维度中很难想象,但到那时我们可以用线性代数来进行数值计算,也可以帮助我们在n维中考虑直线/向量/“平面”/“球”,尽管我们不能画出来。
So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.
VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.
现在,不太相似的文件可能也会提到两个城市,但比例不同。也许文档2只会提到巴黎一次,伦敦七次。
回到我们的x-y平面,如果我们画出这些假设的文档,我们会看到当它们非常相似时,它们的向量会重叠(尽管有些向量可能更长),当它们的共同点开始减少时,这些向量开始发散,它们之间的角度会更大。
通过测量向量之间的夹角,我们可以很好地了解它们的相似性,为了让事情变得更简单,通过计算这个夹角的余弦,我们有一个很好的0到1或-1到1的值,这表明了这种相似性,这取决于我们解释什么以及如何解释。角度越小,余弦值越大(越接近1),相似度也越高。
在极端情况下,如果Doc1只引用巴黎,而Doc2只引用伦敦,那么这些文档绝对没有任何共同之处。Doc1的向量在x轴上,Doc2的向量在y轴上,角度是90度,cos0。在这种情况下,我们会说这些文档彼此正交。
Adding dimensions: With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.
我将用几句话来结束这个公式本身。正如我所说,其他参考文献提供了关于计算的良好信息。
首先是二维空间。两个向量夹角余弦的公式是由三角函数差(角a和角b之间)推导出来的:
cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))
这个公式看起来很类似于点积公式:
Vect1 . Vect2 = (x1 * x2) + (y1 * y2)
其中cos(a)对应x值,sin(a)对应y值,对于第一个向量,等等。唯一的问题是,x, y等并不完全是cos和sin值,因为这些值需要在单位圆上读取。这就是公式分母的作用所在:通过除以这些向量长度的乘积,x和y坐标就变得标准化了。
这里有两篇很短的文章供比较:
朱莉比琳达更爱我 简喜欢我胜过朱莉爱我
我们想知道这些文本有多相似,纯粹从字数的角度来看(忽略词序)。我们首先列出了这两篇文章中的单词:
me Julie loves Linda than more likes Jane
现在我们来数一下这些单词在每篇文章中出现的次数:
me 2 2
Jane 0 1
Julie 1 1
Linda 1 0
likes 0 1
loves 2 1
more 1 1
than 1 1
我们对单词本身不感兴趣。我们只对 这两个垂直向量的计数。例如,有两个实例 每条短信里都有“我”。我们要决定这两篇文章之间的距离 另一种方法是计算这两个向量的一个函数,即cos 它们之间的夹角。
这两个向量是
a: [2, 0, 1, 1, 0, 2, 1, 1]
b: [2, 1, 1, 0, 1, 1, 1, 1]
两者夹角的余弦值约为0.822。
这些向量是8维的。使用余弦相似度的一个优点是显而易见的 它将一个超出人类想象能力的问题转化为一个问题 那是可以的。在这种情况下,你可以认为这个角大约是35度 度是指距离零或完全一致的距离。
为了简单起见,我化简了向量a和b:
Let :
a : [1, 1, 0]
b : [1, 0, 1]
那么余弦相似度(Theta)
(Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5
cos0.5的逆是60度。
这段Python代码是我实现算法的快速而肮脏的尝试:
import math
from collections import Counter
def build_vector(iterable1, iterable2):
counter1 = Counter(iterable1)
counter2 = Counter(iterable2)
all_items = set(counter1.keys()).union(set(counter2.keys()))
vector1 = [counter1[k] for k in all_items]
vector2 = [counter2[k] for k in all_items]
return vector1, vector2
def cosim(v1, v2):
dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
return dot_product / (magnitude1 * magnitude2)
l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()
v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
/**
*
* @author Xiao Ma
* mail : 409791952@qq.com
*
*/
public class SimilarityUtil {
public static double consineTextSimilarity(String[] left, String[] right) {
Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
Set<String> uniqueSet = new HashSet<String>();
Integer temp = null;
for (String leftWord : left) {
temp = leftWordCountMap.get(leftWord);
if (temp == null) {
leftWordCountMap.put(leftWord, 1);
uniqueSet.add(leftWord);
} else {
leftWordCountMap.put(leftWord, temp + 1);
}
}
for (String rightWord : right) {
temp = rightWordCountMap.get(rightWord);
if (temp == null) {
rightWordCountMap.put(rightWord, 1);
uniqueSet.add(rightWord);
} else {
rightWordCountMap.put(rightWord, temp + 1);
}
}
int[] leftVector = new int[uniqueSet.size()];
int[] rightVector = new int[uniqueSet.size()];
int index = 0;
Integer tempCount = 0;
for (String uniqueWord : uniqueSet) {
tempCount = leftWordCountMap.get(uniqueWord);
leftVector[index] = tempCount == null ? 0 : tempCount;
tempCount = rightWordCountMap.get(uniqueWord);
rightVector[index] = tempCount == null ? 0 : tempCount;
index++;
}
return consineVectorSimilarity(leftVector, rightVector);
}
/**
* The resulting similarity ranges from −1 meaning exactly opposite, to 1
* meaning exactly the same, with 0 usually indicating independence, and
* in-between values indicating intermediate similarity or dissimilarity.
*
* For text matching, the attribute vectors A and B are usually the term
* frequency vectors of the documents. The cosine similarity can be seen as
* a method of normalizing document length during comparison.
*
* In the case of information retrieval, the cosine similarity of two
* documents will range from 0 to 1, since the term frequencies (tf-idf
* weights) cannot be negative. The angle between two term frequency vectors
* cannot be greater than 90°.
*
* @param leftVector
* @param rightVector
* @return
*/
private static double consineVectorSimilarity(int[] leftVector,
int[] rightVector) {
if (leftVector.length != rightVector.length)
return 1;
double dotProduct = 0;
double leftNorm = 0;
double rightNorm = 0;
for (int i = 0; i < leftVector.length; i++) {
dotProduct += leftVector[i] * rightVector[i];
leftNorm += leftVector[i] * leftVector[i];
rightNorm += rightVector[i] * rightVector[i];
}
double result = dotProduct
/ (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
return result;
}
public static void main(String[] args) {
String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
"loves", "me" };
String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
"loves", "me" };
System.out.println(consineTextSimilarity(left,right));
}
}
以@Bill Bell为例,在[R]中有两种方法
a = c(2,1,0,2,0,1,1,1)
b = c(2,1,1,1,1,0,1,1)
d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
或者利用crossprod()方法的性能…
e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))
这是一个简单的Python代码,实现余弦相似度。
from scipy import linalg, mat, dot
import numpy as np
In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )
In [13]: matrix
Out[13]:
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
[2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])
简单的JAVA代码计算余弦相似度
/**
* Method to calculate cosine similarity of vectors
* 1 - exactly similar (angle between them is 0)
* 0 - orthogonal vectors (angle between them is 90)
* @param vector1 - vector in the form [a1, a2, a3, ..... an]
* @param vector2 - vector in the form [b1, b2, b3, ..... bn]
* @return - the cosine similarity of vectors (ranges from 0 to 1)
*/
private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vector1.size(); i++) {
dotProduct += vector1.get(i) * vector2.get(i);
normA += Math.pow(vector1.get(i), 2);
normB += Math.pow(vector2.get(i), 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
两个向量A和B存在于二维空间或三维空间中,它们之间的夹角为cos相似度。
如果角度更大(可以达到最大180度),即Cos 180=-1,最小角度为0度。cos0 =1意味着向量是对齐的,因此向量是相似的。
cos 90=0(这足以得出向量A和B根本不相似,因为距离不能为负,余弦值将在0到1之间。因此,更多的角度意味着降低相似性(视觉化也有意义)
下面是一个简单的计算余弦相似度的Python代码:
import math
def dot_prod(v1, v2):
ret = 0
for i in range(len(v1)):
ret += v1[i] * v2[i]
return ret
def magnitude(v):
ret = 0
for i in v:
ret += i**2
return math.sqrt(ret)
def cos_sim(v1, v2):
return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))
让我试着用Python代码和一些图形数学公式来解释这一点。
假设我们的代码中有两个非常短的文本:
texts = ["I am a boy", "I am a girl"]
我们想要比较下面的查询文本,看看查询与上面的文本有多接近,使用快速余弦相似度评分:
query = ["I am a boy scout"]
我们应该如何计算余弦相似度分数?首先,让我们用Python为这些文本构建一个tfidf矩阵:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
接下来,让我们检查tfidf矩阵及其词汇表的值:
print(tfidf_matrix.toarray())
# output
array([[0.57973867, 0.81480247, 0. ],
[0.57973867, 0. , 0.81480247]])
这里,我们得到一个tfidf矩阵,tfidf值为2 x 3,或者2个文档/文本x 3个术语。这是我们的tfidf文档术语矩阵。让我们通过调用vectorizer.vocabulary_来看看这3项是什么
print(vectorizer.vocabulary_)
# output
{'am': 0, 'boy': 1, 'girl': 2}
这告诉我们,tfidf矩阵中的3项是“am”,“boy”和“girl”。'am'在第0列,'boy'在第1列,'girl'在第2列。术语“I”和“a”已被矢量器删除,因为它们是停止词。
现在我们有了tfidf矩阵,我们想要比较我们的查询文本和我们的文本,看看我们的查询有多接近我们的文本。为此,我们可以计算查询的余弦相似度分数与文本的tfidf矩阵。但首先,我们需要计算查询的tfidf:
query = ["I am a boy scout"]
query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())
#output
array([[0.57973867, 0.81480247, 0. ]])
Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values [0.57973867, 0.81480247, 0. ], which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values or vectorizer.transform([query]) values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of [0.57973867, 0.81480247, 0. ] values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)
在计算query_tfidf之后,我们现在可以将query_tfidf向量与文本tfidf_matrix矩阵相乘或点积,以获得余弦相似度分数。
回想一下,余弦相似度分数或公式等于以下:
cosine similarity score = (A . B) / ||A|| ||B||
这里,A =我们的query_tfidf向量,B =我们的tfidf_matrix的每一行
注意:A。B = A * B^T,或者A点积B = A乘以B转置。
了解了公式之后,让我们手动计算query_tfidf的余弦相似度分数,然后将我们的答案与sklearn提供的值进行比较。度量cosine_similarity函数。让我们手动计算:
query_tfidf_arr = query_tfidf.toarray()
tfidf_matrix_arr = tfidf_matrix.toarray()
cosine_similarity_1 = np.dot(query_tfidf_arr, tfidf_matrix_arr[0].T) /
(np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[0]))
cosine_similarity_2 = np.dot(query_tfidf_arr, tfidf_matrix_arr[1].T) /
(np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[1]))
manual_cosine_similarities = [cosine_similarity_1[0], cosine_similarity_2[0]]
print(manual_cosine_similarities)
#output
[1.0, 0.33609692727625745]
我们手动计算的余弦相似度分数给出了[1.0,0.33609692727625745]的值。让我们用sklearn提供的答案值来检查手动计算的余弦相似度分数。度量cosine_similarity函数:
from sklearn.metrics.pairwise import cosine_similarity
function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)
#output
array([[1.0 , 0.33609693]])
输出值都是相同的!手动计算余弦相似度值与函数计算余弦相似度值相同!
因此,这个简单的解释说明了余弦相似值是如何计算的。希望这个解释对你有帮助。