维基百科上的余弦相似度文章

你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?


当前回答

这里有两篇很短的文章供比较:

朱莉比琳达更爱我 简喜欢我胜过朱莉爱我

我们想知道这些文本有多相似,纯粹从字数的角度来看(忽略词序)。我们首先列出了这两篇文章中的单词:

me Julie loves Linda than more likes Jane

现在我们来数一下这些单词在每篇文章中出现的次数:

   me   2   2
 Jane   0   1
Julie   1   1
Linda   1   0
likes   0   1
loves   2   1
 more   1   1
 than   1   1

我们对单词本身不感兴趣。我们只对 这两个垂直向量的计数。例如,有两个实例 每条短信里都有“我”。我们要决定这两篇文章之间的距离 另一种方法是计算这两个向量的一个函数,即cos 它们之间的夹角。

这两个向量是

a: [2, 0, 1, 1, 0, 2, 1, 1]

b: [2, 1, 1, 0, 1, 1, 1, 1]

两者夹角的余弦值约为0.822。

这些向量是8维的。使用余弦相似度的一个优点是显而易见的 它将一个超出人类想象能力的问题转化为一个问题 那是可以的。在这种情况下,你可以认为这个角大约是35度 度是指距离零或完全一致的距离。

其他回答

这是一个简单的Python代码,实现余弦相似度。

from scipy import linalg, mat, dot
import numpy as np

In [12]: matrix = mat( [[2, 1, 0, 2, 0, 1, 1, 1],[2, 1, 1, 1, 1, 0, 1, 1]] )

In [13]: matrix
Out[13]: 
matrix([[2, 1, 0, 2, 0, 1, 1, 1],
        [2, 1, 1, 1, 1, 0, 1, 1]])
In [14]: dot(matrix[0],matrix[1].T)/np.linalg.norm(matrix[0])/np.linalg.norm(matrix[1])
Out[14]: matrix([[ 0.82158384]])

以@Bill Bell为例,在[R]中有两种方法

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

或者利用crossprod()方法的性能…

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

简单的JAVA代码计算余弦相似度

/**
   * Method to calculate cosine similarity of vectors
   * 1 - exactly similar (angle between them is 0)
   * 0 - orthogonal vectors (angle between them is 90)
   * @param vector1 - vector in the form [a1, a2, a3, ..... an]
   * @param vector2 - vector in the form [b1, b2, b3, ..... bn]
   * @return - the cosine similarity of vectors (ranges from 0 to 1)
   */
  private double cosineSimilarity(List<Double> vector1, List<Double> vector2) {

    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    for (int i = 0; i < vector1.size(); i++) {
      dotProduct += vector1.get(i) * vector2.get(i);
      normA += Math.pow(vector1.get(i), 2);
      normB += Math.pow(vector2.get(i), 2);
    }
    return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
  }

这是我在c#中的实现。

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

我猜你更感兴趣的是深入了解余弦相似度工作的“为什么”(为什么它提供了一个很好的相似性指示),而不是“如何”计算它(用于计算的具体操作)。如果您对后者感兴趣,请参阅Daniel在这篇文章中指出的参考文献,以及相关的SO问题。

为了解释如何,甚至是为什么,首先,简化问题并只在二维空间中工作是有用的。一旦你在2D中得到了这个,在三维中考虑它就更容易了,当然在更大的维度中很难想象,但到那时我们可以用线性代数来进行数值计算,也可以帮助我们在n维中考虑直线/向量/“平面”/“球”,尽管我们不能画出来。

So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.

VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.

现在,不太相似的文件可能也会提到两个城市,但比例不同。也许文档2只会提到巴黎一次,伦敦七次。

回到我们的x-y平面,如果我们画出这些假设的文档,我们会看到当它们非常相似时,它们的向量会重叠(尽管有些向量可能更长),当它们的共同点开始减少时,这些向量开始发散,它们之间的角度会更大。

通过测量向量之间的夹角,我们可以很好地了解它们的相似性,为了让事情变得更简单,通过计算这个夹角的余弦,我们有一个很好的0到1或-1到1的值,这表明了这种相似性,这取决于我们解释什么以及如何解释。角度越小,余弦值越大(越接近1),相似度也越高。

在极端情况下,如果Doc1只引用巴黎,而Doc2只引用伦敦,那么这些文档绝对没有任何共同之处。Doc1的向量在x轴上,Doc2的向量在y轴上,角度是90度,cos0。在这种情况下,我们会说这些文档彼此正交。

Adding dimensions: With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.

我将用几句话来结束这个公式本身。正如我所说,其他参考文献提供了关于计算的良好信息。

首先是二维空间。两个向量夹角余弦的公式是由三角函数差(角a和角b之间)推导出来的:

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

这个公式看起来很类似于点积公式:

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

其中cos(a)对应x值,sin(a)对应y值,对于第一个向量,等等。唯一的问题是,x, y等并不完全是cos和sin值,因为这些值需要在单位圆上读取。这就是公式分母的作用所在:通过除以这些向量长度的乘积,x和y坐标就变得标准化了。