维基百科上的余弦相似度文章

你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?


当前回答

下面是一个简单的计算余弦相似度的Python代码:

import math

def dot_prod(v1, v2):
    ret = 0
    for i in range(len(v1)):
        ret += v1[i] * v2[i]
    return ret

def magnitude(v):
    ret = 0
    for i in v:
        ret += i**2
    return math.sqrt(ret)

def cos_sim(v1, v2):
    return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))

其他回答

为了简单起见,我化简了向量a和b:

Let :
    a : [1, 1, 0]
    b : [1, 0, 1]

那么余弦相似度(Theta)

 (Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5

cos0.5的逆是60度。

下面是一个简单的计算余弦相似度的Python代码:

import math

def dot_prod(v1, v2):
    ret = 0
    for i in range(len(v1)):
        ret += v1[i] * v2[i]
    return ret

def magnitude(v):
    ret = 0
    for i in v:
        ret += i**2
    return math.sqrt(ret)

def cos_sim(v1, v2):
    return (dot_prod(v1, v2)) / (magnitude(v1) * magnitude(v2))

这是我在c#中的实现。

using System;

namespace CosineSimilarity
{
    class Program
    {
        static void Main()
        {
            int[] vecA = {1, 2, 3, 4, 5};
            int[] vecB = {6, 7, 7, 9, 10};

            var cosSimilarity = CalculateCosineSimilarity(vecA, vecB);

            Console.WriteLine(cosSimilarity);
            Console.Read();
        }

        private static double CalculateCosineSimilarity(int[] vecA, int[] vecB)
        {
            var dotProduct = DotProduct(vecA, vecB);
            var magnitudeOfA = Magnitude(vecA);
            var magnitudeOfB = Magnitude(vecB);

            return dotProduct/(magnitudeOfA*magnitudeOfB);
        }

        private static double DotProduct(int[] vecA, int[] vecB)
        {
            // I'm not validating inputs here for simplicity.            
            double dotProduct = 0;
            for (var i = 0; i < vecA.Length; i++)
            {
                dotProduct += (vecA[i] * vecB[i]);
            }

            return dotProduct;
        }

        // Magnitude of the vector is the square root of the dot product of the vector with itself.
        private static double Magnitude(int[] vector)
        {
            return Math.Sqrt(DotProduct(vector, vector));
        }
    }
}

我猜你更感兴趣的是深入了解余弦相似度工作的“为什么”(为什么它提供了一个很好的相似性指示),而不是“如何”计算它(用于计算的具体操作)。如果您对后者感兴趣,请参阅Daniel在这篇文章中指出的参考文献,以及相关的SO问题。

为了解释如何,甚至是为什么,首先,简化问题并只在二维空间中工作是有用的。一旦你在2D中得到了这个,在三维中考虑它就更容易了,当然在更大的维度中很难想象,但到那时我们可以用线性代数来进行数值计算,也可以帮助我们在n维中考虑直线/向量/“平面”/“球”,尽管我们不能画出来。

So, in two dimensions: with regards to text similarity this means that we would focus on two distinct terms, say the words "London" and "Paris", and we'd count how many times each of these words is found in each of the two documents we wish to compare. This gives us, for each document, a point in the the x-y plane. For example, if Doc1 had Paris once, and London four times, a point at (1,4) would present this document (with regards to this diminutive evaluation of documents). Or, speaking in terms of vectors, this Doc1 document would be an arrow going from the origin to point (1,4). With this image in mind, let's think about what it means for two documents to be similar and how this relates to the vectors.

VERY similar documents (again with regards to this limited set of dimensions) would have the very same number of references to Paris, AND the very same number of references to London, or maybe, they could have the same ratio of these references. A Document, Doc2, with 2 refs to Paris and 8 refs to London, would also be very similar, only with maybe a longer text or somehow more repetitive of the cities' names, but in the same proportion. Maybe both documents are guides about London, only making passing references to Paris (and how uncool that city is ;-) Just kidding!!!.

现在,不太相似的文件可能也会提到两个城市,但比例不同。也许文档2只会提到巴黎一次,伦敦七次。

回到我们的x-y平面,如果我们画出这些假设的文档,我们会看到当它们非常相似时,它们的向量会重叠(尽管有些向量可能更长),当它们的共同点开始减少时,这些向量开始发散,它们之间的角度会更大。

通过测量向量之间的夹角,我们可以很好地了解它们的相似性,为了让事情变得更简单,通过计算这个夹角的余弦,我们有一个很好的0到1或-1到1的值,这表明了这种相似性,这取决于我们解释什么以及如何解释。角度越小,余弦值越大(越接近1),相似度也越高。

在极端情况下,如果Doc1只引用巴黎,而Doc2只引用伦敦,那么这些文档绝对没有任何共同之处。Doc1的向量在x轴上,Doc2的向量在y轴上,角度是90度,cos0。在这种情况下,我们会说这些文档彼此正交。

Adding dimensions: With this intuitive feel for similarity expressed as a small angle (or large cosine), we can now imagine things in 3 dimensions, say by bringing the word "Amsterdam" into the mix, and visualize quite well how a document with two references to each would have a vector going in a particular direction, and we can see how this direction would compare to a document citing Paris and London three times each, but not Amsterdam, etc. As said, we can try and imagine the this fancy space for 10 or 100 cities. It's hard to draw, but easy to conceptualize.

我将用几句话来结束这个公式本身。正如我所说,其他参考文献提供了关于计算的良好信息。

首先是二维空间。两个向量夹角余弦的公式是由三角函数差(角a和角b之间)推导出来的:

cos(a - b) = (cos(a) * cos(b)) + (sin (a) * sin(b))

这个公式看起来很类似于点积公式:

Vect1 . Vect2 =  (x1 * x2) + (y1 * y2)

其中cos(a)对应x值,sin(a)对应y值,对于第一个向量,等等。唯一的问题是,x, y等并不完全是cos和sin值,因为这些值需要在单位圆上读取。这就是公式分母的作用所在:通过除以这些向量长度的乘积,x和y坐标就变得标准化了。

让我试着用Python代码和一些图形数学公式来解释这一点。

假设我们的代码中有两个非常短的文本:

texts = ["I am a boy", "I am a girl"] 

我们想要比较下面的查询文本,看看查询与上面的文本有多接近,使用快速余弦相似度评分:

query = ["I am a boy scout"]

我们应该如何计算余弦相似度分数?首先,让我们用Python为这些文本构建一个tfidf矩阵:

from sklearn.feature_extraction.text import TfidfVectorizer
    
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

接下来,让我们检查tfidf矩阵及其词汇表的值:

print(tfidf_matrix.toarray())
# output 
array([[0.57973867, 0.81480247, 0.        ],
       [0.57973867, 0.        , 0.81480247]])

这里,我们得到一个tfidf矩阵,tfidf值为2 x 3,或者2个文档/文本x 3个术语。这是我们的tfidf文档术语矩阵。让我们通过调用vectorizer.vocabulary_来看看这3项是什么

print(vectorizer.vocabulary_)
# output
{'am': 0, 'boy': 1, 'girl': 2}

这告诉我们,tfidf矩阵中的3项是“am”,“boy”和“girl”。'am'在第0列,'boy'在第1列,'girl'在第2列。术语“I”和“a”已被矢量器删除,因为它们是停止词。

现在我们有了tfidf矩阵,我们想要比较我们的查询文本和我们的文本,看看我们的查询有多接近我们的文本。为此,我们可以计算查询的余弦相似度分数与文本的tfidf矩阵。但首先,我们需要计算查询的tfidf:

query = ["I am a boy scout"]

query_tfidf = vectorizer.transform([query])
print(query_tfidf.toarray())
#output
array([[0.57973867, 0.81480247, 0.        ]])

Here, we computed the tfidf of our query. Our query_tfidf has a vector of tfidf values [0.57973867, 0.81480247, 0. ], which we will use to compute our cosine similarity multiplication scores. If I am not mistaken, the query_tfidf values or vectorizer.transform([query]) values are derived by just selecting the row or document from tfidf_matrix that has the most word matching with the query. For example, row 1 or document/text 1 of the tfidf_matrix has the most word matching with the query text which contains "am" (0.57973867) and "boy" (0.81480247), hence row 1 of the tfidf_matrix of [0.57973867, 0.81480247, 0. ] values are selected to be the values for query_tfidf. (Note: If someone could help further explain this that would be good)

在计算query_tfidf之后,我们现在可以将query_tfidf向量与文本tfidf_matrix矩阵相乘或点积,以获得余弦相似度分数。

回想一下,余弦相似度分数或公式等于以下:

cosine similarity score = (A . B) / ||A|| ||B||

这里,A =我们的query_tfidf向量,B =我们的tfidf_matrix的每一行

注意:A。B = A * B^T,或者A点积B = A乘以B转置。

了解了公式之后,让我们手动计算query_tfidf的余弦相似度分数,然后将我们的答案与sklearn提供的值进行比较。度量cosine_similarity函数。让我们手动计算:

query_tfidf_arr = query_tfidf.toarray()
tfidf_matrix_arr = tfidf_matrix.toarray()

cosine_similarity_1 = np.dot(query_tfidf_arr, tfidf_matrix_arr[0].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[0])) 
cosine_similarity_2 = np.dot(query_tfidf_arr, tfidf_matrix_arr[1].T) / 
  (np.linalg.norm(query_tfidf_arr) * np.linalg.norm(tfidf_matrix_arr[1]))

manual_cosine_similarities = [cosine_similarity_1[0], cosine_similarity_2[0]]
print(manual_cosine_similarities)    
#output
[1.0, 0.33609692727625745]

我们手动计算的余弦相似度分数给出了[1.0,0.33609692727625745]的值。让我们用sklearn提供的答案值来检查手动计算的余弦相似度分数。度量cosine_similarity函数:

from sklearn.metrics.pairwise import cosine_similarity

function_cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix)
print(function_cosine_similarities)
#output
array([[1.0        , 0.33609693]])

输出值都是相同的!手动计算余弦相似度值与函数计算余弦相似度值相同!

因此,这个简单的解释说明了余弦相似值是如何计算的。希望这个解释对你有帮助。