维基百科上的余弦相似度文章

你能在这里(以列表或其他形式)显示向量吗? 然后算一算,看看是怎么回事?


当前回答

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

/**
 * 
* @author Xiao Ma
* mail : 409791952@qq.com
*
*/
  public class SimilarityUtil {

public static double consineTextSimilarity(String[] left, String[] right) {
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>();
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>();
    Set<String> uniqueSet = new HashSet<String>();
    Integer temp = null;
    for (String leftWord : left) {
        temp = leftWordCountMap.get(leftWord);
        if (temp == null) {
            leftWordCountMap.put(leftWord, 1);
            uniqueSet.add(leftWord);
        } else {
            leftWordCountMap.put(leftWord, temp + 1);
        }
    }
    for (String rightWord : right) {
        temp = rightWordCountMap.get(rightWord);
        if (temp == null) {
            rightWordCountMap.put(rightWord, 1);
            uniqueSet.add(rightWord);
        } else {
            rightWordCountMap.put(rightWord, temp + 1);
        }
    }
    int[] leftVector = new int[uniqueSet.size()];
    int[] rightVector = new int[uniqueSet.size()];
    int index = 0;
    Integer tempCount = 0;
    for (String uniqueWord : uniqueSet) {
        tempCount = leftWordCountMap.get(uniqueWord);
        leftVector[index] = tempCount == null ? 0 : tempCount;
        tempCount = rightWordCountMap.get(uniqueWord);
        rightVector[index] = tempCount == null ? 0 : tempCount;
        index++;
    }
    return consineVectorSimilarity(leftVector, rightVector);
}

/**
 * The resulting similarity ranges from −1 meaning exactly opposite, to 1
 * meaning exactly the same, with 0 usually indicating independence, and
 * in-between values indicating intermediate similarity or dissimilarity.
 * 
 * For text matching, the attribute vectors A and B are usually the term
 * frequency vectors of the documents. The cosine similarity can be seen as
 * a method of normalizing document length during comparison.
 * 
 * In the case of information retrieval, the cosine similarity of two
 * documents will range from 0 to 1, since the term frequencies (tf-idf
 * weights) cannot be negative. The angle between two term frequency vectors
 * cannot be greater than 90°.
 * 
 * @param leftVector
 * @param rightVector
 * @return
 */
private static double consineVectorSimilarity(int[] leftVector,
        int[] rightVector) {
    if (leftVector.length != rightVector.length)
        return 1;
    double dotProduct = 0;
    double leftNorm = 0;
    double rightNorm = 0;
    for (int i = 0; i < leftVector.length; i++) {
        dotProduct += leftVector[i] * rightVector[i];
        leftNorm += leftVector[i] * leftVector[i];
        rightNorm += rightVector[i] * rightVector[i];
    }

    double result = dotProduct
            / (Math.sqrt(leftNorm) * Math.sqrt(rightNorm));
    return result;
}

public static void main(String[] args) {
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda",
            "loves", "me" };
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie",
            "loves", "me" };
    System.out.println(consineTextSimilarity(left,right));
}
}

其他回答

以@Bill Bell为例,在[R]中有两种方法

a = c(2,1,0,2,0,1,1,1)

b = c(2,1,1,1,1,0,1,1)

d = (a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

或者利用crossprod()方法的性能…

e = crossprod(a, b) / (sqrt(crossprod(a, a)) * sqrt(crossprod(b, b)))

为了简单起见,我化简了向量a和b:

Let :
    a : [1, 1, 0]
    b : [1, 0, 1]

那么余弦相似度(Theta)

 (Theta) = (1*1 + 1*0 + 0*1)/sqrt((1^2 + 1^2))* sqrt((1^2 + 1^2)) = 1/2 = 0.5

cos0.5的逆是60度。

这里有两篇很短的文章供比较:

朱莉比琳达更爱我 简喜欢我胜过朱莉爱我

我们想知道这些文本有多相似,纯粹从字数的角度来看(忽略词序)。我们首先列出了这两篇文章中的单词:

me Julie loves Linda than more likes Jane

现在我们来数一下这些单词在每篇文章中出现的次数:

   me   2   2
 Jane   0   1
Julie   1   1
Linda   1   0
likes   0   1
loves   2   1
 more   1   1
 than   1   1

我们对单词本身不感兴趣。我们只对 这两个垂直向量的计数。例如,有两个实例 每条短信里都有“我”。我们要决定这两篇文章之间的距离 另一种方法是计算这两个向量的一个函数,即cos 它们之间的夹角。

这两个向量是

a: [2, 0, 1, 1, 0, 2, 1, 1]

b: [2, 1, 1, 0, 1, 1, 1, 1]

两者夹角的余弦值约为0.822。

这些向量是8维的。使用余弦相似度的一个优点是显而易见的 它将一个超出人类想象能力的问题转化为一个问题 那是可以的。在这种情况下,你可以认为这个角大约是35度 度是指距离零或完全一致的距离。

这段Python代码是我实现算法的快速而肮脏的尝试:

import math
from collections import Counter

def build_vector(iterable1, iterable2):
    counter1 = Counter(iterable1)
    counter2 = Counter(iterable2)
    all_items = set(counter1.keys()).union(set(counter2.keys()))
    vector1 = [counter1[k] for k in all_items]
    vector2 = [counter2[k] for k in all_items]
    return vector1, vector2

def cosim(v1, v2):
    dot_product = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    magnitude1 = math.sqrt(sum(n ** 2 for n in v1))
    magnitude2 = math.sqrt(sum(n ** 2 for n in v2))
    return dot_product / (magnitude1 * magnitude2)


l1 = "Julie loves me more than Linda loves me".split()
l2 = "Jane likes me more than Julie loves me or".split()


v1, v2 = build_vector(l1, l2)
print(cosim(v1, v2))

两个向量A和B存在于二维空间或三维空间中,它们之间的夹角为cos相似度。

如果角度更大(可以达到最大180度),即Cos 180=-1,最小角度为0度。cos0 =1意味着向量是对齐的,因此向量是相似的。

cos 90=0(这足以得出向量A和B根本不相似,因为距离不能为负,余弦值将在0到1之间。因此,更多的角度意味着降低相似性(视觉化也有意义)