我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。


当前回答

你可以用pandas.DataFrame这样做。相关系数:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

这给了

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

其他回答

从Python 3.10开始,Pearson的相关系数(statistics.correlation)可以直接在标准库中获得:

from statistics import correlation

# a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
# b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
correlation(a, b)
# 0.1449981545806852

你可以看看scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

您可能想知道如何在寻找特定方向的相关性(负相关或正相关)的上下文中解释您的概率。这是我写的一个函数。它甚至可能是正确的!

这是基于我从http://www.vassarstats.net/rsig.html和http://en.wikipedia.org/wiki/Student%27s_t_distribution上收集到的信息,感谢这里发布的其他答案。

# Given (possibly random) variables, X and Y, and a correlation direction,
# returns:
#  (r, p),
# where r is the Pearson correlation coefficient, and p is the probability
# that there is no correlation in the given direction.
#
# direction:
#  if positive, p is the probability that there is no positive correlation in
#    the population sampled by X and Y
#  if negative, p is the probability that there is no negative correlation
#  if 0, p is the probability that there is no correlation in either direction
def probabilityNotCorrelated(X, Y, direction=0):
    x = len(X)
    if x != len(Y):
        raise ValueError("variables not same len: " + str(x) + ", and " + \
                         str(len(Y)))
    if x < 6:
        raise ValueError("must have at least 6 samples, but have " + str(x))
    (corr, prb_2_tail) = stats.pearsonr(X, Y)

    if not direction:
        return (corr, prb_2_tail)

    prb_1_tail = prb_2_tail / 2
    if corr * direction > 0:
        return (corr, prb_1_tail)

    return (corr, 1 - prb_1_tail)

你可以看看这篇文章。这是一个使用pandas库(适用于Python)根据多个文件的历史外汇货币对数据计算相关性的示例,然后使用seaborn库生成热图图。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(xi*xi for xi in x)
  sum_y_sq = sum(yi*yi for yi in y)
  psum = sum(xi*yi for xi, yi in zip(x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den