我正在寻找一个函数,它将两个列表作为输入,并返回Pearson相关性,以及相关性的重要性。
当前回答
嗯,很多回复的代码都很长,很难读…
我建议在处理数组时使用numpy及其漂亮的特性:
import numpy as np
def pcc(X, Y):
''' Compute Pearson Correlation Coefficient. '''
# Normalise X and Y
X -= X.mean(0)
Y -= Y.mean(0)
# Standardise X and Y
X /= X.std(0)
Y /= Y.std(0)
# Compute mean product
return np.mean(X*Y)
# Using it on a random example
from random import random
X = np.array([random() for x in xrange(100)])
Y = np.array([random() for x in xrange(100)])
pcc(X, Y)
其他回答
你可以看看scipy.stats:
from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)
>>>
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
non-correlation.
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
Parameters
----------
x : 1D array
y : 1D array the same length as x
Returns
-------
(Pearson's correlation coefficient,
2-tailed p-value)
References
----------
http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
一个替代方法可以是一个来自linreturn的本地scipy函数,它计算:
斜率:回归线的斜率 截距:回归线的截距 R-value:相关系数 p值:零假设为斜率为零的假设检验的双面p值 stderr:估计的标准错误
这里有一个例子:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)
会回复你:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
你可以用pandas.DataFrame这样做。相关系数:
import pandas as pd
a = [[1, 2, 3],
[5, 6, 9],
[5, 6, 11],
[5, 6, 13],
[5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()
这给了
0 1 2
0 1.000000 0.745601 0.916579
1 0.745601 1.000000 0.544248
2 0.916579 0.544248 1.000000
如果你不喜欢安装scipy,我使用了这个快速的hack,稍微修改了Programming Collective Intelligence:
def pearsonr(x, y):
# Assume len(x) == len(y)
n = len(x)
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
计算相关:
相关性-衡量两个不同变量的相似性
使用皮尔逊相关
from scipy.stats import pearsonr
# final_data is the dataframe with n set of columns
pearson_correlation = final_data.corr(method='pearson')
pearson_correlation
# print correlation of n*n column
使用斯皮尔曼相关
from scipy.stats import spearmanr
# final_data is the dataframe with n set of columns
spearman_correlation = final_data.corr(method='spearman')
spearman_correlation
# print correlation of n*n column
使用Kendall相关
kendall_correlation=final_data.corr(method='kendall')
kendall_correlation
推荐文章
- 如何读取文件的前N行?
- 如何删除matplotlib中的顶部和右侧轴?
- 解析.py文件,读取AST,修改它,然后写回修改后的源代码
- Visual Studio Code:如何调试Python脚本的参数
- 使用元组/列表等等。从输入vs直接引用类型如list/tuple/etc
- 结合conda环境。Yml和PIP requirements.txt
- 将命名元组转换为字典
- 如何使x轴和y轴的刻度相等呢?
- Numpy在这里函数多个条件
- 在Python中,使用argparse只允许正整数
- 如何排序mongodb与pymongo
- 不可变与可变类型
- 列表是线程安全的吗?
- 操作系统。makdirs在我的路径上不理解“~”
- 如何在Django模板中获得我的网站的域名?