在scikit-learn线性回归中找到p值(显著性)

如何找到每个系数的p值(显著性)?

lm = sklearn.linear_model.LinearRegression()
lm.fit(x,y)

当前回答

在多变量回归的情况下，@JARH的答案可能有错误。 (我没有足够的声誉来评论。)

在下面一行:

p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) 1)))我在ts_b),

t值遵循degree len(newX)-1的卡方分布，而不是遵循degree len(newX)-len(newX.columns)-1的卡方分布。

所以这应该是:

p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) len (newX.columns) 1)))我在ts_b)

(详见OLS回归的t值)

2018-09-17 14:50:50

其他回答

elyase的答案https://stackoverflow.com/a/27928411/4240413中的代码实际上不起作用。注意，sse是一个标量，然后它尝试遍历它。下面的代码是修改后的版本。不是特别干净，但我觉得差不多能用。

class LinearRegression(linear_model.LinearRegression):

    def __init__(self,*args,**kwargs):
        # *args is the list of arguments that might go into the LinearRegression object
        # that we don't know about and don't want to have to deal with. Similarly, **kwargs
        # is a dictionary of key words and values that might also need to go into the orginal
        # LinearRegression object. We put *args and **kwargs so that we don't have to look
        # these up and write them down explicitly here. Nice and easy.

        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False

        super(LinearRegression,self).__init__(*args,**kwargs)

    # Adding in t-statistics for the coefficients.
    def fit(self,x,y):
        # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column
        # of constants.

        # Not totally sure what 'super' does here and why you redefine self...
        self = super(LinearRegression, self).fit(x,y)
        n, k = x.shape
        yHat = np.matrix(self.predict(x)).T

        # Change X and Y into numpy matricies. x also has a column of ones added to it.
        x = np.hstack((np.ones((n,1)),np.matrix(x)))
        y = np.matrix(y).T

        # Degrees of freedom.
        df = float(n-k-1)

        # Sample variance.     
        sse = np.sum(np.square(yHat - y),axis=0)
        self.sampleVariance = sse/df

        # Sample variance for x.
        self.sampleVarianceX = x.T*x

        # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root.  ugly)
        self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I)

        # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix.
        self.se = self.covarianceMatrix.diagonal()[1:]

        # T statistic for each beta.
        self.betasTStat = np.zeros(len(self.se))
        for i in xrange(len(self.se)):
            self.betasTStat[i] = self.coef_[0,i]/self.se[i]

        # P-value for each beta. This is a two sided t-test, since the betas can be 
        # positive or negative.
        self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

2015-01-16 00:46:25

你可以用scipy表示p值。此代码来自scipy文档。

>>> from scipy import stats >>>导入numpy为np x = np.random.random(10) y = np.random.random(10) >>>斜率，截距，r_value, p_value, std_err = stats. linreturn (x,y)

2017-10-24 13:57:03

在多变量回归的情况下，@JARH的答案可能有错误。 (我没有足够的声誉来评论。)

在下面一行:

p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) 1)))我在ts_b),

t值遵循degree len(newX)-1的卡方分布，而不是遵循degree len(newX)-len(newX.columns)-1的卡方分布。

所以这应该是:

p_values = [2 * (1-stats.t.cdf (np.abs(我),(len (newX) len (newX.columns) 1)))我在ts_b)

(详见OLS回归的t值)

2018-09-17 14:50:50

scikit-learn的线性回归不计算这个信息，但你可以很容易地扩展类来做:

from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()
    are `t` and `p` which are of the shape (y.shape[1], X.shape[1])
    which is (n_features, n_coefs)
    This class sets the intercept to 0 by default, since usually we include it
    in X.
    """

    def __init__(self, *args, **kwargs):
        if not "fit_intercept" in kwargs:
            kwargs['fit_intercept'] = False
        super(LinearRegression, self)\
                .__init__(*args, **kwargs)

    def fit(self, X, y, n_jobs=1):
        self = super(LinearRegression, self).fit(X, y, n_jobs)

        sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1])
        se = np.array([
            np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X))))
                                                    for i in range(sse.shape[0])
                    ])

        self.t = self.coef_ / se
        self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1]))
        return self

从这里偷来的。

您应该看一看Python中用于这种统计分析的统计模型。

2015-01-13 17:54:43

你可以用pingouin来写一行字。线性回归函数(免责声明:我是Pingouin的创建者)，它使用NumPy数组或Pandas DataFrame与单/多变量回归一起工作，例如:

import pingouin as pg
# Using a Pandas DataFrame `df`:
lm = pg.linear_regression(df[['x', 'z']], df['y'])
# Using a NumPy array:
lm = pg.linear_regression(X, y)

输出是一个数据框架，其中包含每个预测器的beta系数、标准误差、t值、p值和置信区间，以及拟合的R^2和调整后的R^2。

2020-04-28 23:05:21

在scikit-learn线性回归中找到p值(显著性)

推荐文章

最新文章

标签