对比概述¶
[1]:
import numpy as np
import statsmodels.api as sm
本文档很大程度上基于来自 UCLA 的出色资源 http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
一个具有 K 个类别或水平的分类变量通常作为 K-1 个虚拟变量的序列进入回归。这相当于对水平均值的线性假设。也就是说,这些变量的每个检验统计量都相当于检验该水平的均值是否与基准类别的均值在统计上显著不同。这种虚拟编码在 R 术语中称为处理编码,我们将遵循这种约定。但是,有不同的编码方法,这些方法相当于不同的线性假设集。
事实上,虚拟编码在技术上并非对比编码。这是因为虚拟变量加起来为 1,并且在功能上独立于模型的截距。另一方面,对于具有 k 个水平的分类变量,一组对比是一组 k-1 个在功能上独立的因子水平均值的线性组合,这些组合也独立于虚拟变量的总和。虚拟编码本身并非错误。它捕获了所有系数,但当模型假设系数的独立性(例如在方差分析中)时,它会使问题复杂化。线性回归模型不假设系数的独立性,因此在这种情况下降,虚拟编码通常是唯一教授的编码。
为了查看 Patsy 中的对比矩阵,我们将使用来自 UCLA ATS 的数据。首先,让我们加载数据。
示例数据¶
[2]:
import pandas as pd
url = "https://stats.idre.ucla.edu/stat/data/hsb2.csv"
hsb2 = pd.read_table(url, delimiter=",")
[3]:
hsb2.head(10)
[3]:
| id | 女性 | 种族 | 社会经济地位 | 学校类型 | 项目 | 阅读 | 写作 | 数学 | 科学 | 社会研究 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70 | 0 | 4 | 1 | 1 | 1 | 57 | 52 | 41 | 47 | 57 | 
| 1 | 121 | 1 | 4 | 2 | 1 | 3 | 68 | 59 | 53 | 63 | 61 | 
| 2 | 86 | 0 | 4 | 3 | 1 | 1 | 44 | 33 | 54 | 58 | 31 | 
| 3 | 141 | 0 | 4 | 3 | 1 | 3 | 63 | 44 | 47 | 53 | 56 | 
| 4 | 172 | 0 | 4 | 2 | 1 | 2 | 47 | 52 | 57 | 53 | 61 | 
| 5 | 113 | 0 | 4 | 2 | 1 | 2 | 44 | 52 | 51 | 63 | 61 | 
| 6 | 50 | 0 | 3 | 2 | 1 | 1 | 50 | 59 | 42 | 53 | 61 | 
| 7 | 11 | 0 | 1 | 2 | 1 | 2 | 34 | 46 | 45 | 39 | 36 | 
| 8 | 84 | 0 | 4 | 2 | 1 | 1 | 63 | 57 | 54 | 58 | 51 | 
| 9 | 48 | 0 | 3 | 2 | 1 | 2 | 57 | 55 | 52 | 50 | 51 | 
查看每个种族水平((1 = 西班牙裔,2 = 亚洲人,3 = 非洲裔美国人和 4 = 白人))的因变量写作的平均值将是有益的。
[4]:
hsb2.groupby("race")["write"].mean()
[4]:
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64
处理(虚拟)编码¶
虚拟编码可能是最著名的编码方案。它将分类变量的每个水平与一个基本参考水平进行比较。基本参考水平是截距的值。它是 Patsy 中对无序分类因子的默认对比。种族变量的处理对比矩阵将是
[5]:
from patsy.contrasts import Treatment
levels = [1, 2, 3, 4]
contrast = Treatment(reference=0).code_without_intercept(levels)
print(contrast.matrix)
[[0. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
在这里我们使用了 reference=0,这意味着第一个级别,西班牙裔,是参考类别,其他级别效应是相对于它来衡量的。如上所述,这些列不加起来为零,因此在功能上独立于截距。为了明确起见,让我们看看这将如何编码 race 变量。
[6]:
hsb2.race.head(10)
[6]:
0    4
1    4
2    4
3    4
4    4
5    4
6    3
7    1
8    4
9    3
Name: race, dtype: int64
[7]:
print(contrast.matrix[hsb2.race - 1, :][:20])
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]
[8]:
pd.get_dummies(hsb2.race.values, drop_first=False)
[8]:
| 1 | 2 | 3 | 4 | |
|---|---|---|---|---|
| 0 | False | False | False | True | 
| 1 | False | False | False | True | 
| 2 | False | False | False | True | 
| 3 | False | False | False | True | 
| 4 | False | False | False | True | 
| ... | ... | ... | ... | ... | 
| 195 | False | True | False | False | 
| 196 | False | False | False | True | 
| 197 | False | False | False | True | 
| 198 | False | False | False | True | 
| 199 | False | False | False | True | 
200 行 × 4 列
这是一个技巧,因为 race 类别方便地映射到基于零的索引。如果它没有,则此转换将在后台发生,因此这在一般情况下将不起作用,但无论如何都是一个修复想法的有用练习。以下说明了使用上面三个对比获得的输出
[9]:
from statsmodels.formula.api import ols
mod = ols("write ~ C(race, Treatment)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  46.4583      1.842     25.218      0.000      42.825      50.091
C(race, Treatment)[T.2]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Treatment)[T.3]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Treatment)[T.4]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.25
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
我们明确地给出了种族对比;但是,由于处理是默认值,因此我们可以省略它。
简单编码¶
与处理编码一样,简单编码将每个水平与一个固定的参考水平进行比较。但是,在简单编码中,截距是所有因子水平的总体均值。Patsy 没有包含简单对比,但您可以轻松定义自己的对比。为此,编写一个包含 code_with_intercept 和 code_without_intercept 方法的类,这些方法返回 patsy.contrast.ContrastMatrix 实例
[10]:
from patsy.contrasts import ContrastMatrix
def _name_levels(prefix, levels):
    return ["[%s%s]" % (prefix, level) for level in levels]
class Simple(object):
    def _simple_contrast(self, levels):
        nlevels = len(levels)
        contr = -1.0 / nlevels * np.ones((nlevels, nlevels - 1))
        contr[1:][np.diag_indices(nlevels - 1)] = (nlevels - 1.0) / nlevels
        return contr
    def code_with_intercept(self, levels):
        contrast = np.column_stack(
            (np.ones(len(levels)), self._simple_contrast(levels))
        )
        return ContrastMatrix(contrast, _name_levels("Simp.", levels))
    def code_without_intercept(self, levels):
        contrast = self._simple_contrast(levels)
        return ContrastMatrix(contrast, _name_levels("Simp.", levels[:-1]))
[11]:
hsb2.groupby("race")["write"].mean().mean()
[11]:
np.float64(51.67837643678162)
[12]:
contrast = Simple().code_without_intercept(levels)
print(contrast.matrix)
[[-0.25 -0.25 -0.25]
 [ 0.75 -0.25 -0.25]
 [-0.25  0.75 -0.25]
 [-0.25 -0.25  0.75]]
[13]:
mod = ols("write ~ C(race, Simple)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Simple)[Simp.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Simple)[Simp.2]     1.7417      2.732      0.637      0.525      -3.647       7.131
C(race, Simple)[Simp.3]     7.5968      1.989      3.820      0.000       3.675      11.519
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
总和(偏差)编码¶
总和编码将给定水平的因变量均值与因变量在所有水平上的总体均值进行比较。也就是说,它使用每个前 k-1 个水平与水平 k 之间的对比。在这个例子中,水平 1 与所有其他水平进行比较,水平 2 与所有其他水平进行比较,水平 3 与所有其他水平进行比较。
[14]:
from patsy.contrasts import Sum
contrast = Sum().code_without_intercept(levels)
print(contrast.matrix)
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [-1. -1. -1.]]
[15]:
mod = ols("write ~ C(race, Sum)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Sum)[S.1]    -5.2200      1.631     -3.200      0.002      -8.437      -2.003
C(race, Sum)[S.2]     6.3216      2.160      2.926      0.004       2.061      10.582
C(race, Sum)[S.3]    -3.4784      1.732     -2.008      0.046      -6.895      -0.062
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         6.72
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
这对应于一个参数化,该参数化强制所有系数加起来为零。请注意,这里的截距是总体均值,其中总体均值是因变量按每个水平的均值的均值。
[16]:
hsb2.groupby("race")["write"].mean().mean()
[16]:
np.float64(51.67837643678162)
后退差分编码¶
在后退差分编码中,因变量在某个水平的均值与因变量在先前水平的均值进行比较。这种类型的编码可能对名义变量或序数变量有用。
[17]:
from patsy.contrasts import Diff
contrast = Diff().code_without_intercept(levels)
print(contrast.matrix)
[[-0.75 -0.5  -0.25]
 [ 0.25 -0.5  -0.25]
 [ 0.25  0.5  -0.25]
 [ 0.25  0.5   0.75]]
[18]:
mod = ols("write ~ C(race, Diff)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept             51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Diff)[D.1]    11.5417      3.286      3.512      0.001       5.061      18.022
C(race, Diff)[D.2]    -9.8000      3.388     -2.893      0.004     -16.481      -3.119
C(race, Diff)[D.3]     5.8552      2.153      2.720      0.007       1.610      10.101
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         8.30
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
例如,这里的水平 1 上的系数是水平 2 上的 write 均值与水平 1 上的均值之差。即,
[19]:
res.params["C(race, Diff)[D.1]"]
hsb2.groupby("race").mean()["write"][2] - hsb2.groupby("race").mean()["write"][1]
[19]:
np.float64(11.541666666666664)
Helmert 编码¶
我们版本的 Helmert 编码有时被称为反向 Helmert 编码。将因变量在某个水平的均值与因变量在所有先前水平的均值进行比较。因此,有时将“反向”这个名字应用于与正向 Helmert 编码进行区分。这种比较对种族这样的名义变量没有多大意义,但我们将像这样使用 Helmert 对比
[20]:
from patsy.contrasts import Helmert
contrast = Helmert().code_without_intercept(levels)
print(contrast.matrix)
[[-1. -1. -1.]
 [ 1. -1. -1.]
 [ 0.  2. -1.]
 [ 0.  0.  3.]]
[21]:
mod = ols("write ~ C(race, Helmert)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.107
Model:                            OLS   Adj. R-squared:                  0.093
Method:                 Least Squares   F-statistic:                     7.833
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.78e-05
Time:                        15:50:53   Log-Likelihood:                -721.77
No. Observations:                 200   AIC:                             1452.
Df Residuals:                     196   BIC:                             1465.
Df Model:                           3
Covariance Type:            nonrobust
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                51.6784      0.982     52.619      0.000      49.741      53.615
C(race, Helmert)[H.2]     5.7708      1.643      3.512      0.001       2.530       9.011
C(race, Helmert)[H.3]    -1.3431      0.867     -1.548      0.123      -3.054       0.368
C(race, Helmert)[H.4]     0.7923      0.372      2.130      0.034       0.059       1.526
==============================================================================
Omnibus:                       10.487   Durbin-Watson:                   1.779
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               11.031
Skew:                          -0.551   Prob(JB):                      0.00402
Kurtosis:                       2.670   Cond. No.                         7.26
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
为了说明,水平 4 上的比较是因变量在先前三个水平上的均值与水平 4 上的均值之差
[22]:
grouped = hsb2.groupby("race")
grouped.mean()["write"].loc[4] - grouped.mean()["write"].loc[:3].mean()
[22]:
np.float64(3.169061302681982)
[23]:
grouped.mean()["write"]
[23]:
race
1    46.458333
2    58.000000
3    48.200000
4    54.055172
Name: write, dtype: float64
如您所见,这些只是相差一个常数。Helmert 对比的其他版本给出了均值的实际差异。无论如何,假设检验都是一样的。
[24]:
k = 4
c4 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
k = 3
c3 = 1.0 / k * (grouped.mean()["write"].loc[k] - grouped.mean()["write"].loc[: k - 1].mean())
print(c4, c3)
0.7922653256704955 -1.3430555555555561
正交多项式编码¶
多项式编码为 k=4 个水平采用的系数是在分类变量中的线性、二次和三次趋势。这里假设分类变量由一个潜在的、等间距的数值变量表示。因此,这种类型的编码仅用于具有等间距的有序分类变量。一般来说,多项式对比产生 k-1 阶多项式。由于 race 不是一个有序的因子变量,让我们以 read 作为例子。首先,我们需要从 read 创建一个有序的分类变量。
[25]:
hsb2["readcat"] = np.asarray(pd.cut(hsb2.read, bins=4))
hsb2["readcat"] = hsb2["readcat"].astype(object)
hsb2.groupby("readcat").mean()["write"]
[25]:
readcat
(27.952, 40.0]    42.772727
(40.0, 52.0]      49.978495
(52.0, 64.0]      56.563636
(64.0, 76.0]      61.833333
Name: write, dtype: float64
[26]:
from patsy.contrasts import Poly
levels = hsb2.readcat.unique()
contrast = Poly().code_without_intercept(levels)
print(contrast.matrix)
[[-0.67082039  0.5        -0.2236068 ]
 [-0.2236068  -0.5         0.67082039]
 [ 0.2236068  -0.5        -0.67082039]
 [ 0.67082039  0.5         0.2236068 ]]
[27]:
mod = ols("write ~ C(readcat, Poly)", data=hsb2)
res = mod.fit()
print(res.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                  write   R-squared:                       0.346
Model:                            OLS   Adj. R-squared:                  0.336
Method:                 Least Squares   F-statistic:                     34.51
Date:                Thu, 03 Oct 2024   Prob (F-statistic):           5.95e-18
Time:                        15:50:53   Log-Likelihood:                -690.69
No. Observations:                 200   AIC:                             1389.
Df Residuals:                     196   BIC:                             1403.
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     52.7870      0.634     83.268      0.000      51.537      54.037
C(readcat, Poly).Linear       14.2587      1.484      9.607      0.000      11.332      17.186
C(readcat, Poly).Quadratic    -0.9680      1.268     -0.764      0.446      -3.468       1.532
C(readcat, Poly).Cubic        -0.1554      1.006     -0.154      0.877      -2.140       1.829
==============================================================================
Omnibus:                        4.467   Durbin-Watson:                   1.768
Prob(Omnibus):                  0.107   Jarque-Bera (JB):                4.289
Skew:                          -0.307   Prob(JB):                        0.117
Kurtosis:                       2.628   Cond. No.                         3.01
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
如您所见,readcat 对因变量 write 有显著的线性影响,但二次和三次影响并不显著。