標準偏差（分散）をn-1で割る理由

標準偏差は下記のように平均偏差の二乗和をn-1で割り平方根とったものです。平方根の中はいわゆる不偏分散です。不偏分散算出時に何故n-1で割るかについては様々な統計ブログに記事が載っています。自身の勉強も兼ねて、なるべくわかりやすく説明してみたいと思います。課題図書のようなものです。

[math] \displaystyle Standard \space deviation=\sqrt{( \frac{1}{n-1}\sum_{i}^{n} {{(x_i - \bar x)}^2} )} [/math]

平均との差、あるいは真の平均との差

以降では平方根の中身、分散ベースで話を進めます。まず不偏分散ではなくnで割る標本分散を考えてみます。標本分散は、下記のように標本平均[math] \bar x [/math]との差を取って計算します。これを[math] V^2 [/math]と置きます。

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \bar x)}^2} = V^2 [/math]

もし真の平均[math] \mu [/math]が分かっている場合には、下記のように[math] \mu [/math]との差より計算します。これが本来の定義です。これを[math] S^2 [/math]と置きます。

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \mu )}^2} = S^2 [/math]

両者の違いはどこにあるのでしょうか？真の平均を標本平均で代用していると、どのような差が生まれるのでしょうか？

下図にデータの散らばりに対して、真の平均と標本平均のイメージを示します。標本平均は得られたデータの平均なので、真の平均より得られているデータの中心に偏る傾向があります。その結果、標本分散は真の分散より小さめになってしまいます。

f:id:OceanOne:20200715013620j:plain:w400

実際、次式を最小化する[math]X[/math]を考えてみると

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - X)}^2} [/math]

これが最小となるのは、[math]X[/math]での偏微分が0となる場合なので

[math] \displaystyle \frac{2}{n}\sum_{i}^{n} {(x_i - X)} = 0 [/math]
[math] \displaystyle \sum_{i}^{n} {(x_i - X)} = 0 [/math]
[math] \displaystyle \sum_{i}^{n} {x_i} - nX = 0 [/math]
[math] \displaystyle nX = \sum_{i}^{n} {x_i} [/math]
[math] \displaystyle X = \frac{1}{n}\sum_{i}^{n} {x_i} = \bar x [/math]

つまり標本平均において標本分散は最小となり、次式が成り立ちます。

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \mu)}^2} \ge \frac{1}{n}\sum_{i}^{n} {{(x_i - \bar x)}^2} [/math]

実際に次のデータで

f:id:OceanOne:20200716013024j:plain:w100 [math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - X)}^2} [/math]

上式の[math] X [/math]を横軸にとり計算した結果が下記です。

f:id:OceanOne:20200716013920j:plain:w300

このデータは平均が0.267ですが、グラフの２次関数フィッティング結果から極値は[math] \displaystyle -\frac{b}{2a} [/math]なので[math] \displaystyle -\frac{4.8}{2 \times 9.0}=0.267 [/math]で、標本平均で最小値をとることが分かります。

数式で追うと、

平均との差について二乗和をとった下記を式変形します。

[math] \displaystyle \sum_{i}^{n} {{(x_i - \bar x)}^2} [/math]

[math] \displaystyle = \sum_{i}^{n} {{[ (x_i - \mu) + (\mu- \bar x) ]}^2} [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} + 2\sum_{i}^{n} {(x_i - \mu)(\mu- \bar x)} + \sum_{i}^{n} {{( \bar x- \mu)}^2} [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} - 2(\bar x-\mu)\sum_{i}^{n} {(x_i - \mu)} + n{( \bar x- \mu)}^2 [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} - 2(\bar x-\mu) (\sum_{i}^{n}{x_i} - n\mu) + n{( \bar x- \mu)}^2 [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} - 2(\bar x-\mu) (n \bar x - n\mu) + n{( \bar x- \mu)}^2 [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} - 2n{(\bar x-\mu)}^2 + n{( \bar x- \mu)}^2 [/math]

[math] \displaystyle = \sum_{i}^{n} {{(x_i - \mu)}^2} - n{(\bar x-\mu)}^2 [/math]

つまり、

[math] \displaystyle \sum_{i}^{n} {{(x_i - \bar x)}^2} = \sum_{i}^{n} {{(x_i - \mu)}^2} - n{(\bar x-\mu)}^2 [/math]

右辺第二項を左辺に移して入れ替えると、

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \mu)}^2} = \frac{1}{n}\sum_{i}^{n} {{(x_i - \bar x)}^2} + {(\bar x-\mu)}^2 [/math]

前節の

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \bar x)}^2} [/math]

と

[math] \displaystyle \frac{1}{n}\sum_{i}^{n} {{(x_i - \mu )}^2} [/math]

の違いは

[math] \displaystyle {(\bar x-\mu)}^2 [/math]

の分だけ前者が小さいことになります。

[math] S^2 [/math]と[math] V^2 [/math]を使うと

[math] \displaystyle S^2 = V^2 + {(\bar x-\mu)}^2 [/math]

[math] \displaystyle {(\bar x-\mu)}^2 [/math]の差が意味するものは

[math] \displaystyle {(\bar x-\mu)}^2 [/math]

上式が意味するものは標本平均と真の平均の差、つまり標本平均のバラツキです。

[math] \displaystyle {(\bar x-\mu)}^2 [/math]
[math] \displaystyle = \frac{1}{n^2} {(n \bar x- n \mu)}^2 [/math]
[math] \displaystyle = \frac{1}{n^2} {(n \cdot \frac{1}{n} \sum_{i}^{n} {x_i} - \sum_{i}^{n} {\mu})}^2 [/math]
[math] \displaystyle = \frac{1}{n^2} {( \sum_{i}^{n} {x_i} - \sum_{i}^{n} {\mu})}^2 [/math]
[math] \displaystyle = \frac{1}{n^2} {( \sum_{i}^{n} {( x_i - \mu )} )}^2 [/math]
[math] \displaystyle = \frac{1}{n^2} {( \sum_{i}^{n} {( x_i - \mu )^2} + \sum_{i \neq j}^{} \sum_{ }^{} {( x_i - \mu )( x_j - \mu )} )} [/math]
[math] \displaystyle = \frac{1}{n^2} {( \sum_{i}^{n} {( x_i - \mu )^2} + \sum_{i \neq j}^{} \sum_{ }^{} {( x_i x_j - x_i \mu - x_j \mu + {\mu}^2 )} )} [/math]

ここで期待値の話をしなくてはならないのですが、上式右辺第二項は結局平均値計算をしているので期待値としては0で次のようになり、

[math] \displaystyle (n^2 - n) {( \mu \mu - \mu \mu - \mu \mu + {\mu}^2 )} = 0 [/math]

第一項のみが残り、第一項は[math] \displaystyle \frac{S^2}{n} [/math]となります。

結局[math] S^2 [/math]と[math] V^2 [/math]の関係は

[math] \displaystyle S^2 = V^2 + \frac{S^2}{n} [/math]
[math] \displaystyle S^2 - \frac{S^2}{n} = V^2 [/math]
[math] \displaystyle \frac{ (n-1) S^2}{n} = V^2 [/math]
[math] \displaystyle S^2 = \frac{n}{n-1} {V^2} [/math]
[math] \displaystyle S^2 = \frac{n}{n-1} \frac{1}{n}\sum_{i}^{n} {{(x_i - \bar x)}^2} [/math]
[math] \displaystyle S^2 = \frac{1}{n-1} \sum_{i}^{n} {{(x_i - \bar x)}^2} [/math]

これを真の平均からのズレ（偏り）を補正した不偏（偏らない）分散と呼びます。