Calculate the variance of a sample from the variances of its subsamples

Mar 12, 2016 00:00 · 128 words · 1 minute read statistics variance notes

Calculating the mean of a sample from the means of its subsamples is pretty straightforward¹.

Calculating however the variance of a sample from the variances of its subsamples didn't seem straightforward to me. It looks like there is a formula that allows us to do that. Assuming we have \(g\) sub-samples, each with \(k_j\), \(j=1,...,g\) elements for a total of \(n=\sum k_j\) values, then:

\[ Var(X_1,...,X_n) = \frac{1}{n-1}(\sum_{j=1}^{g} (k_j-1)Var_j + \sum_{j=1}^{g}k_j(\bar{X}_j-\bar{X})^2) \]

\(Var\) being the variance, and \(\bar{X}_{j}\) the mean of the sample \(j\).

This wikipedia article describes various algorithms to calculate variance in a number of scenarios (e.g. online, in parallel, etc), whereas this paper presents some computational considerations when updating mean and variance estimates.

Provided of course that we have the sizes of all subsamples ^[return]