如何有效地计算跑步标准差

我有一个数字列表的数组,例如:

[0] (0.01, 0.01, 0.02, 0.04, 0.03)
[1] (0.00, 0.02, 0.02, 0.03, 0.02)
[2] (0.01, 0.02, 0.02, 0.03, 0.02)
...
[n] (0.01, 0.00, 0.01, 0.05, 0.03)

我想有效地计算所有数组元素中每个索引的平均值和标准差。

为此,我一直在循环遍历数组,并对列表的给定索引处的值求和。最后,我将“平均值列表”中的每个值除以 n(我使用的是一个总体,而不是来自总体的样本)。

为了做这个标准差,我再次循环,现在我已经计算出了平均值。

我希望避免两次遍历数组,一次是为了平均值,另一次是为了标准差(在我有一个平均值之后)。

是否有一种有效的方法来计算两个值,只通过数组一次?任何直译语言(例如 Perl 或 Python)或伪代码中的代码都可以。

149837 次浏览

Statistics::Descriptive is a very decent Perl module for these types of calculations:

#!/usr/bin/perl


use strict; use warnings;


use Statistics::Descriptive qw( :all );


my $data = [
[ 0.01, 0.01, 0.02, 0.04, 0.03 ],
[ 0.00, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.02, 0.02, 0.03, 0.02 ],
[ 0.01, 0.00, 0.01, 0.05, 0.03 ],
];


my $stat = Statistics::Descriptive::Full->new;
# You also have the option of using sparse data structures


for my $ref ( @$data ) {
$stat->add_data( @$ref );
printf "Running mean: %f\n", $stat->mean;
printf "Running stdev: %f\n", $stat->standard_deviation;
}
__END__

Output:

Running mean: 0.022000
Running stdev: 0.013038
Running mean: 0.020000
Running stdev: 0.011547
Running mean: 0.020000
Running stdev: 0.010000
Running mean: 0.020000
Running stdev: 0.012566

Unless your array is zillions of elements long, don't worry about looping through it twice. The code is simple and easily tested.

My preference would be to use the NumPy array maths extension to convert your array of arrays into a NumPy 2D array and get the standard deviation directly:

>>> x = [ [ 1, 2, 4, 3, 4, 5 ], [ 3, 4, 5, 6, 7, 8 ] ] * 10
>>> import numpy
>>> a = numpy.array(x)
>>> a.std(axis=0)
array([ 1. ,  1. ,  0.5,  1.5,  1.5,  1.5])
>>> a.mean(axis=0)
array([ 2. ,  3. ,  4.5,  4.5,  5.5,  6.5])

If that's not an option and you need a pure Python solution, keep reading...

If your array is

x = [
[ 1, 2, 4, 3, 4, 5 ],
[ 3, 4, 5, 6, 7, 8 ],
....
]

Then the standard deviation is:

d = len(x[0])
n = len(x)
sum_x = [ sum(v[i] for v in x) for i in range(d) ]
sum_x2 = [ sum(v[i]**2 for v in x) for i in range(d) ]
std_dev = [ sqrt((sx2 - sx**2)/N)  for sx, sx2 in zip(sum_x, sum_x2) ]

If you are determined to loop through your array only once, the running sums can be combined.

sum_x  = [ 0 ] * d
sum_x2 = [ 0 ] * d
for v in x:
for i, t in enumerate(v):
sum_x[i] += t
sum_x2[i] += t**2

This isn't nearly as elegant as the list comprehension solution above.

The basic answer is to accumulate the sum of both x (call it 'sum_x1') and x2 (call it 'sum_x2') as you go. The value of the standard deviation is then:

stdev = sqrt((sum_x2 / n) - (mean * mean))

where

mean = sum_x / n

This is the sample standard deviation; you get the population standard deviation using 'n' instead of 'n - 1' as the divisor.

You may need to worry about the numerical stability of taking the difference between two large numbers if you are dealing with large samples. Go to the external references in other answers (Wikipedia, etc) for more information.

Perhaps not what you were asking, but ... If you use a NumPy array, it will do the work for you, efficiently:

from numpy import array


nums = array(((0.01, 0.01, 0.02, 0.04, 0.03),
(0.00, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.02, 0.02, 0.03, 0.02),
(0.01, 0.00, 0.01, 0.05, 0.03)))


print nums.std(axis=1)
# [ 0.0116619   0.00979796  0.00632456  0.01788854]


print nums.mean(axis=1)
# [ 0.022  0.018  0.02   0.02 ]

By the way, there's some interesting discussion in this blog post and comments on one-pass methods for computing means and variances:

Computing sample mean and variance online in one pass

Have a look at PDL (pronounced "piddle!").

This is the Perl Data Language which is designed for high precision mathematics and scientific computing.

Here is an example using your figures....

use strict;
use warnings;
use PDL;


my $figs = pdl [
[0.01, 0.01, 0.02, 0.04, 0.03],
[0.00, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.02, 0.02, 0.03, 0.02],
[0.01, 0.00, 0.01, 0.05, 0.03],
];


my ( $mean, $prms, $median, $min, $max, $adev, $rms ) = statsover( $figs );


say "Mean scores:     ", $mean;
say "Std dev? (adev): ", $adev;
say "Std dev? (prms): ", $prms;
say "Std dev? (rms):  ", $rms;

Which produces:

Mean scores:     [0.022 0.018 0.02 0.02]
Std dev? (adev): [0.0104 0.0072 0.004 0.016]
Std dev? (prms): [0.013038405 0.010954451 0.0070710678 0.02]
Std dev? (rms):  [0.011661904 0.009797959 0.0063245553 0.017888544]

Have a look at PDL::Primitive for more information on the statsover function. This seems to suggest that ADEV is the "standard deviation".

However, it maybe PRMS (which Sinan's Statistics::Descriptive example show) or RMS (which ars's NumPy example shows). I guess one of these three must be right ;-)

For more PDL information, have a look at:

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

Here's a "one-liner", spread over multiple lines, in functional programming style:

def variance(data, opt=0):
return (lambda (m2, i, _): m2 / (opt + i - 1))(
reduce(
lambda (m2, i, avg), x:
(
m2 + (x - avg) ** 2 * i / (i + 1),
i + 1,
avg + (x - avg) / (i + 1)
),
data,
(0, 0, 0)))

Here is a literal pure Python translation of the Welford's algorithm implementation from John D. Cook’s excellent Accurately computing running variance article:

File running_stats.py

import math


class RunningStats:


def __init__(self):
self.n = 0
self.old_m = 0
self.new_m = 0
self.old_s = 0
self.new_s = 0


def clear(self):
self.n = 0


def push(self, x):
self.n += 1


if self.n == 1:
self.old_m = self.new_m = x
self.old_s = 0
else:
self.new_m = self.old_m + (x - self.old_m) / self.n
self.new_s = self.old_s + (x - self.old_m) * (x - self.new_m)


self.old_m = self.new_m
self.old_s = self.new_s


def mean(self):
return self.new_m if self.n else 0.0


def variance(self):
return self.new_s / (self.n - 1) if self.n > 1 else 0.0


def standard_deviation(self):
return math.sqrt(self.variance())

Usage:

rs = RunningStats()
rs.push(17.0)
rs.push(19.0)
rs.push(24.0)


mean = rs.mean()
variance = rs.variance()
stdev = rs.standard_deviation()


print(f'Mean: {mean}, Variance: {variance}, Std. Dev.: {stdev}')

The Python runstats Module is for just this sort of thing. Install runstats from PyPI:

pip install runstats

Runstats summaries can produce the mean, variance, standard deviation, skewness, and kurtosis in a single pass of data. We can use this to create your "running" version.

from runstats import Statistics


stats = [Statistics() for num in range(len(data[0]))]


for row in data:


for index, val in enumerate(row):
stats[index].push(val)


for index, stat in enumerate(stats):
print 'Index', index, 'mean:', stat.mean()
print 'Index', index, 'standard deviation:', stat.stddev()

Statistics summaries are based on the Knuth and Welford method for computing standard deviation in one pass as described in the Art of Computer Programming, Vol 2, p. 232, 3rd edition. The benefit of this is numerically stable and accurate results.

Disclaimer: I am the author the Python runstats module.

n=int(raw_input("Enter no. of terms:"))


L=[]


for i in range (1,n+1):


x=float(raw_input("Enter term:"))


L.append(x)


sum=0


for i in range(n):


sum=sum+L[i]


avg=sum/n


sumdev=0


for j in range(n):


sumdev=sumdev+(L[j]-avg)**2


dev=(sumdev/n)**0.5


print "Standard deviation is", dev

As the following answer describes: Does Pandas, SciPy, or NumPy provide a cumulative standard deviation function?

The Python Pandas module contains a method to calculate the running or cumulative standard deviation. For that, you'll have to convert your data into a Pandas dataframe (or a series if it is one-dimensional), but there are functions for that.

I like to express the update this way:

def running_update(x, N, mu, var):
'''
@arg x: the current data sample
@arg N : the number of previous samples
@arg mu: the mean of the previous samples
@arg var : the variance over the previous samples
@retval (N+1, mu', var') -- updated mean, variance and count
'''
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)

so that a one-pass function would look like this:

def one_pass(data):
N = 0
mu = 0.0
var = 0.0
for x in data:
N = N + 1
rho = 1.0/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
# could yield here if you want partial results
return (N, mu, var)

note that this is calculating the sample variance (1/N), not the unbiased estimate of the population variance (which uses a 1/(N-1) normalzation factor). Unlike the other answers, the variable, var, that is tracking the running variance does not grow in proportion to the number of samples. At all times it is just the variance of the set of samples seen so far (there is no final "dividing by n" in getting the variance).

In a class it would look like this:

class RunningMeanVar(object):
def __init__(self):
self.N = 0
self.mu = 0.0
self.var = 0.0
def push(self, x):
self.N = self.N + 1
rho = 1.0/N
d = x-self.mu
self.mu += rho*d
self.var += + rho*((1-rho)*d**2-self.var)
# reset, accessors etc. can be setup as you see fit

This also works for weighted samples:

def running_update(w, x, N, mu, var):
'''
@arg w: the weight of the current sample
@arg x: the current data sample
@arg mu: the mean of the previous N sample
@arg var : the variance over the previous N samples
@arg N : the number of previous samples
@retval (N+w, mu', var') -- updated mean, variance and count
'''
N = N + w
rho = w/N
d = x - mu
mu += rho*d
var += rho*((1-rho)*d**2 - var)
return (N, mu, var)

Here is a practical example of how you could implement a running standard deviation with Python and NumPy:

a = np.arange(1, 10)
s = 0
s2 = 0
for i in range(0, len(a)):
s += a[i]
s2 += a[i] ** 2
n = (i + 1)
m = s / n
std = np.sqrt((s2 / n) - (m * m))
print(std, np.std(a[:i + 1]))

This will print out the calculated standard deviation and a check standard deviation calculated with NumPy:

0.0 0.0
0.5 0.5
0.8164965809277263 0.816496580927726
1.118033988749895 1.118033988749895
1.4142135623730951 1.4142135623730951
1.707825127659933 1.707825127659933
2.0 2.0
2.29128784747792 2.29128784747792
2.5819888974716116 2.581988897471611

I am just using the formula described in this thread:

stdev = sqrt((sum_x2 / n) - (mean * mean))

Responding to Charlie Parker's 2021 question:

I'd like an answer that I can just copy paste to my code in numpy. My input is a matrix of size [N, 1] where N is the number of data points and I already have computed the running mean and I assuming we have computed the running std/variance, how to update we the new batch of data.

Here we have two implementations of a function that takes the original mean, original variance and original size and the new sample and returns the total mean and total variance of the combined original and new sample (to get the standard deviation, just take variance's square root by using **(1/2)). The first uses NumPy, and the second one uses Welford. You may choose the one that best applies to your case.

def mean_and_variance_update_numpy(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
sample_to_append_mean = np.mean(sample_to_append)
sample_to_append_size = len(sample_to_append)
total_size = previous_size+sample_to_append_size
total_mean = (previous_mean*previous_size+sample_to_append_mean*sample_to_append_size)/total_size
total_var = (((previous_var+(total_mean-previous_mean)**2)*previous_size)+((np.var(sample_to_append)+(sample_to_append_mean-tm)**2)*sample_to_append_size))/total_size
return (total_mean, total_var)


def mean_and_variance_update_welford(previous_mean, previous_var, previous_size, sample_to_append):
if type(sample_to_append) is np.matrix:
sample_to_append = sample_to_append.A1
else:
sample_to_append = sample_to_append.flatten()
pos = previous_size
mean = previous_mean
v = previous_var*previous_size
for value in sample_to_append:
pos += 1
mean_next = mean + (value - mean) / pos
v = v + (value - mean)*(value - mean_next)
mean = mean_next
return (mean, v/pos)

Let's check if it works:

import numpy as np


def mean_and_variance_udpate_numpy:
...
def mean_and_variance_udpate_welford:
...


# Making the samples and results deterministic
np.random.seed(0)


# Our initial sample has 100 samples, we want to append 10
n0, n1 = 100, 10


# Using np.matrix only, because it was in the question. 'np.array' is more common
s0 = np.matrix(1e3+np.random.random_sample(n0)*1e-3).T
s1 = np.matrix(1e3+np.random.random_sample(n1)*1e-3).T


# Precalculating our mean and var for initial sample:
s0mean, s0var = np.mean(s0), np.var(s0)


# Calculating mean and variance for s0+s1 using our NumPy updater
mean_and_variance_update_numpy(s0mean, s0var, len(s0), s1)
# (1000.0004826329636, 8.24577589696613e-08)


# Calculating mean and variance for s0+s1 using our Welford updater
mean_and_variance_update_welford(s0mean, s0var, len(s0), s1)
# (1000.0004826329634, 8.245775896913623e-08)


# Similar results, now checking with NumPy's calculation over the concatenation of s0 and s1
s0s1 = np.concatenate([s0,s1])
(np.mean(s0s1), np.var(s0s1))
# (1000.0004826329638, 8.245775896917313e-08)

Here the three results are closer:

# np(s0s1)        (1000.0004826329638, 8.245775896917313e-08)
# np(s0)updnp(s1) (1000.0004826329636, 8.245775896966130e-08)
# np(s0)updwf(s1) (1000.0004826329634, 8.245775896913623e-08)

It is possible to see that the results are very similar.

Figure I could jump on the old bandwagon. This should work with rbg values

Adapted from https://math.stackexchange.com/a/2148949

import numpy as np




class IterativeNormStats():


def __init__(self):
"""uint64 max is 18446744073709551615
256**2 = 65536


so we can store 18446744073709551615 / 65536 = 281,474,976,710,656
images before running into overflow issues. I think we'll be ok
"""
self.n = 0
self.rgb_sum = np.zeros(3, dtype=np.uint64)
self.rgb_sq_sum = np.zeros(3, dtype=np.uint64)


def update(self, img_arr):
rgbs = np.reshape(img_arr, (-1, 3)).astype(np.uint64)
self.n += rgbs.shape[0]
self.rgb_sum += np.sum(rgbs, axis=0)
self.rgb_sq_sum += np.sum(np.square(rgbs), axis=0)


def mean(self):
return self.rgb_sum / self.n


def std(self):
return np.sqrt((self.rgb_sq_sum / self.n) - np.square(self.rgb_sum / self.n))




def test_IterativeNormStats():
img_a = np.ones((10, 10, 3), dtype=np.uint8) * (1, 2, 3)
img_b = np.ones((10, 10, 3), dtype=np.uint8) * (2, 4, 6)
img_c = np.ones((10, 10, 3), dtype=np.uint8) * (3, 6, 9)
ins = IterativeNormStats()
for i in range(1000):
for img in [img_a, img_b, img_c]:
ins.update(img)


x = np.vstack([
np.reshape(img_a, (-1, 3)),
np.reshape(img_b, (-1, 3)),
np.reshape(img_c, (-1, 3)),
]*1000)
expected_mean = np.mean(x, axis=0)
expected_std = np.std(x, axis=0)


print(expected_mean)
print(ins.mean())
print(expected_std)
print(ins.std())
assert np.allclose(ins.mean(), expected_mean)




if __name__ == "__main__":
test_IterativeNormStats()

I came across thee welford package that's pretty simple to use:

pip install welford

Then

import numpy as np
from welford import Welford


# Initialize Welford object
w = Welford()


# Input data samples sequentialy
w.add(np.array([0, 100]))
w.add(np.array([1, 110]))
w.add(np.array([2, 120]))


# output
print(w.mean)  # mean --> [  1. 110.]
print(w.var_s)  # sample variance --> [1, 100]
print(w.var_p)  # population variance --> [ 0.6666 66.66]


# You can add other samples after calculating variances.
w.add(np.array([3, 130]))
w.add(np.array([4, 140]))


# output with added samples
print(w.mean)  # mean --> [  2. 120.]
print(w.var_s)  # sample variance --> [  2.5 250. ]
print(w.var_p)  # population variance --> [  2. 200.]

Notes:

  • Unlike most othere answers you can feed a Welford object a Numpy array directly
    • You can even add multiple with Welford.add_all(...)
    • You can merge independent computations with w1.merge(w2)
  • You should choose var_p or var_s depending on which one you want to use (Population and Sample variance)
  • As said, those are variances so you should use np.sqrt to get the associated standard deviation