标准差

LINQ 是否为聚合 SQL 函数 STDDEV()(标准差)建模?

如果没有,计算它的最简单/最佳实践方法是什么?

例如:

  SELECT test_id, AVERAGE(result) avg, STDDEV(result) std
FROM tests
GROUP BY test_id
37555 次浏览

You can make your own extension calculating it

public static class Extensions
{
public static double StdDev(this IEnumerable<double> values)
{
double ret = 0;
int count = values.Count();
if (count  > 1)
{
//Compute the Average
double avg = values.Average();


//Perform the Sum of (value-avg)^2
double sum = values.Sum(d => (d - avg) * (d - avg));


//Put it all together
ret = Math.Sqrt(sum / count);
}
return ret;
}
}

If you have a sample of the population rather than the whole population, then you should use ret = Math.Sqrt(sum / (count - 1));.

Transformed into extension from Adding Standard Deviation to LINQ by Chris Bennett.

Dynami's answer works but makes multiple passes through the data to get a result. This is a single pass method that calculates the sample standard deviation:

public static double StdDev(this IEnumerable<double> values)
{
// ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
double mean = 0.0;
double sum = 0.0;
double stdDev = 0.0;
int n = 0;
foreach (double val in values)
{
n++;
double delta = val - mean;
mean += delta / n;
sum += delta * (val - mean);
}
if (1 < n)
stdDev = Math.Sqrt(sum / (n - 1));


return stdDev;
}

This is the sample standard deviation since it divides by n - 1. For the normal standard deviation you need to divide by n instead.

This uses Welford's method which has higher numerical accuracy compared to the Average(x^2)-Average(x)^2 method.

This converts David Clarke's answer into an extension that follows the same form as the other aggregate LINQ functions like Average.

Usage would be: var stdev = data.StdDev(o => o.number)

public static class Extensions
{
public static double StdDev<T>(this IEnumerable<T> list, Func<T, double> values)
{
// ref: https://stackoverflow.com/questions/2253874/linq-equivalent-for-standard-deviation
// ref: http://warrenseen.com/blog/2006/03/13/how-to-calculate-standard-deviation/
var mean = 0.0;
var sum = 0.0;
var stdDev = 0.0;
var n = 0;
foreach (var value in list.Select(values))
{
n++;
var delta = value - mean;
mean += delta / n;
sum += delta * (value - mean);
}
if (1 < n)
stdDev = Math.Sqrt(sum / (n - 1));


return stdDev;


}
}
var stddev = Math.Sqrt(data.Average(z=>z*z)-Math.Pow(data.Average(),2));
public static double StdDev(this IEnumerable<int> values, bool as_sample = false)
{
var count = values.Count();
if (count > 0) // check for divide by zero
// Get the mean.
double mean = values.Sum() / count;


// Get the sum of the squares of the differences
// between the values and the mean.
var squares_query =
from int value in values
select (value - mean) * (value - mean);
double sum_of_squares = squares_query.Sum();
return Math.Sqrt(sum_of_squares / (count - (as_sample ? 1 : 0)))
}

Straight to the point (and C# > 6.0), Dynamis answer becomes this:

    public static double StdDev(this IEnumerable<double> values)
{
var count = values?.Count() ?? 0;
if (count <= 1) return 0;


var avg = values.Average();
var sum = values.Sum(d => Math.Pow(d - avg, 2));


return Math.Sqrt(sum / count);
}

Edit 2020-08-27:

I took @David Clarke comments to make some performance tests and this are the results:

    public static (double stdDev, double avg) StdDevFast(this List<double> values)
{
var count = values?.Count ?? 0;
if (count <= 1) return (0, 0);


var avg = GetAverage(values);
var sum = GetSumOfSquareDiff(values, avg);


return (Math.Sqrt(sum / count), avg);
}


private static double GetAverage(List<double> values)
{
double sum = 0.0;
for (int i = 0; i < values.Count; i++)
sum += values[i];
        

return sum / values.Count;
}
private static double GetSumOfSquareDiff(List<double> values, double avg)
{
double sum = 0.0;
for (int i = 0; i < values.Count; i++)
{
var diff = values[i] - avg;
sum += diff * diff;
}
return sum;
}

I tested this with a list of one million random doubles
the original implementation had an runtime of ~48ms
the performance optimized implementation 2-3ms
so this is an significant improvement.

Some interesting details:
getting rid of Math.Pow brings a boost of 33ms!
List instead of IEnumerable 6ms
manually Average calculation 4ms
For-loops instead of ForEach-loops 2ms
Array instead of List brings just an improvement of ~2% so i skipped this
using single instead of double brings nothing

Further lowering the code and using goto (yes GOTO... haven't used this since the 90s assembler...) instead of for-loops does not pay, Thank goodness!

I have tested also parallel calculation, this makes sense on list > 200.000 items It seems that Hardware and Software needs to initialize a lot and this is for small lists contra-productive.

All tests were executed two times in a row to get rid of the warmup-time.

Simple 4 lines, I used a List of doubles but one could use IEnumerable<int> values

public static double GetStandardDeviation(List<double> values)
{
double avg = values.Average();
double sum = values.Sum(v => (v - avg) * (v - avg));
double denominator = values.Count - 1;
return denominator > 0.0 ? Math.Sqrt(sum / denominator) : -1;
}

In general case we want to compute StdDev in one pass: what if values is file or RDBMS cursor which can be changed between computing average and sum? We are going to have inconsistent result. The code below uses just one pass:

// Population StdDev
public static double StdDev(this IEnumerable<double> values) {
if (null == values)
throw new ArgumentNullException(nameof(values));


double N = 0;
double Sx = 0.0;
double Sxx = 0.0;


foreach (double x in values) {
N += 1;
Sx += x;
Sxx += x * x;
}


return N == 0
? double.NaN // or throw exception
: Math.Sqrt((Sxx - Sx * Sx / N) / N);
}

The very same idea for sample StdDev:

// Sample StdDev
public static double StdDev(this IEnumerable<double> values) {
if (null == values)
throw new ArgumentNullException(nameof(values));


double N = 0;
double Sx = 0.0;
double Sxx = 0.0;


foreach (double x in values) {
N += 1;
Sx += x;
Sxx += x * x;
}


return N <= 1
? double.NaN // or throw exception
: Math.Sqrt((Sxx - Sx * Sx / N) / (N - 1));
}