是否有拒绝列表中的异常值的内建函数

小开

最佳答案

这个方法和你的方法几乎完全相同，只不过更多的是 numpyst (也只用于 numpy 数组) :

def reject_outliers(data, m=2):
return data[abs(data - np.mean(data)) < m * np.std(data)]

小开

在处理异常值时，一些重要的事情是应该尽可能使用稳健的估计器。一个分布的平均值会受到异常值的影响，但是中位数会小得多。

基于 Eumiro 的回答:

def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
return data[s<m]

在这里，我用更稳健的中位数取代平均值，用绝对距离中位数取代标准差。然后我按照它们(再一次)的中值来缩放距离，这样 m就在一个合理的相对尺度上了。

注意，要使 data[s<m]语法正常工作，data必须是一个 numpy 数组。

小开

在本杰明的基础上，使用 pandas.Series代替 MAD 和 IQR:

def reject_outliers(sr, iq_range=0.5):
pcnt = (1 - iq_range) / 2
qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
iqr = qhigh - qlow
return sr[ (sr - median).abs() <= iqr]

例如，如果设置 iq_range=0.6，四分位间距的百分位数将变为: 0.20 <--> 0.80，因此将包括更多异常值。

小开

另一种方法是对标准差进行稳健的估计(假设是高斯统计)。查看在线计算器，我发现90% 的百分比对应于1.2815 & sigma; ，而95% 对应于1.645 & sigma; (http://vassarstats.net/tabs.html?#z)

举个简单的例子:

import numpy as np


# Create some random numbers
x = np.random.normal(5, 2, 1000)


# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))


# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500


# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))


# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)


rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

我得到的输出是:

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649


Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694


Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

接近于期望值2。

如果我们想移除5个标准偏差以上/以下的点(1000点时，我们期望1值 > 3个标准偏差) :

y = x[abs(x - p50) < rSig*5]


# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

结果是:

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

我不知道哪种方法更有效/更健壮

小开

Benjamin Bannier 的回答在距离中值的中值为0时产生一个通过，因此我发现这个修改后的版本对于下面例子中给出的情况更有帮助。

def reject_outliers_2(data, m=2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d / (mdev if mdev else 1.)
return data[s < m]

例如:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

给予:

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

小开

我想做一些类似的事情，除了把数字设置为 NaN 而不是从数据中删除它，因为如果你删除它，你就会改变长度，这会搞乱绘图(也就是说，如果你只是从一个表中的一列中删除离群值，但是你需要它保持与其他列相同，这样你就可以把它们彼此对应起来)。

为此，我使用了 Numpy 的屏蔽功能:

def reject_outliers(data, m=2):
stdev = np.std(data)
mean = np.mean(data)
maskMin = mean - stdev * m
maskMax = mean + stdev * m
mask = np.ma.masked_outside(data, maskMin, maskMax)
print('Masking values outside of {} and {}'.format(maskMin, maskMax))
return mask

小开

我想在这个答案中提供两种方法，基于“ z 分数”的解决方案和基于“ IQR”的解决方案。

本答案中提供的代码既适用于单个弱 numpy数组，也适用于多个 numpy数组。

让我们先导入一些模块。

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

基于 z 分数的方法

这个方法将测试数字是否落在三个标准偏差之外。根据此规则，如果值为离群值，则该方法将返回 true，如果不是，则返回 false。

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'


d_z = stat.zscore(x, axis = axis)


if side == 'gt':
return d_z > bar
elif side == 'lt':
return d_z < -bar
elif side == 'both':
return np.abs(d_z) > bar

基于 IQR 的方法

这种方法将测试值是否小于 q1 - 1.5 * iqr或大于 q3 + 1.5 * iqr，这类似于 SPSS 的绘图方法。

def q1(x, axis = None):
return np.percentile(x, 25, axis = axis)


def q3(x, axis = None):
return np.percentile(x, 75, axis = axis)


def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'


d_iqr = iqr(x, axis = axis)
d_q1 = q1(x, axis = axis)
d_q3 = q3(x, axis = axis)
iqr_distance = np.multiply(d_iqr, bar)


stat_shape = list(x.shape)


if isinstance(axis, collections.Iterable):
for single_axis in axis:
stat_shape[single_axis] = 1
else:
stat_shape[axis] = 1


if side in ['gt', 'both']:
upper_range = d_q3 + iqr_distance
upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
if side in ['lt', 'both']:
lower_range = d_q1 - iqr_distance
lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)


if side == 'gt':
return upper_outlier
if side == 'lt':
return lower_outlier
if side == 'both':
return np.logical_or(upper_outlier, lower_outlier)

最后，如果希望过滤掉异常值，请使用 numpy选择器。

祝你愉快。

小开

考虑到当你的标准差因为巨大的异常值而变得非常大时，上面所有的方法都会失败。

(与平均计算失败相似，应该计算中位数。不过，平均值“更容易出现 stdDv 这样的错误”。)

你可以尝试迭代应用你的算法，或者使用四分差过滤: (这里的“因子”与 n * sigma 范围有关，但只有当你的数据遵循一个正态分布时)

import numpy as np


def sortoutOutliers(dataIn,factor):
quant3, quant1 = np.percentile(dataIn, [75 ,25])
iqr = quant3 - quant1
iqrSigma = iqr/1.34896
medData = np.median(dataIn)
dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ]
return(dataOut)

小开

如果你想得到的索引位置的离群值 idx_list将返回它。

def reject_outliers(data, m = 2.):
d = np.abs(data - np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
data_range = np.arange(len(data))
idx_list = data_range[s>=m]
return data[s<m], idx_list


data_points = np.array([8, 10, 35, 17, 73, 77])
print(reject_outliers(data_points))


after rejection: [ 8 10 35 17], index positions of outliers: [4 5]

小开

对于一组影像(每个图像有3个维度) ，我想要拒绝我使用的每个像素的离群值:

mean = np.mean(imgs, axis=0)
std = np.std(imgs, axis=0)
mask = np.greater(0.5 * std + 1, np.abs(imgs - mean))
masked = np.multiply(imgs, mask)

这样就可以计算出平均值:

masked_mean = np.divide(np.sum(masked, axis=0), np.sum(mask, axis=0))

(我用它来做背景减法)

小开

这里我找到了 x中的异常值，并用它们周围的点窗口(win)的中值来替代它们(取自 Benjamin Bannier 的回答中值偏差)

def outlier_smoother(x, m=3, win=3, plots=False):
''' finds outliers in x, points > m*mdev(x) [mdev:median deviation]
and replaces them with the median of win points around them '''
x_corr = np.copy(x)
d = np.abs(x - np.median(x))
mdev = np.median(d)
idxs_outliers = np.nonzero(d > m*mdev)[0]
for i in idxs_outliers:
if i-win < 0:
x_corr[i] = np.median(np.append(x[0:i], x[i+1:i+win+1]))
elif i+win+1 > len(x):
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:len(x)]))
else:
x_corr[i] = np.median(np.append(x[i-win:i], x[i+1:i+win+1]))
if plots:
plt.figure('outlier_smoother', clear=True)
plt.plot(x, label='orig.', lw=5)
plt.plot(idxs_outliers, x[idxs_outliers], 'ro', label='outliers')
plt.plot(x_corr, '-o', label='corrected')
plt.legend()
    

return x_corr

小开

沿着轴修剪数字数组中的异常值，并用沿着这个轴的最小值或最大值替换它们，以较接近的值为准。门槛是 z 分数:

def np_z_trim(x, threshold=10, axis=0):
""" Replace outliers in numpy ndarray along axis with min or max values
within the threshold along this axis, whichever is closer."""
mean = np.mean(x, axis=axis, keepdims=True)
std = np.std(x, axis=axis, keepdims=True)
masked = np.where(np.abs(x - mean) < threshold * std, x, np.nan)
min = np.nanmin(masked, axis=axis, keepdims=True)
max = np.nanmax(masked, axis=axis, keepdims=True)
repl = np.where(np.abs(x - max) < np.abs(x - min), max, min)
return np.where(np.isnan(masked), repl, masked)

小开

这么多的答案，但我添加了一个新的，可以有用的作者，甚至对其他用户。

你可以使用 汉普尔过滤器，但是你需要使用 Series。

Hampel filter 返回 异常值指数，然后您可以从 Series中删除它们，然后将其转换回 List。

要使用 汉普尔过滤器，您可以很容易地安装包与 pip:

pip install hampel

用法:

# Imports
from hampel import hampel
import pandas as pd


list_d = [2, 4, 5, 1, 6, 5, 40]


# List to Series
time_series = pd.Series(list_d)


# Outlier detection with Hampel filter
# Returns the Outlier indices
outlier_indices = hampel(ts = time_series, window_size = 3)


# Drop Outliers indices from Series
filtered_d = time_series.drop(outlier_indices)


filtered_d.values.tolist()


print(f'filtered_d: {filtered_d.values.tolist()}')

输出结果将是:

Filter _ d: [2,4,5,1,6,5]

其中，ts是一个熊猫 Series对象和 window_size是一个总窗口大小将计算为 2 * window_size + 1。

对于这个系列，我将 window_size设置为值 3。

使用 Series 的好处是能够生成图形:

# Imports
import matplotlib.pyplot as plt


plt.style.use('seaborn-darkgrid')


# Plot Original Series
time_series.plot(style = 'k-')
plt.title('Original Series')
plt.show()
    

# Plot Cleaned Series
filtered_d.plot(style = 'k-')
plt.title('Cleaned Series (Without detected Outliers)')
plt.show()

输出结果将是:

为了更多地了解 汉普尔过滤器，我推荐以下阅读材料:

小开

我的解决方案去掉了顶部和底部的百分位数，保留了与边界相等的值:

def remove_percentile_outliers(data, percent_to_drop=0.001):
low, high = data.quantile([percent_to_drop / 2, 1-percent_to_drop / 2])
return data[(data >= low )&(data <= high)]

小开

我的解决方案让离群值等于前一个值。

test_data = [2,4,5,1,6,5,40, 3]
def reject_outliers(data, m=2):
mean = np.mean(data)
std = np.std(data)
for i in range(len(data)) :
if np.abs(data[i] -mean) > m*std :
data[i] = data[i-1]
return data
reject_outliers(test_data)

产出:

[2, 4, 5, 1, 6, 5, 5, 3]