如何 Qcut 与非唯一的仓边缘?

我的问题和上一个一样:

在熊猫中以零值装箱

但是,我仍然希望在一个分数中包含0值。有办法吗?换句话说,如果我有600个值,其中50% 是0,其余的在1到100之间,那么我如何将分数1中的所有0值归类,然后将其余的非零值归类到分数标签2到10(假设我想要10个分数)。我是否可以将0转换为 nan,将剩余的非 nan 数据切割成9个分数(1到9) ,然后向每个标签(现在是2到10)添加1,并将所有0值手动标记为分数1?即使这是棘手的,因为在我的数据集除了600个值之外,我还有另外几百个可能已经是 nan 的值,在我将0转换成 nan 之前。

更新1/26/14:

我提出了以下临时解决方案。然而,这个代码的问题在于,如果高频值不在分布的边缘,那么它就会在现有的垃圾箱集合中间插入一个额外的垃圾箱,把所有东西都丢掉一些(或者很多)。

def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles
104202 次浏览

I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:

https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8

You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:

bins = algos.quantile(x, quantiles)

Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.

import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series

In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100

zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)

Solution 1: bin 1 contains both 0s and low values

bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)

The result is

In[61]: result.value_counts()
Out[61]:
[0, 9.3]        323
(27.9, 38.2]     37
(9.3, 18.6]      37
(88.7, 99]       35
(57.8, 68.1]     32
(68.1, 78.4]     31
(78.4, 88.7]     30
(38.2, 48.5]     27
(48.5, 57.8]     26
(18.6, 27.9]     22
dtype: int64

Solution 2: bin1 contains only 0s

mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)

The result is:

In[133]: result.value_counts()
Out[133]:
[0, 0.5]        300
(0.5, 11]        32
(11, 18.8]       28
(18.8, 29.7]     30
(29.7, 39]       35
(39, 50]         26
(50, 59]         31
(59, 71]         31
(71, 79.2]       27
(79.2, 90.2]     30
(90.2, 99]       30
dtype: int64

There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.

Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:

a = pd.Series(range(100) + ([0]*20))


def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))


# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)

we can recreate the original error using something like this:

a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
precision=precision, include_lowest=True)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([  0.        ,   0.        ,   0.        ,   3.8       ,
11.73333333,  19.66666667,  27.6       ,  35.53333333,
43.46666667,  51.4       ,  59.33333333,  67.26666667,
75.2       ,  83.13333333,  91.06666667,  99.        ])

The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).

The solutions are:

1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.

2 - Decrease the number of quantiles. Less quantiles means more elements per quantile

3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')

Example:

pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"

Then use this instead:

pd.qcut(df.rank(method='first'), nbins)

4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile

5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin

If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:

  1. Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.

df['rank'] = df['value'].rank(method='first')

  1. Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).

df['decile'] = pd.qcut(df['rank'].values, 10).codes

I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.

def qcut2(x, n=10):
x = np.array(x)
x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
y = np.zeros(len(x))
y[x_index_not0] = x_cut_not0
return y