更快地替代嵌套循环?

我需要创建一个数字组合列表。这些数字非常小,所以我可以使用 byte而不是 int。然而,它需要许多嵌套循环,以获得每一种可能的组合。我想知道是否有更有效的方式来做我想做的事。到目前为止的代码是:

var data = new List<byte[]>();
for (byte a = 0; a < 2; a++)
for (byte b = 0; b < 3; b++)
for (byte c = 0; c < 4; c++)
for (byte d = 0; d < 3; d++)
for (byte e = 0; e < 4; e++)
for (byte f = 0; f < 3; f++)
for (byte g = 0; g < 3; g++)
for (byte h = 0; h < 4; h++)
for (byte i = 0; i < 2; i++)
for (byte j = 0; j < 4; j++)
for (byte k = 0; k < 4; k++)
for (byte l = 0; l < 3; l++)
for (byte m = 0; m < 4; m++)
{
data.Add(new [] {a, b, c, d, e, f, g, h, i, j, k, l, m});
}

我正在考虑使用类似于 BitArray的东西,但我不知道如何将其结合起来。

任何建议我都会非常感激。或者,也许这是做我想做的事情最快的方法?

剪辑 有几个要点(很抱歉我没有把这些放在最初的帖子里) :

  • 数字和它们的顺序(2、3、4、3、4、3、3等等)非常重要,所以使用诸如 使用 LINQ 生成排列这样的解决方案没有帮助,因为每个“列”中的最大值是不同的
  • 我不是一个数学家,所以如果我没有正确地使用“排列”和“组合”这样的技术术语,我道歉:)
  • 我的 需要填充所有这些组合一次-我不能只是抓取一个或另一个基于索引
  • byte比用 int快,我用 保证吧。在内存使用方面,使用67m + 字节数组比使用 int 要好得多
  • 我在这里的最终目标是寻找嵌套循环的更快替代方案。
  • 我考虑过使用并行编程,但是由于我试图实现的迭代特性,我无法找到一种成功的方法(即使使用 ConcurrentBag)-但是我很高兴被证明是错误的:)

结论

Caramiriel has provided a good micro-optimisation which shaves some time off the loops, so I've marked that answer as correct. Eric also mentioned that it is faster to pre-allocate the List. But, at this stage it seems that the nested loops are in fact the fastest possible way of doing this (depressing, I know!).

如果你想尝试我试图用 StopWatch做的基准测试,那么在每个循环中使用13个循环,每个循环中有4个循环——这样一来,列表中就有6700多万行。在我的机器(i5-3320M 2.6 GHz)上,需要大约2.2秒才能完成优化版本。

17622 次浏览

On my machine, this generates the combinations in 222 ms vs 760 ms (the 13 for loops):

private static byte[,] GenerateCombinations(byte[] maxNumberPerLevel)
{
var levels = maxNumberPerLevel.Length;


var periodsPerLevel = new int[levels];
var totalItems = 1;
for (var i = 0; i < levels; i++)
{
periodsPerLevel[i] = totalItems;
totalItems *= maxNumberPerLevel[i];
}


var results = new byte[totalItems, levels];


Parallel.For(0, levels, level =>
{
var periodPerLevel = periodsPerLevel[level];
var maxPerLevel = maxNumberPerLevel[level];
for (var i = 0; i < totalItems; i++)
results[i, level] = (byte)(i / periodPerLevel % maxPerLevel);
});


return results;
}

What about using Parallel.For() to run it? (Struct optimization kudos to @Caramiriel). I slightly modified the values (a is 5 instead of 2) so I'm more confident in the results.

    var data = new ConcurrentStack<List<Bytes>>();
var sw = new Stopwatch();


sw.Start();


Parallel.For(0, 5, () => new List<Bytes>(3*4*3*4*3*3*4*2*4*4*3*4),
(a, loop, localList) => {
var bytes = new Bytes();
bytes.A = (byte) a;
for (byte b = 0; b < 3; b++) {
bytes.B = b;
for (byte c = 0; c < 4; c++) {
bytes.C = c;
for (byte d = 0; d < 3; d++) {
bytes.D = d;
for (byte e = 0; e < 4; e++) {
bytes.E = e;
for (byte f = 0; f < 3; f++) {
bytes.F = f;
for (byte g = 0; g < 3; g++) {
bytes.G = g;
for (byte h = 0; h < 4; h++) {
bytes.H = h;
for (byte i = 0; i < 2; i++) {
bytes.I = i;
for (byte j = 0; j < 4; j++) {
bytes.J = j;
for (byte k = 0; k < 4; k++) {
bytes.K = k;
for (byte l = 0; l < 3; l++) {
bytes.L = l;
for (byte m = 0; m < 4; m++) {
bytes.M = m;
localList.Add(bytes);
}
}
}
}
}
}
}
}
}
}
}
}




return localList;
}, x => {
data.Push(x);
});


var joinedData = _join(data);

_join() is a private method, defined as:

private static IList<Bytes> _join(IEnumerable<IList<Bytes>> data) {
var value = new List<Bytes>();
foreach (var d in data) {
value.AddRange(d);
}
return value;
}

On my system, this version runs approximately 6 times faster (1.718 seconds versus 0.266 seconds).

var numbers = new[] { 2, 3, 4, 3, 4, 3, 3, 4, 2, 4, 4, 3, 4 };
var result = (numbers.Select(i => Enumerable.Range(0, i))).CartesianProduct();

Using the extension method at http://ericlippert.com/2010/06/28/computing-a-cartesian-product-with-linq/

public static IEnumerable<IEnumerable<T>> CartesianProduct<T>(this IEnumerable<IEnumerable<T>> sequences)
{
// base case:
IEnumerable<IEnumerable<T>> result =
new[] { Enumerable.Empty<T>() };
foreach (var sequence in sequences)
{
// don't close over the loop variable (fixed in C# 5 BTW)
var s = sequence;
// recursive case: use SelectMany to build
// the new product out of the old one
result =
from seq in result
from item in s
select seq.Concat(new[] { item });
}
return result;
}

List has an array internally where it stores it values, with a fixed length . When You call List.Add it checks if there is enough space. When it dcannot add the new element it will create a new array of a larger size, copy all the previous elements over, then add then new one. This takes quite a few cycles.

Since you know the number of elements already, you can create the list of the correct size, that should be a lot faster already.

Also, not sure how you access the values, but you could create this thing one and save the image in code (loading it from disk will probably be slower than what you;re doing now. How many times do you read/ write to this thing?

As a reminder: you probably do not need this kind of code while developing your own solution. This can and should only used in very specific situations. Readability is often more important than speed.

You can use the properties of a struct and allocate the structure in advance. I cut off some levels in the sample below, but I'm sure you'll be able to figure out the specifics. Runs about 5-6 times faster than the original (release mode).

The block:

struct ByteBlock
{
public byte A;
public byte B;
public byte C;
public byte D;
public byte E;
}

The loop:

var data = new ByteBlock[2*3*4*3*4];
var counter = 0;


var bytes = new ByteBlock();


for (byte a = 0; a < 2; a++)
{
bytes.A = a;
for (byte b = 0; b < 3; b++)
{
bytes.B = b;
for (byte c = 0; c < 4; c++)
{
bytes.C = c;
for (byte d = 0; d < 3; d++)
{
bytes.D = d;
for (byte e = 0; e < 4; e++)
{
bytes.E = e;
data[counter++] = bytes;
}
}
}
}
}

It's faster because it doesn't allocate a new list every time you add it to the list. Also since it's creating this list, it needs a reference to every other value (a,b,c,d,e). You can assume each value is only modified once inside the loop, so we can optimize it to do so (data locality).

Also read the comments for side-effects.

Edited the answer to use an T[] instead of a List<T>.

Method 1

One way to make it faster is to specify the capacity if you plan to keep using List<byte[]>, like this.

var data = new List<byte[]>(2 * 3 * 4 * 3 * 4 * 3 * 3 * 4 * 2 * 4 * 4 * 3 * 4);

Method 2

Furthermore, you could use System.Array directly to gain faster access. I recommend this approach if your question insists that every element be physically populated in memory, upfront.

var data = new byte[2 * 3 * 4 * 3 * 4 * 3 * 3 * 4 * 2 * 4 * 4 * 3 * 4][];
int counter = 0;


for (byte a = 0; a < 2; a++)
for (byte b = 0; b < 3; b++)
for (byte c = 0; c < 4; c++)
for (byte d = 0; d < 3; d++)
for (byte e = 0; e < 4; e++)
for (byte f = 0; f < 3; f++)
for (byte g = 0; g < 3; g++)
for (byte h = 0; h < 4; h++)
for (byte i = 0; i < 2; i++)
for (byte j = 0; j < 4; j++)
for (byte k = 0; k < 4; k++)
for (byte l = 0; l < 3; l++)
for (byte m = 0; m < 4; m++)
data[counter++] = new[] { a, b, c, d, e, f, g, h, i, j, k, l, m };

This takes 596ms to complete on my computer, which is about 10.4% faster than the code in question (which takes 658ms).

Method 3

Alternatively, you can use the following technique for low cost initialization that suits access in a sparse fashion. This is especially favorable when only some elements may be needed and determining them all upfront is considered unnecessary. Moreover, techniques like these may become the only viable option when working with more vast elements when memory runs short.

In this implementation every element is left to be determined lazily, on-the-fly, upon access. Naturally, this comes at a cost of additional CPU that is incurred during access.

class HypotheticalBytes
{
private readonly int _c1, _c2, _c3, _c4, _c5, _c6, _c7, _c8, _c9, _c10, _c11, _c12;
private readonly int _t0, _t1, _t2, _t3, _t4, _t5, _t6, _t7, _t8, _t9, _t10, _t11;


public int Count
{
get { return _t0; }
}


public HypotheticalBytes(
int c0, int c1, int c2, int c3, int c4, int c5, int c6, int c7, int c8, int c9, int c10, int c11, int c12)
{
_c1 = c1;
_c2 = c2;
_c3 = c3;
_c4 = c4;
_c5 = c5;
_c6 = c6;
_c7 = c7;
_c8 = c8;
_c9 = c9;
_c10 = c10;
_c11 = c11;
_c12 = c12;
_t11 = _c12 * c11;
_t10 = _t11 * c10;
_t9 = _t10 * c9;
_t8 = _t9 * c8;
_t7 = _t8 * c7;
_t6 = _t7 * c6;
_t5 = _t6 * c5;
_t4 = _t5 * c4;
_t3 = _t4 * c3;
_t2 = _t3 * c2;
_t1 = _t2 * c1;
_t0 = _t1 * c0;
}


public byte[] this[int index]
{
get
{
return new[]
{
(byte)(index / _t1),
(byte)((index / _t2) % _c1),
(byte)((index / _t3) % _c2),
(byte)((index / _t4) % _c3),
(byte)((index / _t5) % _c4),
(byte)((index / _t6) % _c5),
(byte)((index / _t7) % _c6),
(byte)((index / _t8) % _c7),
(byte)((index / _t9) % _c8),
(byte)((index / _t10) % _c9),
(byte)((index / _t11) % _c10),
(byte)((index / _c12) % _c11),
(byte)(index % _c12)
};
}
}
}

This takes 897ms to complete on my computer (also creating & adding to an Array as in Method 2), which is about a 36.3% slower than the code in question (which takes 658ms).

Do you need the result to be an array of arrays? With the current setup the length of the inner arrays is fixed and could be replaced with structs. This would allow the entire thing to be reserved as one continuous block of memory and provides easier access to the elements (not sure how you use this thing lateron).

The approach below is much faster (41ms vs 1071ms for the original on my box):

struct element {
public byte a;
public byte b;
public byte c;
public byte d;
public byte e;
public byte f;
public byte g;
public byte h;
public byte i;
public byte j;
public byte k;
public byte l;
public byte m;
}


element[] WithStruct() {
var t = new element[3981312];
int z = 0;
for (byte a = 0; a < 2; a++)
for (byte b = 0; b < 3; b++)
for (byte c = 0; c < 4; c++)
for (byte d = 0; d < 3; d++)
for (byte e = 0; e < 4; e++)
for (byte f = 0; f < 3; f++)
for (byte g = 0; g < 3; g++)
for (byte h = 0; h < 4; h++)
for (byte i = 0; i < 2; i++)
for (byte j = 0; j < 4; j++)
for (byte k = 0; k < 4; k++)
for (byte l = 0; l < 3; l++)
for (byte m = 0; m < 4; m++)
{
t[z].a = a;
t[z].b = b;
t[z].c = c;
t[z].d = d;
t[z].e = e;
t[z].f = f;
t[z].g = g;
t[z].h = h;
t[z].i = i;
t[z].j = j;
t[z].k = k;
t[z].l = l;
t[z].m = m;
z++;
}
return t;
}

What you are doing is counting (with variable radix, but still counting).

Since you are using C#, I assume you don't want to play with useful memory layout and data structures that let you really optimize your code.

So here I'm posting something different, which may not suit your case, but it's worth noting: In case you actually access the list in a sparse fashion, here a class that let you compute the i-th element in linear time (rather than exponential as the other answers)

class Counter
{
public int[] Radices;


public int[] this[int n]
{
get
{
int[] v = new int[Radices.Length];
int i = Radices.Length - 1;


while (n != 0 && i >= 0)
{
//Hope C# has an IL-opcode for div-and-reminder like x86 do
v[i] = n % Radices[i];
n /= Radices[i--];
}
return v;
}
}
}

You can use this class this way

Counter c = new Counter();
c.Radices = new int[] { 2,3,4,3,4,3,3,4,2,4,4,3,4};

now c[i] is the same as your list, name it l, l[i].

As you can see, you can easily avoid all those loops :) even when you pre compute all the list at whole since you can simply implement a Carry-Ripple counter.

Counters are a very studied subject, I strongly advice to search for some literature if you feel.

Some of your numbers fit entirely on an integer nuimber of bits, so you can "pack" them with the upper level number :

for (byte lm = 0; lm < 12; lm++)
{
...
t[z].l = (lm&12)>>2;
t[z].m = lm&3;
...
}

Of course, this makes the code less readable, but you saved one loop. This can be done each time one of the numbers is a power of two, which is seven time in your case.

All your numbers are compile time constant.

What about unrolling all the loops into a list (using your program to write code):

data.Add(new [] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0});
data.Add(new [] {1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0});
etc.

That should at least take away the overhead of the for loops (if there is any).

I'm not familiar with C# too much, but there seems to be some means of serializing objects. What if you just generated that List and serialized it in some form? I'm not sure if the deserialization is quicker then creating the List and adding the elements, though.

Here's a different way that only need 2 loop. The idea is to increase the first element and if that number goes over than increase the next one.

Instead of displaying the data, you can use currentValues.Clone and add that cloned version into your list. For me this ran faster than your version.

byte[] maxValues = {2, 3, 4};
byte[] currentValues = {0, 0, 0};


do {
Console.WriteLine("{0}, {1}, {2}", currentValues[0], currentValues[1], currentValues[2]);


currentValues[0] += 1;


for (int i = 0; i <= maxValues.Count - 2; i++) {
if (currentValues[i] < maxValues[i]) {
break;
}


currentValues[i] = 0;
currentValues[i + 1] += 1;
}


// Stop the whole thing if the last number is over
// } while (currentValues[currentValues.Length-1] < maxValues[maxValues.Length-1]);
} while (currentValues.Last() < maxValues.Last());
  • Hope this code works, I converted it from vb

Here is another solution. Outside of VS it runs as fast as 437.5 ms which is 26% faster than the original code (593.7 on my computer):

static List<byte[]> Combinations(byte[] maxs)
{
int length = maxs.Length;
int count = 1; // 3981312;
Array.ForEach(maxs, m => count *= m);
byte[][] data = new byte[count][];
byte[] counters = new byte[length];


for (int r = 0; r < count; r++)
{
byte[] row = new byte[length];
for (int c = 0; c < length; c++)
row[c] = counters[c];
data[r] = row;


for (int i = length - 1; i >= 0; i--)
{
counters[i]++;
if (counters[i] == maxs[i])
counters[i] = 0;
else
break;
}
}


return data.ToList();
}