如何分批循环遍历 IEnumable

我正在开发一个 C # 程序,它有一个“ IEnumable 用户”,存储400万用户的 id。我需要遍历 IEnumable 并每次提取一批1000个 id,以便在另一个方法中执行某些操作。

如何从 IEnumable 的开始一次提取1000个 id,执行其他操作,然后获取下一批1000个 id,以此类推?

这可能吗?

81882 次浏览

The easiest way to do this is probably just to use the GroupBy method in LINQ:

var batches = myEnumerable
.Select((x, i) => new { x, i })
.GroupBy(p => (p.i / 1000), (p, i) => p.x);

But for a more sophisticated solution, see this blog post on how to create your own extension method to do this. Duplicated here for posterity:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
List<T> nextbatch = new List<T>(batchSize);
foreach (T item in collection)
{
nextbatch.Add(item);
if (nextbatch.Count == batchSize)
{
yield return nextbatch;
nextbatch = new List<T>();
// or nextbatch.Clear(); but see Servy's comment below
}
}


if (nextbatch.Count > 0)
yield return nextbatch;
}

You can achieve that using Take and Skip Enumerable extension method. For more information on usage checkout linq 101

You can use MoreLINQ's Batch operator (available from NuGet):

foreach(IEnumerable<User> batch in users.Batch(1000))
// use batch

If simple usage of library is not an option, you can reuse implementation:

public static IEnumerable<IEnumerable<T>> Batch<T>(
this IEnumerable<T> source, int size)
{
T[] bucket = null;
var count = 0;


foreach (var item in source)
{
if (bucket == null)
bucket = new T[size];


bucket[count++] = item;


if (count != size)
continue;


yield return bucket.Select(x => x);


bucket = null;
count = 0;
}


// Return the last bucket with all remaining elements
if (bucket != null && count > 0)
{
Array.Resize(ref bucket, count);
yield return bucket.Select(x => x);
}
}

BTW for performance you can simply return bucket without calling Select(x => x). Select is optimized for arrays, but selector delegate still would be invoked on each item. So, in your case it's better to use

yield return bucket;

Sounds like you need to use Skip and Take methods of your object. Example:

users.Skip(1000).Take(1000)

this would skip the first 1000 and take the next 1000. You'd just need to increase the amount skipped with each call

You could use an integer variable with the parameter for Skip and you can adjust how much is skipped. You can then call it in a method.

public IEnumerable<user> GetBatch(int pageNumber)
{
return users.Skip(pageNumber * 1000).Take(1000);
}

try using this:

  public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
this IEnumerable<TSource> source,
int batchSize)
{
var batch = new List<TSource>();
foreach (var item in source)
{
batch.Add(item);
if (batch.Count == batchSize)
{
yield return batch;
batch = new List<TSource>();
}
}


if (batch.Any()) yield return batch;
}

and to use above function:

foreach (var list in Users.Batch(1000))
{


}

Something like this would work:

List<MyClass> batch = new List<MyClass>();
foreach (MyClass item in items)
{
batch.Add(item);


if (batch.Count == 1000)
{
// Perform operation on batch
batch.Clear();
}
}


// Process last batch
if (batch.Any())
{
// Perform operation on batch
}

And you could generalize this into a generic method, like this:

static void PerformBatchedOperation<T>(IEnumerable<T> items,
Action<IEnumerable<T>> operation,
int batchSize)
{
List<T> batch = new List<T>();
foreach (T item in items)
{
batch.Add(item);


if (batch.Count == batchSize)
{
operation(batch);
batch.Clear();
}
}


// Process last batch
if (batch.Any())
{
operation(batch);
}
}

How about

int batchsize = 5;
List<string> colection = new List<string> { "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"};
for (int x = 0; x < Math.Ceiling((decimal)colection.Count / batchsize); x++)
{
var t = colection.Skip(x * batchsize).Take(batchsize);
}

In a streaming context, where the enumerator might get blocked in the middle of the batch, simply because the value is not yet produced (yield) it is useful to have a timeout method so that the last batch is produced after a given time. I used this for example for tailing a cursor in MongoDB. It's a little bit complicated, because the enumeration has to be done in another thread.

    public static IEnumerable<List<T>> TimedBatch<T>(this IEnumerable<T> collection, double timeoutMilliseconds, long maxItems)
{
object _lock = new object();
List<T> batch = new List<T>();
AutoResetEvent yieldEventTriggered = new AutoResetEvent(false);
AutoResetEvent yieldEventFinished = new AutoResetEvent(false);
bool yieldEventTriggering = false;


var task = Task.Run(delegate
{
foreach (T item in collection)
{
lock (_lock)
{
batch.Add(item);


if (batch.Count == maxItems)
{
yieldEventTriggering = true;
yieldEventTriggered.Set();
}
}


if (yieldEventTriggering)
{
yieldEventFinished.WaitOne(); //wait for the yield to finish, and batch to be cleaned
yieldEventTriggering = false;
}
}
});


while (!task.IsCompleted)
{
//Wait for the event to be triggered, or the timeout to finish
yieldEventTriggered.WaitOne(TimeSpan.FromMilliseconds(timeoutMilliseconds));
lock (_lock)
{
if (batch.Count > 0) //yield return only if the batch accumulated something
{
yield return batch;
batch.Clear();
yieldEventFinished.Set();
}
}
}
task.Wait();
}