Similarly for the shuffle
method, the same quantity appears and according to documentation :
buffer_size: A tf.int64 scalar tf.Tensor, representing the number of
Importance of
elements from this dataset from which the new dataset will sample.buffer_size
inshuffle()
What is the relation between these parameters ?
I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size
in tf.data.Dataset.shuffle()
.
Suppose I create aDataset
object as follows :
tr_data = TFRecordDataset(trainfilenames)
tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls\
=5)
tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
tr_data = tr_data.batch(trainbatchsize)
Having a low buffer_size
will not just give you inferior shuffling in some cases: it can mess up your whole training.
What role is being played by the buffer
parameters in the above snippet ?
Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000
images in each category):
train/
cat/
filename_00001.jpg
filename_00002.jpg
...
not_cat/
filename_10001.jpg
filename_10002.jpg
...
At the beginning of training, the dataset will take the first 1000
filenames and put them in its buffer, then pick one at random among them. Since all the first 1000
images are images of cat, we will only pick cat images at the beginning.
A standard way to input data with tf.data
can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices()
to create the dataset:
filenames = ["filename_00001.jpg", "filename_00002.jpg", ...,
"filename_10001.jpg", "filename_10002.jpg", ...]
labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...
The fix here is to make sure that buffer_size
is larger than 20000
, or to shuffle in advance filenames
and labels
(with the same indices obviously).
Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames)
to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle()
before applying the heavy transformations (like reading the images, processing them, batching...).
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(buffer_size=len(filenames))
dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...
The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.
The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).