To randomly shuffle the examples, you can use tf.train.shuffle_batch
function instead of tf.train.batch
, as follows:
parsed_batch = tf.train.shuffle_batch([serialized_example],
batch_size=100, capacity=1000,
min_after_dequeue=200)
tf.train.shuffle_batch
(as well as tf.train.batch
) creates a tf.Queue
and keeps adding serialized_examples
to it.
capacity
measures how many elements can be stored in Queue in one time. Bigger capacity leads to bigger memory usage, but lower latency caused by threads waiting to fill it up.
min_after_dequeue
is the minimum number of elements present in the queue after getting elements from it. The shuffle_batch
queue is not shuffling elements perfectly uniformly - it is designed with huge data, not-fitting-memory one, in mind. Instead, it reads between min_after_dequeue
and capacity
elements, store them in memory and randomly chooses a batch of them. After that it enqueues some more elements, to keep its number between min_after_dequeue
and capacity
. Thus, the bigger value of min_after_dequeue
, the more random elements are - the choice of batch_size
elements is guaranteed to be taken from at least min_after_dequeue
consecutive elements, but the bigger capacity
has to be and the longer it takes to fill the queue initially.