tensorflow Reading the data Random shuffling the examples


To randomly shuffle the examples, you can use tf.train.shuffle_batch function instead of tf.train.batch, as follows:

parsed_batch = tf.train.shuffle_batch([serialized_example],
    batch_size=100, capacity=1000,

tf.train.shuffle_batch (as well as tf.train.batch) creates a tf.Queue and keeps adding serialized_examples to it.

capacity measures how many elements can be stored in Queue in one time. Bigger capacity leads to bigger memory usage, but lower latency caused by threads waiting to fill it up.

min_after_dequeue is the minimum number of elements present in the queue after getting elements from it. The shuffle_batch queue is not shuffling elements perfectly uniformly - it is designed with huge data, not-fitting-memory one, in mind. Instead, it reads between min_after_dequeue and capacity elements, store them in memory and randomly chooses a batch of them. After that it enqueues some more elements, to keep its number between min_after_dequeue and capacity. Thus, the bigger value of min_after_dequeue, the more random elements are - the choice of batch_size elements is guaranteed to be taken from at least min_after_dequeue consecutive elements, but the bigger capacity has to be and the longer it takes to fill the queue initially.