To randomly shuffle the examples, you can use
tf.train.shuffle_batch function instead of
tf.train.batch, as follows:
parsed_batch = tf.train.shuffle_batch([serialized_example], batch_size=100, capacity=1000, min_after_dequeue=200)
tf.train.shuffle_batch (as well as
tf.train.batch) creates a
tf.Queue and keeps adding
serialized_examples to it.
capacity measures how many elements can be stored in Queue in one time. Bigger capacity leads to bigger memory usage, but lower latency caused by threads waiting to fill it up.
min_after_dequeue is the minimum number of elements present in the queue after getting elements from it. The
shuffle_batch queue is not shuffling elements perfectly uniformly - it is designed with huge data, not-fitting-memory one, in mind. Instead, it reads between
capacity elements, store them in memory and randomly chooses a batch of them. After that it enqueues some more elements, to keep its number between
capacity. Thus, the bigger value of
min_after_dequeue, the more random elements are - the choice of
batch_size elements is guaranteed to be taken from at least
min_after_dequeue consecutive elements, but the bigger
capacity has to be and the longer it takes to fill the queue initially.