The most common mode of using TensorFlow involves first building a dataflow graph of TensorFlow operators (like tf.constant()
and tf.matmul()
, then running steps by calling the tf.Session.run()
method in a loop (e.g. a training loop).
A common source of memory leaks is where the training loop contains calls that add nodes to the graph, and these run in every iteration, causing the graph to grow. These may be obvious (e.g. a call to a TensorFlow operator like tf.square()
), implicit (e.g. a call to a TensorFlow library function that creates operators like tf.train.Saver()
), or subtle (e.g. a call to an overloaded operator on a tf.Tensor
and a NumPy array, which implicitly calls tf.convert_to_tensor()
and adds a new tf.constant()
to the graph).
The tf.Graph.finalize()
method can help to catch leaks like this: it marks a graph as read-only, and raises an exception if anything is added to the graph. For example:
loss = ...
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
sess.graph.finalize() # Graph is read-only after this statement.
for _ in range(1000000):
sess.run(train_op)
loss_sq = tf.square(loss) # Exception will be thrown here.
sess.run(loss_sq)
In this case, the overloaded *
operator attempts to add new nodes to the graph:
loss = ...
# ...
with tf.Session() as sess:
# ...
sess.graph.finalize() # Graph is read-only after this statement.
# ...
dbl_loss = loss * 2.0 # Exception will be thrown here.