How do I interpret the TensorFlow output for building and executing computational graphs on GPGPUs?

Given the following command that executes an arbitrary tensorflow script using the python API.

python3 tensorflow_test.py > out

The first part stream_executor seems like its loading dependencies.

I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally

What is a NUMA node?

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I assume this is when it finds the available GPU

I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:01:00.0
Total memory: 11.25GiB
Free memory: 11.15GiB

Some gpu initialization? what is DMA?

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:01:00.0)

Why does it throw an error E?

E tensorflow/stream_executor/cuda/cuda_driver.cc:932] failed to allocate 11.15G (11976531968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

Great answer to what the pool_allocator does: https://stackoverflow.com/a/35166985/4233809

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 3160 get requests, put_count=2958 evicted_count=1000 eviction_rate=0.338066 and unsatisfied allocation rate=0.412025
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1743 get requests, put_count=1970 evicted_count=1000 eviction_rate=0.507614 and unsatisfied allocation rate=0.456684
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 256 to 281
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 1986 get requests, put_count=2519 evicted_count=1000 eviction_rate=0.396983 and unsatisfied allocation rate=0.264854
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 655 to 720
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 28728 get requests, put_count=28680 evicted_count=1000 eviction_rate=0.0348675 and unsatisfied allocation rate=0.0418407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 1694 to 1863

About NUMA — https://software.intel.com/en-us/articles/optimizing-applications-for-numa

Roughly speaking, if you have dual-socket CPU, they will each have their own memory and have to access the other processor’s memory through a slower QPI link. So each CPU+memory is a NUMA node.

Potentially you could treat two different NUMA nodes as two different devices and structure your network to optimize for different within-node/between-node bandwidth

However, I don’t think there’s enough wiring in TF right now to do this right now. The detection doesn’t work either — I just tried on a machine with 2 NUMA nodes, and it still printed the same message and initialized to 1 NUMA node.

DMA = Direct Memory Access. You could potentially copy things from one GPU to another GPU without utilizing CPU (ie, through NVlink). NVLink integration isn’t there yet.

As far as the error, TensorFlow tries to allocate close to GPU max memory so it sounds like some of your GPU memory is already been allocated to something else and the allocation failed.

You can do something like below to avoid allocating so much memory

config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3 # don't hog all vRAM
config.operation_timeout_in_ms=15000   # terminate on long hangs
sess = tf.InteractiveSession("", config=config)

  • successfully opened CUDA library xxx locally means that the library was loaded, but it does not meant that it will be used.
  • successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero means that your kernel does not have NUMA support. You can read about NUMA here and here.
  • Found device 0 with properties: you have 1 GPU which you can use. It lists the properties of this GPU.
  • DMA is direct memory access. More information on Wikipedia.
  • failed to allocate 11.15G the error clearly explains why this happened, but it is hard to tell why do you need so much memory without looking at the code.
  • pool allocator messages are explained in this answer