Does TensorFlow use all of the hardware on the GPU?

  • A+
Category:Languages

The NVidia GP100 has 30 TPC circuits and 240 "texture units". Do the TPCs and texture units get used by TensorFlow, or are these disposable bits of silicon for machine learning?

I am looking at GPU-Z and Windows 10's built-in GPU performance monitor on a running neural net training session and I see various hardware functions are underutilized. Tensorflow uses CUDA. CUDA has access, I presume, to all hardware components. If I know where the gap is (between Tensorflow and underlying CUDA) and whether it is material (how much silicon is wasted) I can, for example, remediate by making a clone of TensorFlow, modifying it, and then submitting a pull request.

For example, answer below discusses texture objects, accessible from CUDA. NVidia notes that these can be used to speed up latency-sensitive, short-running kernels. If I google "TextureObject tensorflow" I don't get any hits. So I can sort of assume, barring evidence to the contrary, that TensorFlow is not taking advantage of TextureObjects.

NVidia markets GPGPUs for neural net training. So far it seems they have adopted a dual-use strategy for their circuits, so they are leaving in circuits not used for machine learning. This begs the question of whether a pure TensorFlow circuit would be more efficient. Google is now promoting TPUs for this reason. The jury is out on whether TPUs are actually cheaper for TensorFlow than NVidia GPUs. NVidia is challenging Google price/performance claims.

 


None of those things are separate pieces of individual hardware that can be addressed separately in CUDA. Read this passage on page 10 of your document:

Each GPC inside GP100 has ten SMs. Each SM has 64 CUDA Cores and four texture units. With 60 SMs, GP100 has a total of 3840 single precision CUDA Cores and 240 texture units. Each memory controller is attached to 512 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GPU includes a total of 4096 KB of L2 cache.

And if we read just above that:

GP100 was built to be the highest performing parallel computing processor in the world to address the needs of the GPU accelerated computing markets serviced by our Tesla P100 accelerator platform. Like previous Tesla-class GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), and memory controllers. A full GP100 consists of six GPCs, 60 Pascal SMs, 30 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total).

and take a look at the diagram we see the following:

Does TensorFlow use all of the hardware on the GPU?

So not only are the GPCs and SMS not seperate pieces of hardware, but even the TPCs are just another way to reorganize the hardware architecture and come up with a fancy marketing name. You can clearly see TPC doesn't add anything new in the diagram, it just looks like a container for the SMs. Its [1 GPC]:[5 TPCs]:[10 SMs]

The memory controllers are something all hardware is going to have in order to interface with RAM, it happens that more memory controllers can enable higher bandwidth, see this diagram:

Does TensorFlow use all of the hardware on the GPU?

where "High bandwidth memory" refers to HBM2 a type of video memory like GDDR5, in other words, video RAM. This isn't something you would directly address in software with CUDA any more than you would do so with X86 desktop machines.

So in reality, we only have SMs here, not TPCs an GPCs. So to answer your question, since Tensor flow takes advantage of cuda, presumably its going to use all the available hardware it can.

EDIT: The poster edited their question to an entirely different question, and has new misconceptions there so here is the answer to that:

Texture Processing Clusters (TPCs) and Texture units are not the same thing. TPCs appear to be merely an organization of Streaming Multiprocessors (SM) with a bit of marketing magic thrown in.

Texture units are not a concrete term, and features differ from GPU to GPU, but basically you can think of them as the combination of texture memory or ready access to texture memory, which employs spatial coherence, versus L1,L2,L3... cache which employ temporal coherence, in combination of some fixed function functionality. Fixed functionality may include interpolation access filter (often at least linear interpolation), different coordinate modes, mipmapping control and ansiotropic texture filtering. See the Cuda 9.0 Guide on this topic to get an idea of texture unit functionality and what you can control with CUDA. On the diagram we can see the texture units at the bottom.

Does TensorFlow use all of the hardware on the GPU?

Clearly these are completely different from the TPCs shown in the first picture I posted, which at least according to the diagram have no extra functionality associated with them and are merely a container for two SMs.

Now, despite the fact that you can address texture functionality within cuda, you often don't need to. The texture units fixed function functionality is not all that useful to Neural nets, however, the spatially coherent texture memory is often automatically used by CUDA as an optimization even if you don't explicitly try to access it. In this way, TensorFlow still would not be "wasting" silicon.

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: