Header Ads

Nvidia & DirectX 11


Better late than never is the adage. We say it had better be better. NVIDIAs DX11 part gets dissected

Round one of the DX11 battle is over…AMD has won, and NVIDIA goes home without a runners-up trophy. Writing off AMD as “all washed up” was foolhardy and this coupled with NVIDIAs 40nm problems spelt disaster for them during the age of DX11s infancy. AMD has sold many DX11 parts, thereby restoring coffers sans competition.

NVIDIA claims their 40nm woes “are history”, and our monolithic protagonist is in production as you read this. The GTX 480 and 470 based on the GF100 have been announced. But Fermi’s adversary has a running start and DX 11 has reached proving point with people already tapping their feet impatiently. There is a lot riding on this one, despite the apparent air of nonchalance NVIDIA manages to exude.

The GF100 exposed

Fermi is huge – at 3 billion transistors, it’s more than twice of the GT200s (GTX 280) count of 1.4 billion. 512 stream processors (SPs), a 384-bit GDDR5 interface – Fermi is most impressive on paper, much like our cricket team! But is it more than two GT200s on a single die?

The GF100 architecture in detail

Fermi is scalable. Thanks to decoupling of functional units on the GPU, lower-end chips will have the same logic as top-end ones. The 512SPs are divided into 16 groups of Streaming Multiprocessors (SMs). Furthermore, one cluster of four SMs makes up one Graphic Processing Cluster (GPC).

The Streaming Multiprocessor

The Host Interface is an interconnect between the CPU and GPU, while the GigaThread Engine, which is basically the scheduler, is responsible for fetching and copying data from system RAM to the video RAM via the six memory controllers. This GigaThread Engine also handles redistribution of any workload. Each memory controller is 64-bits wide, which is how NVIDIA arrives at the figure “384- bit” wide controller. Blocks of threaded instructions are created within the GigaThread Engine and sent off to the SMs. Each SM then breaks down this block and feeds it, 32 threads at a time to the SPs within the block, each SP getting one thread. Remember that each SP can execute pixel, vertex, geometry and compute instructions. The large block of common L2 cache (768 KB) handles load, store and texture operations. The 48 ROP units are divided into six clusters of eight units each, and each cluster gets its very own 64-bit memory controller. As you can see, bus bandwidth and the ROP units are very important to NVIDIA, which is why these are very close to the L2 cache and memory controller. We’re also told scaling on each of these units is linked, meaning changing the clocks of one, will affect the other units. On the GT200 and previous generation GPUs, SMs and Texture Units were grouped together and collectively referred to as Texture Processing Clusters, whereas on the GF100, each SM has four texture units (shaded in dark blue) – more of the scalar concept. The SP has a new trick in its ability to perform a two-step multiply-add operation in one step with a new instruction called FMA or fused-multiply-add. Rounding off of decimal places is done in a single step and there is no loss in precision in this single stage step, but there is a saving of processing resources – win-win. There are four SFUs or Special Function Units within an SM that handle transcendental operations such as sine, cosine, square root and reciprocal.

The improved cache on the GF100 is another point NVIDIA has illustrated on, and the most apparent benefit is from the much larger 768 KB unified L2 cache. The benefits of a unified cache are improved utilization, less resource utilization and therefore higher efficiency. Incidentally, all processing on the GF100 is IEEE-754 2008-compliant – this simplifies developers’ lives as well as keeps them honest. Additionally, double precision floating point performance has gone up by 400 per cent compared to the GT200.

The emphasis on geometry: but why?

DX11 is a lot more rigid than DX10 when it comes to rendering techniques and features; therefore both GPU makers have more of a level playing field than ever since neither can add extra features on their GPU without the fear of it being unused, thereby wasting super precious die space. DX11 makes Tessellation a mandatory feature and this has been fully exploited by NVIDIA, whereas rivals AMD, they claim, have only scratched the surface with their RV870. Thanks to its geometry prowess the scaling down with Tessellation will be minimal on the GF100 and this means complex and lifelike characters, objects, environments and scenery with more fine detail than the competition using the same game assets. Environments should look a whole lot better. Environment interaction will be more of an experience and deformation of objects will become more realistic. Tessellation needs oodles of spare geometry power, since triangle density of a given frame can scale up to astronomic proportions, putting a sudden, enormous load on the geometry sub system and it is this feature that has dictated the reworking of the Raster Engine and the creation of the PolyMorph Engine.

The emphasis on geometry: how?

The PolyMorph Engine is an improved version of the Geometry Controller on the GT200. Unlike the GT200 and predecessors, each SM gets its own PolyMorph Engine. Previously, each TPC was made up of three SMs and had a single Geometry Controller. So from sitting inside a pipeline, the PolyMorph Engine becomes modular and all 16 of them can intercommunicate, thereby work cohesively, effectively doing more, more efficiently. The PolyMorph Engine has five stages as shown below. Results after each stage are passed to the connected SM, where it is executed, and the result is passed to the next stage in the PolyMorph Engine.

The Raster Engine comprises hardware that is basically responsible for creating screen pixels from the geometry data received from the SPs and the Polymorph Engine and this data is passed back to the SPs for further pixel shading. The GF100 has four Raster Engines, one each for each GPC. The Raster Engine is unchanged from the GT200 as far as we know.

These changes give the GF100 a massive boost to geometry crunching – 8x the power of the GT200. Geometry performance has stagnated, from neglect more than anything; the GT200 was only 3x faster than the NV30 (FX 5800) in this regard.

Tessellation + Displacement Mapping = visual realism

Tessellation along with displacement mapping adds additional detail to the mesh

Both these techniques are not new, and are used extensively in the film industry as animation techniques. When used together, developers have precise control on the level of geometric detail, which is very important. Animation is done on a compact description and this is scaled up to whatever an appropriate quality level. This also brings about savings in memory usage as well as bandwidth consumption. In the example below, courtesy Kenneth Scott, id Software, we see a general outline of the character on the left, note the very limited geometry due to which the object looks rough. Using Tessellation, we see a result free from gradients in the center, but no more detailed than the object on the left. The object on the right is a result of applying a displacement map to the smoothly tessellated object in the middle is looks much more realistic.

Tessellation used on a larger scale

Better AA

On the GF100 enabling 8x MSAA (Multi Sample Anti-Aliasing) no longer incurs the performance hit that it was associated with. CSAA (Coverage Sample Anti-Aliasing), which is the frugal mans AA technique has also been tweaked with support for up to 32 samples. Transparency Multi-Sample Anti-Aliasing incurs less of a performance hit on the GF100 and thanks to more samples, the overall effect is a little close to realistic.

More gamer-goodies

NVIDIA has showcased 3D Vision Surround, which involves their 3D Vision and up to three monitors for an immersive experience. 3D Vision with Batman: Arkham Asylum was really fun and immersive – the action really seemed to jump out at you. Three monitors with such a 3D effect would be even more amazing. AMD gets kudos for marketing the multiple monitor concept first. Their technology is called Eyefinity, where 2, 3, and up to 6 monitors can be used for a truly larger than life (literally) gaming experience. But NVIDIA has added the 3D angle to this, going a very positive step further.

ECC: For Tesla, with love?

Another much talked about feature is ECC or Error Correction Code. While the RV870 (Radeon HD 5870) can detect errors on the memory bus it cannot correct them. The GF100 sees ECC support on both L1 and L2 cache as well as the register file. Obviously this is a feature included to keep the GF100s chances as a GPGPU alive

ECC is mandatory for simulations and large scale floating point calculations, meaning nobody would touch the GF100 for Tesla-like functions unless ECC was a part of the package. ECC does nothing for games, and one pitfall is while overclocking the memory system could behave weirdly. We’re told NVIDIA might disable this feature on the top-end parts.

So?

We’re postponing judgment till we get a sample card. However, the GF100 seems like a well thought-out product. The G80 was path breaking and a tough act to follow but it seems, NVIDIA has quite a few innovations on the Fermi’s die – enough to call it a new architecture and not a refresh with more processing power. AMD has a firm grip of the market and we’ve seen historically how, difficult to unseat an established product.


AMD has proved that power is as likely to be misused in the hands of an underdog as the top dog; and the iron clad prices of their HD 5xxx series cards testify to this. It’s called skimming the cream in marketing terms and everybody’s guilty of it at some point. Which emphasizes the point of how important competition is. We’re hoping to see a good fight.

Jargons
  1. Out Of Order Execution (OoO): In the quest for greater parallelism, instructions are fed to a processor in parallel, removing them from their order of execution. This is called Out of Order Execution
  2. ROP: acronym for Raster Operations Pipeline, this is one of the final stages of the rendering process that involves taking pixel and texture information and processing it into a final pixel.
  3. Tessellation: This is a process to simplify complex wireframes making them easier to process for the GPU and also reducing the memory footprint
  4. Displacement Mapping: A displacement map is a texture that denotes height information in a scene. In a 3D model, the displacement map is used to alter the relative position of vertices.