Apr 14, 2004 14:29
Here are some clips:
Pixel Shader 3.0 (PS3.0) allows shader programs of over 65,000 lines and includes dynamic flow control (branching). This revision also requires that compliant hardware offer 4 Multiple Render Targets (MRT's allow shaders to draw to more than one location in memory at a time), full 32-bit floating point precision, shader antialiasing, and a total of ten texture coordinate inputs per pixel.
As we can see, somewhere around 16 chips fit horizontally on the wafer, while they can squeeze in about 18 chips vertically. We know that NVIDIA uses a 130nm IBM process on 300mm wafers. We also know that the P4 EE is in the neighborhood of 250mm^2 in size. Doing the math indicates that the NV40 GPU is somewhere between 270mm^2 and 305mm^2. It is difficult to get a closer estimate because we don't know how much space is between each chip on the wafer (which also makes it hard to estimate waste per wafer).
There are a couple factors at work here. First, obviously, the card needs a good amount of power. Second, power supplies generally partition the power they deliver. If you look on the side of a power supply, you'll see a list of voltage rails and amperages. The wattage ratings on a power supply usually indicate (for marketing purposes) the maximum wattage they could supply if the maximum current allowed was drawn on each line. It is not possible to draw all 350 watts of a 350 watt power supply across one connection (or even one rail). NVIDIA indicated that their card needs a stable 12 volt rail, but that generally power supplies offer a large portion of their 12 volt amperage to the motherboard (since the motherboard draws the most power in the system on all rails).
There are 50% more vertex shader units bringing the total to 6, and there are 4 times as many pixel pipelines (16 units) in NV40. The chip was already large, so its not surprising that NVIDIA only doubled the number of texture units from 8 to 16 making this architecture 16x1 (whereas NV3x was 4x2). The architecture can handle 8x2 rendering for multitexture situations by using all 16 pixel shader units. In effect, the pixel shader throughput for multitextured situations is doubled, while single textured pixel throughput is quadrupled. Of course, this doesn't mean performance is always doubled or quadrupled, just that that's the upper bound on the theoretical maximum pixels per clock.
As if all this weren't enough, all the pixel pipes are dual issue (as with the vertex shader units) and coissue capable. DirectX 9 co-issue is the ability to execute two operations on different components of the same pixel at the same time. This means that (under the right conditions), both math units in a pixel pipe can be active at once, and two instructions can be run on different component data on a pixel in each unit. This gives a max of 4 instructions per clock per pixel pipe. Of course, how often this gets used remains to be seen.
The end of the pipeline consists of the ROP pixel pipeline. These are the units that take care of antialiasing, as well as z and color compression and final drawing of a pixel. There are 16 of these units, and they are capable of either computing one color+z pixel, or calculating 2 z/stencil operations per clock. This means that 32 z or stencil operations (think shadowing), or 16 pixels can be drawn per clock cycle. Thus NVIDIA has dubbed this architecture a 16x1 / 32x0 architecture. On a side note, they have retroactively dubbed the NV3x a 4x2 / 8x0 architecture.
Cool, huh? Hello? *sigh* Girls. . .