Home » NVIDIA RTX 40 GPUs will be almost 100% faster than RTX 30

NVIDIA RTX 40 GPUs will be almost 100% faster than RTX 30

New data on the playing field, very interesting data because we have to talk about architectures as such and possible performances in sight, since the supposed block diagram of the MS and GC has been filtered for RTX40so we can get a better idea of ​​where NVIDIA is headed.

The first thing we need to be clear on is that Ada Lovelace is a different architecture than Hopper as such. NVIDIA will therefore segment its ranges, valuing the best of both to concentrate the markets in a more competitive way and to be a step further from what we have seen with Ampere and in a way to be more competitive against AMD, which is also following similar steps. With that said, let’s move on to the changes we’ll see in Ada Lovelace’s game architecture.

RTX 40 vs RTX 30, is there such a big leap in performance?

Architecture NVIDIA-RTX-40-Ada-Lovelace-SM-Diagram

No, not really, at least if the data is true (as always, toss some salt here to taste what’s next). And it is that the hype is diminishing a bit and we are moving from this supposed +2.2X to more realistic and close figures, although it is also true that we are missing crucial data as we will see below -below. Either way, keep your feet on the ground and let’s go to the brothel.

Not to confuse us, what we see above is the TPC schematic of the new AD102, which at the same time will be the one used by all RTX 40 chips as the base structure, with each SM being the fixed structure. The main thing to understand is that we are visually speaking an SM side by side and not one above the other, this has no relevance in the operation as such, but it is a way different to represent them and it can be confusing because it looks like we are talking about an MCM architecture and it won’t be, at least for a single chip (another thing would be a SoIC array system from TSMC)

This is important to clarify as NVIDIA still displays an SM and TPCs display them by stacking SMs vertically, but we emphasize that this is only a schematic representation, not an architectural change. So understood that, let’s go with the changes. We certainly know that the GIC from NVIDIA consist of TPC and these in turn by YOU as far as the hierarchy is understood, then within the SM we have the different units.

NVIDIA Ada LoveLace vs Ampere, AD102 vs GA102 comparison

Knowing this and comparing the new AD102 with the current GA102, we have 12 CP compared to the 7 of the Ampère architecture, but although there are 70% more, each inside retains the 6 PTCs that the RTX 30 have, as well as the 2 SM for each of them. In other words, the hierarchy of CPGs is maintained, but their number is increased.

Now we go with the SM as the minimum unit. Each of them has what NVIDIA calls Sub-Core, where the 4 amps are kept, but they don’t house the same number of units and this is where the changes start. NVIDIA moved from three engines per sub-core to 4what used to be a group for FP32 and INT32, an independent group for FP32 and the Tensor Cores, now in this AD102 is part of two independent groups of FP32 and one of INT32more the tensor nuclei fourth generation.

Why is this done? Because Nvidia wants to go from 64 FP32 units to 128 and intends to multiply the INT32 by adding 64 units for each of these Sub-Cores, for a total of 192 units. That is to say, there is 2 FP32 engines of 64 units and an INT32 engine also of 64 units. So what are the changes here? The number of FP32s does not increase, but by separating them from the INT32s to create a separate engine the count is no longer 128 on Ampere and Ada Lovelace, they are now 128 + 64 (FP32 + INT32). The goal is to add muscle to the rendering and possibly have the INT32s dedicate their resources to compute for RT Cores or Tensor Cores, depending on the complexity of the scene and needs.

Nvidia surely intends to go down the path to a purer Ray Tracing without including its rendering and the work of the algorithms BVH in the graphics pipeline, something we will understand and should be seen in the architecture overview, so for now this is just speculation on our part.


Therefore, each SM has sub-cores and each of them has four engines with 2 units 64 FP32 + 1 unit 64 INT32 and apart from the Tensor Cores, a sum total considering the four Sub-Cores that form a SM is 512 units for FP32 and 256 units for INT32which gives in total 768 units for each of the two SMs that has a TPC. Doing some simple math, that’s 768 units per SM, times the two SMs a TPC has, times the 12 TPCs gives us a known number of 18,432 unitswhich in this case are treated by NVIDIA as separate shaders.

After understanding this, come caches and their hierarchy, where there are significant changes. We went from having a L1D with 128 KB per SM with shared memory to a much more complex system where each SM now has L1D with 192 KB shared and a L1I of which we know nothing. But there are more changes, since what was once an L0 + Warp + Dispatch unit is now three independent units for each sub-core for the same size and log file bus.

These are three independent motors with 32 threads per clock cycle (the latter does not change), but it seems that this movement is related not only to the distribution seen from INT32 and FP32, but also to the fact that the new L1I can balance the distribution of the load in a more optimal way and for that, you have to leverage and divide those prime movers.

If you thought the changes were over… Well, no. What was previously four loading and storage units now becomes a single block with the same function, respecting of course the SFU that we don’t know if it has increased in size. Now yes, finally the texture units remain intact (as far as we know) while the RT Cores will be more complex by jumping to your third generationof which we also do not have information about it, but they surely imply important modifications considering all the above.

What has been leaked is an exponential increase in L2 cache, which reaches a total of 96 MB for him AD102. At the same time, we couldn’t overlook the ROPs given the jump in performance there is going to be and NVIDIA has been smart in giving this new architecture twice as many units, 32 per GIC to be precise, which would give us a total of 384 ORP for the RTX 4090 against 112 for the RTX 3090, the leap is qualitative.

AD102 vs GA102 vs TU102 vs GA100 vs GH100

After what has been explained, a comparative table:

AD102 RTX 40 vs GA102 RTX 30 vs Hopper GH100 (2)

The chart is easier to understand everything said and adds the GA100 and GH100 as well, so it’s very easy to see comparatively what Ada Lovelace and her AD102 will mean for the RTX 4090 versus the rest.

What can we expect in terms of actual performance? Well, short of knowing the base and boost frequencies, how fast and efficient is the node 4N compared to the N5s (both from TSMC), we can’t jump to conclusions. We can only base ourselves on the 1780 MHz in Boost and the 1500 MHz based on the GH100 which includes NVIDIA in its SXM5 version for servers and which integrates the same lithographic process, but a different architecture despite the sharing of Shaders ( including FP64 and FP16 as such).

We are talking about a power in FP32 of 90 TFLOPSmore than double that of the current GA102, but as we well know, TFLOPS is not a good unit of measurement and perhaps from those +2.2X we’ve been talking about we’ll end up in a +2X simply, that in the same way it is a brutal jump, as it is in consumption, but at the same time it is more efficient comparatively speaking, to say the least curious.


About the author


Add Comment

Click here to post a comment

Your email address will not be published. Required fields are marked *