5

I've got a Linux computer with a Ryzen 7 1800X CPU. According to WikiChip it has a L2-DTLB of 1536 entries. So I assumed the associativity to be divisible by 3. I wrote a little program that checks the associativity reported by CPUID. Interestingly it gives me an associativity of 8. Why is that ? This would give a set-size of 192 entries, so no easy modulo power 2 indexing. Sho how is that index efficiently calculated ?

That's my program:

#include <iostream>
#if defined(_MSC_VER)
    #include <intrin.h>
#elif defined(__GNUC__)
    #include <cpuid.h>
#endif

using namespace std;

unsigned cpuid( unsigned (&cpuidRegs)[4], unsigned code, unsigned ex );

int main()
{
    static unsigned const SHORT_WAYS[0x10] = { 0, 1, 2, 0, 4, 0, 8, 0, 16, 0, 32, 48, 64, 96, 128, (unsigned)-1 };
    unsigned regs[4];
    cpuid( regs, 0x80000006u, 0 );
    unsigned n = regs[1] >> 16 & 0xFFF, ways = SHORT_WAYS[regs[1] >> 28];
    cout << "L2 D-TLB: " << n << " / " << ways << " ways" << endl;
}

 inline
unsigned cpuid( unsigned (&cpuidRegs)[4], unsigned code, unsigned ex )
{
#if defined(_MSC_VER)
    __cpuidex( (int *)cpuidRegs, code, ex );
#elif defined(__linux__)
    __cpuid_count(code, ex, cpuidRegs[0], cpuidRegs[1], cpuidRegs[2], cpuidRegs[3]);
#endif
    return cpuidRegs[0];
}
0

1 Answer 1

3

AMD's optimization manual from 2017 says Zen 1's L2dTLB is 12-way associative, 1536 entry, at the top of page 26, in section 2.7.2 L2 Translation Lookaside Buffers. That document is nominally about Epyc 7001 series, but those are the same Zen 1 cores as your Ryzen.

The L2 iTLB is 8-way associative.
(512-entry, for 4k or 2M entries, with a 1G page entry "smashed" into a 2M entry.)

But assuming you're checking the right level,8000_0006h, it seems there's no encoding for 12-way associativity in the field. It's unfortunately just codes for a table of possible values, not an integer bitfield.

Since there's (AFAIK) no way to encode a 12-way L2 dTLB, perhaps AMD just chose to encode the highest value <= the real value, so any code that uses it as a tuning parameter re: avoiding aliasing won't have way more conflict misses than expected.

The 1001b encoding that means "see level 8000_001Dh instead" is (probably) not usable, because that level is only for normal caches, not TLBs.

But actually it's more interesting than that. Hadi Brais commented on this answer that it's not just a "simple" 12-way associative TLB, but also not fully separate. Instead, it's broken down into 8-way for 4K entries, 2-way for 2M/4M, and 2-way for coalesced 32K groups of 4K pages. Or on server CPUs, the breakdown is 6/3/3, and the CPUID dump reports 6-way for 4k and 3-way for 2M.

I found this write-up that gives an overview of the idea behind "skewed" TLBs. Apparently it does have separate ways for separate sizes, but with a hash function for indexing instead of just a couple low bits, reducing conflict misses vs. a simple index scheme for 2-way associative sub-sets.

Hadi writes:

Both the manual and cpuid info provide the correct L2 DTLB associativity and number of entries. Starting with Zen, the L2 DTLB is a skewed unified cache. This means that for a page with a particular address and size (which is unknown at the time of lookup), it can be mapped to some subset of ways of the total 12 ways according to a mapping function. For desktop/mobile models such as the Ryzen 7 1800X, any 4KB page can be mapped to 8 ways out of the 12 ways, any 2MB/4MB page can be mapped to 2 other ways, any coalesced 32KB page can be mapped to 2 other ways. That's a total of 12 ways.

For server models, the mapping is 6/3/3, respectively. The way cpuid reports TLB info is clear for previous uarches that use split TLBs. AMD wanted to use the same format for the new unified skewed design in Zen, but, as you can see, it doesn't really fit well. Anyway, effectively, it's indeed as 12-way cache with 1536 entries. You just have to to know that it's skewed to interpret the cpuid info correctly. PDEs are also cached in the L2 DTLB, but these work differently.


It's possible AMD may have published an erratum or other documentation about the CPUID encoding for L2dTLB associativity on Zen.


BTW, Wikichip's Zen page unfortunately doesn't list the associativities of each level of TLB. But https://www.7-cpu.com/cpu/Zen.html does list the same associativities as AMD's PDF manual.


This would give a set-size of 192 entries, so no easy modulo power 2 indexing.

Indeed, that would require some trickery, if it's doable efficiently at all.

or example, @Hadi suggested in comments on How does the indexing of the Ice Lake's 48KiB L1 data cache work? that a split design could have been possible, e.g. a 32k and a 16k cache. (But actually Intel did increase associativity to 12-way, keeping the number of sets the same and a power of 2, also avoiding aliasing problems while maintaining VIPT performance.)

That's actually a very similar Q&A, but with wrong associativity coming from a manual instead of CPUID. CPUs do sometimes have bugs where CPUID reports wrong info about cache/TLB parameters; programs that want to use CPUID info should have tables of fixups per CPU model/stepping so you have a place to correct errata that doesn't get fixed by microcode update.

(Although in this case it may not really be fixable due to encoding limitations, except by defining some of the unused encodings.)

10
  • The associativity of the number of ways given in CPUID 0x80000006u is encoded in a 4-bit value. The encoding is given in amd.com/system/files/TechDocs/25481.pdf - but there isn't an encoding for 12 ways. Commented Nov 3, 2021 at 6:19
  • 2
    Both the manual and cpuid info provide the correct L2 DTLB associativity and number of entries. Starting with Zen, the L2 DTLB is a skewed unified cache. This means that for a page with a particular address and size (which is unknown at the time of lookup), it can be mapped to some subset of ways of the total 12 ways according to a mapping function. For desktop/mobile models such as the Ryzen 7 1800X, any 4KB page can be mapped to 8 ways out of the 12 ways, any 2MB/4MB page can be mapped to 2 other ways, any coalesced 32KB page can be mapped to 2 other ways. That's a total of 12 ways.
    – Hadi Brais
    Commented Nov 3, 2021 at 17:12
  • 2
    For server models, the mapping is 6/3/3, respectively. The way cpuid reports TLB info is clear for previous uarches that use split TLBs. AMD wanted to use the same format for the new unified skewed design in Zen, but, as you can see, it doesn't really fit well. Anyway, effectively, it's indeed as 12-way cache with 1536 entries. You just have to to know that it's skewed to interpret the cpuid info correctly. PDEs are also cached in the L2 DTLB, but these work differently.
    – Hadi Brais
    Commented Nov 3, 2021 at 17:12
  • 1
    Yea sure. instlatx64 has a treasure of CPUID dumps. Take for example the EPYC 7551P server processor whose dump can be found here. Leaf 80000006 tells you that L2 DTLB has 6 ways for 4KB pages and 3 ways for 2MB/4MB pages. Note that CPUID doesn't report the number of ways for coalesced pages (32KB in Zen/Zen+), but it's easy to deduce from the other information. OP's processor dump is here.
    – Hadi Brais
    Commented Nov 4, 2021 at 1:49
  • 1
    Oh, I don't think you'll be able to find this piece of information in any public written source. At least I couldn't find any. But I'm 99% sure it's skewed. You can quote me on this, no problem, until someone finds an official source.
    – Hadi Brais
    Commented Nov 4, 2021 at 2:03

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.