Yeah, I agree that a Q sized LDR would be inappropriate here, but I only mentioned that as a side note and to further demonstrate the oddity.
The instructions GCC is generating are 32bit S sized LDRs and the size isn't the issue - it's the addressing. If you look at the example asm code above, an offset is included with the base address as the program attempts to read each of the subsequent 32 bit registers in turn (in the example, it actually begins at the last address and decrements to the first). The problem is that each read, even though it has a valid offset, always returns the first 32 bits of the 128 bit address range!
In other words, if the base address (loaded into X1) is 0x3f202010 and you do a LDR S1, [X1], that will work, but and LDR with a 4 byte offset will STILL return the first 32 bits. Ie, LDR S1 [X1, #4] will not load the value at 0x3f202014, it will STILL load the value at 0x3f202010! Taken from the opposite direction, if the base address is 0x3f20201c and you do a LDR S1, [X1], you will still get the data from 0x3f202010! This is the weirdness.
Again, this only seems to happen in device type memory as offsets work fine in regular memory or off the stack (as do D or Q sized loads but that's merely incidental).
I don't think this is a high level issue, or even a compiler issue - it's something specific to device memory type and SIMD LDR/LDUR instructions in ARMv8-A.
I guess I could just ask whether anyone else is successfully doing baremetal with -O2 (or -O3), or just with the -fftree-slp-vectorize optimization enabled, and happens to see any LDRs to SIMD registers from device memory regions?
My specific example is retrieving the SDCard CID data from the RPI 3B SDHOST device after a valid CMD2, from the SDRSP0, SDRSP1, SDRSP2, SDRSP3 registers. The card's Product Name is 7x ascii characters the contents are split across SDRSP3 and SDRSP2 registers so I have a several shift right and bit mask functions in C++ to pull out the values. GCC is loading both registers into separate SIMD vectors and then performing those bitwise operations on the SIMD registers.
My SDCard_CID constructor:The resulting assembly (Via objdump):
You can see gcc trying to vectorize the SDRSP3 (0x3f20201c) and SDRSP2 (0x3f202018) addresses into S2 and S1 respectively. From there, it performs various shift-rights on the SIMD registers before storing them. Problem is, S2 and S1 always get the value of SDRSP0 (0x3f202010) as described above so the result is always bad!
Cheers!
The instructions GCC is generating are 32bit S sized LDRs and the size isn't the issue - it's the addressing. If you look at the example asm code above, an offset is included with the base address as the program attempts to read each of the subsequent 32 bit registers in turn (in the example, it actually begins at the last address and decrements to the first). The problem is that each read, even though it has a valid offset, always returns the first 32 bits of the 128 bit address range!
In other words, if the base address (loaded into X1) is 0x3f202010 and you do a LDR S1, [X1], that will work, but and LDR with a 4 byte offset will STILL return the first 32 bits. Ie, LDR S1 [X1, #4] will not load the value at 0x3f202014, it will STILL load the value at 0x3f202010! Taken from the opposite direction, if the base address is 0x3f20201c and you do a LDR S1, [X1], you will still get the data from 0x3f202010! This is the weirdness.
Again, this only seems to happen in device type memory as offsets work fine in regular memory or off the stack (as do D or Q sized loads but that's merely incidental).
I don't think this is a high level issue, or even a compiler issue - it's something specific to device memory type and SIMD LDR/LDUR instructions in ARMv8-A.
I guess I could just ask whether anyone else is successfully doing baremetal with -O2 (or -O3), or just with the -fftree-slp-vectorize optimization enabled, and happens to see any LDRs to SIMD registers from device memory regions?
My specific example is retrieving the SDCard CID data from the RPI 3B SDHOST device after a valid CMD2, from the SDRSP0, SDRSP1, SDRSP2, SDRSP3 registers. The card's Product Name is 7x ascii characters the contents are split across SDRSP3 and SDRSP2 registers so I have a several shift right and bit mask functions in C++ to pull out the values. GCC is loading both registers into separate SIMD vectors and then performing those bitwise operations on the SIMD registers.
My SDCard_CID constructor:
Code:
SDCardCID(unsigned int b127_96, unsigned int b95_64, unsigned int b63_32, unsigned int b31_0) : manuf_id(b127_96 >> 24), rev_major(b63_32 >> 28), rev_minor(b63_32 >> 24 & 0xF),serial_no((b63_32 << 8) | (b31_0 >> 24)), manuf_month(b31_0 >> 8 & 0xF), manuf_year(b31_0 >> 12 & 0xFF){oem_id[0] = b127_96 >> 16 & 0xFF;oem_id[1] = b127_96 >> 8 & 0xFF;prod_name[0] = b127_96 & 0xFF;prod_name[1] = b95_64 >> 24 & 0xFF;prod_name[2] = b95_64 >> 16 & 0xFF;prod_name[3] = b95_64 >> 8 & 0xFF;prod_name[4] = b95_64 & 0xFF;}
Code:
mov x1, #0x201c movk x1, #0x3f20, lsl #16mov x0, x1ldr s2, [x0], # - 4ldur s1, [x1, # - 4]ldur w2, [x0, # - 4]ldur w1, [x0, # - 8]adrp x3, c5000 <irq_handlers + 0x118>add x3, x3, #0x520add x0, x3, #0x10ushr v0.2s, v2.2s, #24ushr v7.2s, v2.2s, #16ushr v6.2s, v2.2s, #8ushr v5.2s, v1.2s, #24ushr v4.2s, v1.2s, #16ushr v3.2s, v1.2s, #8mov v0.b[1], v7.b[0]mov v0.b[2], v6.b[0]mov v0.b[3], v2.b[0]mov v0.b[4], v5.b[0]mov v0.b[5], v4.b[0]mov v0.b[6], v3.b[0]mov v0.b[7], v1.b[0]str d0, [x3, #16]lsr w3, w2, #28strb w3, [x0, #8]ubfx x3, x2, #24, #4strb w3, [x0, #9]extr w2, w2, w1, #24str w2, [x0, #12]ubfx x2, x1, #8, #4strb w2, [x0, #16]lsr w1, w1, #12strb w1, [x0, #17]
Cheers!
Statistics: Posted by willdieh — Thu May 16, 2024 10:34 pm