I've seen this code back in 2019 when initially trying to support HEVC on the Pi4. That's some black magicWe have used the 3d hardware for acceleration jobs in the past on Pi3.
We have a yadif style deinterlace, and parts of the hevc decoder written in qpu assembly.
These directly read and write (through a dma type mechanism) memory and work directly on the YUV buffers directly (no texture unit involved).
I managed to write a NEON 8x8 transpose code and it does transpose a 1920x1080 byte buffer in 6ms, so it's close but should be sufficient. I also managed to mmap the decoded planes from the hardware H264 decoder, transpose them into another DMA buffer and import that into DRM. The video is transposed on the screen (yay!), but for some reason the transpose now takes 35ms, so it's ~5x slower. My guess is that this is due to the video decoder placing those buffers into uncached memory (like in this post?). Only way out of that might be to force the H264 decoder to allocate buffers on the ARM side somehow? EDIT: Just discovered the "dmabuf_alloc" "cma" decoder setting. Hm
Statistics: Posted by dividuum — Wed Feb 19, 2025 12:07 pm