How do the shuffle/permute intrinsics work for 256 bit pd?
Clash Royale CLAN TAG#URR8PPP
How do the shuffle/permute intrinsics work for 256 bit pd?
I'm trying to wrap my head around how the _mm256_shuffle_pd and _mm256_permute_pd intrinsics work.
I can't seem to predict what the results of one of these operations would be.
First, for _mm_shuffle_ps all is good. The results I get are the one I expect. For example:
float b[4] = 1.12, 2.22, 3.33, 4.44 ;
__m128 a = _mm_load_ps(&b[0]);
a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 1, 2));
_mm_store_ps(&b[0], a);
// 3.33 2.22 1.12 4.44
So everything is right here. Now I wanted to try this with __m256d that is what I'm currently using in my code.
From what I've found the _mm256_shuffle_ps/pd intrinsics works differently.
My understanding here is that the control mask is applied two times. The first time on the first half of the 128 bit and the second on the last 128 bit.
The first two pairs of control bits are used to choose from the first vector ( and store the values in the first&second word and in the fifth&sixth word of the result vector ) while the highest bit pairs choose from the second vector.
For example:
float b[8] = 1.12, 2.22, 3.33, 4.44, 5.55, 6.66, 7.77, 8.88 ;
__m256 a = _mm256_load_ps(&b[0]);
a = _mm256_shuffle_ps(a, a, 0b00000111);
_mm256_store_ps(&b[0], a);
// 4.44 2.22 1.12 1.12 8.88 6.66 5.55 5.55
Here the result I expect ( and I actually get ) is 4.44, 2.22, 1.12, 1.12, 8.88, 6.66, 5.55, 5.55
4.44, 2.22, 1.12, 1.12, 8.88, 6.66, 5.55, 5.55
This should work as follows:
( Sorry I'm bad at drawing ).
And the same is done for the second vector ( in this case a again ) using the highest two pairs ( so 00 00 ) and filling the missing spaces.
I thought that _mm256_shuffle_pd would work the same way. So if I wanted the first double I would have to move the 00 space and the 01 space to construct it correctly.
For example:
__m256d a = _mm256_load_pd(&b[0]);
a = _mm256_shuffle_pd(a, a, 0b01000100);
_mm256_store_pd(&b[0], a);
// 1.12 1.12 4.44 3.33
I would have expected this to output 1.12, 1.12, 3.33, 3.33 .
In my head, I'm taking 00 01 ( 1.12 ) and 00 01 3.33 from the first vector and the same from the second with it being the same vector and all.
I've tried many combinations for the control mask and I just can't wrap my head around how this is used nor was I able to find somewhere where it was explained in a way I would understand.
So my question is: How does _mm256_shuffle_pd work? And how would I get the same result as _mm_shuffle_ps(a, a, _MM_SHUFFLE(3, 0, 2, 1)) with four doubles and a shuffle ( if at all possible)?
1 Answer
1
shufps
needs all 8 bits of its immediate just for 4 elements with 4 possible sources each. So it has no room to grow for 256-bit, and the only option was to replicate the same shuffle in both lanes.
shufps
But 128-bit shufpd
only has 2 elements with 2 sources each, thus 2 x 1 bit. So the AVX version uses 4 bits total, 2 for each lane. (It's not lane-crossing, so it's not as powerful as 128-bit shufps
.)
shufpd
shufps
http://felixcloutier.com/x86/SHUFPD.html has full docs with a diagram, and detailed pseudocode. Intel's intrinsics guide for _mm256_shuffle_pd
has the same pseudo-code.
_mm256_shuffle_pd
AVX2 http://felixcloutier.com/x86/VPERMPD.html (_mm256_permute_pd
, aka _mm256_permute4x64_pd
) is lane-crossing, and uses its immediate exactly the way 128-bit shufps
does: four 2-bit selectors.
_mm256_permute_pd
_mm256_permute4x64_pd
shufps
The only lane-crossing 2-source shuffle is vperm2f128
(_mm256_permute2f128_pd
), until AVX512F introduces finer granularity vpermt2pd
and vpermt2ps
(and equivalent integer shuffles.
vperm2f128
_mm256_permute2f128_pd
vpermt2pd
vpermt2ps
AVX1 doesn't have any lane-crossing shuffles with granularity smaller than 128-bit, not even 1-source versions. If you need one, you have to build it out of vinsertf128
or vperm2f128
+ in-lane shuffles.
vinsertf128
vperm2f128
Thus, keeping 3D vectors in SIMD vectors is even worse with AVX than it is for float
with 128-bit vectors. http://fastcpp.blogspot.com/2011/04/vector-cross-product-using-sse-code.html might be faster than scalar, but it's much worse than you can do if you design your data layout for SIMD.
float
Use separate arrays of contiguous x
, y
, and z
so you can do 4x cross products in parallel with no shuffling, and take advantage of FMA instructions. Use SIMD to do multiple vectors in parallel, not to speed up single vectors.
x
y
z
See links in https://stackoverflow.com/tags/sse/info, especially https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ which explains the data-layout issue quite well, and which level of a loop to vectorize with SIMD.
I've seen there is a permute4x64 that would be able to do what I need but my processor doesn't have access to AVX2. My solution would be to use this to reverse the second parameter. Is there a faster, more correct solution that you would suggest?
– Animamorta
Aug 8 at 15:58
@Animamorta: Right. Whenever possible, don't keep single 3D vectors in SIMD vectors in the first place. See my update.
– Peter Cordes
Aug 8 at 20:42
Thanks a lot for the help and the great resources.
– Animamorta
Aug 8 at 21:39
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
This was really clarifying. Thanks a lot for showing me some great resources too. So there shouldn't be the possibility of replicating [this code]( fastcpp.blogspot.com/2011/04/… ) with __m256d with a single shuffle or permute as they won't be able to fetch the 3rd element to the second space, am I right?
– Animamorta
Aug 8 at 15:58