Principle of interleave shuffle with SSE
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
Target:
For an ordered list of input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Achieve its interleave shuffle:
1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24
Diagram:
Process:
Above target can be achieved with _mm_shuffle_ps
in SSE2. Following is the code:
#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);
__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));
__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));
__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));
_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);
Questions
To summarize, Packing 24 float
s into 6 __m128
then shuffling them for three times Achieves my goals. And I found packing 16 float
s into 4 __m128
then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n
(n=1,2,3,4,...
).
Besides, can anyone help clariying above algorithms Or providing me relevant materials?
c++ sse shuffle simd avx
add a comment |
Target:
For an ordered list of input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Achieve its interleave shuffle:
1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24
Diagram:
Process:
Above target can be achieved with _mm_shuffle_ps
in SSE2. Following is the code:
#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);
__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));
__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));
__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));
_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);
Questions
To summarize, Packing 24 float
s into 6 __m128
then shuffling them for three times Achieves my goals. And I found packing 16 float
s into 4 __m128
then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n
(n=1,2,3,4,...
).
Besides, can anyone help clariying above algorithms Or providing me relevant materials?
c++ sse shuffle simd avx
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
2
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
2
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56
add a comment |
Target:
For an ordered list of input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Achieve its interleave shuffle:
1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24
Diagram:
Process:
Above target can be achieved with _mm_shuffle_ps
in SSE2. Following is the code:
#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);
__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));
__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));
__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));
_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);
Questions
To summarize, Packing 24 float
s into 6 __m128
then shuffling them for three times Achieves my goals. And I found packing 16 float
s into 4 __m128
then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n
(n=1,2,3,4,...
).
Besides, can anyone help clariying above algorithms Or providing me relevant materials?
c++ sse shuffle simd avx
Target:
For an ordered list of input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Achieve its interleave shuffle:
1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24
Diagram:
Process:
Above target can be achieved with _mm_shuffle_ps
in SSE2. Following is the code:
#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);
__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));
__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));
__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));
_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);
Questions
To summarize, Packing 24 float
s into 6 __m128
then shuffling them for three times Achieves my goals. And I found packing 16 float
s into 4 __m128
then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n
(n=1,2,3,4,...
).
Besides, can anyone help clariying above algorithms Or providing me relevant materials?
c++ sse shuffle simd avx
c++ sse shuffle simd avx
edited Jan 4 at 7:05
Finley
asked Jan 4 at 6:28
FinleyFinley
227110
227110
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
2
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
2
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56
add a comment |
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
2
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
2
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
2
2
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
2
2
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034020%2fprinciple-of-interleave-shuffle-with-sse%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034020%2fprinciple-of-interleave-shuffle-with-sse%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
It looks like what you are doing is just a 3x8 transpose ?
– Paul R
Jan 4 at 8:42
@PaulR Seems to be, but how to relate this thread with shuffle?
– Finley
Jan 4 at 9:17
Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.
– Paul R
Jan 4 at 9:21
2
Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.
– wim
Jan 4 at 22:51
2
The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.
– wim
Jan 4 at 23:56