Principle of interleave shuffle with SSE





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















Target:



For an ordered list of input:



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


Achieve its interleave shuffle:



1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24


Diagram:
enter image description hereProcess:



Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:



#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);

__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));

__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));

__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));

_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);


Questions



To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).



Besides, can anyone help clariying above algorithms Or providing me relevant materials?










share|improve this question

























  • It looks like what you are doing is just a 3x8 transpose ?

    – Paul R
    Jan 4 at 8:42











  • @PaulR Seems to be, but how to relate this thread with shuffle?

    – Finley
    Jan 4 at 9:17













  • Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

    – Paul R
    Jan 4 at 9:21






  • 2





    Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

    – wim
    Jan 4 at 22:51








  • 2





    The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

    – wim
    Jan 4 at 23:56


















0















Target:



For an ordered list of input:



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


Achieve its interleave shuffle:



1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24


Diagram:
enter image description hereProcess:



Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:



#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);

__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));

__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));

__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));

_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);


Questions



To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).



Besides, can anyone help clariying above algorithms Or providing me relevant materials?










share|improve this question

























  • It looks like what you are doing is just a 3x8 transpose ?

    – Paul R
    Jan 4 at 8:42











  • @PaulR Seems to be, but how to relate this thread with shuffle?

    – Finley
    Jan 4 at 9:17













  • Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

    – Paul R
    Jan 4 at 9:21






  • 2





    Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

    – wim
    Jan 4 at 22:51








  • 2





    The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

    – wim
    Jan 4 at 23:56














0












0








0








Target:



For an ordered list of input:



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


Achieve its interleave shuffle:



1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24


Diagram:
enter image description hereProcess:



Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:



#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);

__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));

__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));

__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));

_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);


Questions



To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).



Besides, can anyone help clariying above algorithms Or providing me relevant materials?










share|improve this question
















Target:



For an ordered list of input:



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24


Achieve its interleave shuffle:



1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24


Diagram:
enter image description hereProcess:



Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:



#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)
float fArray[24] = {.0};
for(size_t i =0;i<24;i++)
fArray[i] = (i+1);
__m128 a0 = _mm_loadu_ps(fArray);
__m128 a1 = _mm_loadu_ps(fArray+4);
__m128 a2 = _mm_loadu_ps(fArray+8);
__m128 a3 = _mm_loadu_ps(fArray+12);
__m128 a4 = _mm_loadu_ps(fArray+16);
__m128 a5 = _mm_loadu_ps(fArray+20);

__m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));
__m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));
__m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));
__m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));
__m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));
__m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));

__m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));
__m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));
__m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));
__m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));
__m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));
__m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));

__m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));
__m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));
__m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));
__m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));
__m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));
__m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));

_mm_storeu_ps(fArray,d0);
_mm_storeu_ps(fArray+4,d1);
_mm_storeu_ps(fArray+8,d2);
_mm_storeu_ps(fArray+12,d3);
_mm_storeu_ps(fArray+16,d4);
_mm_storeu_ps(fArray+20,d5);


Questions



To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).



Besides, can anyone help clariying above algorithms Or providing me relevant materials?







c++ sse shuffle simd avx






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 4 at 7:05







Finley

















asked Jan 4 at 6:28









FinleyFinley

227110




227110













  • It looks like what you are doing is just a 3x8 transpose ?

    – Paul R
    Jan 4 at 8:42











  • @PaulR Seems to be, but how to relate this thread with shuffle?

    – Finley
    Jan 4 at 9:17













  • Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

    – Paul R
    Jan 4 at 9:21






  • 2





    Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

    – wim
    Jan 4 at 22:51








  • 2





    The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

    – wim
    Jan 4 at 23:56



















  • It looks like what you are doing is just a 3x8 transpose ?

    – Paul R
    Jan 4 at 8:42











  • @PaulR Seems to be, but how to relate this thread with shuffle?

    – Finley
    Jan 4 at 9:17













  • Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

    – Paul R
    Jan 4 at 9:21






  • 2





    Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

    – wim
    Jan 4 at 22:51








  • 2





    The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

    – wim
    Jan 4 at 23:56

















It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42





It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42













@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17







@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17















Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21





Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21




2




2





Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51







Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51






2




2





The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56





The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56












0






active

oldest

votes












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034020%2fprinciple-of-interleave-shuffle-with-sse%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034020%2fprinciple-of-interleave-shuffle-with-sse%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas