Principle of interleave shuffle with SSE

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

Target:

For an ordered list of input:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Achieve its interleave shuffle:

1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24

Diagram:
enter image description here Process:

Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:

#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)

    float fArray[24] = {.0};

    for(size_t i =0;i<24;i++)

        fArray[i] = (i+1);

    __m128 a0 = _mm_loadu_ps(fArray);

    __m128 a1 = _mm_loadu_ps(fArray+4);

    __m128 a2 = _mm_loadu_ps(fArray+8);

    __m128 a3 = _mm_loadu_ps(fArray+12);

    __m128 a4 = _mm_loadu_ps(fArray+16);

    __m128 a5 = _mm_loadu_ps(fArray+20);



    __m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));

    __m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));

    __m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));

    __m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));

    __m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));

    __m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));



    __m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));

    __m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));

    __m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));

    __m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));

    __m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));

    __m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));



    __m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));

    __m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));

    __m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));

    __m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));

    __m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));

    __m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));



     _mm_storeu_ps(fArray,d0);

     _mm_storeu_ps(fArray+4,d1);

     _mm_storeu_ps(fArray+8,d2);

     _mm_storeu_ps(fArray+12,d3);

     _mm_storeu_ps(fArray+16,d4);

     _mm_storeu_ps(fArray+20,d5);

Questions

To summarize, Packing 24 floats into 6 __m128 then shuffling them for three times Achieves my goals. And I found packing 16 floats into 4 __m128 then shuffling for two times could get similar results. So whether there exists a general principle for shuffling float array with size of 4n(n=1,2,3,4,...).

Besides, can anyone help clariying above algorithms Or providing me relevant materials?

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42

@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17

Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21

2

Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51

2

The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56

add a comment |

Target:

For an ordered list of input:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Achieve its interleave shuffle:

1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24

Diagram:
enter image description here Process:

Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:

#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)

    float fArray[24] = {.0};

    for(size_t i =0;i<24;i++)

        fArray[i] = (i+1);

    __m128 a0 = _mm_loadu_ps(fArray);

    __m128 a1 = _mm_loadu_ps(fArray+4);

    __m128 a2 = _mm_loadu_ps(fArray+8);

    __m128 a3 = _mm_loadu_ps(fArray+12);

    __m128 a4 = _mm_loadu_ps(fArray+16);

    __m128 a5 = _mm_loadu_ps(fArray+20);



    __m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));

    __m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));

    __m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));

    __m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));

    __m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));

    __m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));



    __m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));

    __m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));

    __m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));

    __m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));

    __m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));

    __m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));



    __m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));

    __m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));

    __m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));

    __m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));

    __m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));

    __m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));



     _mm_storeu_ps(fArray,d0);

     _mm_storeu_ps(fArray+4,d1);

     _mm_storeu_ps(fArray+8,d2);

     _mm_storeu_ps(fArray+12,d3);

     _mm_storeu_ps(fArray+16,d4);

     _mm_storeu_ps(fArray+20,d5);

Questions

Besides, can anyone help clariying above algorithms Or providing me relevant materials?

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42

@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17

Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21

2

Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51

2

The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56

add a comment |

Target:

For an ordered list of input:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Achieve its interleave shuffle:

1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24

Diagram:
enter image description here Process:

Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:

#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)

    float fArray[24] = {.0};

    for(size_t i =0;i<24;i++)

        fArray[i] = (i+1);

    __m128 a0 = _mm_loadu_ps(fArray);

    __m128 a1 = _mm_loadu_ps(fArray+4);

    __m128 a2 = _mm_loadu_ps(fArray+8);

    __m128 a3 = _mm_loadu_ps(fArray+12);

    __m128 a4 = _mm_loadu_ps(fArray+16);

    __m128 a5 = _mm_loadu_ps(fArray+20);



    __m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));

    __m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));

    __m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));

    __m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));

    __m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));

    __m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));



    __m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));

    __m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));

    __m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));

    __m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));

    __m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));

    __m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));



    __m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));

    __m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));

    __m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));

    __m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));

    __m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));

    __m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));



     _mm_storeu_ps(fArray,d0);

     _mm_storeu_ps(fArray+4,d1);

     _mm_storeu_ps(fArray+8,d2);

     _mm_storeu_ps(fArray+12,d3);

     _mm_storeu_ps(fArray+16,d4);

     _mm_storeu_ps(fArray+20,d5);

Questions

Besides, can anyone help clariying above algorithms Or providing me relevant materials?

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

Target:

For an ordered list of input:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Achieve its interleave shuffle:

1 9 17 2 10 18 3 11 19 4 12 20 5 13 21 6 14 22 7 15 23 8 16 24

Diagram:
enter image description here Process:

Above target can be achieved with _mm_shuffle_ps in SSE2. Following is the code:

#define MAKE_MASK(f3,f2,f1,f0) ((f3<<6)|(f2<<4)|(f1<<2)|f0)

    float fArray[24] = {.0};

    for(size_t i =0;i<24;i++)

        fArray[i] = (i+1);

    __m128 a0 = _mm_loadu_ps(fArray);

    __m128 a1 = _mm_loadu_ps(fArray+4);

    __m128 a2 = _mm_loadu_ps(fArray+8);

    __m128 a3 = _mm_loadu_ps(fArray+12);

    __m128 a4 = _mm_loadu_ps(fArray+16);

    __m128 a5 = _mm_loadu_ps(fArray+20);



    __m128 b0 = _mm_shuffle_ps(a0,a1,MAKE_MASK(2,0,2,0));

    __m128 b3 = _mm_shuffle_ps(a0,a1,MAKE_MASK(3,1,3,1));

    __m128 b1 = _mm_shuffle_ps(a2,a3,MAKE_MASK(2,0,2,0));

    __m128 b4 = _mm_shuffle_ps(a2,a3,MAKE_MASK(3,1,3,1));

    __m128 b2 = _mm_shuffle_ps(a4,a5,MAKE_MASK(2,0,2,0));

    __m128 b5 = _mm_shuffle_ps(a4,a5,MAKE_MASK(3,1,3,1));



    __m128 c0 = _mm_shuffle_ps(b0,b1,MAKE_MASK(2,0,2,0));

    __m128 c3 = _mm_shuffle_ps(b0,b1,MAKE_MASK(3,1,3,1));

    __m128 c1 = _mm_shuffle_ps(b2,b3,MAKE_MASK(2,0,2,0));

    __m128 c4 = _mm_shuffle_ps(b2,b3,MAKE_MASK(3,1,3,1));

    __m128 c2 = _mm_shuffle_ps(b4,b5,MAKE_MASK(2,0,2,0));

    __m128 c5 = _mm_shuffle_ps(b4,b5,MAKE_MASK(3,1,3,1));



    __m128 d0 = _mm_shuffle_ps(c0,c1,MAKE_MASK(2,0,2,0));

    __m128 d3 = _mm_shuffle_ps(c0,c1,MAKE_MASK(3,1,3,1));

    __m128 d1 = _mm_shuffle_ps(c2,c3,MAKE_MASK(2,0,2,0));

    __m128 d4 = _mm_shuffle_ps(c2,c3,MAKE_MASK(3,1,3,1));

    __m128 d2 = _mm_shuffle_ps(c4,c5,MAKE_MASK(2,0,2,0));

    __m128 d5 = _mm_shuffle_ps(c4,c5,MAKE_MASK(3,1,3,1));



     _mm_storeu_ps(fArray,d0);

     _mm_storeu_ps(fArray+4,d1);

     _mm_storeu_ps(fArray+8,d2);

     _mm_storeu_ps(fArray+12,d3);

     _mm_storeu_ps(fArray+16,d4);

     _mm_storeu_ps(fArray+20,d5);

Questions

Besides, can anyone help clariying above algorithms Or providing me relevant materials?

c++ sse shuffle simd avx

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

edited Jan 4 at 7:05

asked Jan 4 at 6:28

Finley

227110

asked Jan 4 at 6:28

Finley

227110

asked Jan 4 at 6:28

Finley

227110

It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42

@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17

Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21

2

Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51

2

The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56

add a comment |

It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42

@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17

Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21

2

Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51

2

The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56

It looks like what you are doing is just a 3x8 transpose ?

– Paul R
Jan 4 at 8:42

@PaulR Seems to be, but how to relate this thread with shuffle?

– Finley
Jan 4 at 9:17

Well in the integer domain a transpose is usually implemented using unpack instructions rather than shuffles. I haven’t had time to give this any serious thought yet, but you might be able to use a similar approach for transposing floats.

– Paul R
Jan 4 at 9:21

Note that in the 3x8 float case you only need 5 shuffles, see this question and answer. For the 4x4 case there exists a handy macro. See also here for other cases.

– wim
Jan 4 at 22:51

The 5 shuffles refer to the AVX case. With SSE you need twice as much shuffles, which is still better than 18 shuffles.

– wim
Jan 4 at 23:56

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54034020%2fprinciple-of-interleave-shuffle-with-sse%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

LLdP tDAS8UxS4dh1tyqe9mvRB3EQFd0K2,jeQd,cr 50

搜尋此網誌

Bdtjtk