Does wave / subgroup need synchronization for shared variables?

I am wondering if within a same wave / subgroup (warp?) we need to call memoryBarrierShared and barrier to synchronize shared variable? In NVIDIA I think it is not necessary, but I do not know for other IHVs.

EDIT : ballot

Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.

Let's say we have such code (1) :

shared uint s_data[128];

uint tid = gl_GlobalInvocationID.x;

// initialization of some s_data

memoryBarrierShared();

barrier();

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)

        s_data[tid] += s_data[tid + i];

}

According to me, this code is not correct. The correct one, according to the spec, would be (2):

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

        barrier();

    }

}

However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

    }

}

However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)

EDIT : int to uint, and change = to +=

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

1

"According to me, this code is not correct." Well, what exactly is it supposed to do? I don't understand what your code is intended to accomplish. I have no idea what s_data is, what values it has, or what it is intended to eventually store. And since all versions of your code exhibit UB, it's not clear what is supposed to be happening here.

– Nicol Bolas
Jan 3 at 15:36

The idea of my code is to accomplish a reduction. (I wanted to write += instead of =). s_data is only "values". What UB do my codes have?

– Antoine Morrier
Jan 3 at 15:55

1

In every case, you have invocations reading from memory that some other invocation will write to with no barriers between them to provide ordering/visibility. Even in your case 2, an invocation where tid == 1 will write to a variable that the tid == 0 invocation reads from. That's undefined behavior, whether shader_ballot exists or not.

– Nicol Bolas
Jan 3 at 16:00

1

@AntoineMorrier ARB_shader_ballot must define a groupsize, but that is not it's purpose. shader_ballot makes no guarantees about the underlying architecture beyond the fact that ballotARB works if the vendor has implemented the extension. Unrolling the last warp works because all other warps are free to do other work with in a Streaming Multiprocessor (NV specific) but also relies on undefined behavior EVEN ON NVIDIA GPUS to carry out adding values simultaneously accumulated from each warp. (cont.)

– opa
Jan 3 at 16:34

1

@AntoineMorrier 1) Yes, the code is UB on Nvidia GPUs. 2) Yes, I was talking about the code in the article. There either should be a __syncwarp extension provided for GLSL, or it should be built into other primitives provided by extension, for example ballotARB internally may just be the __ballot_sync cuda function on Nvidia gpus, which performs ballot and syncs the warp ensuring safe result.

– opa
Jan 3 at 17:57

|
show 7 more comments

EDIT : ballot

Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.

Let's say we have such code (1) :

shared uint s_data[128];

uint tid = gl_GlobalInvocationID.x;

// initialization of some s_data

memoryBarrierShared();

barrier();

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)

        s_data[tid] += s_data[tid + i];

}

According to me, this code is not correct. The correct one, according to the spec, would be (2):

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

        barrier();

    }

}

However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

    }

}

However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)

EDIT : int to uint, and change = to +=

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

1

"According to me, this code is not correct." Well, what exactly is it supposed to do? I don't understand what your code is intended to accomplish. I have no idea what s_data is, what values it has, or what it is intended to eventually store. And since all versions of your code exhibit UB, it's not clear what is supposed to be happening here.

– Nicol Bolas
Jan 3 at 15:36

The idea of my code is to accomplish a reduction. (I wanted to write += instead of =). s_data is only "values". What UB do my codes have?

– Antoine Morrier
Jan 3 at 15:55

1

In every case, you have invocations reading from memory that some other invocation will write to with no barriers between them to provide ordering/visibility. Even in your case 2, an invocation where tid == 1 will write to a variable that the tid == 0 invocation reads from. That's undefined behavior, whether shader_ballot exists or not.

– Nicol Bolas
Jan 3 at 16:00

1

@AntoineMorrier ARB_shader_ballot must define a groupsize, but that is not it's purpose. shader_ballot makes no guarantees about the underlying architecture beyond the fact that ballotARB works if the vendor has implemented the extension. Unrolling the last warp works because all other warps are free to do other work with in a Streaming Multiprocessor (NV specific) but also relies on undefined behavior EVEN ON NVIDIA GPUS to carry out adding values simultaneously accumulated from each warp. (cont.)

– opa
Jan 3 at 16:34

1

@AntoineMorrier 1) Yes, the code is UB on Nvidia GPUs. 2) Yes, I was talking about the code in the article. There either should be a __syncwarp extension provided for GLSL, or it should be built into other primitives provided by extension, for example ballotARB internally may just be the __ballot_sync cuda function on Nvidia gpus, which performs ballot and syncs the warp ensuring safe result.

– opa
Jan 3 at 17:57

|
show 7 more comments

EDIT : ballot

Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.

Let's say we have such code (1) :

shared uint s_data[128];

uint tid = gl_GlobalInvocationID.x;

// initialization of some s_data

memoryBarrierShared();

barrier();

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)

        s_data[tid] += s_data[tid + i];

}

According to me, this code is not correct. The correct one, according to the spec, would be (2):

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

        barrier();

    }

}

However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

    }

}

However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)

EDIT : int to uint, and change = to +=

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

EDIT : ballot

Since I am talking about wave / subgroup, I am talking about the ARB_shader_ballot extension.

Let's say we have such code (1) :

shared uint s_data[128];

uint tid = gl_GlobalInvocationID.x;

// initialization of some s_data

memoryBarrierShared();

barrier();

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1)

        s_data[tid] += s_data[tid + i];

}

According to me, this code is not correct. The correct one, according to the spec, would be (2):

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

        barrier();

    }

}

However, since invocations run in parallel within a wave/subgroup, the barrier function seems to be useless : this one should be correct as well and faster than the second (3) :

if(tid < gl_SubGroupSizeARB) {

    for(uint i = gl_SubGroupeSizeARB; i > 0; i>>=1) {

        s_data[tid] += s_data[tid + i];

        memoryBarrierShared();

    }

}

However, since we do not need barrier function, I wonder if (1) is correct, even if it is unlikely for me, and if not, if (3) is correct (that would means that my understanding is correct)

EDIT : int to uint, and change = to +=

opengl glsl vulkan

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

edited Jan 3 at 15:51

asked Jan 3 at 9:01

Antoine Morrier

2,102721

asked Jan 3 at 9:01

Antoine Morrier

2,102721

asked Jan 3 at 9:01

Antoine Morrier

2,102721

1

"According to me, this code is not correct." Well, what exactly is it supposed to do? I don't understand what your code is intended to accomplish. I have no idea what s_data is, what values it has, or what it is intended to eventually store. And since all versions of your code exhibit UB, it's not clear what is supposed to be happening here.

– Nicol Bolas
Jan 3 at 15:36

The idea of my code is to accomplish a reduction. (I wanted to write += instead of =). s_data is only "values". What UB do my codes have?

– Antoine Morrier
Jan 3 at 15:55

1

In every case, you have invocations reading from memory that some other invocation will write to with no barriers between them to provide ordering/visibility. Even in your case 2, an invocation where tid == 1 will write to a variable that the tid == 0 invocation reads from. That's undefined behavior, whether shader_ballot exists or not.

– Nicol Bolas
Jan 3 at 16:00

1

@AntoineMorrier ARB_shader_ballot must define a groupsize, but that is not it's purpose. shader_ballot makes no guarantees about the underlying architecture beyond the fact that ballotARB works if the vendor has implemented the extension. Unrolling the last warp works because all other warps are free to do other work with in a Streaming Multiprocessor (NV specific) but also relies on undefined behavior EVEN ON NVIDIA GPUS to carry out adding values simultaneously accumulated from each warp. (cont.)

– opa
Jan 3 at 16:34

1

@AntoineMorrier 1) Yes, the code is UB on Nvidia GPUs. 2) Yes, I was talking about the code in the article. There either should be a __syncwarp extension provided for GLSL, or it should be built into other primitives provided by extension, for example ballotARB internally may just be the __ballot_sync cuda function on Nvidia gpus, which performs ballot and syncs the warp ensuring safe result.

– opa
Jan 3 at 17:57

|
show 7 more comments

1

"According to me, this code is not correct." Well, what exactly is it supposed to do? I don't understand what your code is intended to accomplish. I have no idea what s_data is, what values it has, or what it is intended to eventually store. And since all versions of your code exhibit UB, it's not clear what is supposed to be happening here.

– Nicol Bolas
Jan 3 at 15:36

The idea of my code is to accomplish a reduction. (I wanted to write += instead of =). s_data is only "values". What UB do my codes have?

– Antoine Morrier
Jan 3 at 15:55

1

In every case, you have invocations reading from memory that some other invocation will write to with no barriers between them to provide ordering/visibility. Even in your case 2, an invocation where tid == 1 will write to a variable that the tid == 0 invocation reads from. That's undefined behavior, whether shader_ballot exists or not.

– Nicol Bolas
Jan 3 at 16:00

1

@AntoineMorrier ARB_shader_ballot must define a groupsize, but that is not it's purpose. shader_ballot makes no guarantees about the underlying architecture beyond the fact that ballotARB works if the vendor has implemented the extension. Unrolling the last warp works because all other warps are free to do other work with in a Streaming Multiprocessor (NV specific) but also relies on undefined behavior EVEN ON NVIDIA GPUS to carry out adding values simultaneously accumulated from each warp. (cont.)

– opa
Jan 3 at 16:34

1

@AntoineMorrier 1) Yes, the code is UB on Nvidia GPUs. 2) Yes, I was talking about the code in the article. There either should be a __syncwarp extension provided for GLSL, or it should be built into other primitives provided by extension, for example ballotARB internally may just be the __ballot_sync cuda function on Nvidia gpus, which performs ballot and syncs the warp ensuring safe result.

– opa
Jan 3 at 17:57

"According to me, this code is not correct." Well, what exactly is it supposed to do? I don't understand what your code is intended to accomplish. I have no idea what s_data is, what values it has, or what it is intended to eventually store. And since all versions of your code exhibit UB, it's not clear what is supposed to be happening here.

– Nicol Bolas
Jan 3 at 15:36

The idea of my code is to accomplish a reduction. (I wanted to write += instead of =). s_data is only "values". What UB do my codes have?

– Antoine Morrier
Jan 3 at 15:55

In every case, you have invocations reading from memory that some other invocation will write to with no barriers between them to provide ordering/visibility. Even in your case 2, an invocation where tid == 1 will write to a variable that the tid == 0 invocation reads from. That's undefined behavior, whether shader_ballot exists or not.

– Nicol Bolas
Jan 3 at 16:00

@AntoineMorrier ARB_shader_ballot must define a groupsize, but that is not it's purpose. shader_ballot makes no guarantees about the underlying architecture beyond the fact that ballotARB works if the vendor has implemented the extension. Unrolling the last warp works because all other warps are free to do other work with in a Streaming Multiprocessor (NV specific) but also relies on undefined behavior EVEN ON NVIDIA GPUS to carry out adding values simultaneously accumulated from each warp. (cont.)

– opa
Jan 3 at 16:34

@AntoineMorrier 1) Yes, the code is UB on Nvidia GPUs. 2) Yes, I was talking about the code in the article. There either should be a __syncwarp extension provided for GLSL, or it should be built into other primitives provided by extension, for example ballotARB internally may just be the __ballot_sync cuda function on Nvidia gpus, which performs ballot and syncs the warp ensuring safe result.

– opa
Jan 3 at 17:57

|
show 7 more comments

1 Answer
1

active

oldest

votes

The execution model shared by OpenGL and Vulkan with regard to compute shaders does not really recognize the concept of a "wave". It has the concept of a work group, but that is not the same thing. A work group can be much bigger than a GPU "wave", and for small work groups, multiple work groups could be executing on the same GPU "wave".

As such, these specifications make no statements about the behavior of any of its functions with regard to a "wave" (with the exception of shader ballot functions). So if you want synchronization that the standard says will work on all conforming implementations, you must call both functions as dictated by the standard.

Even with ARB_shader_ballot, its behavior does not modify the execution model of shaders. It only allows cross-communication between subgroups, and only via the explicit mechanisms that it provides.

The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

3

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54019084%2fdoes-wave-subgroup-need-synchronization-for-shared-variables%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

3

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

add a comment |

The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

3

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

add a comment |

The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

The execution model and memory model of shader invocations is that they are unordered with respect to each other, unless you explicitly order them with barriers.

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

edited Jan 3 at 16:02

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

answered Jan 3 at 14:21

Nicol Bolas

290k34481657

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

3

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

add a comment |

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

3

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

I am talking about the shader ballot functions :). I am talking about it because I want to optimize my code by using this extension :)

– Antoine Morrier
Jan 3 at 14:25

@AntoineMorrier: No, you aren't. You mentioned shared variables, barrier and memoryBarrierShared. Nowhere did you bring up shader ballot stuff. So you should fix your question to ask about what you wanted to know about, preferably with some source code.

– Nicol Bolas
Jan 3 at 14:26

I edited the question and add some source code :)

– Antoine Morrier
Jan 3 at 14:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

JN08xBJ wRjT,1Xx2rJZ wCQsFmQ6zqil4 4Cyd YEoRjH8cKCosPuDGTfJZ p3ywe8jnXkop5a flupuSPVPIa7U

搜尋此網誌

Bdtjtk