How to improve performance of pushing data to a mutex locked queue
I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.
The functions are pretty basic and look like this:
class JobQueue {
public:
// usually called by main thread but other threads can use this too
void push(Job job) {
{
std::lock_guard<std::mutex> lock(mutex); // this takes 40% of the thread's time (when NOT sync'ing)
ready = true;
queue.emplace_back(job);
}
cv.notify_one(); // this also takes another 40% of the thread's time
}
// only called by worker threads
Job pop() {
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [&]{return ready;});
Job job = list.front();
list.pop_front();
return job;
}
private:
std::list<Job> queue;
std::mutex mutex;
std::condition_variable cv;
bool ready;
};
But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.
The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.
I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.
(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)
Time to calc 2000000 matrice rotations
(20 rotations x 100000 jobs)
threads 0: 149 ms << no-bool baseline
threads 1: 151 ms << single threaded w/pool
threads 2: 89 ms
threads 3: 120 ms
threads 4: 216 ms
threads 8: 269 ms
threads 12: 311 ms << hardware hint
threads 16: 329 ms
threads 24: 332 ms
threads 96: 336 ms


(all worker threads have the same pattern, green is execution, red is waiting on synchronization)
c++ multithreading optimization
|
show 10 more comments
I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.
The functions are pretty basic and look like this:
class JobQueue {
public:
// usually called by main thread but other threads can use this too
void push(Job job) {
{
std::lock_guard<std::mutex> lock(mutex); // this takes 40% of the thread's time (when NOT sync'ing)
ready = true;
queue.emplace_back(job);
}
cv.notify_one(); // this also takes another 40% of the thread's time
}
// only called by worker threads
Job pop() {
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [&]{return ready;});
Job job = list.front();
list.pop_front();
return job;
}
private:
std::list<Job> queue;
std::mutex mutex;
std::condition_variable cv;
bool ready;
};
But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.
The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.
I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.
(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)
Time to calc 2000000 matrice rotations
(20 rotations x 100000 jobs)
threads 0: 149 ms << no-bool baseline
threads 1: 151 ms << single threaded w/pool
threads 2: 89 ms
threads 3: 120 ms
threads 4: 216 ms
threads 8: 269 ms
threads 12: 311 ms << hardware hint
threads 16: 329 ms
threads 24: 332 ms
threads 96: 336 ms


(all worker threads have the same pattern, green is execution, red is waiting on synchronization)
c++ multithreading optimization
2
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
1
Condition variables are exoensive when there is contention. Don'twaitif ready is true when you grab the lock inpop()
– Chad
Dec 31 '18 at 20:42
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
2
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41
|
show 10 more comments
I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.
The functions are pretty basic and look like this:
class JobQueue {
public:
// usually called by main thread but other threads can use this too
void push(Job job) {
{
std::lock_guard<std::mutex> lock(mutex); // this takes 40% of the thread's time (when NOT sync'ing)
ready = true;
queue.emplace_back(job);
}
cv.notify_one(); // this also takes another 40% of the thread's time
}
// only called by worker threads
Job pop() {
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [&]{return ready;});
Job job = list.front();
list.pop_front();
return job;
}
private:
std::list<Job> queue;
std::mutex mutex;
std::condition_variable cv;
bool ready;
};
But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.
The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.
I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.
(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)
Time to calc 2000000 matrice rotations
(20 rotations x 100000 jobs)
threads 0: 149 ms << no-bool baseline
threads 1: 151 ms << single threaded w/pool
threads 2: 89 ms
threads 3: 120 ms
threads 4: 216 ms
threads 8: 269 ms
threads 12: 311 ms << hardware hint
threads 16: 329 ms
threads 24: 332 ms
threads 96: 336 ms


(all worker threads have the same pattern, green is execution, red is waiting on synchronization)
c++ multithreading optimization
I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.
The functions are pretty basic and look like this:
class JobQueue {
public:
// usually called by main thread but other threads can use this too
void push(Job job) {
{
std::lock_guard<std::mutex> lock(mutex); // this takes 40% of the thread's time (when NOT sync'ing)
ready = true;
queue.emplace_back(job);
}
cv.notify_one(); // this also takes another 40% of the thread's time
}
// only called by worker threads
Job pop() {
std::unique_lock<std::mutex> lock(mutex);
cv.wait(lock, [&]{return ready;});
Job job = list.front();
list.pop_front();
return job;
}
private:
std::list<Job> queue;
std::mutex mutex;
std::condition_variable cv;
bool ready;
};
But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.
The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.
I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.
(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)
Time to calc 2000000 matrice rotations
(20 rotations x 100000 jobs)
threads 0: 149 ms << no-bool baseline
threads 1: 151 ms << single threaded w/pool
threads 2: 89 ms
threads 3: 120 ms
threads 4: 216 ms
threads 8: 269 ms
threads 12: 311 ms << hardware hint
threads 16: 329 ms
threads 24: 332 ms
threads 96: 336 ms


(all worker threads have the same pattern, green is execution, red is waiting on synchronization)
c++ multithreading optimization
c++ multithreading optimization
asked Dec 31 '18 at 20:37
Anne QuinnAnne Quinn
4,58273064
4,58273064
2
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
1
Condition variables are exoensive when there is contention. Don'twaitif ready is true when you grab the lock inpop()
– Chad
Dec 31 '18 at 20:42
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
2
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41
|
show 10 more comments
2
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
1
Condition variables are exoensive when there is contention. Don'twaitif ready is true when you grab the lock inpop()
– Chad
Dec 31 '18 at 20:42
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
2
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41
2
2
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
1
1
Condition variables are exoensive when there is contention. Don't
wait if ready is true when you grab the lock in pop()– Chad
Dec 31 '18 at 20:42
Condition variables are exoensive when there is contention. Don't
wait if ready is true when you grab the lock in pop()– Chad
Dec 31 '18 at 20:42
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
2
2
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41
|
show 10 more comments
2 Answers
2
active
oldest
votes
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
add a comment |
I suggest using an event handler.
The events are of two types:
- New job arrives
- Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53991238%2fhow-to-improve-performance-of-pushing-data-to-a-mutex-locked-queue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
add a comment |
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
add a comment |
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)
Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.
Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).
This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.
(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)
edited Dec 31 '18 at 22:58
answered Dec 31 '18 at 22:10
davidbakdavidbak
2,56121934
2,56121934
add a comment |
add a comment |
I suggest using an event handler.
The events are of two types:
- New job arrives
- Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio
add a comment |
I suggest using an event handler.
The events are of two types:
- New job arrives
- Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio
add a comment |
I suggest using an event handler.
The events are of two types:
- New job arrives
- Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio
I suggest using an event handler.
The events are of two types:
- New job arrives
- Worker completes job
The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )
When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker
You will also need a free worker queue, at startup and when no jobs are available.
You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio
answered Dec 31 '18 at 21:09
ravenspointravenspoint
12.8k44280
12.8k44280
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53991238%2fhow-to-improve-performance-of-pushing-data-to-a-mutex-locked-queue%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.
– Sam Varshavchik
Dec 31 '18 at 20:40
1
Condition variables are exoensive when there is contention. Don't
waitif ready is true when you grab the lock inpop()– Chad
Dec 31 '18 at 20:42
@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.
– Anne Quinn
Dec 31 '18 at 20:49
@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too
– Anne Quinn
Dec 31 '18 at 20:52
2
1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.
– Humphrey Winnebago
Dec 31 '18 at 21:41