How to improve performance of pushing data to a mutex locked queue

I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.

The functions are pretty basic and look like this:

class JobQueue {

public: 

    // usually called by main thread but other threads can use this too

    void push(Job job) {

        {

            std::lock_guard<std::mutex> lock(mutex);   // this takes 40% of the thread's time (when NOT sync'ing)

            ready = true;

            queue.emplace_back(job);

        }

        cv.notify_one();   // this also takes another 40% of the thread's time

    }



    // only called by worker threads

    Job pop() {

        std::unique_lock<std::mutex> lock(mutex);

        cv.wait(lock, [&]{return ready;});

        Job job = list.front();

        list.pop_front();

        return job;

    }



private:

    std::list<Job>            queue;

    std::mutex                mutex;

    std::condition_variable   cv;

    bool                      ready;

};

But I have a major problem, push() is really slow. The worker threads outpace the main thread, which in my test adding jobs is all the main thread does. (The worker threads perform 20 4x4 matrix rotations that feed into eachother and get printed at the end so they're not optimized away) This seems to get worse with the number of worker threads available too. If each "Job" is bigger, say 100 matrix operations, this negative goes away and more threads == better, but the Jobs I would give it in practice are much smaller than that.

The hottest calls are the mutex lock and notify_one(), which take up 40% of the time each, everything else is negligible it seems. Also, the mutex lock is rarely waiting, it is nearly always available.

I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.

(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)

Time to calc 2000000 matrice rotations

(20 rotations x 100000 jobs)

threads   0:       149 ms  << no-bool baseline

threads   1:       151 ms  << single threaded w/pool

threads   2:        89 ms

threads   3:       120 ms

threads   4:       216 ms

threads   8:       269 ms

threads  12:       311 ms  << hardware hint

threads  16:       329 ms

threads  24:       332 ms

threads  96:       336 ms

enter image description here

enter image description here
(all worker threads have the same pattern, green is execution, red is waiting on synchronization)

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

2

Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.

– Sam Varshavchik
Dec 31 '18 at 20:40

1

Condition variables are exoensive when there is contention. Don't wait if ready is true when you grab the lock in pop()

– Chad
Dec 31 '18 at 20:42

@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.

– Anne Quinn
Dec 31 '18 at 20:49

@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too

– Anne Quinn
Dec 31 '18 at 20:52

2

1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.

– Humphrey Winnebago
Dec 31 '18 at 21:41

|
show 10 more comments

I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.

The functions are pretty basic and look like this:

class JobQueue {

public: 

    // usually called by main thread but other threads can use this too

    void push(Job job) {

        {

            std::lock_guard<std::mutex> lock(mutex);   // this takes 40% of the thread's time (when NOT sync'ing)

            ready = true;

            queue.emplace_back(job);

        }

        cv.notify_one();   // this also takes another 40% of the thread's time

    }



    // only called by worker threads

    Job pop() {

        std::unique_lock<std::mutex> lock(mutex);

        cv.wait(lock, [&]{return ready;});

        Job job = list.front();

        list.pop_front();

        return job;

    }



private:

    std::list<Job>            queue;

    std::mutex                mutex;

    std::condition_variable   cv;

    bool                      ready;

};

I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.

(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)

Time to calc 2000000 matrice rotations

(20 rotations x 100000 jobs)

threads   0:       149 ms  << no-bool baseline

threads   1:       151 ms  << single threaded w/pool

threads   2:        89 ms

threads   3:       120 ms

threads   4:       216 ms

threads   8:       269 ms

threads  12:       311 ms  << hardware hint

threads  16:       329 ms

threads  24:       332 ms

threads  96:       336 ms

enter image description here

enter image description here
(all worker threads have the same pattern, green is execution, red is waiting on synchronization)

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

2

Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.

– Sam Varshavchik
Dec 31 '18 at 20:40

1

Condition variables are exoensive when there is contention. Don't wait if ready is true when you grab the lock in pop()

– Chad
Dec 31 '18 at 20:42

@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.

– Anne Quinn
Dec 31 '18 at 20:49

@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too

– Anne Quinn
Dec 31 '18 at 20:52

2

1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.

– Humphrey Winnebago
Dec 31 '18 at 21:41

|
show 10 more comments

I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.

The functions are pretty basic and look like this:

class JobQueue {

public: 

    // usually called by main thread but other threads can use this too

    void push(Job job) {

        {

            std::lock_guard<std::mutex> lock(mutex);   // this takes 40% of the thread's time (when NOT sync'ing)

            ready = true;

            queue.emplace_back(job);

        }

        cv.notify_one();   // this also takes another 40% of the thread's time

    }



    // only called by worker threads

    Job pop() {

        std::unique_lock<std::mutex> lock(mutex);

        cv.wait(lock, [&]{return ready;});

        Job job = list.front();

        list.pop_front();

        return job;

    }



private:

    std::list<Job>            queue;

    std::mutex                mutex;

    std::condition_variable   cv;

    bool                      ready;

};

I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.

(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)

Time to calc 2000000 matrice rotations

(20 rotations x 100000 jobs)

threads   0:       149 ms  << no-bool baseline

threads   1:       151 ms  << single threaded w/pool

threads   2:        89 ms

threads   3:       120 ms

threads   4:       216 ms

threads   8:       269 ms

threads  12:       311 ms  << hardware hint

threads  16:       329 ms

threads  24:       332 ms

threads  96:       336 ms

enter image description here

enter image description here
(all worker threads have the same pattern, green is execution, red is waiting on synchronization)

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

I have a queue of "jobs" (function pointers and data) pushed onto it from a main thread, which then notifies worker threads to pop the data off and run it.

The functions are pretty basic and look like this:

class JobQueue {

public: 

    // usually called by main thread but other threads can use this too

    void push(Job job) {

        {

            std::lock_guard<std::mutex> lock(mutex);   // this takes 40% of the thread's time (when NOT sync'ing)

            ready = true;

            queue.emplace_back(job);

        }

        cv.notify_one();   // this also takes another 40% of the thread's time

    }



    // only called by worker threads

    Job pop() {

        std::unique_lock<std::mutex> lock(mutex);

        cv.wait(lock, [&]{return ready;});

        Job job = list.front();

        list.pop_front();

        return job;

    }



private:

    std::list<Job>            queue;

    std::mutex                mutex;

    std::condition_variable   cv;

    bool                      ready;

};

I'm not sure what I should do here, is there an obvious or not-so obvious optimization I can make that will help, or perhaps I have made a mistake? Any insight would be greatly appreciated.

(here are some metrics I took if it might help, they don't count the time it takes to create threads, the pattern is the same even for billions of jobs)

Time to calc 2000000 matrice rotations

(20 rotations x 100000 jobs)

threads   0:       149 ms  << no-bool baseline

threads   1:       151 ms  << single threaded w/pool

threads   2:        89 ms

threads   3:       120 ms

threads   4:       216 ms

threads   8:       269 ms

threads  12:       311 ms  << hardware hint

threads  16:       329 ms

threads  24:       332 ms

threads  96:       336 ms

enter image description here

enter image description here
(all worker threads have the same pattern, green is execution, red is waiting on synchronization)

c++ multithreading optimization

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

asked Dec 31 '18 at 20:37

Anne Quinn

4,58273064

2

Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.

– Sam Varshavchik
Dec 31 '18 at 20:40

1

Condition variables are exoensive when there is contention. Don't wait if ready is true when you grab the lock in pop()

– Chad
Dec 31 '18 at 20:42

@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.

– Anne Quinn
Dec 31 '18 at 20:49

@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too

– Anne Quinn
Dec 31 '18 at 20:52

2

1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.

– Humphrey Winnebago
Dec 31 '18 at 21:41

|
show 10 more comments

2

Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.

– Sam Varshavchik
Dec 31 '18 at 20:40

1

Condition variables are exoensive when there is contention. Don't wait if ready is true when you grab the lock in pop()

– Chad
Dec 31 '18 at 20:42

@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.

– Anne Quinn
Dec 31 '18 at 20:49

@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too

– Anne Quinn
Dec 31 '18 at 20:52

2

1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.

– Humphrey Winnebago
Dec 31 '18 at 21:41

Batch up the jobs. Instead of adding one job at a time, add a whole bunch at a time. Use per-worker job queues, and have the main thread, that generates each job, add it to a per-worker job queue. There are many other variations possible, it all depends on the individual circumstances.

– Sam Varshavchik
Dec 31 '18 at 20:40

Condition variables are exoensive when there is contention. Don't wait if ready is true when you grab the lock in pop()

– Chad
Dec 31 '18 at 20:42

@Chad - Oh, I thought it checked the predicate before trying to wait. I tried adding another check around it but there wasn't any improvement unfortunately.

– Anne Quinn
Dec 31 '18 at 20:49

@SamVarshavchik - I'll try to add a per-worker job queue. I initially avoided it because it makes joining harder, but it might be worth it in this case. Batching too

– Anne Quinn
Dec 31 '18 at 20:52

1) Is there any reason why you have to scale up to 96 threads? Why not use a threadpool with same number of threads as you have cores available? 2) How many milliseconds would you expect a job to take? If the jobs are pretty short, it would be better to go into lockfree queues than to use heavy-weight mutex/cv synchronization.

– Humphrey Winnebago
Dec 31 '18 at 21:41

|
show 10 more comments

2 Answers
2

active

oldest

votes

TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)

Your tasks are (computationally) too small. A 4x4 matrix multiplication is just a few multiplies and adds. ~60-70 operations. 20 of them done together isn't much more expensive, ~1500 (pipelined) arithmetic operations. The cost of the thread switch including waking a thread waiting on the cv and then the actual context switch, is likely higher than this - possibly much higher.

Also, the cost of the synchronization (the manipulation of the mutex and the cv) is very expensive, especially in the case of contention, especially on a multi-core system where the hardware native synchronization operations are much more expensive than arithmetic (because of cache coherency enforcement between the multiple cores).

This is why you observe that the problem lessens when each task is doing 100 of these matrix operations, increased from 20: The workers were going back to the well for more stuff to do too often, causing contention, when they only had 20 MMs to do ... giving them 100 to do slows them down enough that contention is reduced.

(In a comment you indicate that there is only one supplier, pretty much eliminating that as a source of contention to the queue. But even there, the more tasks than can be enqueued together while under the cv lock the better - up to the limit where it is blocking workers from taking tasks.)

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

add a comment |

I suggest using an event handler.

The events are of two types:

New job arrives

Worker completes job

The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )

When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker

You will also need a free worker queue, at startup and when no jobs are available.

You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53991238%2fhow-to-improve-performance-of-pushing-data-to-a-mutex-locked-queue%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

add a comment |

TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

add a comment |

TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

TL;DR: Do more work in each task. (Perhaps take more than one current task off the queue each time, but there are many other possibilities.)

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

edited Dec 31 '18 at 22:58

answered Dec 31 '18 at 22:10

davidbak

2,56121934

answered Dec 31 '18 at 22:10

davidbak

2,56121934

answered Dec 31 '18 at 22:10

davidbak

2,56121934

add a comment |

I suggest using an event handler.

The events are of two types:

New job arrives

Worker completes job

The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )

When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker

You will also need a free worker queue, at startup and when no jobs are available.

You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

add a comment |

I suggest using an event handler.

The events are of two types:

New job arrives

Worker completes job

The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )

When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker

You will also need a free worker queue, at startup and when no jobs are available.

You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

add a comment |

I suggest using an event handler.

The events are of two types:

New job arrives

Worker completes job

The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )

When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker

You will also need a free worker queue, at startup and when no jobs are available.

You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

I suggest using an event handler.

The events are of two types:

New job arrives

Worker completes job

The main thread maintains a job queue, accessed only by the main thread ( so no mutex locking )

When a job arrives it is placed on job queue.
When a worker completes a job a job is popped and passed to the worker

You will also need a free worker queue, at startup and when no jobs are available.

You will also need an event handler. These are tricky, so best to use a well tested library rather than rolling your own. I use boost::asio

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

answered Dec 31 '18 at 21:09

ravenspoint

12.8k44280

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk