Best way to smooth/filter a stream of data
I have a stream of integer data (data coming in over time) that looks something like this.
[46, 46, 46, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 47, 46
, 47
, 46, 46, 46, 46, 46, 46, 46, 100, 100, 100, 70
, 100, 100]
Basically, it's a stream of integers where the last integer is displayed on a screen. Sometimes, noise (a "bad" integer) is displayed for a short time causing the screen to flicker the integer. I'd like to avoid this flickering by not displaying noise.
I want to filter/remove numbers that only occur ~1-2 times in a row since they're generally noise. If a new integer comes in, it could either be noise or not depending on whether the next 1-2 integers are the same or not.
A little latency is ok. I'd like to avoid averaging the data since if a number jumps from 40 to 100, I'd like it to not scoop into 100 (ie. 40, 60, 80, 100) and instead be 100 as soon as deemed non-noise.
What's the best technique for smoothing/filtering this type of data?
filter stream smoothing
|
show 2 more comments
I have a stream of integer data (data coming in over time) that looks something like this.
[46, 46, 46, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 47, 46
, 47
, 46, 46, 46, 46, 46, 46, 46, 100, 100, 100, 70
, 100, 100]
Basically, it's a stream of integers where the last integer is displayed on a screen. Sometimes, noise (a "bad" integer) is displayed for a short time causing the screen to flicker the integer. I'd like to avoid this flickering by not displaying noise.
I want to filter/remove numbers that only occur ~1-2 times in a row since they're generally noise. If a new integer comes in, it could either be noise or not depending on whether the next 1-2 integers are the same or not.
A little latency is ok. I'd like to avoid averaging the data since if a number jumps from 40 to 100, I'd like it to not scoop into 100 (ie. 40, 60, 80, 100) and instead be 100 as soon as deemed non-noise.
What's the best technique for smoothing/filtering this type of data?
filter stream smoothing
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
1
are you referring tojava.util.stream
?
– Aomine
Dec 31 '18 at 23:51
no, just a generalstream
. Meaning, the data is coming in over time.
– user1218464
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57
|
show 2 more comments
I have a stream of integer data (data coming in over time) that looks something like this.
[46, 46, 46, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 47, 46
, 47
, 46, 46, 46, 46, 46, 46, 46, 100, 100, 100, 70
, 100, 100]
Basically, it's a stream of integers where the last integer is displayed on a screen. Sometimes, noise (a "bad" integer) is displayed for a short time causing the screen to flicker the integer. I'd like to avoid this flickering by not displaying noise.
I want to filter/remove numbers that only occur ~1-2 times in a row since they're generally noise. If a new integer comes in, it could either be noise or not depending on whether the next 1-2 integers are the same or not.
A little latency is ok. I'd like to avoid averaging the data since if a number jumps from 40 to 100, I'd like it to not scoop into 100 (ie. 40, 60, 80, 100) and instead be 100 as soon as deemed non-noise.
What's the best technique for smoothing/filtering this type of data?
filter stream smoothing
I have a stream of integer data (data coming in over time) that looks something like this.
[46, 46, 46, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 46
, 47, 47, 47, 47, 47, 47, 47, 46
, 47
, 46, 46, 46, 46, 46, 46, 46, 100, 100, 100, 70
, 100, 100]
Basically, it's a stream of integers where the last integer is displayed on a screen. Sometimes, noise (a "bad" integer) is displayed for a short time causing the screen to flicker the integer. I'd like to avoid this flickering by not displaying noise.
I want to filter/remove numbers that only occur ~1-2 times in a row since they're generally noise. If a new integer comes in, it could either be noise or not depending on whether the next 1-2 integers are the same or not.
A little latency is ok. I'd like to avoid averaging the data since if a number jumps from 40 to 100, I'd like it to not scoop into 100 (ie. 40, 60, 80, 100) and instead be 100 as soon as deemed non-noise.
What's the best technique for smoothing/filtering this type of data?
filter stream smoothing
filter stream smoothing
edited Dec 31 '18 at 23:57
user1218464
asked Dec 31 '18 at 23:32
user1218464user1218464
5061820
5061820
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
1
are you referring tojava.util.stream
?
– Aomine
Dec 31 '18 at 23:51
no, just a generalstream
. Meaning, the data is coming in over time.
– user1218464
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57
|
show 2 more comments
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
1
are you referring tojava.util.stream
?
– Aomine
Dec 31 '18 at 23:51
no, just a generalstream
. Meaning, the data is coming in over time.
– user1218464
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
1
1
are you referring to
java.util.stream
?– Aomine
Dec 31 '18 at 23:51
are you referring to
java.util.stream
?– Aomine
Dec 31 '18 at 23:51
no, just a general
stream
. Meaning, the data is coming in over time.– user1218464
Dec 31 '18 at 23:52
no, just a general
stream
. Meaning, the data is coming in over time.– user1218464
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57
|
show 2 more comments
1 Answer
1
active
oldest
votes
First of all, I am not a signal processing expert, but I have some basic experience (let's call it that). Also, I have seen your other (possibly related) questions and you were talking about tuners and musical notes. Assuming this is related to that problem (?).
Short answer:
You can use local short-buffer filtering with median or mode. In the simplest form, by what you've described, you can 'safely' eliminate values that fall outside some pre-defined noise threshold and occur only once in current buffer.
Long answer:
In statistics, there is something called MODE besides standard AVERAGE and MEDIAN:
In the case of your entire 39-values array you gave as an example in your question:
main_array_values_by_occurrence = [46 => 13 times, 47 => 20 times, 70 => 1 time, 100 => 5 times]
average = 54.05 => 54 rounded
median = 47
mode = 47
Mode and Median functions can help you a lot to determine which value should be the representative value. In essence, mode is similar as ordering values by occurrence and value ascending, but it is not always clear what is the value of it in multi-modal distributions.
You can see that using average is not that great in case where you have occasional, particularly large, spikes.
Median sounds attractive as a major candidate for comparison algorithm, but what should you use? Median or Mode? It really depends on the type of data you are working with and their distribution. For example, if, say, we are detecting pitch of an instrument (musical notes) and we have a decent accuracy of our detector, mode method sounds like a good choice, because median might be misleading. If our tuner detected 60% of the time note A4 and 40% of the time other notes, on "average" it detected A4 because that tone was dominant in our data stream. More on Mode use cases also on above link here.
First, capture stream in a group of, say, 5 consecutive values (arbitrary buffer size or "scan" interval: smaller buffer means less effectiveness; larger buffer means less resolution and longer delays) and store them in array:
array_1 = [46 (first in) 46 46 47 47 (last in)]
You can also group them (count by occurrence) if you expect data to be mostly constant. This will give you some starting analysis point for the captured group.
array_1_occur = [46 => 3 47 => 2]
Then, you can use simple comparative algorithm to determine relative difference between them in consecutive way; you can store those values and +/- difference in a separate array.
array_1_diff = [0 0 0 +1 +1]
The simplest filtering can be easy at this stage, all values that are over the threshold and have low occurrence (<2 for example) can be instantly eliminated. If you wish to make it more robust, you have to consider next buffer batch as well and eliminate them only if they are truly single instances in the local group (including neighbors).
You need to define a threshold for your "noise", call it "noise threshold" or tolerance for simplicity. Say, if the value which occurs only once jumps over +10 from the previous buffer's median or mode it is "marked" as noise and further analysed.
array_2 = [46 46 90 47 47]
array_2_min = 46
array_2_max = 90
array_2_avg = (46+46+90+47+47) : 5 = 55.2
array_2_local_avg_ref = round(55.2) = 55 ( <-- useless in this case because median() value 47 is VERY DIFFERENT from average value! => thus we DISCARD it from analysis and mark value of 90 as a major SUSPECT )
In the case of array_1 example:
array_1_avg = 46
array_1_med = 47 (you order values from lowest to largest and take the middle one as median)
array_1_mod = 46 (not sure 100% about bi-modal or multi-modal cases at least in Excel a value of 46 is returned possibly as the lowest integer with highest occurrence in the ordered list. You can customize that in the event of multi-mode or no mode distribution algorithm fallback to median)
Then, you have to decide if you are going to drop it or keep it.
How? Well, if in array of 5 values all your values are ~ same (with respect to tolerance) except that single data point, it is obviously a noisy candidate. It can be determined from median or mode value comparison. The worse case scenario will be if the group contains all 5 different values. In that case, the value closest to the value of the previous captured group will be set as reference point. But, what if next array of 5 contains all values in the ~ same range as the noisy candidate? So for this reason we need to keep it until next 5 values come in for comparison and analysis. If the candidate is still far off the "norm" we can confirm that it is noise or singular random-like event (insignificant). If, on the other hand, we find at least one value that is near it (remember the threshold or step) than we keep it.
You will have a delay in output stream of at least N x buffer_length data points (samples): for basic N=2 (past+present) or N=3 (past+present+future) or even more.
Some simple observations: if the step is just +/- 1, then any value that jumps from the "norm" will be considered as noise. Problem here is what if the values start fluctuating a lot, say, every consecutive sample is +/- 1 respectively (oscillation event)? In that case you will have a problem to determine what is a "norm", because you work with integers only. This can be kind of solved by keeping an average value of, say, last 15-50-100 data samples and retrieving the last stable value in that case (no change).
"Ideally" you should keep 3* buffers: previous current and future one for analysis. Previous one is required for reference for the current and "future" one is required to improve transients (jumps to new values in the semi-steady stream).
Actually, you don't need 3 (or N) buffers. One buffer is just fine, but you can internally splice it to sections (say, buffer of 15 samples internally divided in 3 sections), which will greatly simplify processing (this part might not be obvious, so I've mentioned it).
In any case, you need to analyse your samples, try to figure their distribution, then build the algorithm and experiment with your actual data stream and tweak it. And inevitably you will lose some frequency or spatial resolution (depends on the type of signal) determined by the length of the local groups (buffers).
Of course, using some conditions like checking the distance between current sample and median/mode value and removing noisy samples is kind of a cheating, because it's not a "real" NR algorithm like some convolution or whatever. It's a shortcut given this specific case.
You will never be able to eliminate the fluctuation completely unless you create a very large buffer and then use learning or slow-adapting local group algorithm to minimize the fluctuation (noise) but preserve transients as much as possible which you explicitly mentioned. With "live" signals/streams where the delay is critical you have to keep buffering minimal which negates the possibility of "ideal" filtering.
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992132%2fbest-way-to-smooth-filter-a-stream-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First of all, I am not a signal processing expert, but I have some basic experience (let's call it that). Also, I have seen your other (possibly related) questions and you were talking about tuners and musical notes. Assuming this is related to that problem (?).
Short answer:
You can use local short-buffer filtering with median or mode. In the simplest form, by what you've described, you can 'safely' eliminate values that fall outside some pre-defined noise threshold and occur only once in current buffer.
Long answer:
In statistics, there is something called MODE besides standard AVERAGE and MEDIAN:
In the case of your entire 39-values array you gave as an example in your question:
main_array_values_by_occurrence = [46 => 13 times, 47 => 20 times, 70 => 1 time, 100 => 5 times]
average = 54.05 => 54 rounded
median = 47
mode = 47
Mode and Median functions can help you a lot to determine which value should be the representative value. In essence, mode is similar as ordering values by occurrence and value ascending, but it is not always clear what is the value of it in multi-modal distributions.
You can see that using average is not that great in case where you have occasional, particularly large, spikes.
Median sounds attractive as a major candidate for comparison algorithm, but what should you use? Median or Mode? It really depends on the type of data you are working with and their distribution. For example, if, say, we are detecting pitch of an instrument (musical notes) and we have a decent accuracy of our detector, mode method sounds like a good choice, because median might be misleading. If our tuner detected 60% of the time note A4 and 40% of the time other notes, on "average" it detected A4 because that tone was dominant in our data stream. More on Mode use cases also on above link here.
First, capture stream in a group of, say, 5 consecutive values (arbitrary buffer size or "scan" interval: smaller buffer means less effectiveness; larger buffer means less resolution and longer delays) and store them in array:
array_1 = [46 (first in) 46 46 47 47 (last in)]
You can also group them (count by occurrence) if you expect data to be mostly constant. This will give you some starting analysis point for the captured group.
array_1_occur = [46 => 3 47 => 2]
Then, you can use simple comparative algorithm to determine relative difference between them in consecutive way; you can store those values and +/- difference in a separate array.
array_1_diff = [0 0 0 +1 +1]
The simplest filtering can be easy at this stage, all values that are over the threshold and have low occurrence (<2 for example) can be instantly eliminated. If you wish to make it more robust, you have to consider next buffer batch as well and eliminate them only if they are truly single instances in the local group (including neighbors).
You need to define a threshold for your "noise", call it "noise threshold" or tolerance for simplicity. Say, if the value which occurs only once jumps over +10 from the previous buffer's median or mode it is "marked" as noise and further analysed.
array_2 = [46 46 90 47 47]
array_2_min = 46
array_2_max = 90
array_2_avg = (46+46+90+47+47) : 5 = 55.2
array_2_local_avg_ref = round(55.2) = 55 ( <-- useless in this case because median() value 47 is VERY DIFFERENT from average value! => thus we DISCARD it from analysis and mark value of 90 as a major SUSPECT )
In the case of array_1 example:
array_1_avg = 46
array_1_med = 47 (you order values from lowest to largest and take the middle one as median)
array_1_mod = 46 (not sure 100% about bi-modal or multi-modal cases at least in Excel a value of 46 is returned possibly as the lowest integer with highest occurrence in the ordered list. You can customize that in the event of multi-mode or no mode distribution algorithm fallback to median)
Then, you have to decide if you are going to drop it or keep it.
How? Well, if in array of 5 values all your values are ~ same (with respect to tolerance) except that single data point, it is obviously a noisy candidate. It can be determined from median or mode value comparison. The worse case scenario will be if the group contains all 5 different values. In that case, the value closest to the value of the previous captured group will be set as reference point. But, what if next array of 5 contains all values in the ~ same range as the noisy candidate? So for this reason we need to keep it until next 5 values come in for comparison and analysis. If the candidate is still far off the "norm" we can confirm that it is noise or singular random-like event (insignificant). If, on the other hand, we find at least one value that is near it (remember the threshold or step) than we keep it.
You will have a delay in output stream of at least N x buffer_length data points (samples): for basic N=2 (past+present) or N=3 (past+present+future) or even more.
Some simple observations: if the step is just +/- 1, then any value that jumps from the "norm" will be considered as noise. Problem here is what if the values start fluctuating a lot, say, every consecutive sample is +/- 1 respectively (oscillation event)? In that case you will have a problem to determine what is a "norm", because you work with integers only. This can be kind of solved by keeping an average value of, say, last 15-50-100 data samples and retrieving the last stable value in that case (no change).
"Ideally" you should keep 3* buffers: previous current and future one for analysis. Previous one is required for reference for the current and "future" one is required to improve transients (jumps to new values in the semi-steady stream).
Actually, you don't need 3 (or N) buffers. One buffer is just fine, but you can internally splice it to sections (say, buffer of 15 samples internally divided in 3 sections), which will greatly simplify processing (this part might not be obvious, so I've mentioned it).
In any case, you need to analyse your samples, try to figure their distribution, then build the algorithm and experiment with your actual data stream and tweak it. And inevitably you will lose some frequency or spatial resolution (depends on the type of signal) determined by the length of the local groups (buffers).
Of course, using some conditions like checking the distance between current sample and median/mode value and removing noisy samples is kind of a cheating, because it's not a "real" NR algorithm like some convolution or whatever. It's a shortcut given this specific case.
You will never be able to eliminate the fluctuation completely unless you create a very large buffer and then use learning or slow-adapting local group algorithm to minimize the fluctuation (noise) but preserve transients as much as possible which you explicitly mentioned. With "live" signals/streams where the delay is critical you have to keep buffering minimal which negates the possibility of "ideal" filtering.
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
add a comment |
First of all, I am not a signal processing expert, but I have some basic experience (let's call it that). Also, I have seen your other (possibly related) questions and you were talking about tuners and musical notes. Assuming this is related to that problem (?).
Short answer:
You can use local short-buffer filtering with median or mode. In the simplest form, by what you've described, you can 'safely' eliminate values that fall outside some pre-defined noise threshold and occur only once in current buffer.
Long answer:
In statistics, there is something called MODE besides standard AVERAGE and MEDIAN:
In the case of your entire 39-values array you gave as an example in your question:
main_array_values_by_occurrence = [46 => 13 times, 47 => 20 times, 70 => 1 time, 100 => 5 times]
average = 54.05 => 54 rounded
median = 47
mode = 47
Mode and Median functions can help you a lot to determine which value should be the representative value. In essence, mode is similar as ordering values by occurrence and value ascending, but it is not always clear what is the value of it in multi-modal distributions.
You can see that using average is not that great in case where you have occasional, particularly large, spikes.
Median sounds attractive as a major candidate for comparison algorithm, but what should you use? Median or Mode? It really depends on the type of data you are working with and their distribution. For example, if, say, we are detecting pitch of an instrument (musical notes) and we have a decent accuracy of our detector, mode method sounds like a good choice, because median might be misleading. If our tuner detected 60% of the time note A4 and 40% of the time other notes, on "average" it detected A4 because that tone was dominant in our data stream. More on Mode use cases also on above link here.
First, capture stream in a group of, say, 5 consecutive values (arbitrary buffer size or "scan" interval: smaller buffer means less effectiveness; larger buffer means less resolution and longer delays) and store them in array:
array_1 = [46 (first in) 46 46 47 47 (last in)]
You can also group them (count by occurrence) if you expect data to be mostly constant. This will give you some starting analysis point for the captured group.
array_1_occur = [46 => 3 47 => 2]
Then, you can use simple comparative algorithm to determine relative difference between them in consecutive way; you can store those values and +/- difference in a separate array.
array_1_diff = [0 0 0 +1 +1]
The simplest filtering can be easy at this stage, all values that are over the threshold and have low occurrence (<2 for example) can be instantly eliminated. If you wish to make it more robust, you have to consider next buffer batch as well and eliminate them only if they are truly single instances in the local group (including neighbors).
You need to define a threshold for your "noise", call it "noise threshold" or tolerance for simplicity. Say, if the value which occurs only once jumps over +10 from the previous buffer's median or mode it is "marked" as noise and further analysed.
array_2 = [46 46 90 47 47]
array_2_min = 46
array_2_max = 90
array_2_avg = (46+46+90+47+47) : 5 = 55.2
array_2_local_avg_ref = round(55.2) = 55 ( <-- useless in this case because median() value 47 is VERY DIFFERENT from average value! => thus we DISCARD it from analysis and mark value of 90 as a major SUSPECT )
In the case of array_1 example:
array_1_avg = 46
array_1_med = 47 (you order values from lowest to largest and take the middle one as median)
array_1_mod = 46 (not sure 100% about bi-modal or multi-modal cases at least in Excel a value of 46 is returned possibly as the lowest integer with highest occurrence in the ordered list. You can customize that in the event of multi-mode or no mode distribution algorithm fallback to median)
Then, you have to decide if you are going to drop it or keep it.
How? Well, if in array of 5 values all your values are ~ same (with respect to tolerance) except that single data point, it is obviously a noisy candidate. It can be determined from median or mode value comparison. The worse case scenario will be if the group contains all 5 different values. In that case, the value closest to the value of the previous captured group will be set as reference point. But, what if next array of 5 contains all values in the ~ same range as the noisy candidate? So for this reason we need to keep it until next 5 values come in for comparison and analysis. If the candidate is still far off the "norm" we can confirm that it is noise or singular random-like event (insignificant). If, on the other hand, we find at least one value that is near it (remember the threshold or step) than we keep it.
You will have a delay in output stream of at least N x buffer_length data points (samples): for basic N=2 (past+present) or N=3 (past+present+future) or even more.
Some simple observations: if the step is just +/- 1, then any value that jumps from the "norm" will be considered as noise. Problem here is what if the values start fluctuating a lot, say, every consecutive sample is +/- 1 respectively (oscillation event)? In that case you will have a problem to determine what is a "norm", because you work with integers only. This can be kind of solved by keeping an average value of, say, last 15-50-100 data samples and retrieving the last stable value in that case (no change).
"Ideally" you should keep 3* buffers: previous current and future one for analysis. Previous one is required for reference for the current and "future" one is required to improve transients (jumps to new values in the semi-steady stream).
Actually, you don't need 3 (or N) buffers. One buffer is just fine, but you can internally splice it to sections (say, buffer of 15 samples internally divided in 3 sections), which will greatly simplify processing (this part might not be obvious, so I've mentioned it).
In any case, you need to analyse your samples, try to figure their distribution, then build the algorithm and experiment with your actual data stream and tweak it. And inevitably you will lose some frequency or spatial resolution (depends on the type of signal) determined by the length of the local groups (buffers).
Of course, using some conditions like checking the distance between current sample and median/mode value and removing noisy samples is kind of a cheating, because it's not a "real" NR algorithm like some convolution or whatever. It's a shortcut given this specific case.
You will never be able to eliminate the fluctuation completely unless you create a very large buffer and then use learning or slow-adapting local group algorithm to minimize the fluctuation (noise) but preserve transients as much as possible which you explicitly mentioned. With "live" signals/streams where the delay is critical you have to keep buffering minimal which negates the possibility of "ideal" filtering.
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
add a comment |
First of all, I am not a signal processing expert, but I have some basic experience (let's call it that). Also, I have seen your other (possibly related) questions and you were talking about tuners and musical notes. Assuming this is related to that problem (?).
Short answer:
You can use local short-buffer filtering with median or mode. In the simplest form, by what you've described, you can 'safely' eliminate values that fall outside some pre-defined noise threshold and occur only once in current buffer.
Long answer:
In statistics, there is something called MODE besides standard AVERAGE and MEDIAN:
In the case of your entire 39-values array you gave as an example in your question:
main_array_values_by_occurrence = [46 => 13 times, 47 => 20 times, 70 => 1 time, 100 => 5 times]
average = 54.05 => 54 rounded
median = 47
mode = 47
Mode and Median functions can help you a lot to determine which value should be the representative value. In essence, mode is similar as ordering values by occurrence and value ascending, but it is not always clear what is the value of it in multi-modal distributions.
You can see that using average is not that great in case where you have occasional, particularly large, spikes.
Median sounds attractive as a major candidate for comparison algorithm, but what should you use? Median or Mode? It really depends on the type of data you are working with and their distribution. For example, if, say, we are detecting pitch of an instrument (musical notes) and we have a decent accuracy of our detector, mode method sounds like a good choice, because median might be misleading. If our tuner detected 60% of the time note A4 and 40% of the time other notes, on "average" it detected A4 because that tone was dominant in our data stream. More on Mode use cases also on above link here.
First, capture stream in a group of, say, 5 consecutive values (arbitrary buffer size or "scan" interval: smaller buffer means less effectiveness; larger buffer means less resolution and longer delays) and store them in array:
array_1 = [46 (first in) 46 46 47 47 (last in)]
You can also group them (count by occurrence) if you expect data to be mostly constant. This will give you some starting analysis point for the captured group.
array_1_occur = [46 => 3 47 => 2]
Then, you can use simple comparative algorithm to determine relative difference between them in consecutive way; you can store those values and +/- difference in a separate array.
array_1_diff = [0 0 0 +1 +1]
The simplest filtering can be easy at this stage, all values that are over the threshold and have low occurrence (<2 for example) can be instantly eliminated. If you wish to make it more robust, you have to consider next buffer batch as well and eliminate them only if they are truly single instances in the local group (including neighbors).
You need to define a threshold for your "noise", call it "noise threshold" or tolerance for simplicity. Say, if the value which occurs only once jumps over +10 from the previous buffer's median or mode it is "marked" as noise and further analysed.
array_2 = [46 46 90 47 47]
array_2_min = 46
array_2_max = 90
array_2_avg = (46+46+90+47+47) : 5 = 55.2
array_2_local_avg_ref = round(55.2) = 55 ( <-- useless in this case because median() value 47 is VERY DIFFERENT from average value! => thus we DISCARD it from analysis and mark value of 90 as a major SUSPECT )
In the case of array_1 example:
array_1_avg = 46
array_1_med = 47 (you order values from lowest to largest and take the middle one as median)
array_1_mod = 46 (not sure 100% about bi-modal or multi-modal cases at least in Excel a value of 46 is returned possibly as the lowest integer with highest occurrence in the ordered list. You can customize that in the event of multi-mode or no mode distribution algorithm fallback to median)
Then, you have to decide if you are going to drop it or keep it.
How? Well, if in array of 5 values all your values are ~ same (with respect to tolerance) except that single data point, it is obviously a noisy candidate. It can be determined from median or mode value comparison. The worse case scenario will be if the group contains all 5 different values. In that case, the value closest to the value of the previous captured group will be set as reference point. But, what if next array of 5 contains all values in the ~ same range as the noisy candidate? So for this reason we need to keep it until next 5 values come in for comparison and analysis. If the candidate is still far off the "norm" we can confirm that it is noise or singular random-like event (insignificant). If, on the other hand, we find at least one value that is near it (remember the threshold or step) than we keep it.
You will have a delay in output stream of at least N x buffer_length data points (samples): for basic N=2 (past+present) or N=3 (past+present+future) or even more.
Some simple observations: if the step is just +/- 1, then any value that jumps from the "norm" will be considered as noise. Problem here is what if the values start fluctuating a lot, say, every consecutive sample is +/- 1 respectively (oscillation event)? In that case you will have a problem to determine what is a "norm", because you work with integers only. This can be kind of solved by keeping an average value of, say, last 15-50-100 data samples and retrieving the last stable value in that case (no change).
"Ideally" you should keep 3* buffers: previous current and future one for analysis. Previous one is required for reference for the current and "future" one is required to improve transients (jumps to new values in the semi-steady stream).
Actually, you don't need 3 (or N) buffers. One buffer is just fine, but you can internally splice it to sections (say, buffer of 15 samples internally divided in 3 sections), which will greatly simplify processing (this part might not be obvious, so I've mentioned it).
In any case, you need to analyse your samples, try to figure their distribution, then build the algorithm and experiment with your actual data stream and tweak it. And inevitably you will lose some frequency or spatial resolution (depends on the type of signal) determined by the length of the local groups (buffers).
Of course, using some conditions like checking the distance between current sample and median/mode value and removing noisy samples is kind of a cheating, because it's not a "real" NR algorithm like some convolution or whatever. It's a shortcut given this specific case.
You will never be able to eliminate the fluctuation completely unless you create a very large buffer and then use learning or slow-adapting local group algorithm to minimize the fluctuation (noise) but preserve transients as much as possible which you explicitly mentioned. With "live" signals/streams where the delay is critical you have to keep buffering minimal which negates the possibility of "ideal" filtering.
First of all, I am not a signal processing expert, but I have some basic experience (let's call it that). Also, I have seen your other (possibly related) questions and you were talking about tuners and musical notes. Assuming this is related to that problem (?).
Short answer:
You can use local short-buffer filtering with median or mode. In the simplest form, by what you've described, you can 'safely' eliminate values that fall outside some pre-defined noise threshold and occur only once in current buffer.
Long answer:
In statistics, there is something called MODE besides standard AVERAGE and MEDIAN:
In the case of your entire 39-values array you gave as an example in your question:
main_array_values_by_occurrence = [46 => 13 times, 47 => 20 times, 70 => 1 time, 100 => 5 times]
average = 54.05 => 54 rounded
median = 47
mode = 47
Mode and Median functions can help you a lot to determine which value should be the representative value. In essence, mode is similar as ordering values by occurrence and value ascending, but it is not always clear what is the value of it in multi-modal distributions.
You can see that using average is not that great in case where you have occasional, particularly large, spikes.
Median sounds attractive as a major candidate for comparison algorithm, but what should you use? Median or Mode? It really depends on the type of data you are working with and their distribution. For example, if, say, we are detecting pitch of an instrument (musical notes) and we have a decent accuracy of our detector, mode method sounds like a good choice, because median might be misleading. If our tuner detected 60% of the time note A4 and 40% of the time other notes, on "average" it detected A4 because that tone was dominant in our data stream. More on Mode use cases also on above link here.
First, capture stream in a group of, say, 5 consecutive values (arbitrary buffer size or "scan" interval: smaller buffer means less effectiveness; larger buffer means less resolution and longer delays) and store them in array:
array_1 = [46 (first in) 46 46 47 47 (last in)]
You can also group them (count by occurrence) if you expect data to be mostly constant. This will give you some starting analysis point for the captured group.
array_1_occur = [46 => 3 47 => 2]
Then, you can use simple comparative algorithm to determine relative difference between them in consecutive way; you can store those values and +/- difference in a separate array.
array_1_diff = [0 0 0 +1 +1]
The simplest filtering can be easy at this stage, all values that are over the threshold and have low occurrence (<2 for example) can be instantly eliminated. If you wish to make it more robust, you have to consider next buffer batch as well and eliminate them only if they are truly single instances in the local group (including neighbors).
You need to define a threshold for your "noise", call it "noise threshold" or tolerance for simplicity. Say, if the value which occurs only once jumps over +10 from the previous buffer's median or mode it is "marked" as noise and further analysed.
array_2 = [46 46 90 47 47]
array_2_min = 46
array_2_max = 90
array_2_avg = (46+46+90+47+47) : 5 = 55.2
array_2_local_avg_ref = round(55.2) = 55 ( <-- useless in this case because median() value 47 is VERY DIFFERENT from average value! => thus we DISCARD it from analysis and mark value of 90 as a major SUSPECT )
In the case of array_1 example:
array_1_avg = 46
array_1_med = 47 (you order values from lowest to largest and take the middle one as median)
array_1_mod = 46 (not sure 100% about bi-modal or multi-modal cases at least in Excel a value of 46 is returned possibly as the lowest integer with highest occurrence in the ordered list. You can customize that in the event of multi-mode or no mode distribution algorithm fallback to median)
Then, you have to decide if you are going to drop it or keep it.
How? Well, if in array of 5 values all your values are ~ same (with respect to tolerance) except that single data point, it is obviously a noisy candidate. It can be determined from median or mode value comparison. The worse case scenario will be if the group contains all 5 different values. In that case, the value closest to the value of the previous captured group will be set as reference point. But, what if next array of 5 contains all values in the ~ same range as the noisy candidate? So for this reason we need to keep it until next 5 values come in for comparison and analysis. If the candidate is still far off the "norm" we can confirm that it is noise or singular random-like event (insignificant). If, on the other hand, we find at least one value that is near it (remember the threshold or step) than we keep it.
You will have a delay in output stream of at least N x buffer_length data points (samples): for basic N=2 (past+present) or N=3 (past+present+future) or even more.
Some simple observations: if the step is just +/- 1, then any value that jumps from the "norm" will be considered as noise. Problem here is what if the values start fluctuating a lot, say, every consecutive sample is +/- 1 respectively (oscillation event)? In that case you will have a problem to determine what is a "norm", because you work with integers only. This can be kind of solved by keeping an average value of, say, last 15-50-100 data samples and retrieving the last stable value in that case (no change).
"Ideally" you should keep 3* buffers: previous current and future one for analysis. Previous one is required for reference for the current and "future" one is required to improve transients (jumps to new values in the semi-steady stream).
Actually, you don't need 3 (or N) buffers. One buffer is just fine, but you can internally splice it to sections (say, buffer of 15 samples internally divided in 3 sections), which will greatly simplify processing (this part might not be obvious, so I've mentioned it).
In any case, you need to analyse your samples, try to figure their distribution, then build the algorithm and experiment with your actual data stream and tweak it. And inevitably you will lose some frequency or spatial resolution (depends on the type of signal) determined by the length of the local groups (buffers).
Of course, using some conditions like checking the distance between current sample and median/mode value and removing noisy samples is kind of a cheating, because it's not a "real" NR algorithm like some convolution or whatever. It's a shortcut given this specific case.
You will never be able to eliminate the fluctuation completely unless you create a very large buffer and then use learning or slow-adapting local group algorithm to minimize the fluctuation (noise) but preserve transients as much as possible which you explicitly mentioned. With "live" signals/streams where the delay is critical you have to keep buffering minimal which negates the possibility of "ideal" filtering.
edited Jan 3 at 3:24
answered Jan 2 at 1:37
dev101dev101
6921121
6921121
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
add a comment |
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
Thanks for the great response. It'll take me a little time to digest, but this is definitely helpful.
– user1218464
Jan 3 at 19:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53992132%2fbest-way-to-smooth-filter-a-stream-of-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
thanks, I added a little more information. While the data is a stream, it's ok for some latency.
– user1218464
Dec 31 '18 at 23:47
1
are you referring to
java.util.stream
?– Aomine
Dec 31 '18 at 23:51
no, just a general
stream
. Meaning, the data is coming in over time.– user1218464
Dec 31 '18 at 23:52
Mind showing us an input-output example?
– Cardinal System
Dec 31 '18 at 23:52
I would compare to the last value and include it if it is the same.
– Peter Lawrey
Dec 31 '18 at 23:57