Calculating confidence interval for ratio of sums
I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny
app and it would react much faster.
Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width)
by Species
using the iris
dataset in R
.
Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.
Set up data
set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)
Setup a bootstrap function
bootstrap_func <- function(x) {
# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]
# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]
return(iris_boot_result)
}
Replicate bootstrap
rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)
Get results
rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)
# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109
r confidence-interval
migrated from stackoverflow.com Dec 28 '18 at 16:16
This question came from our site for professional and enthusiast programmers.
|
show 1 more comment
I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny
app and it would react much faster.
Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width)
by Species
using the iris
dataset in R
.
Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.
Set up data
set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)
Setup a bootstrap function
bootstrap_func <- function(x) {
# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]
# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]
return(iris_boot_result)
}
Replicate bootstrap
rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)
Get results
rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)
# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109
r confidence-interval
migrated from stackoverflow.com Dec 28 '18 at 16:16
This question came from our site for professional and enthusiast programmers.
1
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
1
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
1
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
1
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
1
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48
|
show 1 more comment
I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny
app and it would react much faster.
Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width)
by Species
using the iris
dataset in R
.
Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.
Set up data
set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)
Setup a bootstrap function
bootstrap_func <- function(x) {
# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]
# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]
return(iris_boot_result)
}
Replicate bootstrap
rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)
Get results
rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)
# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109
r confidence-interval
I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny
app and it would react much faster.
Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width)
by Species
using the iris
dataset in R
.
Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.
Set up data
set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)
Setup a bootstrap function
bootstrap_func <- function(x) {
# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]
# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]
return(iris_boot_result)
}
Replicate bootstrap
rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)
Get results
rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)
# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109
r confidence-interval
r confidence-interval
edited Dec 28 '18 at 22:02
kjetil b halvorsen
29.1k980212
29.1k980212
asked Dec 28 '18 at 14:43
Mike.GahanMike.Gahan
1454
1454
migrated from stackoverflow.com Dec 28 '18 at 16:16
This question came from our site for professional and enthusiast programmers.
migrated from stackoverflow.com Dec 28 '18 at 16:16
This question came from our site for professional and enthusiast programmers.
1
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
1
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
1
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
1
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
1
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48
|
show 1 more comment
1
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
1
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
1
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
1
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
1
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48
1
1
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
1
1
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
1
1
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
1
1
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
1
1
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48
|
show 1 more comment
1 Answer
1
active
oldest
votes
try this one
library(rsample)
library(purrr)
library(dplyr)
iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333
This solution in 2.6 times faster in my tests
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384780%2fcalculating-confidence-interval-for-ratio-of-sums%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
try this one
library(rsample)
library(purrr)
library(dplyr)
iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333
This solution in 2.6 times faster in my tests
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
add a comment |
try this one
library(rsample)
library(purrr)
library(dplyr)
iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333
This solution in 2.6 times faster in my tests
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
add a comment |
try this one
library(rsample)
library(purrr)
library(dplyr)
iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333
This solution in 2.6 times faster in my tests
try this one
library(rsample)
library(purrr)
library(dplyr)
iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333
This solution in 2.6 times faster in my tests
answered Dec 28 '18 at 16:10
jyjekjyjek
109
109
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
add a comment |
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
1
1
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.
iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower.
iris <- iris[rep(1:.N, 1000)]
– Mike.Gahan
Dec 28 '18 at 16:53
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384780%2fcalculating-confidence-interval-for-ratio-of-sums%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?
– whuber♦
Dec 28 '18 at 16:46
1
Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.
– Mike.Gahan
Dec 28 '18 at 16:57
1
The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.
– Robert Dodier
Dec 28 '18 at 18:10
1
Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.
– Robert Dodier
Dec 28 '18 at 18:12
1
I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.
– Robert Dodier
Dec 28 '18 at 20:48