Calculating confidence interval for ratio of sums












2















I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny app and it would react much faster.



Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.



Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.



Set up data



set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)


Setup a bootstrap function



bootstrap_func <- function(x) {

# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]

# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]

return(iris_boot_result)
}


Replicate bootstrap



rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)


Get results



rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)

# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109









share|cite|improve this question















migrated from stackoverflow.com Dec 28 '18 at 16:16


This question came from our site for professional and enthusiast programmers.











  • 1





    Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

    – whuber
    Dec 28 '18 at 16:46






  • 1





    Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

    – Mike.Gahan
    Dec 28 '18 at 16:57






  • 1





    The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

    – Robert Dodier
    Dec 28 '18 at 18:10






  • 1





    Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

    – Robert Dodier
    Dec 28 '18 at 18:12






  • 1





    I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

    – Robert Dodier
    Dec 28 '18 at 20:48
















2















I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny app and it would react much faster.



Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.



Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.



Set up data



set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)


Setup a bootstrap function



bootstrap_func <- function(x) {

# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]

# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]

return(iris_boot_result)
}


Replicate bootstrap



rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)


Get results



rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)

# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109









share|cite|improve this question















migrated from stackoverflow.com Dec 28 '18 at 16:16


This question came from our site for professional and enthusiast programmers.











  • 1





    Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

    – whuber
    Dec 28 '18 at 16:46






  • 1





    Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

    – Mike.Gahan
    Dec 28 '18 at 16:57






  • 1





    The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

    – Robert Dodier
    Dec 28 '18 at 18:10






  • 1





    Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

    – Robert Dodier
    Dec 28 '18 at 18:12






  • 1





    I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

    – Robert Dodier
    Dec 28 '18 at 20:48














2












2








2








I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny app and it would react much faster.



Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.



Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.



Set up data



set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)


Setup a bootstrap function



bootstrap_func <- function(x) {

# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]

# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]

return(iris_boot_result)
}


Replicate bootstrap



rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)


Get results



rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)

# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109









share|cite|improve this question
















I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny app and it would react much faster.



Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.



Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.



Set up data



set.seed(1996+01+02)
library(data.table)
iris <- as.data.table(iris)


Setup a bootstrap function



bootstrap_func <- function(x) {

# Resample data with replacement
iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]

# Summarize
iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]
iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]
iris_boot_result[, Sample := x]

return(iris_boot_result)
}


Replicate bootstrap



rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)


Get results



rep_results <- rep_samples[, .(
.N,
Mean=mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio)
), keyby=.(Species)]
print(rep_results)

# Species N Mean Lower Upper Median StdDev
#1: setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591
#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066
#3: virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109






r confidence-interval






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Dec 28 '18 at 22:02









kjetil b halvorsen

29.1k980212




29.1k980212










asked Dec 28 '18 at 14:43









Mike.GahanMike.Gahan

1454




1454




migrated from stackoverflow.com Dec 28 '18 at 16:16


This question came from our site for professional and enthusiast programmers.






migrated from stackoverflow.com Dec 28 '18 at 16:16


This question came from our site for professional and enthusiast programmers.










  • 1





    Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

    – whuber
    Dec 28 '18 at 16:46






  • 1





    Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

    – Mike.Gahan
    Dec 28 '18 at 16:57






  • 1





    The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

    – Robert Dodier
    Dec 28 '18 at 18:10






  • 1





    Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

    – Robert Dodier
    Dec 28 '18 at 18:12






  • 1





    I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

    – Robert Dodier
    Dec 28 '18 at 20:48














  • 1





    Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

    – whuber
    Dec 28 '18 at 16:46






  • 1





    Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

    – Mike.Gahan
    Dec 28 '18 at 16:57






  • 1





    The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

    – Robert Dodier
    Dec 28 '18 at 18:10






  • 1





    Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

    – Robert Dodier
    Dec 28 '18 at 18:12






  • 1





    I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

    – Robert Dodier
    Dec 28 '18 at 20:48








1




1





Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber
Dec 28 '18 at 16:46





Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber
Dec 28 '18 at 16:46




1




1





Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57





Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57




1




1





The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10





The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10




1




1





Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12





Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12




1




1





I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48





I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48










1 Answer
1






active

oldest

votes


















0














try this one



library(rsample)
library(purrr)
library(dplyr)

iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333


This solution in 2.6 times faster in my tests






share|cite|improve this answer



















  • 1





    Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

    – Mike.Gahan
    Dec 28 '18 at 16:53











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384780%2fcalculating-confidence-interval-for-ratio-of-sums%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














try this one



library(rsample)
library(purrr)
library(dplyr)

iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333


This solution in 2.6 times faster in my tests






share|cite|improve this answer



















  • 1





    Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

    – Mike.Gahan
    Dec 28 '18 at 16:53
















0














try this one



library(rsample)
library(purrr)
library(dplyr)

iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333


This solution in 2.6 times faster in my tests






share|cite|improve this answer



















  • 1





    Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

    – Mike.Gahan
    Dec 28 '18 at 16:53














0












0








0







try this one



library(rsample)
library(purrr)
library(dplyr)

iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333


This solution in 2.6 times faster in my tests






share|cite|improve this answer













try this one



library(rsample)
library(purrr)
library(dplyr)

iris %>% bootstraps(1000) %>%
mutate(
Ratio=map(splits,function(x){
analysis(x) %>%
group_by(Species) %>%
summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width))
})
) %>%
unnest(Ratio) %>%
select(-id) %>%
group_by(Species) %>%
summarise( Mean = mean(Ratio),
Lower=quantile(Ratio,0.025),
Upper=quantile(Ratio, 0.975),
Median=quantile(Ratio, 0.50),
StdDev=sd(Ratio) )
# A tibble: 3 x 6
Species Mean Lower Upper Median StdDev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 1.46 1.43 1.49 1.46 0.0151
2 versicolor 2.14 2.08 2.20 2.14 0.0307
3 virginica 2.22 2.15 2.29 2.22 0.0333


This solution in 2.6 times faster in my tests







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered Dec 28 '18 at 16:10









jyjekjyjek

109




109








  • 1





    Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

    – Mike.Gahan
    Dec 28 '18 at 16:53














  • 1





    Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

    – Mike.Gahan
    Dec 28 '18 at 16:53








1




1





Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53





Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53


















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384780%2fcalculating-confidence-interval-for-ratio-of-sums%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas