Calculating confidence interval for ratio of sums

I have a problem where I need to calculate a confidence interval for a ratio of two sums. I have code below which calculates this stat using a bootstrapping method, but I would much prefer a closed form formula as I have this in a R Shiny app and it would react much faster.

Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.

Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.

Set up data

set.seed(1996+01+02)

library(data.table)

iris <- as.data.table(iris)

Setup a bootstrap function

bootstrap_func <- function(x) {



    # Resample data with replacement

    iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]



    # Summarize

    iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]

    iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]

    iris_boot_result[, Sample := x]



    return(iris_boot_result)

}

Replicate bootstrap

rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)

Get results

rep_results <- rep_samples[, .(

    .N, 

    Mean=mean(Ratio), 

    Lower=quantile(Ratio,0.025), 

    Upper=quantile(Ratio, 0.975),

    Median=quantile(Ratio, 0.50),

    StdDev=sd(Ratio)

), keyby=.(Species)]

print(rep_results)



#      Species    N     Mean    Lower    Upper   Median     StdDev

#1:     setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591

#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066

#3:  virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

migrated from stackoverflow.com Dec 28 '18 at 16:16

This question came from our site for professional and enthusiast programmers.

1

Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber♦
Dec 28 '18 at 16:46

1

Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57

1

The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10

1

Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12

1

I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48

|
show 1 more comment

Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.

Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.

Set up data

set.seed(1996+01+02)

library(data.table)

iris <- as.data.table(iris)

Setup a bootstrap function

bootstrap_func <- function(x) {



    # Resample data with replacement

    iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]



    # Summarize

    iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]

    iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]

    iris_boot_result[, Sample := x]



    return(iris_boot_result)

}

Replicate bootstrap

rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)

Get results

rep_results <- rep_samples[, .(

    .N, 

    Mean=mean(Ratio), 

    Lower=quantile(Ratio,0.025), 

    Upper=quantile(Ratio, 0.975),

    Median=quantile(Ratio, 0.50),

    StdDev=sd(Ratio)

), keyby=.(Species)]

print(rep_results)



#      Species    N     Mean    Lower    Upper   Median     StdDev

#1:     setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591

#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066

#3:  virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

migrated from stackoverflow.com Dec 28 '18 at 16:16

This question came from our site for professional and enthusiast programmers.

1

Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber♦
Dec 28 '18 at 16:46

1

Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57

1

The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10

1

Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12

1

I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48

|
show 1 more comment

Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.

Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.

Set up data

set.seed(1996+01+02)

library(data.table)

iris <- as.data.table(iris)

Setup a bootstrap function

bootstrap_func <- function(x) {



    # Resample data with replacement

    iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]



    # Summarize

    iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]

    iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]

    iris_boot_result[, Sample := x]



    return(iris_boot_result)

}

Replicate bootstrap

rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)

Get results

rep_results <- rep_samples[, .(

    .N, 

    Mean=mean(Ratio), 

    Lower=quantile(Ratio,0.025), 

    Upper=quantile(Ratio, 0.975),

    Median=quantile(Ratio, 0.50),

    StdDev=sd(Ratio)

), keyby=.(Species)]

print(rep_results)



#      Species    N     Mean    Lower    Upper   Median     StdDev

#1:     setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591

#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066

#3:  virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

Below I have code that calculates a bootstrapped confidence interval for the sum(Sepal.Length)/sum(Sepal.Width) by Species using the iris dataset in R.

Any ideas on a closed form solution for this problem? I feel like I am missing something obvious.

Set up data

set.seed(1996+01+02)

library(data.table)

iris <- as.data.table(iris)

Setup a bootstrap function

bootstrap_func <- function(x) {



    # Resample data with replacement

    iris_boot <- iris[, .SD[sample(1:.N, size=.N, replace=TRUE)], keyby=.(Species)]



    # Summarize

    iris_boot_result <- iris_boot[, .(Sepal.Length=sum(Sepal.Length), Sepal.Width=sum(Sepal.Width)), keyby=.(Species)]

    iris_boot_result[, Ratio := Sepal.Length/Sepal.Width]

    iris_boot_result[, Sample := x]



    return(iris_boot_result)

}

Replicate bootstrap

rep_samples <- rbindlist(lapply(1:1000, bootstrap_func), fill=TRUE)

Get results

rep_results <- rep_samples[, .(

    .N, 

    Mean=mean(Ratio), 

    Lower=quantile(Ratio,0.025), 

    Upper=quantile(Ratio, 0.975),

    Median=quantile(Ratio, 0.50),

    StdDev=sd(Ratio)

), keyby=.(Species)]

print(rep_results)



#      Species    N     Mean    Lower    Upper   Median     StdDev

#1:     setosa 1000 1.460326 1.430836 1.489853 1.459942 0.01531591

#2: versicolor 1000 2.143124 2.089608 2.205423 2.140150 0.03010066

#3:  virginica 1000 2.216022 2.155089 2.279134 2.215410 0.03312109

r confidence-interval

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

edited Dec 28 '18 at 22:02

kjetil b halvorsen

29.1k980212

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

asked Dec 28 '18 at 14:43

Mike.Gahan

1454

migrated from stackoverflow.com Dec 28 '18 at 16:16

This question came from our site for professional and enthusiast programmers.

migrated from stackoverflow.com Dec 28 '18 at 16:16

This question came from our site for professional and enthusiast programmers.

1

Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber♦
Dec 28 '18 at 16:46

1

Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57

1

The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10

1

Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12

1

I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48

|
show 1 more comment

1

Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber♦
Dec 28 '18 at 16:46

1

Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57

1

The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10

1

Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12

1

I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48

Could you characterize "this problem" a little more specifically? What kinds of data do you anticipate applying your solution to: how many observations, what ranges of values, what amounts of correlation, what kinds of bivariate distribution, and so on? Are you asking for a closed form formula for the bootstrap CI in particular or just a formula for any reasonable CI procedure? Would approximations be acceptable? If so, how accurate must they be?

– whuber♦
Dec 28 '18 at 16:46

Most of my datasets will be between 100K and 1M rows. Most of the values will be positive and less than 30,000. The values will definitely be correlated (probably between 0.5 and 0.8). A closed form formula would be ideal since it would run a lot faster, and I figured it probably exists? However, most estimations would suffice. I know I could probably sample down with replacement and it would be faster, but I was hoping for something better.

– Mike.Gahan
Dec 28 '18 at 16:57

The distribution of the ratio of variables is typically kind of messy and might be tricky -- if the denominator variable has density > 0 at variable = 0, then the density of the ratio will have a Cauchy-like component which might cause the variance to not exist, for example. But the cdf would still be well defined, I think, except maybe in special cases (although I haven't considered this carefully). Anyway in general this problem can be approach via a change of variables approach; a web search for "change of variables probability distribution" should find some resources.

– Robert Dodier
Dec 28 '18 at 18:10

Also, long ago I worked out the distribution of the ratio of two correlated normal variables (a good fit for this problem, I think) and came up with a relatively simple formula, but later found (not surprisingly) that it had been published before. If I recall correctly it was the subject of a paper by George Marsaglia in 1967. If you can't find it, I will try to see if I have the reference somewhere.

– Robert Dodier
Dec 28 '18 at 18:12

I see that a web search for "george marsaglia ratio of correlated normal variables article" finds the original paper, an updated version, and other resources.

– Robert Dodier
Dec 28 '18 at 20:48

|
show 1 more comment

1 Answer
1

active

oldest

votes

try this one

library(rsample)

library(purrr)

library(dplyr)



iris %>% bootstraps(1000)  %>% 

   mutate(

     Ratio=map(splits,function(x){

       analysis(x) %>% 

       group_by(Species) %>% 

       summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width)) 

     })

   ) %>%

   unnest(Ratio) %>% 

   select(-id) %>% 

   group_by(Species) %>% 

   summarise( Mean = mean(Ratio),

              Lower=quantile(Ratio,0.025), 

              Upper=quantile(Ratio, 0.975),

              Median=quantile(Ratio, 0.50),

              StdDev=sd(Ratio) )

# A tibble: 3 x 6

  Species     Mean Lower Upper Median StdDev

  <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>

1 setosa      1.46  1.43  1.49   1.46 0.0151

2 versicolor  2.14  2.08  2.20   2.14 0.0307

3 virginica   2.22  2.15  2.29   2.22 0.0333

This solution in 2.6 times faster in my tests

answered Dec 28 '18 at 16:10

jyjek

109

1

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f384780%2fcalculating-confidence-interval-for-ratio-of-sums%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

try this one

library(rsample)

library(purrr)

library(dplyr)



iris %>% bootstraps(1000)  %>% 

   mutate(

     Ratio=map(splits,function(x){

       analysis(x) %>% 

       group_by(Species) %>% 

       summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width)) 

     })

   ) %>%

   unnest(Ratio) %>% 

   select(-id) %>% 

   group_by(Species) %>% 

   summarise( Mean = mean(Ratio),

              Lower=quantile(Ratio,0.025), 

              Upper=quantile(Ratio, 0.975),

              Median=quantile(Ratio, 0.50),

              StdDev=sd(Ratio) )

# A tibble: 3 x 6

  Species     Mean Lower Upper Median StdDev

  <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>

1 setosa      1.46  1.43  1.49   1.46 0.0151

2 versicolor  2.14  2.08  2.20   2.14 0.0307

3 virginica   2.22  2.15  2.29   2.22 0.0333

This solution in 2.6 times faster in my tests

answered Dec 28 '18 at 16:10

jyjek

109

1

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

add a comment |

try this one

library(rsample)

library(purrr)

library(dplyr)



iris %>% bootstraps(1000)  %>% 

   mutate(

     Ratio=map(splits,function(x){

       analysis(x) %>% 

       group_by(Species) %>% 

       summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width)) 

     })

   ) %>%

   unnest(Ratio) %>% 

   select(-id) %>% 

   group_by(Species) %>% 

   summarise( Mean = mean(Ratio),

              Lower=quantile(Ratio,0.025), 

              Upper=quantile(Ratio, 0.975),

              Median=quantile(Ratio, 0.50),

              StdDev=sd(Ratio) )

# A tibble: 3 x 6

  Species     Mean Lower Upper Median StdDev

  <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>

1 setosa      1.46  1.43  1.49   1.46 0.0151

2 versicolor  2.14  2.08  2.20   2.14 0.0307

3 virginica   2.22  2.15  2.29   2.22 0.0333

This solution in 2.6 times faster in my tests

answered Dec 28 '18 at 16:10

jyjek

109

1

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

add a comment |

try this one

library(rsample)

library(purrr)

library(dplyr)



iris %>% bootstraps(1000)  %>% 

   mutate(

     Ratio=map(splits,function(x){

       analysis(x) %>% 

       group_by(Species) %>% 

       summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width)) 

     })

   ) %>%

   unnest(Ratio) %>% 

   select(-id) %>% 

   group_by(Species) %>% 

   summarise( Mean = mean(Ratio),

              Lower=quantile(Ratio,0.025), 

              Upper=quantile(Ratio, 0.975),

              Median=quantile(Ratio, 0.50),

              StdDev=sd(Ratio) )

# A tibble: 3 x 6

  Species     Mean Lower Upper Median StdDev

  <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>

1 setosa      1.46  1.43  1.49   1.46 0.0151

2 versicolor  2.14  2.08  2.20   2.14 0.0307

3 virginica   2.22  2.15  2.29   2.22 0.0333

This solution in 2.6 times faster in my tests

answered Dec 28 '18 at 16:10

jyjek

109

try this one

library(rsample)

library(purrr)

library(dplyr)



iris %>% bootstraps(1000)  %>% 

   mutate(

     Ratio=map(splits,function(x){

       analysis(x) %>% 

       group_by(Species) %>% 

       summarise(Ratio=sum(Sepal.Length)/sum(Sepal.Width)) 

     })

   ) %>%

   unnest(Ratio) %>% 

   select(-id) %>% 

   group_by(Species) %>% 

   summarise( Mean = mean(Ratio),

              Lower=quantile(Ratio,0.025), 

              Upper=quantile(Ratio, 0.975),

              Median=quantile(Ratio, 0.50),

              StdDev=sd(Ratio) )

# A tibble: 3 x 6

  Species     Mean Lower Upper Median StdDev

  <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>

1 setosa      1.46  1.43  1.49   1.46 0.0151

2 versicolor  2.14  2.08  2.20   2.14 0.0307

3 virginica   2.22  2.15  2.29   2.22 0.0333

This solution in 2.6 times faster in my tests

answered Dec 28 '18 at 16:10

jyjek

109

answered Dec 28 '18 at 16:10

jyjek

109

answered Dec 28 '18 at 16:10

jyjek

109

answered Dec 28 '18 at 16:10

jyjek

109

1

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

add a comment |

1

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

Thanks @jyjek, but this is actually quite a bit slower on larger datasets (over 100,000K rows and this thing is 5-6x slower. iris <- iris[rep(1:.N, 1000)]

– Mike.Gahan
Dec 28 '18 at 16:53

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

kYiUiS1k 1Vk,035oVeDNKiGkYY oucr4,v2Or5AOzDv8He BTV0drIuKld,bA 3O,OEV3sP5 CXi3L T

搜尋此網誌

Bdtjtk