Aggregate / summarize multiple variables per group (e.g. sum, mean)












121














From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?



Below are some sample data:



library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)


I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?



### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)


Any suggestions would be greatly appreciated.










share|improve this question




















  • 1




    @Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
    – Keiku
    Nov 18 '17 at 10:10
















121














From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?



Below are some sample data:



library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)


I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?



### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)


Any suggestions would be greatly appreciated.










share|improve this question




















  • 1




    @Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
    – Keiku
    Nov 18 '17 at 10:10














121












121








121


55





From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?



Below are some sample data:



library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)


I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?



### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)


Any suggestions would be greatly appreciated.










share|improve this question















From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?



Below are some sample data:



library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)


I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?



### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)


Any suggestions would be greatly appreciated.







r dataframe data.table aggregate r-faq






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 8 '18 at 20:24









Henrik

40.8k992107




40.8k992107










asked Mar 15 '12 at 15:44









MikeTP

2,561103552




2,561103552








  • 1




    @Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
    – Keiku
    Nov 18 '17 at 10:10














  • 1




    @Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
    – Keiku
    Nov 18 '17 at 10:10








1




1




@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10




@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10












6 Answers
6






active

oldest

votes


















41














Where is this year() function from?



You could also use the reshape2 package for this task:



require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522





share|improve this answer



















  • 6




    The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
    – Jaap
    May 13 '16 at 6:17





















173














Yes, in your formula, you can cbind the numeric variables to be aggregated:



aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589




See ?aggregate, the formula argument and the examples.






share|improve this answer



















  • 3




    Is it possible for the cbind to use dynamic variables?
    – pdb
    Nov 13 '15 at 5:29






  • 12




    It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
    – pdb
    Nov 13 '15 at 6:19






  • 1




    what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
    – Clock Slave
    Mar 16 '16 at 11:22






  • 7




    @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
    – A5C1D2H2I1M1N2O1R2T1
    Mar 21 '16 at 3:53






  • 4




    What if I don't want two variables but two functions?. For example mean and sd.
    – skan
    Apr 14 '16 at 19:15



















44














Using the data.table package, which is fast (useful for larger datasets)



https://github.com/Rdatatable/data.table/wiki



library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe


Using the plyr package



require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))


Using summarize() from the Hmisc package
(column headings are messy in my example though)



# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))





share|improve this answer























  • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
    – Bulat
    Oct 13 '18 at 12:00



















38














With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:



library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)


The result of the latter two options:



    year month        x1         x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows


Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.





As mentioned in my comment above, you can also use the recast function from the reshape2-package:



library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))


which will give you the same result.






share|improve this answer































    1














    Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:



    aggregate(
    x = df1[c("x1", "x2")],
    by = df1[c("year", "month")],
    FUN = sum, na.rm = TRUE
    )


    More generic use of aggregate's data.frame method:



    Since we are providing a





    • data.frame as x and

    • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

    • also with custom-made aggregation functions


    For example like so:



    colsToAggregate <- c("x1")
    aggregateBy <- c("year", "month")
    dummyaggfun <- function(v, na.rm = TRUE) {
    c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
    }

    aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)





    share|improve this answer























    • Nice & necessary addition!
      – Jaap
      Dec 28 '18 at 12:45



















    0














    Late to the party, but recently found another way to get the summary statistics.



    library(psych)
    describe(data)



    Will output:
    mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.






    share|improve this answer




















      protected by Jaap Jul 27 '16 at 18:39



      Thank you for your interest in this question.
      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



      Would you like to answer one of these unanswered questions instead?














      6 Answers
      6






      active

      oldest

      votes








      6 Answers
      6






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      41














      Where is this year() function from?



      You could also use the reshape2 package for this task:



      require(reshape2)
      df_melt <- melt(df1, id = c("date", "year", "month"))
      dcast(df_melt, year + month ~ variable, sum)
      # year month x1 x2
      1 2000 1 -80.83405 -224.9540159
      2 2000 2 -223.76331 -288.2418017
      3 2000 3 -188.83930 -481.5601913
      4 2000 4 -197.47797 -473.7137420
      5 2000 5 -259.07928 -372.4563522





      share|improve this answer



















      • 6




        The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
        – Jaap
        May 13 '16 at 6:17


















      41














      Where is this year() function from?



      You could also use the reshape2 package for this task:



      require(reshape2)
      df_melt <- melt(df1, id = c("date", "year", "month"))
      dcast(df_melt, year + month ~ variable, sum)
      # year month x1 x2
      1 2000 1 -80.83405 -224.9540159
      2 2000 2 -223.76331 -288.2418017
      3 2000 3 -188.83930 -481.5601913
      4 2000 4 -197.47797 -473.7137420
      5 2000 5 -259.07928 -372.4563522





      share|improve this answer



















      • 6




        The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
        – Jaap
        May 13 '16 at 6:17
















      41












      41








      41






      Where is this year() function from?



      You could also use the reshape2 package for this task:



      require(reshape2)
      df_melt <- melt(df1, id = c("date", "year", "month"))
      dcast(df_melt, year + month ~ variable, sum)
      # year month x1 x2
      1 2000 1 -80.83405 -224.9540159
      2 2000 2 -223.76331 -288.2418017
      3 2000 3 -188.83930 -481.5601913
      4 2000 4 -197.47797 -473.7137420
      5 2000 5 -259.07928 -372.4563522





      share|improve this answer














      Where is this year() function from?



      You could also use the reshape2 package for this task:



      require(reshape2)
      df_melt <- melt(df1, id = c("date", "year", "month"))
      dcast(df_melt, year + month ~ variable, sum)
      # year month x1 x2
      1 2000 1 -80.83405 -224.9540159
      2 2000 2 -223.76331 -288.2418017
      3 2000 3 -188.83930 -481.5601913
      4 2000 4 -197.47797 -473.7137420
      5 2000 5 -259.07928 -372.4563522






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jun 16 '18 at 9:56









      Striezel

      2,36171326




      2,36171326










      answered Mar 15 '12 at 15:56









      EDi

      10.7k23649




      10.7k23649








      • 6




        The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
        – Jaap
        May 13 '16 at 6:17
















      • 6




        The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
        – Jaap
        May 13 '16 at 6:17










      6




      6




      The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
      – Jaap
      May 13 '16 at 6:17






      The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
      – Jaap
      May 13 '16 at 6:17















      173














      Yes, in your formula, you can cbind the numeric variables to be aggregated:



      aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
      year month x1 x2
      1 2000 1 7.862002 -7.469298
      2 2001 1 276.758209 474.384252
      3 2000 2 13.122369 -128.122613
      ...
      23 2000 12 63.436507 449.794454
      24 2001 12 999.472226 922.726589




      See ?aggregate, the formula argument and the examples.






      share|improve this answer



















      • 3




        Is it possible for the cbind to use dynamic variables?
        – pdb
        Nov 13 '15 at 5:29






      • 12




        It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
        – pdb
        Nov 13 '15 at 6:19






      • 1




        what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
        – Clock Slave
        Mar 16 '16 at 11:22






      • 7




        @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
        – A5C1D2H2I1M1N2O1R2T1
        Mar 21 '16 at 3:53






      • 4




        What if I don't want two variables but two functions?. For example mean and sd.
        – skan
        Apr 14 '16 at 19:15
















      173














      Yes, in your formula, you can cbind the numeric variables to be aggregated:



      aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
      year month x1 x2
      1 2000 1 7.862002 -7.469298
      2 2001 1 276.758209 474.384252
      3 2000 2 13.122369 -128.122613
      ...
      23 2000 12 63.436507 449.794454
      24 2001 12 999.472226 922.726589




      See ?aggregate, the formula argument and the examples.






      share|improve this answer



















      • 3




        Is it possible for the cbind to use dynamic variables?
        – pdb
        Nov 13 '15 at 5:29






      • 12




        It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
        – pdb
        Nov 13 '15 at 6:19






      • 1




        what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
        – Clock Slave
        Mar 16 '16 at 11:22






      • 7




        @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
        – A5C1D2H2I1M1N2O1R2T1
        Mar 21 '16 at 3:53






      • 4




        What if I don't want two variables but two functions?. For example mean and sd.
        – skan
        Apr 14 '16 at 19:15














      173












      173








      173






      Yes, in your formula, you can cbind the numeric variables to be aggregated:



      aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
      year month x1 x2
      1 2000 1 7.862002 -7.469298
      2 2001 1 276.758209 474.384252
      3 2000 2 13.122369 -128.122613
      ...
      23 2000 12 63.436507 449.794454
      24 2001 12 999.472226 922.726589




      See ?aggregate, the formula argument and the examples.






      share|improve this answer














      Yes, in your formula, you can cbind the numeric variables to be aggregated:



      aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
      year month x1 x2
      1 2000 1 7.862002 -7.469298
      2 2001 1 276.758209 474.384252
      3 2000 2 13.122369 -128.122613
      ...
      23 2000 12 63.436507 449.794454
      24 2001 12 999.472226 922.726589




      See ?aggregate, the formula argument and the examples.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Oct 22 '17 at 10:22









      Henrik

      40.8k992107




      40.8k992107










      answered Mar 15 '12 at 15:50









      Andrie

      135k26347436




      135k26347436








      • 3




        Is it possible for the cbind to use dynamic variables?
        – pdb
        Nov 13 '15 at 5:29






      • 12




        It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
        – pdb
        Nov 13 '15 at 6:19






      • 1




        what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
        – Clock Slave
        Mar 16 '16 at 11:22






      • 7




        @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
        – A5C1D2H2I1M1N2O1R2T1
        Mar 21 '16 at 3:53






      • 4




        What if I don't want two variables but two functions?. For example mean and sd.
        – skan
        Apr 14 '16 at 19:15














      • 3




        Is it possible for the cbind to use dynamic variables?
        – pdb
        Nov 13 '15 at 5:29






      • 12




        It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
        – pdb
        Nov 13 '15 at 6:19






      • 1




        what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
        – Clock Slave
        Mar 16 '16 at 11:22






      • 7




        @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
        – A5C1D2H2I1M1N2O1R2T1
        Mar 21 '16 at 3:53






      • 4




        What if I don't want two variables but two functions?. For example mean and sd.
        – skan
        Apr 14 '16 at 19:15








      3




      3




      Is it possible for the cbind to use dynamic variables?
      – pdb
      Nov 13 '15 at 5:29




      Is it possible for the cbind to use dynamic variables?
      – pdb
      Nov 13 '15 at 5:29




      12




      12




      It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
      – pdb
      Nov 13 '15 at 6:19




      It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
      – pdb
      Nov 13 '15 at 6:19




      1




      1




      what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
      – Clock Slave
      Mar 16 '16 at 11:22




      what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
      – Clock Slave
      Mar 16 '16 at 11:22




      7




      7




      @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
      – A5C1D2H2I1M1N2O1R2T1
      Mar 21 '16 at 3:53




      @ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
      – A5C1D2H2I1M1N2O1R2T1
      Mar 21 '16 at 3:53




      4




      4




      What if I don't want two variables but two functions?. For example mean and sd.
      – skan
      Apr 14 '16 at 19:15




      What if I don't want two variables but two functions?. For example mean and sd.
      – skan
      Apr 14 '16 at 19:15











      44














      Using the data.table package, which is fast (useful for larger datasets)



      https://github.com/Rdatatable/data.table/wiki



      library(data.table)
      df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
      setDF(df2) # convert back to dataframe


      Using the plyr package



      require(plyr)
      df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))


      Using summarize() from the Hmisc package
      (column headings are messy in my example though)



      # need to detach plyr because plyr and Hmisc both have a summarize()
      detach(package:plyr)
      require(Hmisc)
      df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))





      share|improve this answer























      • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
        – Bulat
        Oct 13 '18 at 12:00
















      44














      Using the data.table package, which is fast (useful for larger datasets)



      https://github.com/Rdatatable/data.table/wiki



      library(data.table)
      df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
      setDF(df2) # convert back to dataframe


      Using the plyr package



      require(plyr)
      df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))


      Using summarize() from the Hmisc package
      (column headings are messy in my example though)



      # need to detach plyr because plyr and Hmisc both have a summarize()
      detach(package:plyr)
      require(Hmisc)
      df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))





      share|improve this answer























      • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
        – Bulat
        Oct 13 '18 at 12:00














      44












      44








      44






      Using the data.table package, which is fast (useful for larger datasets)



      https://github.com/Rdatatable/data.table/wiki



      library(data.table)
      df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
      setDF(df2) # convert back to dataframe


      Using the plyr package



      require(plyr)
      df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))


      Using summarize() from the Hmisc package
      (column headings are messy in my example though)



      # need to detach plyr because plyr and Hmisc both have a summarize()
      detach(package:plyr)
      require(Hmisc)
      df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))





      share|improve this answer














      Using the data.table package, which is fast (useful for larger datasets)



      https://github.com/Rdatatable/data.table/wiki



      library(data.table)
      df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
      setDF(df2) # convert back to dataframe


      Using the plyr package



      require(plyr)
      df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))


      Using summarize() from the Hmisc package
      (column headings are messy in my example though)



      # need to detach plyr because plyr and Hmisc both have a summarize()
      detach(package:plyr)
      require(Hmisc)
      df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Sep 26 '15 at 9:25









      Jaap

      55k20117130




      55k20117130










      answered Mar 15 '12 at 23:00









      numbercruncher

      74658




      74658












      • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
        – Bulat
        Oct 13 '18 at 12:00


















      • why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
        – Bulat
        Oct 13 '18 at 12:00
















      why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
      – Bulat
      Oct 13 '18 at 12:00




      why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
      – Bulat
      Oct 13 '18 at 12:00











      38














      With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:



      library(dplyr)
      # summarising all non-grouping variables
      df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

      # summarising a specific set of non-grouping variables
      df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
      df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

      # summarising a specific set of non-grouping variables based on condition (class)
      df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)


      The result of the latter two options:



          year month        x1         x2
      <dbl> <dbl> <dbl> <dbl>
      1 2000 1 -73.58134 -92.78595
      2 2000 2 -57.81334 -152.36983
      3 2000 3 122.68758 153.55243
      4 2000 4 450.24980 285.56374
      5 2000 5 678.37867 384.42888
      6 2000 6 792.68696 530.28694
      7 2000 7 908.58795 452.31222
      8 2000 8 710.69928 719.35225
      9 2000 9 725.06079 914.93687
      10 2000 10 770.60304 863.39337
      # ... with 14 more rows


      Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.





      As mentioned in my comment above, you can also use the recast function from the reshape2-package:



      library(reshape2)
      recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))


      which will give you the same result.






      share|improve this answer




























        38














        With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:



        library(dplyr)
        # summarising all non-grouping variables
        df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

        # summarising a specific set of non-grouping variables
        df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
        df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

        # summarising a specific set of non-grouping variables based on condition (class)
        df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)


        The result of the latter two options:



            year month        x1         x2
        <dbl> <dbl> <dbl> <dbl>
        1 2000 1 -73.58134 -92.78595
        2 2000 2 -57.81334 -152.36983
        3 2000 3 122.68758 153.55243
        4 2000 4 450.24980 285.56374
        5 2000 5 678.37867 384.42888
        6 2000 6 792.68696 530.28694
        7 2000 7 908.58795 452.31222
        8 2000 8 710.69928 719.35225
        9 2000 9 725.06079 914.93687
        10 2000 10 770.60304 863.39337
        # ... with 14 more rows


        Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.





        As mentioned in my comment above, you can also use the recast function from the reshape2-package:



        library(reshape2)
        recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))


        which will give you the same result.






        share|improve this answer


























          38












          38








          38






          With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:



          library(dplyr)
          # summarising all non-grouping variables
          df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

          # summarising a specific set of non-grouping variables
          df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
          df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

          # summarising a specific set of non-grouping variables based on condition (class)
          df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)


          The result of the latter two options:



              year month        x1         x2
          <dbl> <dbl> <dbl> <dbl>
          1 2000 1 -73.58134 -92.78595
          2 2000 2 -57.81334 -152.36983
          3 2000 3 122.68758 153.55243
          4 2000 4 450.24980 285.56374
          5 2000 5 678.37867 384.42888
          6 2000 6 792.68696 530.28694
          7 2000 7 908.58795 452.31222
          8 2000 8 710.69928 719.35225
          9 2000 9 725.06079 914.93687
          10 2000 10 770.60304 863.39337
          # ... with 14 more rows


          Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.





          As mentioned in my comment above, you can also use the recast function from the reshape2-package:



          library(reshape2)
          recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))


          which will give you the same result.






          share|improve this answer














          With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:



          library(dplyr)
          # summarising all non-grouping variables
          df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

          # summarising a specific set of non-grouping variables
          df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
          df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

          # summarising a specific set of non-grouping variables based on condition (class)
          df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)


          The result of the latter two options:



              year month        x1         x2
          <dbl> <dbl> <dbl> <dbl>
          1 2000 1 -73.58134 -92.78595
          2 2000 2 -57.81334 -152.36983
          3 2000 3 122.68758 153.55243
          4 2000 4 450.24980 285.56374
          5 2000 5 678.37867 384.42888
          6 2000 6 792.68696 530.28694
          7 2000 7 908.58795 452.31222
          8 2000 8 710.69928 719.35225
          9 2000 9 725.06079 914.93687
          10 2000 10 770.60304 863.39337
          # ... with 14 more rows


          Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.





          As mentioned in my comment above, you can also use the recast function from the reshape2-package:



          library(reshape2)
          recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))


          which will give you the same result.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Feb 7 '18 at 15:16

























          answered Oct 16 '15 at 10:19









          Jaap

          55k20117130




          55k20117130























              1














              Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:



              aggregate(
              x = df1[c("x1", "x2")],
              by = df1[c("year", "month")],
              FUN = sum, na.rm = TRUE
              )


              More generic use of aggregate's data.frame method:



              Since we are providing a





              • data.frame as x and

              • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

              • also with custom-made aggregation functions


              For example like so:



              colsToAggregate <- c("x1")
              aggregateBy <- c("year", "month")
              dummyaggfun <- function(v, na.rm = TRUE) {
              c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
              }

              aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)





              share|improve this answer























              • Nice & necessary addition!
                – Jaap
                Dec 28 '18 at 12:45
















              1














              Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:



              aggregate(
              x = df1[c("x1", "x2")],
              by = df1[c("year", "month")],
              FUN = sum, na.rm = TRUE
              )


              More generic use of aggregate's data.frame method:



              Since we are providing a





              • data.frame as x and

              • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

              • also with custom-made aggregation functions


              For example like so:



              colsToAggregate <- c("x1")
              aggregateBy <- c("year", "month")
              dummyaggfun <- function(v, na.rm = TRUE) {
              c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
              }

              aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)





              share|improve this answer























              • Nice & necessary addition!
                – Jaap
                Dec 28 '18 at 12:45














              1












              1








              1






              Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:



              aggregate(
              x = df1[c("x1", "x2")],
              by = df1[c("year", "month")],
              FUN = sum, na.rm = TRUE
              )


              More generic use of aggregate's data.frame method:



              Since we are providing a





              • data.frame as x and

              • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

              • also with custom-made aggregation functions


              For example like so:



              colsToAggregate <- c("x1")
              aggregateBy <- c("year", "month")
              dummyaggfun <- function(v, na.rm = TRUE) {
              c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
              }

              aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)





              share|improve this answer














              Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:



              aggregate(
              x = df1[c("x1", "x2")],
              by = df1[c("year", "month")],
              FUN = sum, na.rm = TRUE
              )


              More generic use of aggregate's data.frame method:



              Since we are providing a





              • data.frame as x and

              • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

              • also with custom-made aggregation functions


              For example like so:



              colsToAggregate <- c("x1")
              aggregateBy <- c("year", "month")
              dummyaggfun <- function(v, na.rm = TRUE) {
              c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
              }

              aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)






              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Dec 28 '18 at 12:48

























              answered Dec 27 '18 at 15:18









              Jozef

              571110




              571110












              • Nice & necessary addition!
                – Jaap
                Dec 28 '18 at 12:45


















              • Nice & necessary addition!
                – Jaap
                Dec 28 '18 at 12:45
















              Nice & necessary addition!
              – Jaap
              Dec 28 '18 at 12:45




              Nice & necessary addition!
              – Jaap
              Dec 28 '18 at 12:45











              0














              Late to the party, but recently found another way to get the summary statistics.



              library(psych)
              describe(data)



              Will output:
              mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.






              share|improve this answer


























                0














                Late to the party, but recently found another way to get the summary statistics.



                library(psych)
                describe(data)



                Will output:
                mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.






                share|improve this answer
























                  0












                  0








                  0






                  Late to the party, but recently found another way to get the summary statistics.



                  library(psych)
                  describe(data)



                  Will output:
                  mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.






                  share|improve this answer












                  Late to the party, but recently found another way to get the summary statistics.



                  library(psych)
                  describe(data)



                  Will output:
                  mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Aug 15 '18 at 16:22









                  britt

                  427




                  427

















                      protected by Jaap Jul 27 '16 at 18:39



                      Thank you for your interest in this question.
                      Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                      Would you like to answer one of these unanswered questions instead?



                      Popular posts from this blog

                      Monofisismo

                      Angular Downloading a file using contenturl with Basic Authentication

                      Olmecas