Aggregate / summarize multiple variables per group (e.g. sum, mean)

121

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)

days = 365*2

date = seq(as.Date("2000-01-01"), length = days, by = "day")

year = year(date)

month = month(date)

x1 = cumsum(rnorm(days, 0.05)) 

x2 = cumsum(rnorm(days, 0.05))

df1 = data.frame(date, year, month, x1, x2)

I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?

### aggregate variables by year month

df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)

head(df2)

Any suggestions would be greatly appreciated.

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

1

@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10

add a comment |

121

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)

days = 365*2

date = seq(as.Date("2000-01-01"), length = days, by = "day")

year = year(date)

month = month(date)

x1 = cumsum(rnorm(days, 0.05)) 

x2 = cumsum(rnorm(days, 0.05))

df1 = data.frame(date, year, month, x1, x2)

### aggregate variables by year month

df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)

head(df2)

Any suggestions would be greatly appreciated.

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

1

@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10

add a comment |

121

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)

days = 365*2

date = seq(as.Date("2000-01-01"), length = days, by = "day")

year = year(date)

month = month(date)

x1 = cumsum(rnorm(days, 0.05)) 

x2 = cumsum(rnorm(days, 0.05))

df1 = data.frame(date, year, month, x1, x2)

### aggregate variables by year month

df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)

head(df2)

Any suggestions would be greatly appreciated.

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)

days = 365*2

date = seq(as.Date("2000-01-01"), length = days, by = "day")

year = year(date)

month = month(date)

x1 = cumsum(rnorm(days, 0.05)) 

x2 = cumsum(rnorm(days, 0.05))

df1 = data.frame(date, year, month, x1, x2)

### aggregate variables by year month

df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)

head(df2)

Any suggestions would be greatly appreciated.

r dataframe data.table aggregate r-faq

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

edited Apr 8 '18 at 20:24

Henrik

40.8k992107

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

asked Mar 15 '12 at 15:44

MikeTP

2,561103552

1

@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10

add a comment |

1

@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10

@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10

add a comment |

6 Answers
6

active

oldest

votes

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)

df_melt <- melt(df1, id = c("date", "year", "month"))

dcast(df_melt, year + month ~ variable, sum)

#  year month         x1           x2

1  2000     1  -80.83405 -224.9540159

2  2000     2 -223.76331 -288.2418017

3  2000     3 -188.83930 -481.5601913

4  2000     4 -197.47797 -473.7137420

5  2000     5 -259.07928 -372.4563522

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

6

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

add a comment |

173

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)

   year month         x1          x2

1  2000     1   7.862002   -7.469298

2  2001     1 276.758209  474.384252

3  2000     2  13.122369 -128.122613

...

23 2000    12  63.436507  449.794454

24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

3

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

12

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

1

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

7

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

4

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

|
show 3 more comments

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)

df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]

setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)

df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package
(column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()

detach(package:plyr)

require(Hmisc)

df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

add a comment |

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)

# summarising all non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)



# summarising a specific set of non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)



# summarising a specific set of non-grouping variables based on condition (class)

df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2

   <dbl> <dbl>     <dbl>      <dbl>

1   2000     1 -73.58134  -92.78595

2   2000     2 -57.81334 -152.36983

3   2000     3 122.68758  153.55243

4   2000     4 450.24980  285.56374

5   2000     5 678.37867  384.42888

6   2000     6 792.68696  530.28694

7   2000     7 908.58795  452.31222

8   2000     8 710.69928  719.35225

9   2000     9 725.06079  914.93687

10  2000    10 770.60304  863.39337

# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.

As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)

recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

add a comment |

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(

  x = df1[c("x1", "x2")],

  by = df1[c("year", "month")],

  FUN = sum, na.rm = TRUE

)

More generic use of aggregate's data.frame method:

Since we are providing a

data.frame as x and

a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")

aggregateBy <- c("year", "month")

dummyaggfun <- function(v, na.rm = TRUE) {

  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))

}



aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

add a comment |

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

answered Aug 15 '18 at 16:22

britt

427

add a comment |

protected by Jaap Jul 27 '16 at 18:39

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)

df_melt <- melt(df1, id = c("date", "year", "month"))

dcast(df_melt, year + month ~ variable, sum)

#  year month         x1           x2

1  2000     1  -80.83405 -224.9540159

2  2000     2 -223.76331 -288.2418017

3  2000     3 -188.83930 -481.5601913

4  2000     4 -197.47797 -473.7137420

5  2000     5 -259.07928 -372.4563522

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

6

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

add a comment |

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)

df_melt <- melt(df1, id = c("date", "year", "month"))

dcast(df_melt, year + month ~ variable, sum)

#  year month         x1           x2

1  2000     1  -80.83405 -224.9540159

2  2000     2 -223.76331 -288.2418017

3  2000     3 -188.83930 -481.5601913

4  2000     4 -197.47797 -473.7137420

5  2000     5 -259.07928 -372.4563522

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

6

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

add a comment |

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)

df_melt <- melt(df1, id = c("date", "year", "month"))

dcast(df_melt, year + month ~ variable, sum)

#  year month         x1           x2

1  2000     1  -80.83405 -224.9540159

2  2000     2 -223.76331 -288.2418017

3  2000     3 -188.83930 -481.5601913

4  2000     4 -197.47797 -473.7137420

5  2000     5 -259.07928 -372.4563522

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)

df_melt <- melt(df1, id = c("date", "year", "month"))

dcast(df_melt, year + month ~ variable, sum)

#  year month         x1           x2

1  2000     1  -80.83405 -224.9540159

2  2000     2 -223.76331 -288.2418017

3  2000     3 -188.83930 -481.5601913

4  2000     4 -197.47797 -473.7137420

5  2000     5 -259.07928 -372.4563522

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

edited Jun 16 '18 at 9:56

Striezel

2,36171326

edited Jun 16 '18 at 9:56

Striezel

2,36171326

edited Jun 16 '18 at 9:56

Striezel

2,36171326

answered Mar 15 '12 at 15:56

EDi

10.7k23649

answered Mar 15 '12 at 15:56

EDi

10.7k23649

answered Mar 15 '12 at 15:56

EDi

10.7k23649

6

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

add a comment |

6

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17

add a comment |

173

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)

   year month         x1          x2

1  2000     1   7.862002   -7.469298

2  2001     1 276.758209  474.384252

3  2000     2  13.122369 -128.122613

...

23 2000    12  63.436507  449.794454

24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

3

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

12

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

1

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

7

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

4

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

|
show 3 more comments

173

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)

   year month         x1          x2

1  2000     1   7.862002   -7.469298

2  2001     1 276.758209  474.384252

3  2000     2  13.122369 -128.122613

...

23 2000    12  63.436507  449.794454

24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

3

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

12

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

1

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

7

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

4

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

|
show 3 more comments

173

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)

   year month         x1          x2

1  2000     1   7.862002   -7.469298

2  2001     1 276.758209  474.384252

3  2000     2  13.122369 -128.122613

...

23 2000    12  63.436507  449.794454

24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)

   year month         x1          x2

1  2000     1   7.862002   -7.469298

2  2001     1 276.758209  474.384252

3  2000     2  13.122369 -128.122613

...

23 2000    12  63.436507  449.794454

24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

edited Oct 22 '17 at 10:22

Henrik

40.8k992107

answered Mar 15 '12 at 15:50

Andrie

135k26347436

answered Mar 15 '12 at 15:50

Andrie

135k26347436

answered Mar 15 '12 at 15:50

Andrie

135k26347436

3

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

12

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

1

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

7

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

4

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

|
show 3 more comments

3

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

12

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

1

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

7

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

4

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53

What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15

|
show 3 more comments

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)

df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]

setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)

df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package
(column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()

detach(package:plyr)

require(Hmisc)

df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

add a comment |

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)

df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]

setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)

df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package
(column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()

detach(package:plyr)

require(Hmisc)

df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

add a comment |

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)

df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]

setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)

df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package
(column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()

detach(package:plyr)

require(Hmisc)

df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)

df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]

setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)

df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package
(column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()

detach(package:plyr)

require(Hmisc)

df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

edited Sep 26 '15 at 9:25

Jaap

55k20117130

edited Sep 26 '15 at 9:25

Jaap

55k20117130

edited Sep 26 '15 at 9:25

Jaap

55k20117130

answered Mar 15 '12 at 23:00

numbercruncher

74658

answered Mar 15 '12 at 23:00

numbercruncher

74658

answered Mar 15 '12 at 23:00

numbercruncher

74658

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

add a comment |

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?
– Bulat
Oct 13 '18 at 12:00

add a comment |

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)

# summarising all non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)



# summarising a specific set of non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)



# summarising a specific set of non-grouping variables based on condition (class)

df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2

   <dbl> <dbl>     <dbl>      <dbl>

1   2000     1 -73.58134  -92.78595

2   2000     2 -57.81334 -152.36983

3   2000     3 122.68758  153.55243

4   2000     4 450.24980  285.56374

5   2000     5 678.37867  384.42888

6   2000     6 792.68696  530.28694

7   2000     7 908.58795  452.31222

8   2000     8 710.69928  719.35225

9   2000     9 725.06079  914.93687

10  2000    10 770.60304  863.39337

# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.

As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)

recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

add a comment |

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)

# summarising all non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)



# summarising a specific set of non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)



# summarising a specific set of non-grouping variables based on condition (class)

df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2

   <dbl> <dbl>     <dbl>      <dbl>

1   2000     1 -73.58134  -92.78595

2   2000     2 -57.81334 -152.36983

3   2000     3 122.68758  153.55243

4   2000     4 450.24980  285.56374

5   2000     5 678.37867  384.42888

6   2000     6 792.68696  530.28694

7   2000     7 908.58795  452.31222

8   2000     8 710.69928  719.35225

9   2000     9 725.06079  914.93687

10  2000    10 770.60304  863.39337

# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.

As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)

recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

add a comment |

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)

# summarising all non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)



# summarising a specific set of non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)



# summarising a specific set of non-grouping variables based on condition (class)

df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2

   <dbl> <dbl>     <dbl>      <dbl>

1   2000     1 -73.58134  -92.78595

2   2000     2 -57.81334 -152.36983

3   2000     3 122.68758  153.55243

4   2000     4 450.24980  285.56374

5   2000     5 678.37867  384.42888

6   2000     6 792.68696  530.28694

7   2000     7 908.58795  452.31222

8   2000     8 710.69928  719.35225

9   2000     9 725.06079  914.93687

10  2000    10 770.60304  863.39337

# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.

As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)

recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)

# summarising all non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)



# summarising a specific set of non-grouping variables

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)

df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)



# summarising a specific set of non-grouping variables based on condition (class)

df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2

   <dbl> <dbl>     <dbl>      <dbl>

1   2000     1 -73.58134  -92.78595

2   2000     2 -57.81334 -152.36983

3   2000     3 122.68758  153.55243

4   2000     4 450.24980  285.56374

5   2000     5 678.37867  384.42888

6   2000     6 792.68696  530.28694

7   2000     7 908.58795  452.31222

8   2000     8 710.69928  719.35225

9   2000     9 725.06079  914.93687

10  2000    10 770.60304  863.39337

# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.

As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)

recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

edited Feb 7 '18 at 15:16

answered Oct 16 '15 at 10:19

Jaap

55k20117130

answered Oct 16 '15 at 10:19

Jaap

55k20117130

answered Oct 16 '15 at 10:19

Jaap

55k20117130

add a comment |

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(

  x = df1[c("x1", "x2")],

  by = df1[c("year", "month")],

  FUN = sum, na.rm = TRUE

)

More generic use of aggregate's data.frame method:

Since we are providing a

data.frame as x and

a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")

aggregateBy <- c("year", "month")

dummyaggfun <- function(v, na.rm = TRUE) {

  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))

}



aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

add a comment |

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(

  x = df1[c("x1", "x2")],

  by = df1[c("year", "month")],

  FUN = sum, na.rm = TRUE

)

More generic use of aggregate's data.frame method:

Since we are providing a

data.frame as x and

a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")

aggregateBy <- c("year", "month")

dummyaggfun <- function(v, na.rm = TRUE) {

  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))

}



aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

add a comment |

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(

  x = df1[c("x1", "x2")],

  by = df1[c("year", "month")],

  FUN = sum, na.rm = TRUE

)

More generic use of aggregate's data.frame method:

Since we are providing a

data.frame as x and

a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")

aggregateBy <- c("year", "month")

dummyaggfun <- function(v, na.rm = TRUE) {

  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))

}



aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(

  x = df1[c("x1", "x2")],

  by = df1[c("year", "month")],

  FUN = sum, na.rm = TRUE

)

More generic use of aggregate's data.frame method:

Since we are providing a

data.frame as x and

a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple

also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")

aggregateBy <- c("year", "month")

dummyaggfun <- function(v, na.rm = TRUE) {

  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))

}



aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

edited Dec 28 '18 at 12:48

answered Dec 27 '18 at 15:18

Jozef

571110

answered Dec 27 '18 at 15:18

Jozef

571110

answered Dec 27 '18 at 15:18

Jozef

571110

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

add a comment |

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45

add a comment |

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

answered Aug 15 '18 at 16:22

britt

427

add a comment |

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

answered Aug 15 '18 at 16:22

britt

427

add a comment |

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

answered Aug 15 '18 at 16:22

britt

427

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

answered Aug 15 '18 at 16:22

britt

427

answered Aug 15 '18 at 16:22

britt

427

answered Aug 15 '18 at 16:22

britt

427

answered Aug 15 '18 at 16:22

britt

427

add a comment |

protected by Jaap Jul 27 '16 at 18:39

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk