Aggregate / summarize multiple variables per group (e.g. sum, mean)
From a data frame, is there a easy way to aggregate (sum
, mean
, max
et c) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Any suggestions would be greatly appreciated.
r dataframe data.table aggregate r-faq
add a comment |
From a data frame, is there a easy way to aggregate (sum
, mean
, max
et c) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Any suggestions would be greatly appreciated.
r dataframe data.table aggregate r-faq
1
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10
add a comment |
From a data frame, is there a easy way to aggregate (sum
, mean
, max
et c) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Any suggestions would be greatly appreciated.
r dataframe data.table aggregate r-faq
From a data frame, is there a easy way to aggregate (sum
, mean
, max
et c) multiple variables simultaneously?
Below are some sample data:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
I would like to simultaneously aggregate the x1
and x2
variables from the df2
data frame by year and month. The following code aggregates the x1
variable, but is it also possible to simultaneously aggregate the x2
variable?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
Any suggestions would be greatly appreciated.
r dataframe data.table aggregate r-faq
r dataframe data.table aggregate r-faq
edited Apr 8 '18 at 20:24
Henrik
40.8k992107
40.8k992107
asked Mar 15 '12 at 15:44
MikeTP
2,561103552
2,561103552
1
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10
add a comment |
1
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10
1
1
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10
add a comment |
6 Answers
6
active
oldest
votes
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
6
Therecast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
add a comment |
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
@ClockSlave, then you need to just use.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
|
show 3 more comments
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package
(column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
why not do this for data.table option:dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?
– Bulat
Oct 13 '18 at 12:00
add a comment |
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
add a comment |
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
asx
and- a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
add a comment |
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
add a comment |
protected by Jaap Jul 27 '16 at 18:39
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
6
Therecast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
add a comment |
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
6
Therecast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
add a comment |
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
edited Jun 16 '18 at 9:56
Striezel
2,36171326
2,36171326
answered Mar 15 '12 at 15:56
EDi
10.7k23649
10.7k23649
6
Therecast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
add a comment |
6
Therecast
function (also fromreshape2
) integrates themelt
anddcast
function in one go for tasks like this:recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
6
6
The
recast
function (also from reshape2
) integrates the melt
and dcast
function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
The
recast
function (also from reshape2
) integrates the melt
and dcast
function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
– Jaap
May 13 '16 at 6:17
add a comment |
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
@ClockSlave, then you need to just use.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
|
show 3 more comments
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
@ClockSlave, then you need to just use.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
|
show 3 more comments
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
Yes, in your formula
, you can cbind
the numeric variables to be aggregated:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
See ?aggregate
, the formula
argument and the examples.
edited Oct 22 '17 at 10:22
Henrik
40.8k992107
40.8k992107
answered Mar 15 '12 at 15:50
Andrie
135k26347436
135k26347436
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
@ClockSlave, then you need to just use.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
|
show 3 more comments
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
@ClockSlave, then you need to just use.
on the LHS.aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example,sum
for "date" doesn't make sense though....
– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
3
3
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
Is it possible for the cbind to use dynamic variables?
– pdb
Nov 13 '15 at 5:29
12
12
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.
– pdb
Nov 13 '15 at 6:19
1
1
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)
– Clock Slave
Mar 16 '16 at 11:22
7
7
@ClockSlave, then you need to just use
.
on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example, sum
for "date" doesn't make sense though....– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
@ClockSlave, then you need to just use
.
on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE)
. In this example, sum
for "date" doesn't make sense though....– A5C1D2H2I1M1N2O1R2T1
Mar 21 '16 at 3:53
4
4
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
What if I don't want two variables but two functions?. For example mean and sd.
– skan
Apr 14 '16 at 19:15
|
show 3 more comments
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package
(column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
why not do this for data.table option:dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?
– Bulat
Oct 13 '18 at 12:00
add a comment |
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package
(column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
why not do this for data.table option:dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?
– Bulat
Oct 13 '18 at 12:00
add a comment |
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package
(column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
Using the data.table
package, which is fast (useful for larger datasets)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe
Using the plyr package
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
Using summarize() from the Hmisc package
(column headings are messy in my example though)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
edited Sep 26 '15 at 9:25
Jaap
55k20117130
55k20117130
answered Mar 15 '12 at 23:00
numbercruncher
74658
74658
why not do this for data.table option:dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?
– Bulat
Oct 13 '18 at 12:00
add a comment |
why not do this for data.table option:dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?
– Bulat
Oct 13 '18 at 12:00
why not do this for data.table option:
dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?– Bulat
Oct 13 '18 at 12:00
why not do this for data.table option:
dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)
?– Bulat
Oct 13 '18 at 12:00
add a comment |
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
add a comment |
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
add a comment |
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
With the dplyr
package, you can use summarise_all
, summarise_at
or summarise_if
functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:
library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)
# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)
# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)
The result of the latter two options:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
Note: summarise_each
is deprecated in favor of summarise_all
, summarise_at
and summarise_if
.
As mentioned in my comment above, you can also use the recast
function from the reshape2
-package:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
which will give you the same result.
edited Feb 7 '18 at 15:16
answered Oct 16 '15 at 10:19
Jaap
55k20117130
55k20117130
add a comment |
add a comment |
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
asx
and- a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
add a comment |
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
asx
and- a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
add a comment |
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
asx
and- a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Interestingly, base R aggregate
's data.frame
method is not showcased here, above the formula interface is used, so for completeness:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
More generic use of aggregate's data.frame method:
Since we are providing a
data.frame
asx
and- a
list
(data.frame
is also alist
) asby
, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple - also with custom-made aggregation functions
For example like so:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
edited Dec 28 '18 at 12:48
answered Dec 27 '18 at 15:18
Jozef
571110
571110
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
add a comment |
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
Nice & necessary addition!
– Jaap
Dec 28 '18 at 12:45
add a comment |
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
add a comment |
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
add a comment |
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
Late to the party, but recently found another way to get the summary statistics.
library(psych)
describe(data)
Will output:
mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.
answered Aug 15 '18 at 16:22
britt
427
427
add a comment |
add a comment |
protected by Jaap Jul 27 '16 at 18:39
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
1
@Jaap This question is not a duplicate question with the following link. It is because it is a question of what to do with "dplyr". Please cancel the duplicate mark. r - Summarizing multiple columns with dplyr? - Stack Overflow stackoverflow.com/questions/21644848/…
– Keiku
Nov 18 '17 at 10:10