For each row return the column name of the largest value
I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
r
add a comment |
I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
r
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
1
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
4
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07
add a comment |
I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
r
I have a roster of employees, and I need to know at what department they are in most often. It is trivial to tabulate employee ID against department name, but it is trickier to return the department name, rather than the number of roster counts, from the frequency table. A simple example below (column names = departments, row names = employee ids).
DF <- matrix(sample(1:9,9),ncol=3,nrow=3)
DF <- as.data.frame.matrix(DF)
> DF
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
Now how do I get
> DF2
RE
1 V3
2 V1
3 V2
r
r
edited Jul 14 '17 at 13:17
zx8754
29.8k76399
29.8k76399
asked Jul 18 '13 at 23:45
dmviannadmvianna
6,704125789
6,704125789
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
1
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
4
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07
add a comment |
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
1
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
4
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
1
1
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
4
4
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07
add a comment |
6 Answers
6
active
oldest
votes
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - usingwhich.maxwill be fine then.
– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
applyconverts thedata.frametomatrixinternally. You may not see a performance difference on these dimensions though.
– Arun
Jul 19 '13 at 0:07
1
@PankajKaundal - assuming distinct values, how about thiscolnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]
– thelatemail
Apr 19 '16 at 22:14
|
show 2 more comments
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
add a comment |
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using thewhich.minin something that would look like:DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]orDT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.
– Valentin
Feb 28 '17 at 10:09
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:colnames(.SD)[max.col(-.SD, ties.method="first")].
– lwshang
Mar 9 '18 at 14:08
add a comment |
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
add a comment |
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
add a comment |
A dplyr solution:
Idea:
- add rowids as a column
- reshape to long format
- filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
add a comment |
protected by zx8754 Jul 14 '17 at 13:16
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - usingwhich.maxwill be fine then.
– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
applyconverts thedata.frametomatrixinternally. You may not see a performance difference on these dimensions though.
– Arun
Jul 19 '13 at 0:07
1
@PankajKaundal - assuming distinct values, how about thiscolnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]
– thelatemail
Apr 19 '16 at 22:14
|
show 2 more comments
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - usingwhich.maxwill be fine then.
– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
applyconverts thedata.frametomatrixinternally. You may not see a performance difference on these dimensions though.
– Arun
Jul 19 '13 at 0:07
1
@PankajKaundal - assuming distinct values, how about thiscolnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]
– thelatemail
Apr 19 '16 at 22:14
|
show 2 more comments
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
One option using your data (for future reference, use set.seed() to make examples using sample reproducible):
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
colnames(DF)[apply(DF,1,which.max)]
[1] "V3" "V1" "V2"
A faster solution than using apply might be max.col:
colnames(DF)[max.col(DF,ties.method="first")]
#[1] "V3" "V1" "V2"
...where ties.method can be any of "random" "first" or "last"
This of course causes issues if you happen to have two columns which are equal to the maximum. I'm not sure what you want to do in that instance as you will have more than one result for some rows. E.g.:
DF <- data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(7,6,4))
apply(DF,1,function(x) which(x==max(x)))
[[1]]
V2 V3
2 3
[[2]]
V1
1
[[3]]
V2
2
edited Sep 30 '14 at 22:49
answered Jul 18 '13 at 23:49
thelatemailthelatemail
67.4k882150
67.4k882150
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - usingwhich.maxwill be fine then.
– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
applyconverts thedata.frametomatrixinternally. You may not see a performance difference on these dimensions though.
– Arun
Jul 19 '13 at 0:07
1
@PankajKaundal - assuming distinct values, how about thiscolnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]
– thelatemail
Apr 19 '16 at 22:14
|
show 2 more comments
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - usingwhich.maxwill be fine then.
– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
applyconverts thedata.frametomatrixinternally. You may not see a performance difference on these dimensions though.
– Arun
Jul 19 '13 at 0:07
1
@PankajKaundal - assuming distinct values, how about thiscolnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]
– thelatemail
Apr 19 '16 at 22:14
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
If I have two equal columns I usually just pick the first. These are border cases which do not upset my statistical analysis.
– dmvianna
Jul 18 '13 at 23:59
@dmvianna - using
which.max will be fine then.– thelatemail
Jul 19 '13 at 0:03
@dmvianna - using
which.max will be fine then.– thelatemail
Jul 19 '13 at 0:03
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
I'm assuming the order is preserved, so I can create a new column with this vector that will align correctly to the employees IDs. Is that correct?
– dmvianna
Jul 19 '13 at 0:05
apply converts the data.frame to matrix internally. You may not see a performance difference on these dimensions though.– Arun
Jul 19 '13 at 0:07
apply converts the data.frame to matrix internally. You may not see a performance difference on these dimensions though.– Arun
Jul 19 '13 at 0:07
1
1
@PankajKaundal - assuming distinct values, how about this
colnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]– thelatemail
Apr 19 '16 at 22:14
@PankajKaundal - assuming distinct values, how about this
colnames(DF)[max.col(replace(DF, cbind(seq_len(nrow(DF)), max.col(DF,ties.method="first")), -Inf), "first")]– thelatemail
Apr 19 '16 at 22:14
|
show 2 more comments
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
add a comment |
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
add a comment |
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
If you're interested in a data.table solution, here's one. It's a bit tricky since you prefer to get the id for the first maximum. It's much easier if you'd rather want the last maximum. Nevertheless, it's not that complicated and it's fast!
Here I've generated data of your dimensions (26746 * 18).
Data
set.seed(45)
DF <- data.frame(matrix(sample(10, 26746*18, TRUE), ncol=18))
data.table answer:
require(data.table)
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
Benchmarking:
# data.table solution
system.time({
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid), DT[J(unique(colid)), value, mult="last"]), rowid, mult="first"]
})
# user system elapsed
# 0.174 0.029 0.227
# apply solution from @thelatemail
system.time(t2 <- colnames(DF)[apply(DF,1,which.max)])
# user system elapsed
# 2.322 0.036 2.602
identical(t1, t2)
# [1] TRUE
It's about 11 times faster on data of these dimensions, and data.table scales pretty well too.
Edit: if any of the max ids is okay, then:
DT <- data.table(value=unlist(DF, use.names=FALSE),
colid = 1:nrow(DF), rowid = rep(names(DF), each=nrow(DF)))
setkey(DT, colid, value)
t1 <- DT[J(unique(colid)), rowid, mult="last"]
edited Jul 19 '13 at 1:13
answered Jul 19 '13 at 0:30
ArunArun
93k11217316
93k11217316
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
add a comment |
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
I actually dont' care if it is the first or last maximum. I'm going for simplicity first, but I'm sure a data.table solution will come handy in the future, thanks!
– dmvianna
Jul 19 '13 at 0:48
add a comment |
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using thewhich.minin something that would look like:DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]orDT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.
– Valentin
Feb 28 '17 at 10:09
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:colnames(.SD)[max.col(-.SD, ties.method="first")].
– lwshang
Mar 9 '18 at 14:08
add a comment |
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using thewhich.minin something that would look like:DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]orDT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.
– Valentin
Feb 28 '17 at 10:09
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:colnames(.SD)[max.col(-.SD, ties.method="first")].
– lwshang
Mar 9 '18 at 14:08
add a comment |
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
Based on the above suggestions, the following data.table solution worked very fast for me:
library(data.table)
set.seed(45)
DT <- data.table(matrix(sample(10, 10^7, TRUE), ncol=10))
system.time(
DT[, col_max := colnames(.SD)[max.col(.SD, ties.method = "first")]]
)
#> user system elapsed
#> 0.15 0.06 0.21
DT
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 col_max
#> 1: 7 4 1 2 3 7 6 6 6 1 V1
#> 2: 4 6 9 10 6 2 7 7 1 3 V4
#> 3: 3 4 9 8 9 9 8 8 6 7 V3
#> 4: 4 8 8 9 7 5 9 2 7 1 V4
#> 5: 4 3 9 10 2 7 9 6 6 9 V4
#> ---
#> 999996: 4 6 10 5 4 7 3 8 2 8 V3
#> 999997: 8 7 6 6 3 10 2 3 10 1 V6
#> 999998: 2 3 2 7 4 7 5 2 7 3 V4
#> 999999: 8 10 3 2 3 4 5 1 1 4 V2
#> 1000000: 10 4 2 6 6 2 8 4 7 4 V1
And also comes with the advantage that can always specify what columns .SD should consider by mentioning them in .SDcols:
DT[, MAX2 := colnames(.SD)[max.col(.SD, ties.method="first")], .SDcols = c("V9", "V10")]
In case we need the column name of the smallest value, as suggested by @lwshang, one just needs to use -.SD:
DT[, col_min := colnames(.SD)[max.col(-.SD, ties.method = "first")]]
edited Dec 31 '18 at 11:58
answered Oct 7 '16 at 18:04
ValentinValentin
1,9651230
1,9651230
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using thewhich.minin something that would look like:DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]orDT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.
– Valentin
Feb 28 '17 at 10:09
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:colnames(.SD)[max.col(-.SD, ties.method="first")].
– lwshang
Mar 9 '18 at 14:08
add a comment |
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using thewhich.minin something that would look like:DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]]orDT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)]on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.
– Valentin
Feb 28 '17 at 10:09
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:colnames(.SD)[max.col(-.SD, ties.method="first")].
– lwshang
Mar 9 '18 at 14:08
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
I had a similar requirement but want to get the column name having the minimum value for each row.....we don't seem to have min.col in R.....would you know what would be the equivalent solution?
– user1412
Feb 26 '17 at 18:38
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using the
which.min in something that would look like: DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]] or DT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)] on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.– Valentin
Feb 28 '17 at 10:09
Hi @user1412. Thanks for your interesting question. I don't have any idea right now other than using the
which.min in something that would look like: DT[, MIN := colnames(.SD)[apply(.SD,1,which.min)]] or DT[, MIN2 := colnames(.SD)[which.min(.SD)], by = 1:nrow(DT)] on the dummy data above. This doesn't consider ties and returns only the first minimum. Maybe consider asking a separate question. I would be curious as well what other answers you would get.– Valentin
Feb 28 '17 at 10:09
1
1
A trick to get minimum column is sending the negative of the data.frame into max.col, like:
colnames(.SD)[max.col(-.SD, ties.method="first")].– lwshang
Mar 9 '18 at 14:08
A trick to get minimum column is sending the negative of the data.frame into max.col, like:
colnames(.SD)[max.col(-.SD, ties.method="first")].– lwshang
Mar 9 '18 at 14:08
add a comment |
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
add a comment |
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
add a comment |
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
One solution could be to reshape the date from wide to long putting all the departments in one column and counts in another, group by the employer id (in this case, the row number), and then filter to the department(s) with the max value. There are a couple of options for handling ties with this approach too.
library(tidyverse)
# sample data frame with a tie
df <- data_frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,5))
# If you aren't worried about ties:
df %>%
rownames_to_column('id') %>% # creates an ID number
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(cnt))
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
# If you're worried about keeping ties:
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
filter(cnt == max(cnt)) %>% # top_n(cnt, n = 1) also works
arrange(id)
# A tibble: 4 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 1 V3 9.
2 2 V1 8.
3 3 V2 5.
4 3 V3 5.
# If you're worried about ties, but only want a certain department, you could use rank() and choose 'first' or 'last'
df %>%
rownames_to_column('id') %>%
gather(dept, cnt, V1:V3) %>%
group_by(id) %>%
mutate(dept_rank = rank(-cnt, ties.method = "first")) %>% # or 'last'
filter(dept_rank == 1) %>%
select(-dept_rank)
# A tibble: 3 x 3
# Groups: id [3]
id dept cnt
<chr> <chr> <dbl>
1 2 V1 8.
2 3 V2 5.
3 1 V3 9.
# if you wanted to keep the original wide data frame
df %>%
rownames_to_column('id') %>%
left_join(
df %>%
rownames_to_column('id') %>%
gather(max_dept, max_cnt, V1:V3) %>%
group_by(id) %>%
slice(which.max(max_cnt)),
by = 'id'
)
# A tibble: 3 x 6
id V1 V2 V3 max_dept max_cnt
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 1 2. 7. 9. V3 9.
2 2 8. 3. 6. V1 8.
3 3 1. 5. 5. V2 5.
edited Mar 31 '18 at 12:21
answered Mar 31 '18 at 11:23
sbhasbha
2,27222124
2,27222124
add a comment |
add a comment |
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
add a comment |
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
add a comment |
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
A simple for loop can also be handy:
> df<-data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
> df
V1 V2 V3
1 2 7 9
2 8 3 6
3 1 5 4
> df2<-data.frame()
> for (i in 1:nrow(df)){
+ df2[i,1]<-colnames(df[which.max(df[i,])])
+ }
> df2
V1
1 V3
2 V1
3 V2
edited Jul 3 '18 at 5:11
answered Jul 3 '18 at 5:05
rarrar
6331515
6331515
add a comment |
add a comment |
A dplyr solution:
Idea:
- add rowids as a column
- reshape to long format
- filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
add a comment |
A dplyr solution:
Idea:
- add rowids as a column
- reshape to long format
- filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
add a comment |
A dplyr solution:
Idea:
- add rowids as a column
- reshape to long format
- filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
A dplyr solution:
Idea:
- add rowids as a column
- reshape to long format
- filter for max in each group
Code:
DF = data.frame(V1=c(2,8,1),V2=c(7,3,5),V3=c(9,6,4))
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
filter(rank(-value) == 1)
Result:
# A tibble: 3 x 3
# Groups: rowname [3]
rowname column value
<chr> <chr> <dbl>
1 2 V1 8
2 3 V2 5
3 1 V3 9
This approach can be easily extended to get the top n columns.
Example for n=2:
DF %>%
rownames_to_column() %>%
gather(column, value, -rowname) %>%
group_by(rowname) %>%
mutate(rk = rank(-value)) %>%
filter(rk <= 2) %>%
arrange(rowname, rk)
Result:
# A tibble: 6 x 4
# Groups: rowname [3]
rowname column value rk
<chr> <chr> <dbl> <dbl>
1 1 V3 9 1
2 1 V2 7 2
3 2 V1 8 1
4 2 V3 6 2
5 3 V2 5 1
6 3 V3 4 2
answered Nov 6 '18 at 14:31
Gregor SturmGregor Sturm
753618
753618
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
add a comment |
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
Could you comment on the difference between this approach and sbha's answer above? They look about the same to me.
– Gregor
Nov 6 '18 at 14:46
add a comment |
protected by zx8754 Jul 14 '17 at 13:16
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
how big is your actual data?
– Arun
Jul 18 '13 at 23:51
1
@Arun > dim(test) [1] 26746 18
– dmvianna
Jul 18 '13 at 23:57
4
An interesting generalization would be the largest n values' column names per row
– Hack-R
Sep 7 '16 at 0:07