How to efficiently return all the column names across 1m records when certain conditions met
Updated with dummy data and dummycode - apologies, I assumed my question was simple and you could advice the best way without a reproducible example.
dummy<-data.frame(prodA=c(0,0,0,1,1,0,0,1),
prodB=c(0,0,1,1,0,1,1,0),
prodC=c(1,1,1,0,0,0,0,1))
dummy[,4:6]<-dummy[,1:3]
for (j in (1:nrow(dummy))){
for (i in 4:6){
dummy[j,i]<-ifelse(dummy[j,i]==1,colnames(dummy[i]),"")}
}
dummy2<-dummy[,4:6]
dummy$NewProds<-apply(dummy2,1,paste,collapse="")
dummy$NewProds<-gsub(".1","//",dummy$NewProds)
My second attempt is as:
prods<-dummy[,1:3]
prods[,4:6]<-dummy[,1:3]
for (i in 4:6){
prods[,i]<-colnames(prods[i-3])
}
prods[,7:9]<-prods[,4:6]
#works, but I will need multiple ifs for this to work, suggesting this
#won't be very efficient
prods[,10]<-ifelse(prods[,1]==1,prods[,4],"")
Original Post Follows:
I am playing with the Santander Product recommendation dataset from Kaggle. I have identified which products have been purchased from one month to another. This means I have 23 columns of 1's ( when a new product is added) and 0's (when not).
I created the following code to return the column name when a product has been purchased. It works great on a sample of 6 lines, but it runs forever when I try this on the 48k customers who changed, let alone the million in the dataset.
Is there another way to do this?
df2[,99:122]<-df2[,72:95]
for (j in (1:nrow(df2))){
for (i in 99:122){
df2[j,i]<-ifelse(df2[j,i]==1,colnames(df2[i]),"")}
}
df22<-df2[,99:122]
df2$NewProds<-apply(df22,1,paste,collapse="")
df2$NewProds<-gsub("change.1","//",df2$NewProds)
I figured the challenge was that I am looking at every variable and so started with another approach whereby I would take a couple of versions of the data, and then do an if variable is 1 then take the name. However I couldn't get this to work, and I think I come to the same issue.
#copy a bunch of 1's and 0's
prods<-df2[,72:95]
#repeat and overwrite with colnames
prods[,25:48]<-df2[,72:95]
for (i in 25:48){
prods[,i]<-colnames(prods[i-24])
}
prods[,49:72]<-prods[,25:48]
#attempt to only populate colnames if it was originally a 1 - doesn't work
prod[,49]<-ifelse(prod[,1]==1,prod[,25],"")
I haven't provided any data but I hope you can see what I am tring to do and can advise on efficient ways of doing this.
Thanks in advance,
J
r loops
add a comment |
Updated with dummy data and dummycode - apologies, I assumed my question was simple and you could advice the best way without a reproducible example.
dummy<-data.frame(prodA=c(0,0,0,1,1,0,0,1),
prodB=c(0,0,1,1,0,1,1,0),
prodC=c(1,1,1,0,0,0,0,1))
dummy[,4:6]<-dummy[,1:3]
for (j in (1:nrow(dummy))){
for (i in 4:6){
dummy[j,i]<-ifelse(dummy[j,i]==1,colnames(dummy[i]),"")}
}
dummy2<-dummy[,4:6]
dummy$NewProds<-apply(dummy2,1,paste,collapse="")
dummy$NewProds<-gsub(".1","//",dummy$NewProds)
My second attempt is as:
prods<-dummy[,1:3]
prods[,4:6]<-dummy[,1:3]
for (i in 4:6){
prods[,i]<-colnames(prods[i-3])
}
prods[,7:9]<-prods[,4:6]
#works, but I will need multiple ifs for this to work, suggesting this
#won't be very efficient
prods[,10]<-ifelse(prods[,1]==1,prods[,4],"")
Original Post Follows:
I am playing with the Santander Product recommendation dataset from Kaggle. I have identified which products have been purchased from one month to another. This means I have 23 columns of 1's ( when a new product is added) and 0's (when not).
I created the following code to return the column name when a product has been purchased. It works great on a sample of 6 lines, but it runs forever when I try this on the 48k customers who changed, let alone the million in the dataset.
Is there another way to do this?
df2[,99:122]<-df2[,72:95]
for (j in (1:nrow(df2))){
for (i in 99:122){
df2[j,i]<-ifelse(df2[j,i]==1,colnames(df2[i]),"")}
}
df22<-df2[,99:122]
df2$NewProds<-apply(df22,1,paste,collapse="")
df2$NewProds<-gsub("change.1","//",df2$NewProds)
I figured the challenge was that I am looking at every variable and so started with another approach whereby I would take a couple of versions of the data, and then do an if variable is 1 then take the name. However I couldn't get this to work, and I think I come to the same issue.
#copy a bunch of 1's and 0's
prods<-df2[,72:95]
#repeat and overwrite with colnames
prods[,25:48]<-df2[,72:95]
for (i in 25:48){
prods[,i]<-colnames(prods[i-24])
}
prods[,49:72]<-prods[,25:48]
#attempt to only populate colnames if it was originally a 1 - doesn't work
prod[,49]<-ifelse(prod[,1]==1,prod[,25],"")
I haven't provided any data but I hope you can see what I am tring to do and can advise on efficient ways of doing this.
Thanks in advance,
J
r loops
5
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
Can you post sample data? Please edit the question with the output ofdput(df2). Or, if it is too big with the output ofdput(df2[1:20, 72:95])).
– Rui Barradas
Dec 29 '18 at 18:36
1
I don't understand the output you want. The column names of the columns with at least one1?
– Rui Barradas
Dec 29 '18 at 18:54
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20
add a comment |
Updated with dummy data and dummycode - apologies, I assumed my question was simple and you could advice the best way without a reproducible example.
dummy<-data.frame(prodA=c(0,0,0,1,1,0,0,1),
prodB=c(0,0,1,1,0,1,1,0),
prodC=c(1,1,1,0,0,0,0,1))
dummy[,4:6]<-dummy[,1:3]
for (j in (1:nrow(dummy))){
for (i in 4:6){
dummy[j,i]<-ifelse(dummy[j,i]==1,colnames(dummy[i]),"")}
}
dummy2<-dummy[,4:6]
dummy$NewProds<-apply(dummy2,1,paste,collapse="")
dummy$NewProds<-gsub(".1","//",dummy$NewProds)
My second attempt is as:
prods<-dummy[,1:3]
prods[,4:6]<-dummy[,1:3]
for (i in 4:6){
prods[,i]<-colnames(prods[i-3])
}
prods[,7:9]<-prods[,4:6]
#works, but I will need multiple ifs for this to work, suggesting this
#won't be very efficient
prods[,10]<-ifelse(prods[,1]==1,prods[,4],"")
Original Post Follows:
I am playing with the Santander Product recommendation dataset from Kaggle. I have identified which products have been purchased from one month to another. This means I have 23 columns of 1's ( when a new product is added) and 0's (when not).
I created the following code to return the column name when a product has been purchased. It works great on a sample of 6 lines, but it runs forever when I try this on the 48k customers who changed, let alone the million in the dataset.
Is there another way to do this?
df2[,99:122]<-df2[,72:95]
for (j in (1:nrow(df2))){
for (i in 99:122){
df2[j,i]<-ifelse(df2[j,i]==1,colnames(df2[i]),"")}
}
df22<-df2[,99:122]
df2$NewProds<-apply(df22,1,paste,collapse="")
df2$NewProds<-gsub("change.1","//",df2$NewProds)
I figured the challenge was that I am looking at every variable and so started with another approach whereby I would take a couple of versions of the data, and then do an if variable is 1 then take the name. However I couldn't get this to work, and I think I come to the same issue.
#copy a bunch of 1's and 0's
prods<-df2[,72:95]
#repeat and overwrite with colnames
prods[,25:48]<-df2[,72:95]
for (i in 25:48){
prods[,i]<-colnames(prods[i-24])
}
prods[,49:72]<-prods[,25:48]
#attempt to only populate colnames if it was originally a 1 - doesn't work
prod[,49]<-ifelse(prod[,1]==1,prod[,25],"")
I haven't provided any data but I hope you can see what I am tring to do and can advise on efficient ways of doing this.
Thanks in advance,
J
r loops
Updated with dummy data and dummycode - apologies, I assumed my question was simple and you could advice the best way without a reproducible example.
dummy<-data.frame(prodA=c(0,0,0,1,1,0,0,1),
prodB=c(0,0,1,1,0,1,1,0),
prodC=c(1,1,1,0,0,0,0,1))
dummy[,4:6]<-dummy[,1:3]
for (j in (1:nrow(dummy))){
for (i in 4:6){
dummy[j,i]<-ifelse(dummy[j,i]==1,colnames(dummy[i]),"")}
}
dummy2<-dummy[,4:6]
dummy$NewProds<-apply(dummy2,1,paste,collapse="")
dummy$NewProds<-gsub(".1","//",dummy$NewProds)
My second attempt is as:
prods<-dummy[,1:3]
prods[,4:6]<-dummy[,1:3]
for (i in 4:6){
prods[,i]<-colnames(prods[i-3])
}
prods[,7:9]<-prods[,4:6]
#works, but I will need multiple ifs for this to work, suggesting this
#won't be very efficient
prods[,10]<-ifelse(prods[,1]==1,prods[,4],"")
Original Post Follows:
I am playing with the Santander Product recommendation dataset from Kaggle. I have identified which products have been purchased from one month to another. This means I have 23 columns of 1's ( when a new product is added) and 0's (when not).
I created the following code to return the column name when a product has been purchased. It works great on a sample of 6 lines, but it runs forever when I try this on the 48k customers who changed, let alone the million in the dataset.
Is there another way to do this?
df2[,99:122]<-df2[,72:95]
for (j in (1:nrow(df2))){
for (i in 99:122){
df2[j,i]<-ifelse(df2[j,i]==1,colnames(df2[i]),"")}
}
df22<-df2[,99:122]
df2$NewProds<-apply(df22,1,paste,collapse="")
df2$NewProds<-gsub("change.1","//",df2$NewProds)
I figured the challenge was that I am looking at every variable and so started with another approach whereby I would take a couple of versions of the data, and then do an if variable is 1 then take the name. However I couldn't get this to work, and I think I come to the same issue.
#copy a bunch of 1's and 0's
prods<-df2[,72:95]
#repeat and overwrite with colnames
prods[,25:48]<-df2[,72:95]
for (i in 25:48){
prods[,i]<-colnames(prods[i-24])
}
prods[,49:72]<-prods[,25:48]
#attempt to only populate colnames if it was originally a 1 - doesn't work
prod[,49]<-ifelse(prod[,1]==1,prod[,25],"")
I haven't provided any data but I hope you can see what I am tring to do and can advise on efficient ways of doing this.
Thanks in advance,
J
r loops
r loops
edited Dec 29 '18 at 21:27
James Oliver
asked Dec 29 '18 at 18:00
James OliverJames Oliver
5816
5816
5
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
Can you post sample data? Please edit the question with the output ofdput(df2). Or, if it is too big with the output ofdput(df2[1:20, 72:95])).
– Rui Barradas
Dec 29 '18 at 18:36
1
I don't understand the output you want. The column names of the columns with at least one1?
– Rui Barradas
Dec 29 '18 at 18:54
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20
add a comment |
5
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
Can you post sample data? Please edit the question with the output ofdput(df2). Or, if it is too big with the output ofdput(df2[1:20, 72:95])).
– Rui Barradas
Dec 29 '18 at 18:36
1
I don't understand the output you want. The column names of the columns with at least one1?
– Rui Barradas
Dec 29 '18 at 18:54
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20
5
5
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
Can you post sample data? Please edit the question with the output of
dput(df2). Or, if it is too big with the output of dput(df2[1:20, 72:95])).– Rui Barradas
Dec 29 '18 at 18:36
Can you post sample data? Please edit the question with the output of
dput(df2). Or, if it is too big with the output of dput(df2[1:20, 72:95])).– Rui Barradas
Dec 29 '18 at 18:36
1
1
I don't understand the output you want. The column names of the columns with at least one
1?– Rui Barradas
Dec 29 '18 at 18:54
I don't understand the output you want. The column names of the columns with at least one
1?– Rui Barradas
Dec 29 '18 at 18:54
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20
add a comment |
2 Answers
2
active
oldest
votes
Using apply as @AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
add a comment |
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
- You probably do not need
forloops. - You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
1
The OP wantscolnames.
– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53972034%2fhow-to-efficiently-return-all-the-column-names-across-1m-records-when-certain-co%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using apply as @AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
add a comment |
Using apply as @AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
add a comment |
Using apply as @AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
Using apply as @AndersEllernBilgrau illustrated is one obvious way to do it, but it will be slow for data sets with many rows.
dummy[["NewProds"]] <- do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//"))
is a bit harder to follow, but it will be much faster:
library(microbenchmark)
n <- 10000
dummy <- data.frame(prodA = rep(c(0,0,0,1,1,0,0,1), n),
prodB = rep(c(0,0,1,1,0,1,1,0), n),
prodC = rep(c(1,1,1,0,0,0,0,1), n))
microbenchmark(
do.call = do.call(
paste,
c(mapply(ifelse,
dummy,
names(dummy),
MoreArgs = list(no = ""),
SIMPLIFY = FALSE),
sep = "//")),
apply = apply(
dummy == 1,
1,
function(x) paste0(names(which(x)), collapse = "//")
))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## do.call 63.92695 65.44777 72.07261 67.8667 73.3850 184.5151 100 a
## apply 296.81323 364.31947 404.71894 397.0927 443.7223 683.3892 100 b
answered Dec 30 '18 at 1:21
IstaIsta
7,69712426
7,69712426
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
add a comment |
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
Wow! I cannot believe how quick that was. Thank you. I need to get closer to these functions.
– James Oliver
Jan 5 at 13:17
add a comment |
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
- You probably do not need
forloops. - You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
1
The OP wantscolnames.
– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
add a comment |
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
- You probably do not need
forloops. - You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
1
The OP wantscolnames.
– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
add a comment |
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
- You probably do not need
forloops. - You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
Without data, I have a hard time understanding precisely what you want to do.
A couple of things are (almost) certain however:
- You probably do not need
forloops. - You should used R's vectorized functions, the dataset is not that big
Using some toy data, does the following do what you want?
d <- 23
n <- 46e3
# Simulate some toy data
df <- data.frame(matrix(rbinom(d*n, 1, 0.1), n, d),
row.names = paste0("row", 1:n))
head(df)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23
row1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
row2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
row3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
row4 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0
row5 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
row6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
# Paste together the colnames of all non-zero rows
res <- apply(df == 1, 1, function(x) paste0(names(which(x)), collapse = "-"))
head(res)
# row1 row2 row3 row4 row5 row6
#"X8-X16" "X1" "X8-X20" "X4-X11-X20" "X7-X15" "X4-X18-X21"
I.e. res is here a character vector of length n with the colnames of each row the corresponding to 1 entries pasted together (with separator -). This it at least what it appears to me what your code is doing conceptually.
edited Dec 29 '18 at 18:56
answered Dec 29 '18 at 18:40
Anders Ellern BilgrauAnders Ellern Bilgrau
6,4231730
6,4231730
1
The OP wantscolnames.
– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
add a comment |
1
The OP wantscolnames.
– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
1
1
The OP wants
colnames.– Rui Barradas
Dec 29 '18 at 18:41
The OP wants
colnames.– Rui Barradas
Dec 29 '18 at 18:41
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
@RuiBarradas Arh, doh
– Anders Ellern Bilgrau
Dec 29 '18 at 18:42
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
Thank you for trying. I have updated the question with dummy data and my amended code so that it works with the dummy code.
– James Oliver
Dec 29 '18 at 21:24
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53972034%2fhow-to-efficiently-return-all-the-column-names-across-1m-records-when-certain-co%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
So you actually note that you haven't provided any data, but why would you not just include some and make it a reproducible example. If you're not going to take the time to write a good question, why would we take the time to write a good answer
– Conor Neilson
Dec 29 '18 at 18:28
Can you post sample data? Please edit the question with the output of
dput(df2). Or, if it is too big with the output ofdput(df2[1:20, 72:95])).– Rui Barradas
Dec 29 '18 at 18:36
1
I don't understand the output you want. The column names of the columns with at least one
1?– Rui Barradas
Dec 29 '18 at 18:54
I apologise. I thought my question was simple and that this would not need dummy data. I have provided it now and the working example. The point here is that this works, but for the mass of data it takes far too long. I am looking for someone who can give me a more effective way of doing this. Thank you in advance.
– James Oliver
Dec 29 '18 at 21:20