How to randomly select and bind data columns based on their median values in R?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T)

)

I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.

In this example, for n = 2, Y should be matched with C and D because of how close their median values are.

median(wiki_1$Y) # 7

median(wiki_2$C) # 6

median(wiki_2$D) # 4.5

I'm not sure how to implement the difference in median values criterion to get the desired result.

Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.

This is what I'm working with so far:

df <- zoo(cbind(subset(wiki_1,select="Y"), 

                   subset(wiki_2,select=c("C","D"))),time)

asked Jan 4 at 17:43

Manassa Mauler

584415

so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00

add a comment |

I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T)

)

In this example, for n = 2, Y should be matched with C and D because of how close their median values are.

median(wiki_1$Y) # 7

median(wiki_2$C) # 6

median(wiki_2$D) # 4.5

I'm not sure how to implement the difference in median values criterion to get the desired result.

Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.

This is what I'm working with so far:

df <- zoo(cbind(subset(wiki_1,select="Y"), 

                   subset(wiki_2,select=c("C","D"))),time)

asked Jan 4 at 17:43

Manassa Mauler

584415

so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00

add a comment |

I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T)

)

In this example, for n = 2, Y should be matched with C and D because of how close their median values are.

median(wiki_1$Y) # 7

median(wiki_2$C) # 6

median(wiki_2$D) # 4.5

I'm not sure how to implement the difference in median values criterion to get the desired result.

Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.

This is what I'm working with so far:

df <- zoo(cbind(subset(wiki_1,select="Y"), 

                   subset(wiki_2,select=c("C","D"))),time)

asked Jan 4 at 17:43

Manassa Mauler

584415

I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T)

)

In this example, for n = 2, Y should be matched with C and D because of how close their median values are.

median(wiki_1$Y) # 7

median(wiki_2$C) # 6

median(wiki_2$D) # 4.5

I'm not sure how to implement the difference in median values criterion to get the desired result.

Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.

This is what I'm working with so far:

df <- zoo(cbind(subset(wiki_1,select="Y"), 

                   subset(wiki_2,select=c("C","D"))),time)

r dataframe selection data-manipulation

asked Jan 4 at 17:43

Manassa Mauler

584415

asked Jan 4 at 17:43

Manassa Mauler

584415

asked Jan 4 at 17:43

Manassa Mauler

584415

asked Jan 4 at 17:43

Manassa Mauler

584415

asked Jan 4 at 17:43

Manassa Mauler

584415

so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00

add a comment |

so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00

so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00

add a comment |

3 Answers
3

active

oldest

votes

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T),

  E = sample(1:20,10,replace = T)

)





selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {

  #' Select Columns By Median

  #' @description Select any number of columns from a test data.frame whose median value is

  #' close to the median value of a specified column from a reference data.frame. "Close to"

  #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.

  #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns

  #' from the test data.frame

  #' @param df1 reference data.frame

  #' @param df2 test data.frame

  #' @param ref_v column from reference data.frame to test against

  #' @param n_v number of columns from df2 to select

  #' @param cutoff_v value to use to determine if test columns' medians are close enough

  #' @return data.frame with 1 column from df1 and matching columns from df2



  ## Get median of ref

  med_v <- median(df1[,ref_v], na.rm = T)



  ## Get other medians

  otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))



  ## Get differences

  medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))



  ## Get whoever is within range (and order them)

  inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])

  inRangeCols_v <- names(inRange_v)



  ## Select random sample, if needed

  if (length(inRangeCols_v) > n_v){

    whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)

  } else {

    whichRandom_v <- 1:length(inRangeCols_v)

  }

  finalCols_v <- inRangeCols_v[whichRandom_v]



  ## Final output

  out_df <- cbind(df1[,ref_v], df2[,finalCols_v])

  colnames(out_df) <- c(ref_v, finalCols_v)



  ## Return

  return(out_df)

} # selectColsByMedian



### 3 matching columns, select 2

match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)



### 2 matching columns, select 2

match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

add a comment |

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).

set.seed(123)



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,100,replace = T),

  B = sample(90:100,100,replace = T),

  C = sample(1:10,100,replace = T),

  D = sample(1:10,100,replace = T)

)



combineMedianComp <- function(data1, data2, col, n){

  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")



  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column



  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')



  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n



  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'



  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1



  # merge data

  res <- do.call(cbind, list(data1[col], d2)) 



  return(res)

}





combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

answered Jan 4 at 18:26

JdeMello

816419

add a comment |

You can do:

time = as.Date('2009-01-01') + 0:9



close_median <- function(df1, df2, to_match = NULL){



    # get median

    m <- median(df1[[to_match]])



    # get difference of median from other data 

    mat_cols <-  apply(df2, 2, function(x) abs(m - median(x)))



    # get top 2 matched column

    cols <- sort(names(sort(v)[1:2]))



    return(cbind(df1[to_match], df2[cols], row.names=time))



}



close_median(wiki_1, wiki_2, 'Y')



            Y  C  D

2009-01-01  8  9 10

2009-01-02  7  8  1

2009-01-03  1  7  7

2009-01-04 10  3 10

2009-01-05  2  1  1

2009-01-06  3 10  3

2009-01-07  6  2  3

2009-01-08  5  8 10

2009-01-09  3  8  5

2009-01-10 10  8  3

answered Jan 4 at 18:27

YOLO

5,5731525

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043723%2fhow-to-randomly-select-and-bind-data-columns-based-on-their-median-values-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T),

  E = sample(1:20,10,replace = T)

)





selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {

  #' Select Columns By Median

  #' @description Select any number of columns from a test data.frame whose median value is

  #' close to the median value of a specified column from a reference data.frame. "Close to"

  #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.

  #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns

  #' from the test data.frame

  #' @param df1 reference data.frame

  #' @param df2 test data.frame

  #' @param ref_v column from reference data.frame to test against

  #' @param n_v number of columns from df2 to select

  #' @param cutoff_v value to use to determine if test columns' medians are close enough

  #' @return data.frame with 1 column from df1 and matching columns from df2



  ## Get median of ref

  med_v <- median(df1[,ref_v], na.rm = T)



  ## Get other medians

  otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))



  ## Get differences

  medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))



  ## Get whoever is within range (and order them)

  inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])

  inRangeCols_v <- names(inRange_v)



  ## Select random sample, if needed

  if (length(inRangeCols_v) > n_v){

    whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)

  } else {

    whichRandom_v <- 1:length(inRangeCols_v)

  }

  finalCols_v <- inRangeCols_v[whichRandom_v]



  ## Final output

  out_df <- cbind(df1[,ref_v], df2[,finalCols_v])

  colnames(out_df) <- c(ref_v, finalCols_v)



  ## Return

  return(out_df)

} # selectColsByMedian



### 3 matching columns, select 2

match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)



### 2 matching columns, select 2

match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

add a comment |

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T),

  E = sample(1:20,10,replace = T)

)





selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {

  #' Select Columns By Median

  #' @description Select any number of columns from a test data.frame whose median value is

  #' close to the median value of a specified column from a reference data.frame. "Close to"

  #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.

  #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns

  #' from the test data.frame

  #' @param df1 reference data.frame

  #' @param df2 test data.frame

  #' @param ref_v column from reference data.frame to test against

  #' @param n_v number of columns from df2 to select

  #' @param cutoff_v value to use to determine if test columns' medians are close enough

  #' @return data.frame with 1 column from df1 and matching columns from df2



  ## Get median of ref

  med_v <- median(df1[,ref_v], na.rm = T)



  ## Get other medians

  otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))



  ## Get differences

  medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))



  ## Get whoever is within range (and order them)

  inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])

  inRangeCols_v <- names(inRange_v)



  ## Select random sample, if needed

  if (length(inRangeCols_v) > n_v){

    whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)

  } else {

    whichRandom_v <- 1:length(inRangeCols_v)

  }

  finalCols_v <- inRangeCols_v[whichRandom_v]



  ## Final output

  out_df <- cbind(df1[,ref_v], df2[,finalCols_v])

  colnames(out_df) <- c(ref_v, finalCols_v)



  ## Return

  return(out_df)

} # selectColsByMedian



### 3 matching columns, select 2

match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)



### 2 matching columns, select 2

match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

add a comment |

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T),

  E = sample(1:20,10,replace = T)

)





selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {

  #' Select Columns By Median

  #' @description Select any number of columns from a test data.frame whose median value is

  #' close to the median value of a specified column from a reference data.frame. "Close to"

  #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.

  #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns

  #' from the test data.frame

  #' @param df1 reference data.frame

  #' @param df2 test data.frame

  #' @param ref_v column from reference data.frame to test against

  #' @param n_v number of columns from df2 to select

  #' @param cutoff_v value to use to determine if test columns' medians are close enough

  #' @return data.frame with 1 column from df1 and matching columns from df2



  ## Get median of ref

  med_v <- median(df1[,ref_v], na.rm = T)



  ## Get other medians

  otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))



  ## Get differences

  medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))



  ## Get whoever is within range (and order them)

  inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])

  inRangeCols_v <- names(inRange_v)



  ## Select random sample, if needed

  if (length(inRangeCols_v) > n_v){

    whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)

  } else {

    whichRandom_v <- 1:length(inRangeCols_v)

  }

  finalCols_v <- inRangeCols_v[whichRandom_v]



  ## Final output

  out_df <- cbind(df1[,ref_v], df2[,finalCols_v])

  colnames(out_df) <- c(ref_v, finalCols_v)



  ## Return

  return(out_df)

} # selectColsByMedian



### 3 matching columns, select 2

match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)



### 2 matching columns, select 2

match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.

set.seed(123)

library(tidyr)



time = as.Date('2009-01-01') + 0:9



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,10,replace = T),

  B = sample(90:100,10,replace = T),

  C = sample(1:10,10,replace = T),

  D = sample(1:10,10,replace = T),

  E = sample(1:20,10,replace = T)

)





selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {

  #' Select Columns By Median

  #' @description Select any number of columns from a test data.frame whose median value is

  #' close to the median value of a specified column from a reference data.frame. "Close to"

  #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.

  #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns

  #' from the test data.frame

  #' @param df1 reference data.frame

  #' @param df2 test data.frame

  #' @param ref_v column from reference data.frame to test against

  #' @param n_v number of columns from df2 to select

  #' @param cutoff_v value to use to determine if test columns' medians are close enough

  #' @return data.frame with 1 column from df1 and matching columns from df2



  ## Get median of ref

  med_v <- median(df1[,ref_v], na.rm = T)



  ## Get other medians

  otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))



  ## Get differences

  medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))



  ## Get whoever is within range (and order them)

  inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])

  inRangeCols_v <- names(inRange_v)



  ## Select random sample, if needed

  if (length(inRangeCols_v) > n_v){

    whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)

  } else {

    whichRandom_v <- 1:length(inRangeCols_v)

  }

  finalCols_v <- inRangeCols_v[whichRandom_v]



  ## Final output

  out_df <- cbind(df1[,ref_v], df2[,finalCols_v])

  colnames(out_df) <- c(ref_v, finalCols_v)



  ## Return

  return(out_df)

} # selectColsByMedian



### 3 matching columns, select 2

match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)



### 2 matching columns, select 2

match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

edited Jan 4 at 18:35

answered Jan 4 at 18:08

Qwfqwf

34318

answered Jan 4 at 18:08

Qwfqwf

34318

answered Jan 4 at 18:08

Qwfqwf

34318

add a comment |

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).

set.seed(123)



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,100,replace = T),

  B = sample(90:100,100,replace = T),

  C = sample(1:10,100,replace = T),

  D = sample(1:10,100,replace = T)

)



combineMedianComp <- function(data1, data2, col, n){

  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")



  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column



  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')



  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n



  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'



  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1



  # merge data

  res <- do.call(cbind, list(data1[col], d2)) 



  return(res)

}





combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

answered Jan 4 at 18:26

JdeMello

816419

add a comment |

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).

set.seed(123)



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,100,replace = T),

  B = sample(90:100,100,replace = T),

  C = sample(1:10,100,replace = T),

  D = sample(1:10,100,replace = T)

)



combineMedianComp <- function(data1, data2, col, n){

  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")



  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column



  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')



  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n



  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'



  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1



  # merge data

  res <- do.call(cbind, list(data1[col], d2)) 



  return(res)

}





combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

answered Jan 4 at 18:26

JdeMello

816419

add a comment |

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).

set.seed(123)



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,100,replace = T),

  B = sample(90:100,100,replace = T),

  C = sample(1:10,100,replace = T),

  D = sample(1:10,100,replace = T)

)



combineMedianComp <- function(data1, data2, col, n){

  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")



  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column



  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')



  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n



  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'



  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1



  # merge data

  res <- do.call(cbind, list(data1[col], d2)) 



  return(res)

}





combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

answered Jan 4 at 18:26

JdeMello

816419

Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).

set.seed(123)



wiki_1 <- data.frame(

  W = sample(1:1000,10,replace = T),

  X = sample(1:100,10,replace = T),

  Y = sample(1:10,10,replace = T),

  Z = sample(1:10,10, replace = T)

)



wiki_2 <- data.frame(

  A = sample(500:1000,100,replace = T),

  B = sample(90:100,100,replace = T),

  C = sample(1:10,100,replace = T),

  D = sample(1:10,100,replace = T)

)



combineMedianComp <- function(data1, data2, col, n){

  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")



  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column



  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')



  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n



  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'



  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1



  # merge data

  res <- do.call(cbind, list(data1[col], d2)) 



  return(res)

}





combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)

answered Jan 4 at 18:26

JdeMello

816419

answered Jan 4 at 18:26

JdeMello

816419

answered Jan 4 at 18:26

JdeMello

816419

answered Jan 4 at 18:26

JdeMello

816419

add a comment |

You can do:

time = as.Date('2009-01-01') + 0:9



close_median <- function(df1, df2, to_match = NULL){



    # get median

    m <- median(df1[[to_match]])



    # get difference of median from other data 

    mat_cols <-  apply(df2, 2, function(x) abs(m - median(x)))



    # get top 2 matched column

    cols <- sort(names(sort(v)[1:2]))



    return(cbind(df1[to_match], df2[cols], row.names=time))



}



close_median(wiki_1, wiki_2, 'Y')



            Y  C  D

2009-01-01  8  9 10

2009-01-02  7  8  1

2009-01-03  1  7  7

2009-01-04 10  3 10

2009-01-05  2  1  1

2009-01-06  3 10  3

2009-01-07  6  2  3

2009-01-08  5  8 10

2009-01-09  3  8  5

2009-01-10 10  8  3

answered Jan 4 at 18:27

YOLO

5,5731525

add a comment |

You can do:

time = as.Date('2009-01-01') + 0:9



close_median <- function(df1, df2, to_match = NULL){



    # get median

    m <- median(df1[[to_match]])



    # get difference of median from other data 

    mat_cols <-  apply(df2, 2, function(x) abs(m - median(x)))



    # get top 2 matched column

    cols <- sort(names(sort(v)[1:2]))



    return(cbind(df1[to_match], df2[cols], row.names=time))



}



close_median(wiki_1, wiki_2, 'Y')



            Y  C  D

2009-01-01  8  9 10

2009-01-02  7  8  1

2009-01-03  1  7  7

2009-01-04 10  3 10

2009-01-05  2  1  1

2009-01-06  3 10  3

2009-01-07  6  2  3

2009-01-08  5  8 10

2009-01-09  3  8  5

2009-01-10 10  8  3

answered Jan 4 at 18:27

YOLO

5,5731525

add a comment |

You can do:

time = as.Date('2009-01-01') + 0:9



close_median <- function(df1, df2, to_match = NULL){



    # get median

    m <- median(df1[[to_match]])



    # get difference of median from other data 

    mat_cols <-  apply(df2, 2, function(x) abs(m - median(x)))



    # get top 2 matched column

    cols <- sort(names(sort(v)[1:2]))



    return(cbind(df1[to_match], df2[cols], row.names=time))



}



close_median(wiki_1, wiki_2, 'Y')



            Y  C  D

2009-01-01  8  9 10

2009-01-02  7  8  1

2009-01-03  1  7  7

2009-01-04 10  3 10

2009-01-05  2  1  1

2009-01-06  3 10  3

2009-01-07  6  2  3

2009-01-08  5  8 10

2009-01-09  3  8  5

2009-01-10 10  8  3

answered Jan 4 at 18:27

YOLO

5,5731525

You can do:

time = as.Date('2009-01-01') + 0:9



close_median <- function(df1, df2, to_match = NULL){



    # get median

    m <- median(df1[[to_match]])



    # get difference of median from other data 

    mat_cols <-  apply(df2, 2, function(x) abs(m - median(x)))



    # get top 2 matched column

    cols <- sort(names(sort(v)[1:2]))



    return(cbind(df1[to_match], df2[cols], row.names=time))



}



close_median(wiki_1, wiki_2, 'Y')



            Y  C  D

2009-01-01  8  9 10

2009-01-02  7  8  1

2009-01-03  1  7  7

2009-01-04 10  3 10

2009-01-05  2  1  1

2009-01-06  3 10  3

2009-01-07  6  2  3

2009-01-08  5  8 10

2009-01-09  3  8  5

2009-01-10 10  8  3

answered Jan 4 at 18:27

YOLO

5,5731525

answered Jan 4 at 18:27

YOLO

5,5731525

answered Jan 4 at 18:27

YOLO

5,5731525

answered Jan 4 at 18:27

YOLO

5,5731525

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk