How to randomly select and bind data columns based on their median values in R?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.



set.seed(123)
library(tidyr)

time = as.Date('2009-01-01') + 0:9

wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)

wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T)
)


I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.



In this example, for n = 2, Y should be matched with C and D because of how close their median values are.



median(wiki_1$Y) # 7
median(wiki_2$C) # 6
median(wiki_2$D) # 4.5


I'm not sure how to implement the difference in median values criterion to get the desired result.



Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.



This is what I'm working with so far:



df <- zoo(cbind(subset(wiki_1,select="Y"), 
subset(wiki_2,select=c("C","D"))),time)









share|improve this question























  • so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

    – JdeMello
    Jan 4 at 18:00


















1















I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.



set.seed(123)
library(tidyr)

time = as.Date('2009-01-01') + 0:9

wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)

wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T)
)


I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.



In this example, for n = 2, Y should be matched with C and D because of how close their median values are.



median(wiki_1$Y) # 7
median(wiki_2$C) # 6
median(wiki_2$D) # 4.5


I'm not sure how to implement the difference in median values criterion to get the desired result.



Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.



This is what I'm working with so far:



df <- zoo(cbind(subset(wiki_1,select="Y"), 
subset(wiki_2,select=c("C","D"))),time)









share|improve this question























  • so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

    – JdeMello
    Jan 4 at 18:00














1












1








1








I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.



set.seed(123)
library(tidyr)

time = as.Date('2009-01-01') + 0:9

wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)

wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T)
)


I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.



In this example, for n = 2, Y should be matched with C and D because of how close their median values are.



median(wiki_1$Y) # 7
median(wiki_2$C) # 6
median(wiki_2$D) # 4.5


I'm not sure how to implement the difference in median values criterion to get the desired result.



Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.



This is what I'm working with so far:



df <- zoo(cbind(subset(wiki_1,select="Y"), 
subset(wiki_2,select=c("C","D"))),time)









share|improve this question














I have two dataframes in wide format. Each of the columns is a time series of page hits for various wikipedia articles.



set.seed(123)
library(tidyr)

time = as.Date('2009-01-01') + 0:9

wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)

wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T)
)


I want to combine one of the columns from the first dataset (wiki_1) with n columns from the second dataset (wiki_2). But this selection should be based on how close the median values of the columns in wiki_2 are to those in wiki_1 e.g. by order of magnitude.



In this example, for n = 2, Y should be matched with C and D because of how close their median values are.



median(wiki_1$Y) # 7
median(wiki_2$C) # 6
median(wiki_2$D) # 4.5


I'm not sure how to implement the difference in median values criterion to get the desired result.



Additionally, it would be useful to be able to randomly sample from the columns in wiki_2 that satisfy the criterion as my real dataset has many more columns.



This is what I'm working with so far:



df <- zoo(cbind(subset(wiki_1,select="Y"), 
subset(wiki_2,select=c("C","D"))),time)






r dataframe selection data-manipulation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 4 at 17:43









Manassa MaulerManassa Mauler

584415




584415













  • so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

    – JdeMello
    Jan 4 at 18:00



















  • so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

    – JdeMello
    Jan 4 at 18:00

















so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00





so you want to check the median for a column in wiki_1 and compare how close it is to the median in each column in wiki_2 and based on a ranking criteria (top 2 closest in your example) you want to combine them. Additionally you want to subset the columns in wiki_2 that satisfy the criteria so as to match the number of rows in the columns specified in wiki_1?

– JdeMello
Jan 4 at 18:00












3 Answers
3






active

oldest

votes


















1














I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.



set.seed(123)
library(tidyr)

time = as.Date('2009-01-01') + 0:9

wiki_1 <- data.frame(
W = sample(1:1000,10,replace = T),
X = sample(1:100,10,replace = T),
Y = sample(1:10,10,replace = T),
Z = sample(1:10,10, replace = T)
)

wiki_2 <- data.frame(
A = sample(500:1000,10,replace = T),
B = sample(90:100,10,replace = T),
C = sample(1:10,10,replace = T),
D = sample(1:10,10,replace = T),
E = sample(1:20,10,replace = T)
)


selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
#' Select Columns By Median
#' @description Select any number of columns from a test data.frame whose median value is
#' close to the median value of a specified column from a reference data.frame. "Close to"
#' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
#' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
#' from the test data.frame
#' @param df1 reference data.frame
#' @param df2 test data.frame
#' @param ref_v column from reference data.frame to test against
#' @param n_v number of columns from df2 to select
#' @param cutoff_v value to use to determine if test columns' medians are close enough
#' @return data.frame with 1 column from df1 and matching columns from df2

## Get median of ref
med_v <- median(df1[,ref_v], na.rm = T)

## Get other medians
otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))

## Get differences
medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))

## Get whoever is within range (and order them)
inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
inRangeCols_v <- names(inRange_v)

## Select random sample, if needed
if (length(inRangeCols_v) > n_v){
whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
} else {
whichRandom_v <- 1:length(inRangeCols_v)
}
finalCols_v <- inRangeCols_v[whichRandom_v]

## Final output
out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
colnames(out_df) <- c(ref_v, finalCols_v)

## Return
return(out_df)
} # selectColsByMedian

### 3 matching columns, select 2
match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

### 2 matching columns, select 2
match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)





share|improve this answer

































    1














    Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).



    set.seed(123)

    wiki_1 <- data.frame(
    W = sample(1:1000,10,replace = T),
    X = sample(1:100,10,replace = T),
    Y = sample(1:10,10,replace = T),
    Z = sample(1:10,10, replace = T)
    )

    wiki_2 <- data.frame(
    A = sample(500:1000,100,replace = T),
    B = sample(90:100,100,replace = T),
    C = sample(1:10,100,replace = T),
    D = sample(1:10,100,replace = T)
    )

    combineMedianComp <- function(data1, data2, col, n){
    if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")

    medRef <- median(data1[[col]], na.rm = T, ) # median of desired column

    medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')

    cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n

    d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'

    d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1

    # merge data
    res <- do.call(cbind, list(data1[col], d2))

    return(res)
    }


    combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)





    share|improve this answer































      1














      You can do:



      time = as.Date('2009-01-01') + 0:9

      close_median <- function(df1, df2, to_match = NULL){

      # get median
      m <- median(df1[[to_match]])

      # get difference of median from other data
      mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))

      # get top 2 matched column
      cols <- sort(names(sort(v)[1:2]))

      return(cbind(df1[to_match], df2[cols], row.names=time))

      }

      close_median(wiki_1, wiki_2, 'Y')

      Y C D
      2009-01-01 8 9 10
      2009-01-02 7 8 1
      2009-01-03 1 7 7
      2009-01-04 10 3 10
      2009-01-05 2 1 1
      2009-01-06 3 10 3
      2009-01-07 6 2 3
      2009-01-08 5 8 10
      2009-01-09 3 8 5
      2009-01-10 10 8 3





      share|improve this answer
























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043723%2fhow-to-randomly-select-and-bind-data-columns-based-on-their-median-values-in-r%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.



        set.seed(123)
        library(tidyr)

        time = as.Date('2009-01-01') + 0:9

        wiki_1 <- data.frame(
        W = sample(1:1000,10,replace = T),
        X = sample(1:100,10,replace = T),
        Y = sample(1:10,10,replace = T),
        Z = sample(1:10,10, replace = T)
        )

        wiki_2 <- data.frame(
        A = sample(500:1000,10,replace = T),
        B = sample(90:100,10,replace = T),
        C = sample(1:10,10,replace = T),
        D = sample(1:10,10,replace = T),
        E = sample(1:20,10,replace = T)
        )


        selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
        #' Select Columns By Median
        #' @description Select any number of columns from a test data.frame whose median value is
        #' close to the median value of a specified column from a reference data.frame. "Close to"
        #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
        #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
        #' from the test data.frame
        #' @param df1 reference data.frame
        #' @param df2 test data.frame
        #' @param ref_v column from reference data.frame to test against
        #' @param n_v number of columns from df2 to select
        #' @param cutoff_v value to use to determine if test columns' medians are close enough
        #' @return data.frame with 1 column from df1 and matching columns from df2

        ## Get median of ref
        med_v <- median(df1[,ref_v], na.rm = T)

        ## Get other medians
        otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))

        ## Get differences
        medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))

        ## Get whoever is within range (and order them)
        inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
        inRangeCols_v <- names(inRange_v)

        ## Select random sample, if needed
        if (length(inRangeCols_v) > n_v){
        whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
        } else {
        whichRandom_v <- 1:length(inRangeCols_v)
        }
        finalCols_v <- inRangeCols_v[whichRandom_v]

        ## Final output
        out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
        colnames(out_df) <- c(ref_v, finalCols_v)

        ## Return
        return(out_df)
        } # selectColsByMedian

        ### 3 matching columns, select 2
        match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
        match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

        ### 2 matching columns, select 2
        match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)





        share|improve this answer






























          1














          I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.



          set.seed(123)
          library(tidyr)

          time = as.Date('2009-01-01') + 0:9

          wiki_1 <- data.frame(
          W = sample(1:1000,10,replace = T),
          X = sample(1:100,10,replace = T),
          Y = sample(1:10,10,replace = T),
          Z = sample(1:10,10, replace = T)
          )

          wiki_2 <- data.frame(
          A = sample(500:1000,10,replace = T),
          B = sample(90:100,10,replace = T),
          C = sample(1:10,10,replace = T),
          D = sample(1:10,10,replace = T),
          E = sample(1:20,10,replace = T)
          )


          selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
          #' Select Columns By Median
          #' @description Select any number of columns from a test data.frame whose median value is
          #' close to the median value of a specified column from a reference data.frame. "Close to"
          #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
          #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
          #' from the test data.frame
          #' @param df1 reference data.frame
          #' @param df2 test data.frame
          #' @param ref_v column from reference data.frame to test against
          #' @param n_v number of columns from df2 to select
          #' @param cutoff_v value to use to determine if test columns' medians are close enough
          #' @return data.frame with 1 column from df1 and matching columns from df2

          ## Get median of ref
          med_v <- median(df1[,ref_v], na.rm = T)

          ## Get other medians
          otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))

          ## Get differences
          medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))

          ## Get whoever is within range (and order them)
          inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
          inRangeCols_v <- names(inRange_v)

          ## Select random sample, if needed
          if (length(inRangeCols_v) > n_v){
          whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
          } else {
          whichRandom_v <- 1:length(inRangeCols_v)
          }
          finalCols_v <- inRangeCols_v[whichRandom_v]

          ## Final output
          out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
          colnames(out_df) <- c(ref_v, finalCols_v)

          ## Return
          return(out_df)
          } # selectColsByMedian

          ### 3 matching columns, select 2
          match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
          match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

          ### 2 matching columns, select 2
          match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)





          share|improve this answer




























            1












            1








            1







            I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.



            set.seed(123)
            library(tidyr)

            time = as.Date('2009-01-01') + 0:9

            wiki_1 <- data.frame(
            W = sample(1:1000,10,replace = T),
            X = sample(1:100,10,replace = T),
            Y = sample(1:10,10,replace = T),
            Z = sample(1:10,10, replace = T)
            )

            wiki_2 <- data.frame(
            A = sample(500:1000,10,replace = T),
            B = sample(90:100,10,replace = T),
            C = sample(1:10,10,replace = T),
            D = sample(1:10,10,replace = T),
            E = sample(1:20,10,replace = T)
            )


            selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
            #' Select Columns By Median
            #' @description Select any number of columns from a test data.frame whose median value is
            #' close to the median value of a specified column from a reference data.frame. "Close to"
            #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
            #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
            #' from the test data.frame
            #' @param df1 reference data.frame
            #' @param df2 test data.frame
            #' @param ref_v column from reference data.frame to test against
            #' @param n_v number of columns from df2 to select
            #' @param cutoff_v value to use to determine if test columns' medians are close enough
            #' @return data.frame with 1 column from df1 and matching columns from df2

            ## Get median of ref
            med_v <- median(df1[,ref_v], na.rm = T)

            ## Get other medians
            otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))

            ## Get differences
            medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))

            ## Get whoever is within range (and order them)
            inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
            inRangeCols_v <- names(inRange_v)

            ## Select random sample, if needed
            if (length(inRangeCols_v) > n_v){
            whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
            } else {
            whichRandom_v <- 1:length(inRangeCols_v)
            }
            finalCols_v <- inRangeCols_v[whichRandom_v]

            ## Final output
            out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
            colnames(out_df) <- c(ref_v, finalCols_v)

            ## Return
            return(out_df)
            } # selectColsByMedian

            ### 3 matching columns, select 2
            match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
            match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

            ### 2 matching columns, select 2
            match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)





            share|improve this answer















            I think this is what you're after. I added a column to wiki_2 in order to allow more than 2 matches to show the random selection of matching columns.



            set.seed(123)
            library(tidyr)

            time = as.Date('2009-01-01') + 0:9

            wiki_1 <- data.frame(
            W = sample(1:1000,10,replace = T),
            X = sample(1:100,10,replace = T),
            Y = sample(1:10,10,replace = T),
            Z = sample(1:10,10, replace = T)
            )

            wiki_2 <- data.frame(
            A = sample(500:1000,10,replace = T),
            B = sample(90:100,10,replace = T),
            C = sample(1:10,10,replace = T),
            D = sample(1:10,10,replace = T),
            E = sample(1:20,10,replace = T)
            )


            selectColsByMedian <- function(df1, df2, ref_v, n_v, cutoff_v) {
            #' Select Columns By Median
            #' @description Select any number of columns from a test data.frame whose median value is
            #' close to the median value of a specified column from a reference data.frame. "Close to"
            #' is determined as the absolute value of the difference in medians being less thant he specified cutoff.
            #' Outputs a new data.frame containing the reference data.frame's test column and all matching columns
            #' from the test data.frame
            #' @param df1 reference data.frame
            #' @param df2 test data.frame
            #' @param ref_v column from reference data.frame to test against
            #' @param n_v number of columns from df2 to select
            #' @param cutoff_v value to use to determine if test columns' medians are close enough
            #' @return data.frame with 1 column from df1 and matching columns from df2

            ## Get median of ref
            med_v <- median(df1[,ref_v], na.rm = T)

            ## Get other medians
            otherMed_v <- apply(wiki_2, 2, function(x) median(x, na.rm = T))

            ## Get differences
            medDiff_v <- sapply(otherMed_v, function(x) abs(med_v - x))

            ## Get whoever is within range (and order them)
            inRange_v <- sort(medDiff_v[medDiff_v < cutoff_v])
            inRangeCols_v <- names(inRange_v)

            ## Select random sample, if needed
            if (length(inRangeCols_v) > n_v){
            whichRandom_v <- sample(1:length(inRangeCols_v), size = n_v, replace = F)
            } else {
            whichRandom_v <- 1:length(inRangeCols_v)
            }
            finalCols_v <- inRangeCols_v[whichRandom_v]

            ## Final output
            out_df <- cbind(df1[,ref_v], df2[,finalCols_v])
            colnames(out_df) <- c(ref_v, finalCols_v)

            ## Return
            return(out_df)
            } # selectColsByMedian

            ### 3 matching columns, select 2
            match3pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)
            match3pick2_df2 <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 12)

            ### 2 matching columns, select 2
            match2pick2_df <- selectColsByMedian(df1 = wiki_1, df2 = wiki_2, ref_v = "Y", n_v = 2, cutoff_v = 10)






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 4 at 18:35

























            answered Jan 4 at 18:08









            QwfqwfQwfqwf

            34318




            34318

























                1














                Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).



                set.seed(123)

                wiki_1 <- data.frame(
                W = sample(1:1000,10,replace = T),
                X = sample(1:100,10,replace = T),
                Y = sample(1:10,10,replace = T),
                Z = sample(1:10,10, replace = T)
                )

                wiki_2 <- data.frame(
                A = sample(500:1000,100,replace = T),
                B = sample(90:100,100,replace = T),
                C = sample(1:10,100,replace = T),
                D = sample(1:10,100,replace = T)
                )

                combineMedianComp <- function(data1, data2, col, n){
                if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")

                medRef <- median(data1[[col]], na.rm = T, ) # median of desired column

                medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')

                cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n

                d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'

                d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1

                # merge data
                res <- do.call(cbind, list(data1[col], d2))

                return(res)
                }


                combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)





                share|improve this answer




























                  1














                  Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).



                  set.seed(123)

                  wiki_1 <- data.frame(
                  W = sample(1:1000,10,replace = T),
                  X = sample(1:100,10,replace = T),
                  Y = sample(1:10,10,replace = T),
                  Z = sample(1:10,10, replace = T)
                  )

                  wiki_2 <- data.frame(
                  A = sample(500:1000,100,replace = T),
                  B = sample(90:100,100,replace = T),
                  C = sample(1:10,100,replace = T),
                  D = sample(1:10,100,replace = T)
                  )

                  combineMedianComp <- function(data1, data2, col, n){
                  if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")

                  medRef <- median(data1[[col]], na.rm = T, ) # median of desired column

                  medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')

                  cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n

                  d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'

                  d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1

                  # merge data
                  res <- do.call(cbind, list(data1[col], d2))

                  return(res)
                  }


                  combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)





                  share|improve this answer


























                    1












                    1








                    1







                    Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).



                    set.seed(123)

                    wiki_1 <- data.frame(
                    W = sample(1:1000,10,replace = T),
                    X = sample(1:100,10,replace = T),
                    Y = sample(1:10,10,replace = T),
                    Z = sample(1:10,10, replace = T)
                    )

                    wiki_2 <- data.frame(
                    A = sample(500:1000,100,replace = T),
                    B = sample(90:100,100,replace = T),
                    C = sample(1:10,100,replace = T),
                    D = sample(1:10,100,replace = T)
                    )

                    combineMedianComp <- function(data1, data2, col, n){
                    if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")

                    medRef <- median(data1[[col]], na.rm = T, ) # median of desired column

                    medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')

                    cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n

                    d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'

                    d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1

                    # merge data
                    res <- do.call(cbind, list(data1[col], d2))

                    return(res)
                    }


                    combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)





                    share|improve this answer













                    Here is my solution, I've added more columns to wiki_2 to allow for subsetting (but it works if ncols(wiki_1) == ncols(wiki_2).



                    set.seed(123)

                    wiki_1 <- data.frame(
                    W = sample(1:1000,10,replace = T),
                    X = sample(1:100,10,replace = T),
                    Y = sample(1:10,10,replace = T),
                    Z = sample(1:10,10, replace = T)
                    )

                    wiki_2 <- data.frame(
                    A = sample(500:1000,100,replace = T),
                    B = sample(90:100,100,replace = T),
                    C = sample(1:10,100,replace = T),
                    D = sample(1:10,100,replace = T)
                    )

                    combineMedianComp <- function(data1, data2, col, n){
                    if(nrow(data1) > nrow(data2)) stop("Rows in 'data2' need to be greater or equal to rows in 'data1'")

                    medRef <- median(data1[[col]], na.rm = T, ) # median of desired column

                    medComp <- sapply(data2, function(x){abs(medRef - median(x, na.rm = T))}) # vector with medians for each columns in data2 ('wiki_2')

                    cols <- names(sort(medComp)[seq_len(n)]) # sort this vector in ascending order, select top n

                    d2 <- data2[, c(cols)] # select columns in data2 that have medians closest to 'medRef'

                    d2 <- d2[sample(seq_len(nrow(d2)), size = nrow(data1), replace = F), ] # subset column as to match those in data1

                    # merge data
                    res <- do.call(cbind, list(data1[col], d2))

                    return(res)
                    }


                    combineMedianComp(data1 = wiki_1, data2 = wiki_2, col = "Y", n = 2)






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 4 at 18:26









                    JdeMelloJdeMello

                    816419




                    816419























                        1














                        You can do:



                        time = as.Date('2009-01-01') + 0:9

                        close_median <- function(df1, df2, to_match = NULL){

                        # get median
                        m <- median(df1[[to_match]])

                        # get difference of median from other data
                        mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))

                        # get top 2 matched column
                        cols <- sort(names(sort(v)[1:2]))

                        return(cbind(df1[to_match], df2[cols], row.names=time))

                        }

                        close_median(wiki_1, wiki_2, 'Y')

                        Y C D
                        2009-01-01 8 9 10
                        2009-01-02 7 8 1
                        2009-01-03 1 7 7
                        2009-01-04 10 3 10
                        2009-01-05 2 1 1
                        2009-01-06 3 10 3
                        2009-01-07 6 2 3
                        2009-01-08 5 8 10
                        2009-01-09 3 8 5
                        2009-01-10 10 8 3





                        share|improve this answer




























                          1














                          You can do:



                          time = as.Date('2009-01-01') + 0:9

                          close_median <- function(df1, df2, to_match = NULL){

                          # get median
                          m <- median(df1[[to_match]])

                          # get difference of median from other data
                          mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))

                          # get top 2 matched column
                          cols <- sort(names(sort(v)[1:2]))

                          return(cbind(df1[to_match], df2[cols], row.names=time))

                          }

                          close_median(wiki_1, wiki_2, 'Y')

                          Y C D
                          2009-01-01 8 9 10
                          2009-01-02 7 8 1
                          2009-01-03 1 7 7
                          2009-01-04 10 3 10
                          2009-01-05 2 1 1
                          2009-01-06 3 10 3
                          2009-01-07 6 2 3
                          2009-01-08 5 8 10
                          2009-01-09 3 8 5
                          2009-01-10 10 8 3





                          share|improve this answer


























                            1












                            1








                            1







                            You can do:



                            time = as.Date('2009-01-01') + 0:9

                            close_median <- function(df1, df2, to_match = NULL){

                            # get median
                            m <- median(df1[[to_match]])

                            # get difference of median from other data
                            mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))

                            # get top 2 matched column
                            cols <- sort(names(sort(v)[1:2]))

                            return(cbind(df1[to_match], df2[cols], row.names=time))

                            }

                            close_median(wiki_1, wiki_2, 'Y')

                            Y C D
                            2009-01-01 8 9 10
                            2009-01-02 7 8 1
                            2009-01-03 1 7 7
                            2009-01-04 10 3 10
                            2009-01-05 2 1 1
                            2009-01-06 3 10 3
                            2009-01-07 6 2 3
                            2009-01-08 5 8 10
                            2009-01-09 3 8 5
                            2009-01-10 10 8 3





                            share|improve this answer













                            You can do:



                            time = as.Date('2009-01-01') + 0:9

                            close_median <- function(df1, df2, to_match = NULL){

                            # get median
                            m <- median(df1[[to_match]])

                            # get difference of median from other data
                            mat_cols <- apply(df2, 2, function(x) abs(m - median(x)))

                            # get top 2 matched column
                            cols <- sort(names(sort(v)[1:2]))

                            return(cbind(df1[to_match], df2[cols], row.names=time))

                            }

                            close_median(wiki_1, wiki_2, 'Y')

                            Y C D
                            2009-01-01 8 9 10
                            2009-01-02 7 8 1
                            2009-01-03 1 7 7
                            2009-01-04 10 3 10
                            2009-01-05 2 1 1
                            2009-01-06 3 10 3
                            2009-01-07 6 2 3
                            2009-01-08 5 8 10
                            2009-01-09 3 8 5
                            2009-01-10 10 8 3






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Jan 4 at 18:27









                            YOLOYOLO

                            5,5731525




                            5,5731525






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54043723%2fhow-to-randomly-select-and-bind-data-columns-based-on-their-median-values-in-r%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Mossoró

                                Error while reading .h5 file using the rhdf5 package in R

                                Pushsharp Apns notification error: 'InvalidToken'