Handling NAs when mapping several dataframe columns into their percentile values in R












3















> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927


I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).



Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.



> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE


Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).



I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!










share|improve this question























  • You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

    – Chase
    Jan 2 at 1:26






  • 1





    So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

    – Ronak Shah
    Jan 2 at 2:13













  • @RonakShah preferably they have an NaN percentile value, so ignored from calculation

    – Canovice
    Jan 3 at 15:42
















3















> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927


I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).



Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.



> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE


Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).



I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!










share|improve this question























  • You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

    – Chase
    Jan 2 at 1:26






  • 1





    So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

    – Ronak Shah
    Jan 2 at 2:13













  • @RonakShah preferably they have an NaN percentile value, so ignored from calculation

    – Canovice
    Jan 3 at 15:42














3












3








3








> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927


I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).



Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.



> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE


Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).



I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!










share|improve this question














> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))

> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927


I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).



Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.



> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE


Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).



I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!







r dplyr data-manipulation






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 2 at 0:35









CanoviceCanovice

1,5071839




1,5071839













  • You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

    – Chase
    Jan 2 at 1:26






  • 1





    So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

    – Ronak Shah
    Jan 2 at 2:13













  • @RonakShah preferably they have an NaN percentile value, so ignored from calculation

    – Canovice
    Jan 3 at 15:42



















  • You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

    – Chase
    Jan 2 at 1:26






  • 1





    So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

    – Ronak Shah
    Jan 2 at 2:13













  • @RonakShah preferably they have an NaN percentile value, so ignored from calculation

    – Canovice
    Jan 3 at 15:42

















You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

– Chase
Jan 2 at 1:26





You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…

– Chase
Jan 2 at 1:26




1




1





So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

– Ronak Shah
Jan 2 at 2:13







So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?

– Ronak Shah
Jan 2 at 2:13















@RonakShah preferably they have an NaN percentile value, so ignored from calculation

– Canovice
Jan 3 at 15:42





@RonakShah preferably they have an NaN percentile value, so ignored from calculation

– Canovice
Jan 3 at 15:42












2 Answers
2






active

oldest

votes


















1














There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.



Here is the code of the percentile function.



library(dplyr)
library(fmsb)

percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>


As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.



In addition, when providing a vector with zero, the function returns error as follows.



percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique


My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.



percentile_narm_zero <- function(dat, small = 0.0000000000001){

# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

return(dat4$datp)
}


Now we can apply this function to all columns in zed using mutate_all.



zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6





share|improve this answer



















  • 1





    wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

    – Canovice
    Jan 3 at 15:44



















0














First define a function to calculate percentile group as:



percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}


Now use the function in a mutate statement as



 mutate_if(zen, is.numeric, funs(pctile = percentile_group))


The output is:



# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000053%2fhandling-nas-when-mapping-several-dataframe-columns-into-their-percentile-values%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.



    Here is the code of the percentile function.



    library(dplyr)
    library(fmsb)

    percentile
    # function (dat)
    # {
    # pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
    # pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    # pt3 <- rownames(pt2)
    # pt4 <- as.integer(strsplit(pt3, "%"))
    # datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
    # return(datp)
    # }
    # <bytecode: 0x0000000016c498b0>
    # <environment: namespace:fmsb>


    As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.



    In addition, when providing a vector with zero, the function returns error as follows.



    percentile(0:5)
    # Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
    # 'breaks' are not unique


    My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.



    percentile_narm_zero <- function(dat, small = 0.0000000000001){

    # Create a data frame with the numeric values and index
    dat2 <- data.frame(index = 1:length(dat), dat = dat)
    # Remove NA
    dat3 <- dat2[ !is.na(dat2$dat), ]
    # Add a small number to 0
    dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

    # This part is the same as the percentile function
    pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
    pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    pt3 <- rownames(pt2)
    pt4 <- as.integer(strsplit(pt3, "%"))
    datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

    # Merge datp back to dat2
    dat3$datp <- datp
    dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

    return(dat4$datp)
    }


    Now we can apply this function to all columns in zed using mutate_all.



    zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
    # A tibble: 20 x 6
    # col1 col2 col3 col1_pctile col2_pctile col3_pctile
    # <dbl> <dbl> <dbl> <int> <int> <int>
    # 1 0 0.122 0.228 11 42 83
    # 2 0.236 0.180 0.265 89 77 95
    # 3 0.438 0.306 0.190 100 100 42
    # 4 0 0 0.564 11 17 100
    # 5 NaN NaN NaN NA NA NA
    # 6 0.199 0.239 0.117 84 89 18
    # 7 0.0755 0.0626 0.136 34 30 24
    # 8 0.102 0.162 0.137 45 65 30
    # 9 0.0459 0.0613 0.227 23 24 77
    # 10 0.256 0.220 0.216 95 83 71
    # 11 0.0881 0.136 0.241 39 48 89
    # 12 0.117 NaN 0.108 56 NA 12
    # 13 0.128 0 0.0681 73 17 0
    # 14 0 0 0.195 11 17 48
    # 15 0.118 0.0934 NaN 62 36 NA
    # 16 0.115 0.140 0.208 50 53 65
    # 17 0.0584 0.142 0.200 28 59 53
    # 18 0.123 0 0.140 67 17 36
    # 19 0.197 0.177 0.202 78 71 59
    # 20 0.0370 0.254 0.0927 17 95 6





    share|improve this answer



















    • 1





      wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

      – Canovice
      Jan 3 at 15:44
















    1














    There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.



    Here is the code of the percentile function.



    library(dplyr)
    library(fmsb)

    percentile
    # function (dat)
    # {
    # pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
    # pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    # pt3 <- rownames(pt2)
    # pt4 <- as.integer(strsplit(pt3, "%"))
    # datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
    # return(datp)
    # }
    # <bytecode: 0x0000000016c498b0>
    # <environment: namespace:fmsb>


    As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.



    In addition, when providing a vector with zero, the function returns error as follows.



    percentile(0:5)
    # Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
    # 'breaks' are not unique


    My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.



    percentile_narm_zero <- function(dat, small = 0.0000000000001){

    # Create a data frame with the numeric values and index
    dat2 <- data.frame(index = 1:length(dat), dat = dat)
    # Remove NA
    dat3 <- dat2[ !is.na(dat2$dat), ]
    # Add a small number to 0
    dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

    # This part is the same as the percentile function
    pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
    pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    pt3 <- rownames(pt2)
    pt4 <- as.integer(strsplit(pt3, "%"))
    datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

    # Merge datp back to dat2
    dat3$datp <- datp
    dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

    return(dat4$datp)
    }


    Now we can apply this function to all columns in zed using mutate_all.



    zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
    # A tibble: 20 x 6
    # col1 col2 col3 col1_pctile col2_pctile col3_pctile
    # <dbl> <dbl> <dbl> <int> <int> <int>
    # 1 0 0.122 0.228 11 42 83
    # 2 0.236 0.180 0.265 89 77 95
    # 3 0.438 0.306 0.190 100 100 42
    # 4 0 0 0.564 11 17 100
    # 5 NaN NaN NaN NA NA NA
    # 6 0.199 0.239 0.117 84 89 18
    # 7 0.0755 0.0626 0.136 34 30 24
    # 8 0.102 0.162 0.137 45 65 30
    # 9 0.0459 0.0613 0.227 23 24 77
    # 10 0.256 0.220 0.216 95 83 71
    # 11 0.0881 0.136 0.241 39 48 89
    # 12 0.117 NaN 0.108 56 NA 12
    # 13 0.128 0 0.0681 73 17 0
    # 14 0 0 0.195 11 17 48
    # 15 0.118 0.0934 NaN 62 36 NA
    # 16 0.115 0.140 0.208 50 53 65
    # 17 0.0584 0.142 0.200 28 59 53
    # 18 0.123 0 0.140 67 17 36
    # 19 0.197 0.177 0.202 78 71 59
    # 20 0.0370 0.254 0.0927 17 95 6





    share|improve this answer



















    • 1





      wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

      – Canovice
      Jan 3 at 15:44














    1












    1








    1







    There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.



    Here is the code of the percentile function.



    library(dplyr)
    library(fmsb)

    percentile
    # function (dat)
    # {
    # pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
    # pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    # pt3 <- rownames(pt2)
    # pt4 <- as.integer(strsplit(pt3, "%"))
    # datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
    # return(datp)
    # }
    # <bytecode: 0x0000000016c498b0>
    # <environment: namespace:fmsb>


    As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.



    In addition, when providing a vector with zero, the function returns error as follows.



    percentile(0:5)
    # Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
    # 'breaks' are not unique


    My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.



    percentile_narm_zero <- function(dat, small = 0.0000000000001){

    # Create a data frame with the numeric values and index
    dat2 <- data.frame(index = 1:length(dat), dat = dat)
    # Remove NA
    dat3 <- dat2[ !is.na(dat2$dat), ]
    # Add a small number to 0
    dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

    # This part is the same as the percentile function
    pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
    pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    pt3 <- rownames(pt2)
    pt4 <- as.integer(strsplit(pt3, "%"))
    datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

    # Merge datp back to dat2
    dat3$datp <- datp
    dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

    return(dat4$datp)
    }


    Now we can apply this function to all columns in zed using mutate_all.



    zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
    # A tibble: 20 x 6
    # col1 col2 col3 col1_pctile col2_pctile col3_pctile
    # <dbl> <dbl> <dbl> <int> <int> <int>
    # 1 0 0.122 0.228 11 42 83
    # 2 0.236 0.180 0.265 89 77 95
    # 3 0.438 0.306 0.190 100 100 42
    # 4 0 0 0.564 11 17 100
    # 5 NaN NaN NaN NA NA NA
    # 6 0.199 0.239 0.117 84 89 18
    # 7 0.0755 0.0626 0.136 34 30 24
    # 8 0.102 0.162 0.137 45 65 30
    # 9 0.0459 0.0613 0.227 23 24 77
    # 10 0.256 0.220 0.216 95 83 71
    # 11 0.0881 0.136 0.241 39 48 89
    # 12 0.117 NaN 0.108 56 NA 12
    # 13 0.128 0 0.0681 73 17 0
    # 14 0 0 0.195 11 17 48
    # 15 0.118 0.0934 NaN 62 36 NA
    # 16 0.115 0.140 0.208 50 53 65
    # 17 0.0584 0.142 0.200 28 59 53
    # 18 0.123 0 0.140 67 17 36
    # 19 0.197 0.177 0.202 78 71 59
    # 20 0.0370 0.254 0.0927 17 95 6





    share|improve this answer













    There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.



    Here is the code of the percentile function.



    library(dplyr)
    library(fmsb)

    percentile
    # function (dat)
    # {
    # pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
    # pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    # pt3 <- rownames(pt2)
    # pt4 <- as.integer(strsplit(pt3, "%"))
    # datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
    # return(datp)
    # }
    # <bytecode: 0x0000000016c498b0>
    # <environment: namespace:fmsb>


    As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.



    In addition, when providing a vector with zero, the function returns error as follows.



    percentile(0:5)
    # Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
    # 'breaks' are not unique


    My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.



    percentile_narm_zero <- function(dat, small = 0.0000000000001){

    # Create a data frame with the numeric values and index
    dat2 <- data.frame(index = 1:length(dat), dat = dat)
    # Remove NA
    dat3 <- dat2[ !is.na(dat2$dat), ]
    # Add a small number to 0
    dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)

    # This part is the same as the percentile function
    pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
    pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
    pt3 <- rownames(pt2)
    pt4 <- as.integer(strsplit(pt3, "%"))
    datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]

    # Merge datp back to dat2
    dat3$datp <- datp
    dat4 <- merge(dat2, dat3, by = "index", all = TRUE)

    return(dat4$datp)
    }


    Now we can apply this function to all columns in zed using mutate_all.



    zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
    # A tibble: 20 x 6
    # col1 col2 col3 col1_pctile col2_pctile col3_pctile
    # <dbl> <dbl> <dbl> <int> <int> <int>
    # 1 0 0.122 0.228 11 42 83
    # 2 0.236 0.180 0.265 89 77 95
    # 3 0.438 0.306 0.190 100 100 42
    # 4 0 0 0.564 11 17 100
    # 5 NaN NaN NaN NA NA NA
    # 6 0.199 0.239 0.117 84 89 18
    # 7 0.0755 0.0626 0.136 34 30 24
    # 8 0.102 0.162 0.137 45 65 30
    # 9 0.0459 0.0613 0.227 23 24 77
    # 10 0.256 0.220 0.216 95 83 71
    # 11 0.0881 0.136 0.241 39 48 89
    # 12 0.117 NaN 0.108 56 NA 12
    # 13 0.128 0 0.0681 73 17 0
    # 14 0 0 0.195 11 17 48
    # 15 0.118 0.0934 NaN 62 36 NA
    # 16 0.115 0.140 0.208 50 53 65
    # 17 0.0584 0.142 0.200 28 59 53
    # 18 0.123 0 0.140 67 17 36
    # 19 0.197 0.177 0.202 78 71 59
    # 20 0.0370 0.254 0.0927 17 95 6






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Jan 2 at 2:18









    wwwwww

    27.8k112241




    27.8k112241








    • 1





      wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

      – Canovice
      Jan 3 at 15:44














    • 1





      wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

      – Canovice
      Jan 3 at 15:44








    1




    1





    wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

    – Canovice
    Jan 3 at 15:44





    wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.

    – Canovice
    Jan 3 at 15:44













    0














    First define a function to calculate percentile group as:



    percentile_group <- function(x)
    {
    y <- as.numeric(x) %>% discard(is.na)
    qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
    grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
    #return(qn)
    return(grp)
    }


    Now use the function in a mutate statement as



     mutate_if(zen, is.numeric, funs(pctile = percentile_group))


    The output is:



    # A tibble: 20 x 6
    col1 col2 col3 col1_pctile col2_pctile col3_pctile
    <dbl> <dbl> <dbl> <int> <int> <int>
    1 0 0.122 0.228 1 4 9
    2 0.236 0.180 0.265 8 7 10
    3 0.438 0.306 0.190 9 9 5
    4 0 0 0.564 1 1 10
    5 NaN NaN NaN NA NA NA
    6 0.199 0.239 0.117 8 8 2
    7 0.0755 0.0626 0.136 3 2 3
    8 0.102 0.162 0.137 4 6 3
    9 0.0459 0.0613 0.227 2 2 8
    10 0.256 0.220 0.216 9 8 8
    11 0.0881 0.136 0.241 3 4 9
    12 0.117 NaN 0.108 5 NA 2
    13 0.128 0 0.0681 7 1 1
    14 0 0 0.195 1 1 5
    15 0.118 0.0934 NaN 6 3 NA
    16 0.115 0.140 0.208 4 5 7
    17 0.0584 0.142 0.200 2 5 6
    18 0.123 0 0.140 6 1 4
    19 0.197 0.177 0.202 7 7 6
    20 0.0370 0.254 0.0927 1 9 1





    share|improve this answer




























      0














      First define a function to calculate percentile group as:



      percentile_group <- function(x)
      {
      y <- as.numeric(x) %>% discard(is.na)
      qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
      grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
      #return(qn)
      return(grp)
      }


      Now use the function in a mutate statement as



       mutate_if(zen, is.numeric, funs(pctile = percentile_group))


      The output is:



      # A tibble: 20 x 6
      col1 col2 col3 col1_pctile col2_pctile col3_pctile
      <dbl> <dbl> <dbl> <int> <int> <int>
      1 0 0.122 0.228 1 4 9
      2 0.236 0.180 0.265 8 7 10
      3 0.438 0.306 0.190 9 9 5
      4 0 0 0.564 1 1 10
      5 NaN NaN NaN NA NA NA
      6 0.199 0.239 0.117 8 8 2
      7 0.0755 0.0626 0.136 3 2 3
      8 0.102 0.162 0.137 4 6 3
      9 0.0459 0.0613 0.227 2 2 8
      10 0.256 0.220 0.216 9 8 8
      11 0.0881 0.136 0.241 3 4 9
      12 0.117 NaN 0.108 5 NA 2
      13 0.128 0 0.0681 7 1 1
      14 0 0 0.195 1 1 5
      15 0.118 0.0934 NaN 6 3 NA
      16 0.115 0.140 0.208 4 5 7
      17 0.0584 0.142 0.200 2 5 6
      18 0.123 0 0.140 6 1 4
      19 0.197 0.177 0.202 7 7 6
      20 0.0370 0.254 0.0927 1 9 1





      share|improve this answer


























        0












        0








        0







        First define a function to calculate percentile group as:



        percentile_group <- function(x)
        {
        y <- as.numeric(x) %>% discard(is.na)
        qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
        grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
        #return(qn)
        return(grp)
        }


        Now use the function in a mutate statement as



         mutate_if(zen, is.numeric, funs(pctile = percentile_group))


        The output is:



        # A tibble: 20 x 6
        col1 col2 col3 col1_pctile col2_pctile col3_pctile
        <dbl> <dbl> <dbl> <int> <int> <int>
        1 0 0.122 0.228 1 4 9
        2 0.236 0.180 0.265 8 7 10
        3 0.438 0.306 0.190 9 9 5
        4 0 0 0.564 1 1 10
        5 NaN NaN NaN NA NA NA
        6 0.199 0.239 0.117 8 8 2
        7 0.0755 0.0626 0.136 3 2 3
        8 0.102 0.162 0.137 4 6 3
        9 0.0459 0.0613 0.227 2 2 8
        10 0.256 0.220 0.216 9 8 8
        11 0.0881 0.136 0.241 3 4 9
        12 0.117 NaN 0.108 5 NA 2
        13 0.128 0 0.0681 7 1 1
        14 0 0 0.195 1 1 5
        15 0.118 0.0934 NaN 6 3 NA
        16 0.115 0.140 0.208 4 5 7
        17 0.0584 0.142 0.200 2 5 6
        18 0.123 0 0.140 6 1 4
        19 0.197 0.177 0.202 7 7 6
        20 0.0370 0.254 0.0927 1 9 1





        share|improve this answer













        First define a function to calculate percentile group as:



        percentile_group <- function(x)
        {
        y <- as.numeric(x) %>% discard(is.na)
        qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
        grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
        #return(qn)
        return(grp)
        }


        Now use the function in a mutate statement as



         mutate_if(zen, is.numeric, funs(pctile = percentile_group))


        The output is:



        # A tibble: 20 x 6
        col1 col2 col3 col1_pctile col2_pctile col3_pctile
        <dbl> <dbl> <dbl> <int> <int> <int>
        1 0 0.122 0.228 1 4 9
        2 0.236 0.180 0.265 8 7 10
        3 0.438 0.306 0.190 9 9 5
        4 0 0 0.564 1 1 10
        5 NaN NaN NaN NA NA NA
        6 0.199 0.239 0.117 8 8 2
        7 0.0755 0.0626 0.136 3 2 3
        8 0.102 0.162 0.137 4 6 3
        9 0.0459 0.0613 0.227 2 2 8
        10 0.256 0.220 0.216 9 8 8
        11 0.0881 0.136 0.241 3 4 9
        12 0.117 NaN 0.108 5 NA 2
        13 0.128 0 0.0681 7 1 1
        14 0 0 0.195 1 1 5
        15 0.118 0.0934 NaN 6 3 NA
        16 0.115 0.140 0.208 4 5 7
        17 0.0584 0.142 0.200 2 5 6
        18 0.123 0 0.140 6 1 4
        19 0.197 0.177 0.202 7 7 6
        20 0.0370 0.254 0.0927 1 9 1






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 2 at 2:19









        discipulusdiscipulus

        1,43121939




        1,43121939






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000053%2fhandling-nas-when-mapping-several-dataframe-columns-into-their-percentile-values%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mossoró

            Error while reading .h5 file using the rhdf5 package in R

            Pushsharp Apns notification error: 'InvalidToken'