Bdtjtk

Question

> dput(zed)

    structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742, 

    0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209, 

    0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946, 

    0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367, 

    0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195, 

    0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567, 

    0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022, 

    0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087, 

    0.141872719902716, 0, 0.176720075400566, 0.253924284395199), 

        col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226, 

        0.564015792442188, NaN, 0.116857208286359, 0.136034761917893, 

        0.137370134394451, 0.227357158778513, 0.215714919326088, 

        0.240671647524362, 0.107512520868114, 0.0681162324911809, 

        0.195274360476469, NaN, 0.208033156719459, 0.199848016844409, 

        0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA, 

    -20L), class = c("tbl_df", "tbl", "data.frame"))



> zed

# A tibble: 20 x 3

       col1     col2     col3

      <dbl>    <dbl>    <dbl>

 1   0        0.122    0.228 

 2   0.236    0.180    0.265 

 3   0.438    0.306    0.190 

 4   0        0        0.564 

 5 NaN      NaN      NaN     

 6   0.199    0.239    0.117 

 7   0.0755   0.0626   0.136 

 8   0.102    0.162    0.137 

 9   0.0459   0.0613   0.227 

10   0.256    0.220    0.216 

11   0.0881   0.136    0.241 

12   0.117  NaN        0.108 

13   0.128    0        0.0681

14   0        0        0.195 

15   0.118    0.0934 NaN     

16   0.115    0.140    0.208 

17   0.0584   0.142    0.200 

18   0.123    0        0.140 

19   0.197    0.177    0.202 

20   0.0370   0.254    0.0927

I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).

Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.

> fmsb::percentile(zed$col1)

Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) : 

  missing values and NaN's not allowed if 'na.rm' is FALSE

Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).

I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!

You could probably take advantage of something like is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/… — Jan 2 at 1:26
So what do you want to do with NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is? — Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation — Jan 3 at 15:42

wwwwww 27.8k112241 · Accepted Answer · 2019-01-02 02:18:24Z

There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.

Here is the code of the percentile function.

library(dplyr)

library(fmsb)



percentile

# function (dat) 

# {

#   pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)

#   pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)

#   pt3 <- rownames(pt2)

#   pt4 <- as.integer(strsplit(pt3, "%"))

#   datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]

#   return(datp)

# }

# <bytecode: 0x0000000016c498b0>

#   <environment: namespace:fmsb>

As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.

In addition, when providing a vector with zero, the function returns error as follows.

percentile(0:5)

# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) : 

#  'breaks' are not unique

My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.

percentile_narm_zero <- function(dat, small = 0.0000000000001){



  # Create a data frame with the numeric values and index

  dat2 <- data.frame(index = 1:length(dat), dat = dat)

  # Remove NA

  dat3 <- dat2[ !is.na(dat2$dat), ]

  # Add a small number to 0

  dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)



  # This part is the same as the percentile function

  pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)

  pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)

  pt3 <- rownames(pt2)

  pt4 <- as.integer(strsplit(pt3, "%"))

  datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]



  # Merge datp back to dat2

  dat3$datp <- datp

  dat4 <- merge(dat2, dat3, by = "index", all = TRUE)



  return(dat4$datp)

}

Now we can apply this function to all columns in zed using mutate_all.

zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))

# A tibble: 20 x 6

#       col1     col2     col3 col1_pctile col2_pctile col3_pctile

#      <dbl>    <dbl>    <dbl>       <int>       <int>       <int>

#  1   0        0.122    0.228           11          42          83

#  2   0.236    0.180    0.265           89          77          95

#  3   0.438    0.306    0.190          100         100          42

#  4   0        0        0.564           11          17         100

#  5 NaN      NaN      NaN               NA          NA          NA

#  6   0.199    0.239    0.117           84          89          18

#  7   0.0755   0.0626   0.136           34          30          24

#  8   0.102    0.162    0.137           45          65          30

#  9   0.0459   0.0613   0.227           23          24          77

# 10   0.256    0.220    0.216           95          83          71

# 11   0.0881   0.136    0.241           39          48          89

# 12   0.117  NaN        0.108           56          NA          12

# 13   0.128    0        0.0681          73          17           0

# 14   0        0        0.195           11          17          48

# 15   0.118    0.0934 NaN               62          36          NA

# 16   0.115    0.140    0.208           50          53          65

# 17   0.0584   0.142    0.200           28          59          53

# 18   0.123    0        0.140           67          17          36

# 19   0.197    0.177    0.202           78          71          59

# 20   0.0370   0.254    0.0927          17          95           6

wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks. — Jan 3 at 15:44

discipulusdiscipulus 1,43121939 · Accepted Answer · 2019-01-02 02:19:52Z

First define a function to calculate percentile group as:

percentile_group <- function(x)

{

  y <- as.numeric(x) %>% discard(is.na)

  qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()

  grp <- cut(x, breaks=qn, include.lowest=T, labels=F)

  #return(qn)

  return(grp)

}

Now use the function in a mutate statement as

 mutate_if(zen, is.numeric, funs(pctile = percentile_group))

The output is:

# A tibble: 20 x 6

col1     col2     col3 col1_pctile col2_pctile col3_pctile

<dbl>    <dbl>    <dbl>       <int>       <int>       <int>

  1   0        0.122    0.228            1           4           9

2   0.236    0.180    0.265            8           7          10

3   0.438    0.306    0.190            9           9           5

4   0        0        0.564            1           1          10

5 NaN      NaN      NaN               NA          NA          NA

6   0.199    0.239    0.117            8           8           2

7   0.0755   0.0626   0.136            3           2           3

8   0.102    0.162    0.137            4           6           3

9   0.0459   0.0613   0.227            2           2           8

10   0.256    0.220    0.216            9           8           8

11   0.0881   0.136    0.241            3           4           9

12   0.117  NaN        0.108            5          NA           2

13   0.128    0        0.0681           7           1           1

14   0        0        0.195            1           1           5

15   0.118    0.0934 NaN                6           3          NA

16   0.115    0.140    0.208            4           5           7

17   0.0584   0.142    0.200            2           5           6

18   0.123    0        0.140            6           1           4

19   0.197    0.177    0.202            7           7           6

20   0.0370   0.254    0.0927           1           9           1

wwwwww 27.8k112241 · Accepted Answer · 2019-01-02 02:18:24Z

There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.

Here is the code of the percentile function.

library(dplyr)

library(fmsb)



percentile

# function (dat) 

# {

#   pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)

#   pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)

#   pt3 <- rownames(pt2)

#   pt4 <- as.integer(strsplit(pt3, "%"))

#   datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]

#   return(datp)

# }

# <bytecode: 0x0000000016c498b0>

#   <environment: namespace:fmsb>

As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.

In addition, when providing a vector with zero, the function returns error as follows.

percentile(0:5)

# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) : 

#  'breaks' are not unique

My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.

percentile_narm_zero <- function(dat, small = 0.0000000000001){



  # Create a data frame with the numeric values and index

  dat2 <- data.frame(index = 1:length(dat), dat = dat)

  # Remove NA

  dat3 <- dat2[ !is.na(dat2$dat), ]

  # Add a small number to 0

  dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)



  # This part is the same as the percentile function

  pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)

  pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)

  pt3 <- rownames(pt2)

  pt4 <- as.integer(strsplit(pt3, "%"))

  datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]



  # Merge datp back to dat2

  dat3$datp <- datp

  dat4 <- merge(dat2, dat3, by = "index", all = TRUE)



  return(dat4$datp)

}

Now we can apply this function to all columns in zed using mutate_all.

zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))

# A tibble: 20 x 6

#       col1     col2     col3 col1_pctile col2_pctile col3_pctile

#      <dbl>    <dbl>    <dbl>       <int>       <int>       <int>

#  1   0        0.122    0.228           11          42          83

#  2   0.236    0.180    0.265           89          77          95

#  3   0.438    0.306    0.190          100         100          42

#  4   0        0        0.564           11          17         100

#  5 NaN      NaN      NaN               NA          NA          NA

#  6   0.199    0.239    0.117           84          89          18

#  7   0.0755   0.0626   0.136           34          30          24

#  8   0.102    0.162    0.137           45          65          30

#  9   0.0459   0.0613   0.227           23          24          77

# 10   0.256    0.220    0.216           95          83          71

# 11   0.0881   0.136    0.241           39          48          89

# 12   0.117  NaN        0.108           56          NA          12

# 13   0.128    0        0.0681          73          17           0

# 14   0        0        0.195           11          17          48

# 15   0.118    0.0934 NaN               62          36          NA

# 16   0.115    0.140    0.208           50          53          65

# 17   0.0584   0.142    0.200           28          59          53

# 18   0.123    0        0.140           67          17          36

# 19   0.197    0.177    0.202           78          71          59

# 20   0.0370   0.254    0.0927          17          95           6

wonderful. i did realize that percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks. — Jan 3 at 15:44

discipulusdiscipulus 1,43121939 · Accepted Answer · 2019-01-02 02:19:52Z

First define a function to calculate percentile group as:

percentile_group <- function(x)

{

  y <- as.numeric(x) %>% discard(is.na)

  qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()

  grp <- cut(x, breaks=qn, include.lowest=T, labels=F)

  #return(qn)

  return(grp)

}

Now use the function in a mutate statement as

 mutate_if(zen, is.numeric, funs(pctile = percentile_group))

The output is:

# A tibble: 20 x 6

col1     col2     col3 col1_pctile col2_pctile col3_pctile

<dbl>    <dbl>    <dbl>       <int>       <int>       <int>

  1   0        0.122    0.228            1           4           9

2   0.236    0.180    0.265            8           7          10

3   0.438    0.306    0.190            9           9           5

4   0        0        0.564            1           1          10

5 NaN      NaN      NaN               NA          NA          NA

6   0.199    0.239    0.117            8           8           2

7   0.0755   0.0626   0.136            3           2           3

8   0.102    0.162    0.137            4           6           3

9   0.0459   0.0613   0.227            2           2           8

10   0.256    0.220    0.216            9           8           8

11   0.0881   0.136    0.241            3           4           9

12   0.117  NaN        0.108            5          NA           2

13   0.128    0        0.0681           7           1           1

14   0        0        0.195            1           1           5

15   0.118    0.0934 NaN                6           3          NA

16   0.115    0.140    0.208            4           5           7

17   0.0584   0.142    0.200            2           5           6

18   0.123    0        0.140            6           1           4

19   0.197    0.177    0.202            7           7           6

20   0.0370   0.254    0.0927           1           9           1

搜尋此網誌

Bdtjtk

Handling NAs when mapping several dataframe columns into their percentile values in R

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Mossoró

Can't read property showImagePicker of undefined in react native iOS

Pushsharp Apns notification error: 'InvalidToken'

Handling NAs when mapping several dataframe columns into their percentile values in R

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Mossoró

Can't read property showImagePicker of undefined in react native iOS

Pushsharp Apns notification error: 'InvalidToken'

2 Answers
2

2 Answers
2

2 Answers
2