Handling NAs when mapping several dataframe columns into their percentile values in R
> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927
I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).
Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.
> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).
I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!
r dplyr data-manipulation
add a comment |
> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927
I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).
Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.
> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).
I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!
r dplyr data-manipulation
You could probably take advantage of something likeis.finite()to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…
– Chase
Jan 2 at 1:26
1
So what do you want to do withNaNvalues ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?
– Ronak Shah
Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42
add a comment |
> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927
I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).
Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.
> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).
I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!
r dplyr data-manipulation
> dput(zed)
structure(list(col1 = c(0, 0.236258076229343, 0.43840483531742,
0, NaN, 0.198838380845137, 0.0754815882584196, 0.10176020461209,
0.045933014354067, 0.256237616143739, 0.0880658828009711, 0.117285153415946,
0.127902400629673, 0, 0.117682083253069, 0.114542851298834, 0.0584035686594367,
0.123456790123457, 0.196817420435511, 0.0369541251378046), col2 = c(0.121951219512195,
0.17979731938542, 0.305944055944056, 0, NaN, 0.239463601532567,
0.0625521267723103, 0.161729656111679, 0.0612745098039216, 0.22002200220022,
0.135608048993876, NaN, 0, 0, 0.0934420659191301, 0.140091696383087,
0.141872719902716, 0, 0.176720075400566, 0.253924284395199),
col3 = c(0.227540305157712, 0.264931804641559, 0.190018713264226,
0.564015792442188, NaN, 0.116857208286359, 0.136034761917893,
0.137370134394451, 0.227357158778513, 0.215714919326088,
0.240671647524362, 0.107512520868114, 0.0681162324911809,
0.195274360476469, NaN, 0.208033156719459, 0.199848016844409,
0.140383517621937, 0.202430694674985, 0.0927417625979096)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
> zed
# A tibble: 20 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 0 0.122 0.228
2 0.236 0.180 0.265
3 0.438 0.306 0.190
4 0 0 0.564
5 NaN NaN NaN
6 0.199 0.239 0.117
7 0.0755 0.0626 0.136
8 0.102 0.162 0.137
9 0.0459 0.0613 0.227
10 0.256 0.220 0.216
11 0.0881 0.136 0.241
12 0.117 NaN 0.108
13 0.128 0 0.0681
14 0 0 0.195
15 0.118 0.0934 NaN
16 0.115 0.140 0.208
17 0.0584 0.142 0.200
18 0.123 0 0.140
19 0.197 0.177 0.202
20 0.0370 0.254 0.0927
I have the following dataframe, which has multiple columns (col1, col2, col3) for which I need to convert into percentiles (rounded to the nearest integer, so one of 1:100). My preference - and what I assume is easiest - is to add 3 additional columns col1pctile, col2pctile, col3pctile that maps each respective column to their percentile value (within that column).
Using the fmsb::percentile() function on a single column returns an error due to the presence of NAs.
> fmsb::percentile(zed$col1)
Error in quantile.default(dat, probs = seq(0, 1, by = 0.01), type = 7) :
missing values and NaN's not allowed if 'na.rm' is FALSE
Although the example dataframe above only has 20 rows, my actual dataframe is many more rows than just 20, and having percentile values actually makes sense for my use-case (whereas percentiles wouldn't make sense for only 20 rows).
I will edit this post shortly with my current attempts, which aren't working as I'd hope. Any help with this would be greatly appreciated!
r dplyr data-manipulation
r dplyr data-manipulation
asked Jan 2 at 0:35
CanoviceCanovice
1,5071839
1,5071839
You could probably take advantage of something likeis.finite()to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…
– Chase
Jan 2 at 1:26
1
So what do you want to do withNaNvalues ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?
– Ronak Shah
Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42
add a comment |
You could probably take advantage of something likeis.finite()to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…
– Chase
Jan 2 at 1:26
1
So what do you want to do withNaNvalues ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?
– Ronak Shah
Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42
You could probably take advantage of something like
is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…– Chase
Jan 2 at 1:26
You could probably take advantage of something like
is.finite() to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…– Chase
Jan 2 at 1:26
1
1
So what do you want to do with
NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?– Ronak Shah
Jan 2 at 2:13
So what do you want to do with
NaN values ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?– Ronak Shah
Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42
add a comment |
2 Answers
2
active
oldest
votes
There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.
Here is the code of the percentile function.
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.
In addition, when providing a vector with zero, the function returns error as follows.
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed using mutate_all.
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
1
wonderful. i did realize thatpercentile()did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.
– Canovice
Jan 3 at 15:44
add a comment |
First define a function to calculate percentile group as:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000053%2fhandling-nas-when-mapping-several-dataframe-columns-into-their-percentile-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.
Here is the code of the percentile function.
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.
In addition, when providing a vector with zero, the function returns error as follows.
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed using mutate_all.
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
1
wonderful. i did realize thatpercentile()did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.
– Canovice
Jan 3 at 15:44
add a comment |
There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.
Here is the code of the percentile function.
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.
In addition, when providing a vector with zero, the function returns error as follows.
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed using mutate_all.
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
1
wonderful. i did realize thatpercentile()did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.
– Canovice
Jan 3 at 15:44
add a comment |
There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.
Here is the code of the percentile function.
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.
In addition, when providing a vector with zero, the function returns error as follows.
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed using mutate_all.
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
There are two challenges when using the percentile function from the fmsb. First, it cannot handle missing values. Second, it cannot handle zero.
Here is the code of the percentile function.
library(dplyr)
library(fmsb)
percentile
# function (dat)
# {
# pt1 <- quantile(dat, probs = seq(0, 1, by = 0.01), type = 7)
# pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
# pt3 <- rownames(pt2)
# pt4 <- as.integer(strsplit(pt3, "%"))
# datp <- pt4[as.integer(cut(dat, c(0, pt2$pt1), labels = 1:length(pt3)))]
# return(datp)
# }
# <bytecode: 0x0000000016c498b0>
# <environment: namespace:fmsb>
As you can see, there are no ways to specify the na.rm argument to the quantile function. However, simply set na.rm = TRUE to quantile function will not work because we would like the function to return NA when the input numbers are NA.
In addition, when providing a vector with zero, the function returns error as follows.
percentile(0:5)
# Error in cut.default(dat, c(0, pt2$pt1), labels = 1:length(pt3)) :
# 'breaks' are not unique
My suggestion is to re-write the function to be able to return NA for NA input values, and add a small numbers for zero. Here is my modification for the function. I called it percentile_narm_zero.
percentile_narm_zero <- function(dat, small = 0.0000000000001){
# Create a data frame with the numeric values and index
dat2 <- data.frame(index = 1:length(dat), dat = dat)
# Remove NA
dat3 <- dat2[ !is.na(dat2$dat), ]
# Add a small number to 0
dat3$dat <- ifelse(dat3$dat == 0, dat3$dat + small, dat3$dat)
# This part is the same as the percentile function
pt1 <- quantile(dat3$dat, probs = seq(0, 1, by = 0.01), type = 7)
pt2 <- unique(as.data.frame(pt1), fromLast = TRUE)
pt3 <- rownames(pt2)
pt4 <- as.integer(strsplit(pt3, "%"))
datp <- pt4[as.integer(cut(dat3$dat, c(0, pt2$pt1)), labels = 1:length(pt3))]
# Merge datp back to dat2
dat3$datp <- datp
dat4 <- merge(dat2, dat3, by = "index", all = TRUE)
return(dat4$datp)
}
Now we can apply this function to all columns in zed using mutate_all.
zed2 <- zed %>% mutate_all(funs(pctile = percentile_narm_zero(.)))
# A tibble: 20 x 6
# col1 col2 col3 col1_pctile col2_pctile col3_pctile
# <dbl> <dbl> <dbl> <int> <int> <int>
# 1 0 0.122 0.228 11 42 83
# 2 0.236 0.180 0.265 89 77 95
# 3 0.438 0.306 0.190 100 100 42
# 4 0 0 0.564 11 17 100
# 5 NaN NaN NaN NA NA NA
# 6 0.199 0.239 0.117 84 89 18
# 7 0.0755 0.0626 0.136 34 30 24
# 8 0.102 0.162 0.137 45 65 30
# 9 0.0459 0.0613 0.227 23 24 77
# 10 0.256 0.220 0.216 95 83 71
# 11 0.0881 0.136 0.241 39 48 89
# 12 0.117 NaN 0.108 56 NA 12
# 13 0.128 0 0.0681 73 17 0
# 14 0 0 0.195 11 17 48
# 15 0.118 0.0934 NaN 62 36 NA
# 16 0.115 0.140 0.208 50 53 65
# 17 0.0584 0.142 0.200 28 59 53
# 18 0.123 0 0.140 67 17 36
# 19 0.197 0.177 0.202 78 71 59
# 20 0.0370 0.254 0.0927 17 95 6
answered Jan 2 at 2:18
wwwwww
27.8k112241
27.8k112241
1
wonderful. i did realize thatpercentile()did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.
– Canovice
Jan 3 at 15:44
add a comment |
1
wonderful. i did realize thatpercentile()did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.
– Canovice
Jan 3 at 15:44
1
1
wonderful. i did realize that
percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.– Canovice
Jan 3 at 15:44
wonderful. i did realize that
percentile() did not handle zeros and so i was already (in another example) adding 0.1 to all values in the column. This is a great answer, thanks.– Canovice
Jan 3 at 15:44
add a comment |
First define a function to calculate percentile group as:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
add a comment |
First define a function to calculate percentile group as:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
add a comment |
First define a function to calculate percentile group as:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
First define a function to calculate percentile group as:
percentile_group <- function(x)
{
y <- as.numeric(x) %>% discard(is.na)
qn <- quantile(y, probs = seq(0, 1, by= 0.1), na.rm = TRUE) %>% unique()
grp <- cut(x, breaks=qn, include.lowest=T, labels=F)
#return(qn)
return(grp)
}
Now use the function in a mutate statement as
mutate_if(zen, is.numeric, funs(pctile = percentile_group))
The output is:
# A tibble: 20 x 6
col1 col2 col3 col1_pctile col2_pctile col3_pctile
<dbl> <dbl> <dbl> <int> <int> <int>
1 0 0.122 0.228 1 4 9
2 0.236 0.180 0.265 8 7 10
3 0.438 0.306 0.190 9 9 5
4 0 0 0.564 1 1 10
5 NaN NaN NaN NA NA NA
6 0.199 0.239 0.117 8 8 2
7 0.0755 0.0626 0.136 3 2 3
8 0.102 0.162 0.137 4 6 3
9 0.0459 0.0613 0.227 2 2 8
10 0.256 0.220 0.216 9 8 8
11 0.0881 0.136 0.241 3 4 9
12 0.117 NaN 0.108 5 NA 2
13 0.128 0 0.0681 7 1 1
14 0 0 0.195 1 1 5
15 0.118 0.0934 NaN 6 3 NA
16 0.115 0.140 0.208 4 5 7
17 0.0584 0.142 0.200 2 5 6
18 0.123 0 0.140 6 1 4
19 0.197 0.177 0.202 7 7 6
20 0.0370 0.254 0.0927 1 9 1
answered Jan 2 at 2:19
discipulusdiscipulus
1,43121939
1,43121939
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54000053%2fhandling-nas-when-mapping-several-dataframe-columns-into-their-percentile-values%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You could probably take advantage of something like
is.finite()to subset only the finite values in each column, OR follow some of the advice here: stackoverflow.com/questions/21219447/…– Chase
Jan 2 at 1:26
1
So what do you want to do with
NaNvalues ? Ignore them from calculation? and return 0 for their corresponding percentile value or keep them as it is?– Ronak Shah
Jan 2 at 2:13
@RonakShah preferably they have an NaN percentile value, so ignored from calculation
– Canovice
Jan 3 at 15:42