Preprocessing: text analysis on many columns from a dataframe
Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))
r function dataframe
add a comment |
Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))
r function dataframe
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50
add a comment |
Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))
r function dataframe
Using the following lines it is possible to preprocess text in a specific column of my dataframe:
#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))
I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))
r function dataframe
r function dataframe
edited Dec 31 '17 at 15:34
Florian
17.2k31549
17.2k31549
asked Dec 30 '17 at 9:21
PitterJePitterJe
1169
1169
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50
add a comment |
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50
add a comment |
2 Answers
2
active
oldest
votes
If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
Why anybody would prefer the base R syntax above (for(x in setdiff ...) tomutate_at()is beyond me, especially as the OP is already in the Hadley-verse and usingstringr...
– Stuart Allen
Dec 31 '17 at 5:22
add a comment |
You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48032578%2fpreprocessing-text-analysis-on-many-columns-from-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
Why anybody would prefer the base R syntax above (for(x in setdiff ...) tomutate_at()is beyond me, especially as the OP is already in the Hadley-verse and usingstringr...
– Stuart Allen
Dec 31 '17 at 5:22
add a comment |
If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
Why anybody would prefer the base R syntax above (for(x in setdiff ...) tomutate_at()is beyond me, especially as the OP is already in the Hadley-verse and usingstringr...
– Stuart Allen
Dec 31 '17 at 5:22
add a comment |
If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
If you want to do something multiple times, it is often useful to define a function.
For example, you could do the following:
library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))
# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}
# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}
answered Dec 30 '17 at 9:42
FlorianFlorian
17.2k31549
17.2k31549
Why anybody would prefer the base R syntax above (for(x in setdiff ...) tomutate_at()is beyond me, especially as the OP is already in the Hadley-verse and usingstringr...
– Stuart Allen
Dec 31 '17 at 5:22
add a comment |
Why anybody would prefer the base R syntax above (for(x in setdiff ...) tomutate_at()is beyond me, especially as the OP is already in the Hadley-verse and usingstringr...
– Stuart Allen
Dec 31 '17 at 5:22
Why anybody would prefer the base R syntax above (
for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...– Stuart Allen
Dec 31 '17 at 5:22
Why anybody would prefer the base R syntax above (
for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...– Stuart Allen
Dec 31 '17 at 5:22
add a comment |
You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22
add a comment |
You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22
add a comment |
You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22
You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:
mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)
Where processText is a function containing your desired code:
processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}
The output is as follows:
id D E G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22
edited Dec 30 '17 at 11:50
answered Dec 30 '17 at 9:47
Stuart AllenStuart Allen
9121716
9121716
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48032578%2fpreprocessing-text-analysis-on-many-columns-from-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
you should supply a data sample if you wish to receive answers
– Seymour
Dec 30 '17 at 9:40
@Seymour as you can see I provide sample data.
– PitterJe
Dec 30 '17 at 10:50