Preprocessing: text analysis on many columns from a dataframe

Using the following lines it is possible to preprocess text in a specific column of my dataframe:

#text to lower case

df$name <- tolower(df$name)

#remove all special characters

df$name <- gsub("[[:punct:]]", " ", df$name)

#remove long spaces

df$name <- gsub("\s+"," ",str_trim(df$name))

I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40

@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50

add a comment |

Using the following lines it is possible to preprocess text in a specific column of my dataframe:

#text to lower case

df$name <- tolower(df$name)

#remove all special characters

df$name <- gsub("[[:punct:]]", " ", df$name)

#remove long spaces

df$name <- gsub("\s+"," ",str_trim(df$name))

I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40

@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50

add a comment |

Using the following lines it is possible to preprocess text in a specific column of my dataframe:

#text to lower case

df$name <- tolower(df$name)

#remove all special characters

df$name <- gsub("[[:punct:]]", " ", df$name)

#remove long spaces

df$name <- gsub("\s+"," ",str_trim(df$name))

I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

Using the following lines it is possible to preprocess text in a specific column of my dataframe:

#text to lower case

df$name <- tolower(df$name)

#remove all special characters

df$name <- gsub("[[:punct:]]", " ", df$name)

#remove long spaces

df$name <- gsub("\s+"," ",str_trim(df$name))

I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))

r function dataframe

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

edited Dec 31 '17 at 15:34

Florian

17.2k31549

edited Dec 31 '17 at 15:34

Florian

17.2k31549

edited Dec 31 '17 at 15:34

Florian

17.2k31549

asked Dec 30 '17 at 9:21

PitterJe

1169

asked Dec 30 '17 at 9:21

PitterJe

1169

asked Dec 30 '17 at 9:21

PitterJe

1169

you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40

@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50

add a comment |

you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40

@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50

you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40

@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50

add a comment |

2 Answers
2

active

oldest

votes

If you want to do something multiple times, it is often useful to define a function.

For example, you could do the following:

library(stringr)

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 

                  E = c("text","stg","1.2"), F = c("press","remove","22"))



# create a function so we can apply this multiple times easily.

process <- function(my_vector)

{

  my_vector <- tolower(my_vector)

  #remove all special characters

  my_vector <- gsub("[[:punct:]]", " ", my_vector)

  #remove long spaces

  my_vector <- gsub("\s+"," ",str_trim(my_vector))

  # return result

  return(my_vector)

}



# for all columns except 'id', apply our function.

for(x in setdiff(colnames(df),"id"))

{

 df[[x]]=process(df[[x]])

}

answered Dec 30 '17 at 9:42

Florian

17.2k31549

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

add a comment |

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:

mydf %>%

  mutate_at(.vars = vars(-id),

            .funs = processText)

Where processText is a function containing your desired code:

processText <- function(str) {

str %>%

    str_to_lower() %>%

    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%

    str_trim()

}

The output is as follows:

  id         D    E      G

1  A mytext 11 text  press

2  B    mytext  stg remove

3  C            1 2     22

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48032578%2fpreprocessing-text-analysis-on-many-columns-from-a-dataframe%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

If you want to do something multiple times, it is often useful to define a function.

For example, you could do the following:

library(stringr)

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 

                  E = c("text","stg","1.2"), F = c("press","remove","22"))



# create a function so we can apply this multiple times easily.

process <- function(my_vector)

{

  my_vector <- tolower(my_vector)

  #remove all special characters

  my_vector <- gsub("[[:punct:]]", " ", my_vector)

  #remove long spaces

  my_vector <- gsub("\s+"," ",str_trim(my_vector))

  # return result

  return(my_vector)

}



# for all columns except 'id', apply our function.

for(x in setdiff(colnames(df),"id"))

{

 df[[x]]=process(df[[x]])

}

answered Dec 30 '17 at 9:42

Florian

17.2k31549

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

add a comment |

If you want to do something multiple times, it is often useful to define a function.

For example, you could do the following:

library(stringr)

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 

                  E = c("text","stg","1.2"), F = c("press","remove","22"))



# create a function so we can apply this multiple times easily.

process <- function(my_vector)

{

  my_vector <- tolower(my_vector)

  #remove all special characters

  my_vector <- gsub("[[:punct:]]", " ", my_vector)

  #remove long spaces

  my_vector <- gsub("\s+"," ",str_trim(my_vector))

  # return result

  return(my_vector)

}



# for all columns except 'id', apply our function.

for(x in setdiff(colnames(df),"id"))

{

 df[[x]]=process(df[[x]])

}

answered Dec 30 '17 at 9:42

Florian

17.2k31549

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

add a comment |

If you want to do something multiple times, it is often useful to define a function.

For example, you could do the following:

library(stringr)

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 

                  E = c("text","stg","1.2"), F = c("press","remove","22"))



# create a function so we can apply this multiple times easily.

process <- function(my_vector)

{

  my_vector <- tolower(my_vector)

  #remove all special characters

  my_vector <- gsub("[[:punct:]]", " ", my_vector)

  #remove long spaces

  my_vector <- gsub("\s+"," ",str_trim(my_vector))

  # return result

  return(my_vector)

}



# for all columns except 'id', apply our function.

for(x in setdiff(colnames(df),"id"))

{

 df[[x]]=process(df[[x]])

}

answered Dec 30 '17 at 9:42

Florian

17.2k31549

If you want to do something multiple times, it is often useful to define a function.

For example, you could do the following:

library(stringr)

df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), 

                  E = c("text","stg","1.2"), F = c("press","remove","22"))



# create a function so we can apply this multiple times easily.

process <- function(my_vector)

{

  my_vector <- tolower(my_vector)

  #remove all special characters

  my_vector <- gsub("[[:punct:]]", " ", my_vector)

  #remove long spaces

  my_vector <- gsub("\s+"," ",str_trim(my_vector))

  # return result

  return(my_vector)

}



# for all columns except 'id', apply our function.

for(x in setdiff(colnames(df),"id"))

{

 df[[x]]=process(df[[x]])

}

answered Dec 30 '17 at 9:42

Florian

17.2k31549

answered Dec 30 '17 at 9:42

Florian

17.2k31549

answered Dec 30 '17 at 9:42

Florian

17.2k31549

answered Dec 30 '17 at 9:42

Florian

17.2k31549

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

add a comment |

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

– Stuart Allen
Dec 31 '17 at 5:22

add a comment |

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:

mydf %>%

  mutate_at(.vars = vars(-id),

            .funs = processText)

Where processText is a function containing your desired code:

processText <- function(str) {

str %>%

    str_to_lower() %>%

    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%

    str_trim()

}

The output is as follows:

  id         D    E      G

1  A mytext 11 text  press

2  B    mytext  stg remove

3  C            1 2     22

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

add a comment |

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:

mydf %>%

  mutate_at(.vars = vars(-id),

            .funs = processText)

Where processText is a function containing your desired code:

processText <- function(str) {

str %>%

    str_to_lower() %>%

    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%

    str_trim()

}

The output is as follows:

  id         D    E      G

1  A mytext 11 text  press

2  B    mytext  stg remove

3  C            1 2     22

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

add a comment |

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:

mydf %>%

  mutate_at(.vars = vars(-id),

            .funs = processText)

Where processText is a function containing your desired code:

processText <- function(str) {

str %>%

    str_to_lower() %>%

    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%

    str_trim()

}

The output is as follows:

  id         D    E      G

1  A mytext 11 text  press

2  B    mytext  stg remove

3  C            1 2     22

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:

mydf %>%

  mutate_at(.vars = vars(-id),

            .funs = processText)

Where processText is a function containing your desired code:

processText <- function(str) {

str %>%

    str_to_lower() %>%

    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%

    str_trim()

}

The output is as follows:

  id         D    E      G

1  A mytext 11 text  press

2  B    mytext  stg remove

3  C            1 2     22

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

edited Dec 30 '17 at 11:50

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

answered Dec 30 '17 at 9:47

Stuart Allen

9121716

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk