Preprocessing: text analysis on many columns from a dataframe












2















Using the following lines it is possible to preprocess text in a specific column of my dataframe:



#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))


I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:



df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))









share|improve this question

























  • you should supply a data sample if you wish to receive answers

    – Seymour
    Dec 30 '17 at 9:40











  • @Seymour as you can see I provide sample data.

    – PitterJe
    Dec 30 '17 at 10:50
















2















Using the following lines it is possible to preprocess text in a specific column of my dataframe:



#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))


I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:



df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))









share|improve this question

























  • you should supply a data sample if you wish to receive answers

    – Seymour
    Dec 30 '17 at 9:40











  • @Seymour as you can see I provide sample data.

    – PitterJe
    Dec 30 '17 at 10:50














2












2








2








Using the following lines it is possible to preprocess text in a specific column of my dataframe:



#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))


I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:



df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))









share|improve this question
















Using the following lines it is possible to preprocess text in a specific column of my dataframe:



#text to lower case
df$name <- tolower(df$name)
#remove all special characters
df$name <- gsub("[[:punct:]]", " ", df$name)
#remove long spaces
df$name <- gsub("\s+"," ",str_trim(df$name))


I would like to implement this preprocessing rules in all columns (expect id) of a dataframe like this:



df  <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"), E = c("text","stg","1.2"), F = c("press","remove","22"))






r function dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 31 '17 at 15:34









Florian

17.2k31549




17.2k31549










asked Dec 30 '17 at 9:21









PitterJePitterJe

1169




1169













  • you should supply a data sample if you wish to receive answers

    – Seymour
    Dec 30 '17 at 9:40











  • @Seymour as you can see I provide sample data.

    – PitterJe
    Dec 30 '17 at 10:50



















  • you should supply a data sample if you wish to receive answers

    – Seymour
    Dec 30 '17 at 9:40











  • @Seymour as you can see I provide sample data.

    – PitterJe
    Dec 30 '17 at 10:50

















you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40





you should supply a data sample if you wish to receive answers

– Seymour
Dec 30 '17 at 9:40













@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50





@Seymour as you can see I provide sample data.

– PitterJe
Dec 30 '17 at 10:50












2 Answers
2






active

oldest

votes


















4














If you want to do something multiple times, it is often useful to define a function.



For example, you could do the following:



library(stringr)
df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
E = c("text","stg","1.2"), F = c("press","remove","22"))

# create a function so we can apply this multiple times easily.
process <- function(my_vector)
{
my_vector <- tolower(my_vector)
#remove all special characters
my_vector <- gsub("[[:punct:]]", " ", my_vector)
#remove long spaces
my_vector <- gsub("\s+"," ",str_trim(my_vector))
# return result
return(my_vector)
}

# for all columns except 'id', apply our function.
for(x in setdiff(colnames(df),"id"))
{
df[[x]]=process(df[[x]])
}





share|improve this answer
























  • Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

    – Stuart Allen
    Dec 31 '17 at 5:22



















2














You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:



mydf %>%
mutate_at(.vars = vars(-id),
.funs = processText)


Where processText is a function containing your desired code:



processText <- function(str) {
str %>%
str_to_lower() %>%
str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
str_trim()
}


The output is as follows:



  id         D    E      G
1 A mytext 11 text press
2 B mytext stg remove
3 C 1 2 22





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48032578%2fpreprocessing-text-analysis-on-many-columns-from-a-dataframe%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    If you want to do something multiple times, it is often useful to define a function.



    For example, you could do the following:



    library(stringr)
    df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
    E = c("text","stg","1.2"), F = c("press","remove","22"))

    # create a function so we can apply this multiple times easily.
    process <- function(my_vector)
    {
    my_vector <- tolower(my_vector)
    #remove all special characters
    my_vector <- gsub("[[:punct:]]", " ", my_vector)
    #remove long spaces
    my_vector <- gsub("\s+"," ",str_trim(my_vector))
    # return result
    return(my_vector)
    }

    # for all columns except 'id', apply our function.
    for(x in setdiff(colnames(df),"id"))
    {
    df[[x]]=process(df[[x]])
    }





    share|improve this answer
























    • Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

      – Stuart Allen
      Dec 31 '17 at 5:22
















    4














    If you want to do something multiple times, it is often useful to define a function.



    For example, you could do the following:



    library(stringr)
    df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
    E = c("text","stg","1.2"), F = c("press","remove","22"))

    # create a function so we can apply this multiple times easily.
    process <- function(my_vector)
    {
    my_vector <- tolower(my_vector)
    #remove all special characters
    my_vector <- gsub("[[:punct:]]", " ", my_vector)
    #remove long spaces
    my_vector <- gsub("\s+"," ",str_trim(my_vector))
    # return result
    return(my_vector)
    }

    # for all columns except 'id', apply our function.
    for(x in setdiff(colnames(df),"id"))
    {
    df[[x]]=process(df[[x]])
    }





    share|improve this answer
























    • Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

      – Stuart Allen
      Dec 31 '17 at 5:22














    4












    4








    4







    If you want to do something multiple times, it is often useful to define a function.



    For example, you could do the following:



    library(stringr)
    df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
    E = c("text","stg","1.2"), F = c("press","remove","22"))

    # create a function so we can apply this multiple times easily.
    process <- function(my_vector)
    {
    my_vector <- tolower(my_vector)
    #remove all special characters
    my_vector <- gsub("[[:punct:]]", " ", my_vector)
    #remove long spaces
    my_vector <- gsub("\s+"," ",str_trim(my_vector))
    # return result
    return(my_vector)
    }

    # for all columns except 'id', apply our function.
    for(x in setdiff(colnames(df),"id"))
    {
    df[[x]]=process(df[[x]])
    }





    share|improve this answer













    If you want to do something multiple times, it is often useful to define a function.



    For example, you could do the following:



    library(stringr)
    df <- data.frame(id = c("A","B","C"), D = c("mytext 11","mytext +", "!!"),
    E = c("text","stg","1.2"), F = c("press","remove","22"))

    # create a function so we can apply this multiple times easily.
    process <- function(my_vector)
    {
    my_vector <- tolower(my_vector)
    #remove all special characters
    my_vector <- gsub("[[:punct:]]", " ", my_vector)
    #remove long spaces
    my_vector <- gsub("\s+"," ",str_trim(my_vector))
    # return result
    return(my_vector)
    }

    # for all columns except 'id', apply our function.
    for(x in setdiff(colnames(df),"id"))
    {
    df[[x]]=process(df[[x]])
    }






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 30 '17 at 9:42









    FlorianFlorian

    17.2k31549




    17.2k31549













    • Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

      – Stuart Allen
      Dec 31 '17 at 5:22



















    • Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

      – Stuart Allen
      Dec 31 '17 at 5:22

















    Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

    – Stuart Allen
    Dec 31 '17 at 5:22





    Why anybody would prefer the base R syntax above (for(x in setdiff ...) to mutate_at() is beyond me, especially as the OP is already in the Hadley-verse and using stringr ...

    – Stuart Allen
    Dec 31 '17 at 5:22













    2














    You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:



    mydf %>%
    mutate_at(.vars = vars(-id),
    .funs = processText)


    Where processText is a function containing your desired code:



    processText <- function(str) {
    str %>%
    str_to_lower() %>%
    str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
    str_trim()
    }


    The output is as follows:



      id         D    E      G
    1 A mytext 11 text press
    2 B mytext stg remove
    3 C 1 2 22





    share|improve this answer






























      2














      You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:



      mydf %>%
      mutate_at(.vars = vars(-id),
      .funs = processText)


      Where processText is a function containing your desired code:



      processText <- function(str) {
      str %>%
      str_to_lower() %>%
      str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
      str_trim()
      }


      The output is as follows:



        id         D    E      G
      1 A mytext 11 text press
      2 B mytext stg remove
      3 C 1 2 22





      share|improve this answer




























        2












        2








        2







        You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:



        mydf %>%
        mutate_at(.vars = vars(-id),
        .funs = processText)


        Where processText is a function containing your desired code:



        processText <- function(str) {
        str %>%
        str_to_lower() %>%
        str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
        str_trim()
        }


        The output is as follows:



          id         D    E      G
        1 A mytext 11 text press
        2 B mytext stg remove
        3 C 1 2 22





        share|improve this answer















        You can use dplyr::mutate_at() to mutate multiple columns; in this case, all columns except for id:



        mydf %>%
        mutate_at(.vars = vars(-id),
        .funs = processText)


        Where processText is a function containing your desired code:



        processText <- function(str) {
        str %>%
        str_to_lower() %>%
        str_replace_all(pattern="[[[:punct:]]]|[\s+]", replacement=" ", .) %>%
        str_trim()
        }


        The output is as follows:



          id         D    E      G
        1 A mytext 11 text press
        2 B mytext stg remove
        3 C 1 2 22






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Dec 30 '17 at 11:50

























        answered Dec 30 '17 at 9:47









        Stuart AllenStuart Allen

        9121716




        9121716






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48032578%2fpreprocessing-text-analysis-on-many-columns-from-a-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Mossoró

            Error while reading .h5 file using the rhdf5 package in R

            Pushsharp Apns notification error: 'InvalidToken'