How do I remove words numerics pyspark












0














I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears



First I converted my string and word tokens





from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\W"
)

data = regexTokenizer.transform(data)


I created the function to remove only the numbers



from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType())


Call function



data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)


Error:




org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):




Sample Dataframe



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+


expected result



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+









share|improve this question




















  • 1




    You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
    – Psidom
    Dec 27 '18 at 22:01










  • You need to iterate over the array and filter out the desired words. What version of spark are you using?
    – pault
    Dec 27 '18 at 22:05










  • @Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
    – user3661384
    Dec 28 '18 at 12:22










  • @pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
    – user3661384
    Dec 28 '18 at 12:44






  • 1




    from pyspark.sql.types import ArrayType
    – pault
    Dec 28 '18 at 13:00


















0














I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears



First I converted my string and word tokens





from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\W"
)

data = regexTokenizer.transform(data)


I created the function to remove only the numbers



from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType())


Call function



data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)


Error:




org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):




Sample Dataframe



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+


expected result



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+









share|improve this question




















  • 1




    You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
    – Psidom
    Dec 27 '18 at 22:01










  • You need to iterate over the array and filter out the desired words. What version of spark are you using?
    – pault
    Dec 27 '18 at 22:05










  • @Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
    – user3661384
    Dec 28 '18 at 12:22










  • @pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
    – user3661384
    Dec 28 '18 at 12:44






  • 1




    from pyspark.sql.types import ArrayType
    – pault
    Dec 28 '18 at 13:00
















0












0








0







I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears



First I converted my string and word tokens





from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\W"
)

data = regexTokenizer.transform(data)


I created the function to remove only the numbers



from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType())


Call function



data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)


Error:




org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):




Sample Dataframe



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+


expected result



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+









share|improve this question















I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears



First I converted my string and word tokens





from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\W"
)

data = regexTokenizer.transform(data)


I created the function to remove only the numbers



from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType())


Call function



data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)


Error:




org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):




Sample Dataframe



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+


expected result



+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+






pyspark apache-spark-sql apache-spark-ml






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 28 '18 at 17:49









pault

14.2k31947




14.2k31947










asked Dec 27 '18 at 21:49









user3661384

11818




11818








  • 1




    You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
    – Psidom
    Dec 27 '18 at 22:01










  • You need to iterate over the array and filter out the desired words. What version of spark are you using?
    – pault
    Dec 27 '18 at 22:05










  • @Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
    – user3661384
    Dec 28 '18 at 12:22










  • @pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
    – user3661384
    Dec 28 '18 at 12:44






  • 1




    from pyspark.sql.types import ArrayType
    – pault
    Dec 28 '18 at 13:00
















  • 1




    You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
    – Psidom
    Dec 27 '18 at 22:01










  • You need to iterate over the array and filter out the desired words. What version of spark are you using?
    – pault
    Dec 27 '18 at 22:05










  • @Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
    – user3661384
    Dec 28 '18 at 12:22










  • @pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
    – user3661384
    Dec 28 '18 at 12:44






  • 1




    from pyspark.sql.types import ArrayType
    – pault
    Dec 28 '18 at 13:00










1




1




You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01




You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01












You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05




You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05












@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22




@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22












@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44




@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44




1




1




from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00






from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00














2 Answers
2






active

oldest

votes


















0














As a help @pault the solution was this.



from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType()


Call function



from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType

filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))





share|improve this answer





























    0














    If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.



    Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.



    scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
    df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

    scala> df.show(false)
    +----------+-------------------------------------------------+
    |categoryid|description |
    +----------+-------------------------------------------------+
    |33004 |[short, sarja, 40567, detalhe, couro] |
    |22033 |[multipane, 6768686868686867868888, 220v, branco]|
    |33033 |[0123, x220, 220v, 889] |
    +----------+-------------------------------------------------+


    scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)
    +----------+-------------------------------------------------+------------------------------+
    |categoryid|description |newc |
    +----------+-------------------------------------------------+------------------------------+
    |33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
    |22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
    |33033 |[0123, x220, 220v, 889] |[x220, 220v] |
    +----------+-------------------------------------------------+------------------------------+


    scala>


    Spark 2.4 answer



    Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results



    scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
    df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

    scala> df.createOrReplaceTempView("tab")

    scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
    +----------+------------------------------+
    |categoryid|fw |
    +----------+------------------------------+
    |33004 |[short, sarja, detalhe, couro]|
    |22033 |[multipane, 220v, branco] |
    |33033 |[x220, 220v] |
    +----------+------------------------------+


    scala>





    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53951215%2fhow-do-i-remove-words-numerics-pyspark%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      0














      As a help @pault the solution was this.



      from pyspark.sql.functions import when,udf
      from pyspark.sql.types import BooleanType

      def is_digit(value):
      if value:
      return value.isdigit()
      else:
      return False

      is_digit_udf = udf(is_digit, BooleanType()


      Call function



      from pyspark.sql.types import ArrayType, StringType
      from pyspark.sql.types import StructType

      filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

      data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))





      share|improve this answer


























        0














        As a help @pault the solution was this.



        from pyspark.sql.functions import when,udf
        from pyspark.sql.types import BooleanType

        def is_digit(value):
        if value:
        return value.isdigit()
        else:
        return False

        is_digit_udf = udf(is_digit, BooleanType()


        Call function



        from pyspark.sql.types import ArrayType, StringType
        from pyspark.sql.types import StructType

        filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

        data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))





        share|improve this answer
























          0












          0








          0






          As a help @pault the solution was this.



          from pyspark.sql.functions import when,udf
          from pyspark.sql.types import BooleanType

          def is_digit(value):
          if value:
          return value.isdigit()
          else:
          return False

          is_digit_udf = udf(is_digit, BooleanType()


          Call function



          from pyspark.sql.types import ArrayType, StringType
          from pyspark.sql.types import StructType

          filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

          data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))





          share|improve this answer












          As a help @pault the solution was this.



          from pyspark.sql.functions import when,udf
          from pyspark.sql.types import BooleanType

          def is_digit(value):
          if value:
          return value.isdigit()
          else:
          return False

          is_digit_udf = udf(is_digit, BooleanType()


          Call function



          from pyspark.sql.types import ArrayType, StringType
          from pyspark.sql.types import StructType

          filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

          data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Dec 28 '18 at 16:31









          user3661384

          11818




          11818

























              0














              If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.



              Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.



              scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
              df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

              scala> df.show(false)
              +----------+-------------------------------------------------+
              |categoryid|description |
              +----------+-------------------------------------------------+
              |33004 |[short, sarja, 40567, detalhe, couro] |
              |22033 |[multipane, 6768686868686867868888, 220v, branco]|
              |33033 |[0123, x220, 220v, 889] |
              +----------+-------------------------------------------------+


              scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)
              +----------+-------------------------------------------------+------------------------------+
              |categoryid|description |newc |
              +----------+-------------------------------------------------+------------------------------+
              |33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
              |22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
              |33033 |[0123, x220, 220v, 889] |[x220, 220v] |
              +----------+-------------------------------------------------+------------------------------+


              scala>


              Spark 2.4 answer



              Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results



              scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
              df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

              scala> df.createOrReplaceTempView("tab")

              scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
              +----------+------------------------------+
              |categoryid|fw |
              +----------+------------------------------+
              |33004 |[short, sarja, detalhe, couro]|
              |22033 |[multipane, 220v, branco] |
              |33033 |[x220, 220v] |
              +----------+------------------------------+


              scala>





              share|improve this answer




























                0














                If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.



                Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.



                scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                scala> df.show(false)
                +----------+-------------------------------------------------+
                |categoryid|description |
                +----------+-------------------------------------------------+
                |33004 |[short, sarja, 40567, detalhe, couro] |
                |22033 |[multipane, 6768686868686867868888, 220v, branco]|
                |33033 |[0123, x220, 220v, 889] |
                +----------+-------------------------------------------------+


                scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)
                +----------+-------------------------------------------------+------------------------------+
                |categoryid|description |newc |
                +----------+-------------------------------------------------+------------------------------+
                |33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
                |22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
                |33033 |[0123, x220, 220v, 889] |[x220, 220v] |
                +----------+-------------------------------------------------+------------------------------+


                scala>


                Spark 2.4 answer



                Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results



                scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                scala> df.createOrReplaceTempView("tab")

                scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
                +----------+------------------------------+
                |categoryid|fw |
                +----------+------------------------------+
                |33004 |[short, sarja, detalhe, couro]|
                |22033 |[multipane, 220v, branco] |
                |33033 |[x220, 220v] |
                +----------+------------------------------+


                scala>





                share|improve this answer


























                  0












                  0








                  0






                  If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.



                  Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.



                  scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                  df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                  scala> df.show(false)
                  +----------+-------------------------------------------------+
                  |categoryid|description |
                  +----------+-------------------------------------------------+
                  |33004 |[short, sarja, 40567, detalhe, couro] |
                  |22033 |[multipane, 6768686868686867868888, 220v, branco]|
                  |33033 |[0123, x220, 220v, 889] |
                  +----------+-------------------------------------------------+


                  scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)
                  +----------+-------------------------------------------------+------------------------------+
                  |categoryid|description |newc |
                  +----------+-------------------------------------------------+------------------------------+
                  |33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
                  |22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
                  |33033 |[0123, x220, 220v, 889] |[x220, 220v] |
                  +----------+-------------------------------------------------+------------------------------+


                  scala>


                  Spark 2.4 answer



                  Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results



                  scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                  df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                  scala> df.createOrReplaceTempView("tab")

                  scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
                  +----------+------------------------------+
                  |categoryid|fw |
                  +----------+------------------------------+
                  |33004 |[short, sarja, detalhe, couro]|
                  |22033 |[multipane, 220v, branco] |
                  |33033 |[x220, 220v] |
                  +----------+------------------------------+


                  scala>





                  share|improve this answer














                  If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.



                  Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.



                  scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                  df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                  scala> df.show(false)
                  +----------+-------------------------------------------------+
                  |categoryid|description |
                  +----------+-------------------------------------------------+
                  |33004 |[short, sarja, 40567, detalhe, couro] |
                  |22033 |[multipane, 6768686868686867868888, 220v, branco]|
                  |33033 |[0123, x220, 220v, 889] |
                  +----------+-------------------------------------------------+


                  scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)
                  +----------+-------------------------------------------------+------------------------------+
                  |categoryid|description |newc |
                  +----------+-------------------------------------------------+------------------------------+
                  |33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
                  |22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
                  |33033 |[0123, x220, 220v, 889] |[x220, 220v] |
                  +----------+-------------------------------------------------+------------------------------+


                  scala>


                  Spark 2.4 answer



                  Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results



                  scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
                  df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]

                  scala> df.createOrReplaceTempView("tab")

                  scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
                  +----------+------------------------------+
                  |categoryid|fw |
                  +----------+------------------------------+
                  |33004 |[short, sarja, detalhe, couro]|
                  |22033 |[multipane, 220v, branco] |
                  |33033 |[x220, 220v] |
                  +----------+------------------------------+


                  scala>






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jan 1 at 0:08

























                  answered Dec 31 '18 at 20:14









                  stack0114106

                  2,2211417




                  2,2211417






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53951215%2fhow-do-i-remove-words-numerics-pyspark%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Mossoró

                      Error while reading .h5 file using the rhdf5 package in R

                      Pushsharp Apns notification error: 'InvalidToken'