How do I remove words numerics pyspark

I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(

    inputCol="description",

    outputCol="words_withnumber",

    pattern="\W"

)



data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(

    'words_withoutnumber', 

    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])

)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja", "40567","detalhe","couro"]               | 

|      22033|["multipane","6768686868686867868888","220v","branco"]     | 

+-----------+-----------------------------------------------------------+

expected result

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja","detalhe","couro"]                        | 

|      22033|["multipane","220v","branco"]                              |

+-----------+-----------------------------------------------------------+

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

1

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01

You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05

@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22

@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44

1

from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00

|
show 1 more comment

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(

    inputCol="description",

    outputCol="words_withnumber",

    pattern="\W"

)



data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(

    'words_withoutnumber', 

    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])

)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja", "40567","detalhe","couro"]               | 

|      22033|["multipane","6768686868686867868888","220v","branco"]     | 

+-----------+-----------------------------------------------------------+

expected result

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja","detalhe","couro"]                        | 

|      22033|["multipane","220v","branco"]                              |

+-----------+-----------------------------------------------------------+

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

1

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01

You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05

@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22

@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44

1

from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00

|
show 1 more comment

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(

    inputCol="description",

    outputCol="words_withnumber",

    pattern="\W"

)



data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(

    'words_withoutnumber', 

    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])

)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja", "40567","detalhe","couro"]               | 

|      22033|["multipane","6768686868686867868888","220v","branco"]     | 

+-----------+-----------------------------------------------------------+

expected result

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja","detalhe","couro"]                        | 

|      22033|["multipane","220v","branco"]                              |

+-----------+-----------------------------------------------------------+

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer

regexTokenizer = RegexTokenizer(

    inputCol="description",

    outputCol="words_withnumber",

    pattern="\W"

)



data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(

    'words_withoutnumber', 

    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])

)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja", "40567","detalhe","couro"]               | 

|      22033|["multipane","6768686868686867868888","220v","branco"]     | 

+-----------+-----------------------------------------------------------+

expected result

+-----------+-----------------------------------------------------------+

|categoryid |description                                                |

+-----------+-----------------------------------------------------------+

|      33004|["short","sarja","detalhe","couro"]                        | 

|      22033|["multipane","220v","branco"]                              |

+-----------+-----------------------------------------------------------+

pyspark apache-spark-sql apache-spark-ml

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

edited Dec 28 '18 at 17:49

pault

14.2k31947

edited Dec 28 '18 at 17:49

pault

14.2k31947

edited Dec 28 '18 at 17:49

pault

14.2k31947

asked Dec 27 '18 at 21:49

user3661384

11818

asked Dec 27 '18 at 21:49

user3661384

11818

asked Dec 27 '18 at 21:49

user3661384

11818

1

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01

You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05

@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22

@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44

1

from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00

|
show 1 more comment

1

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01

You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05

@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22

@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44

1

from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber?
– Psidom
Dec 27 '18 at 22:01

You need to iterate over the array and filter out the desired words. What version of spark are you using?
– pault
Dec 27 '18 at 22:05

@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined
– user3661384
Dec 28 '18 at 12:22

@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago
– user3661384
Dec 28 '18 at 12:44

from pyspark.sql.types import ArrayType
– pault
Dec 28 '18 at 13:00

|
show 1 more comment

2 Answers
2

active

oldest

votes

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType

from pyspark.sql.types import StructType



filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))



data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

answered Dec 28 '18 at 16:31

user3661384

11818

add a comment |

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.show(false)

+----------+-------------------------------------------------+

|categoryid|description                                      |

+----------+-------------------------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |

|22033     |[multipane, 6768686868686867868888, 220v, branco]|

|33033     |[0123, x220, 220v, 889]                          |

+----------+-------------------------------------------------+





scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)

+----------+-------------------------------------------------+------------------------------+

|categoryid|description                                      |newc                          |

+----------+-------------------------------------------------+------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|

|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |

|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |

+----------+-------------------------------------------------+------------------------------+





scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.createOrReplaceTempView("tab")



scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)

+----------+------------------------------+

|categoryid|fw                            |

+----------+------------------------------+

|33004     |[short, sarja, detalhe, couro]|

|22033     |[multipane, 220v, branco]     |

|33033     |[x220, 220v]                  |

+----------+------------------------------+





scala>

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53951215%2fhow-do-i-remove-words-numerics-pyspark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType

from pyspark.sql.types import StructType



filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))



data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

answered Dec 28 '18 at 16:31

user3661384

11818

add a comment |

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType

from pyspark.sql.types import StructType



filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))



data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

answered Dec 28 '18 at 16:31

user3661384

11818

add a comment |

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType

from pyspark.sql.types import StructType



filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))



data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

answered Dec 28 '18 at 16:31

user3661384

11818

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf

from pyspark.sql.types import BooleanType



def is_digit(value):

    if value:

        return value.isdigit()

    else:

        return False



is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType

from pyspark.sql.types import StructType



filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))



data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

answered Dec 28 '18 at 16:31

user3661384

11818

answered Dec 28 '18 at 16:31

user3661384

11818

answered Dec 28 '18 at 16:31

user3661384

11818

answered Dec 28 '18 at 16:31

user3661384

11818

add a comment |

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.show(false)

+----------+-------------------------------------------------+

|categoryid|description                                      |

+----------+-------------------------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |

|22033     |[multipane, 6768686868686867868888, 220v, branco]|

|33033     |[0123, x220, 220v, 889]                          |

+----------+-------------------------------------------------+





scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)

+----------+-------------------------------------------------+------------------------------+

|categoryid|description                                      |newc                          |

+----------+-------------------------------------------------+------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|

|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |

|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |

+----------+-------------------------------------------------+------------------------------+





scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.createOrReplaceTempView("tab")



scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)

+----------+------------------------------+

|categoryid|fw                            |

+----------+------------------------------+

|33004     |[short, sarja, detalhe, couro]|

|22033     |[multipane, 220v, branco]     |

|33033     |[x220, 220v]                  |

+----------+------------------------------+





scala>

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

add a comment |

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.show(false)

+----------+-------------------------------------------------+

|categoryid|description                                      |

+----------+-------------------------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |

|22033     |[multipane, 6768686868686867868888, 220v, branco]|

|33033     |[0123, x220, 220v, 889]                          |

+----------+-------------------------------------------------+





scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)

+----------+-------------------------------------------------+------------------------------+

|categoryid|description                                      |newc                          |

+----------+-------------------------------------------------+------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|

|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |

|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |

+----------+-------------------------------------------------+------------------------------+





scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.createOrReplaceTempView("tab")



scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)

+----------+------------------------------+

|categoryid|fw                            |

+----------+------------------------------+

|33004     |[short, sarja, detalhe, couro]|

|22033     |[multipane, 220v, branco]     |

|33033     |[x220, 220v]                  |

+----------+------------------------------+





scala>

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

add a comment |

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.show(false)

+----------+-------------------------------------------------+

|categoryid|description                                      |

+----------+-------------------------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |

|22033     |[multipane, 6768686868686867868888, 220v, branco]|

|33033     |[0123, x220, 220v, 889]                          |

+----------+-------------------------------------------------+





scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)

+----------+-------------------------------------------------+------------------------------+

|categoryid|description                                      |newc                          |

+----------+-------------------------------------------------+------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|

|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |

|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |

+----------+-------------------------------------------------+------------------------------+





scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.createOrReplaceTempView("tab")



scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)

+----------+------------------------------+

|categoryid|fw                            |

+----------+------------------------------+

|33004     |[short, sarja, detalhe, couro]|

|22033     |[multipane, 220v, branco]     |

|33033     |[x220, 220v]                  |

+----------+------------------------------+





scala>

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

If you want to avoid udf() for performance reasons and if comma won't appear in your "description" column, then below scala solution would work. The df.withColumn() should be similar in pyspark.

Note: I also added a third record to show that the solution works when the numbers appear at the start/end of the array. Try it out.

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.show(false)

+----------+-------------------------------------------------+

|categoryid|description                                      |

+----------+-------------------------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |

|22033     |[multipane, 6768686868686867868888, 220v, branco]|

|33033     |[0123, x220, 220v, 889]                          |

+----------+-------------------------------------------------+





scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""bd+b""",""),"""^,|,$""",""),",,",","),",")).show(false)

+----------+-------------------------------------------------+------------------------------+

|categoryid|description                                      |newc                          |

+----------+-------------------------------------------------+------------------------------+

|33004     |[short, sarja, 40567, detalhe, couro]            |[short, sarja, detalhe, couro]|

|22033     |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco]     |

|33033     |[0123, x220, 220v, 889]                          |[x220, 220v]                  |

+----------+-------------------------------------------------+------------------------------+





scala>

Spark 2.4 answer

Using spark-sql in version 2.4 onwards, you can use the filter() higher order function and get the results

scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")

df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]



scala> df.createOrReplaceTempView("tab")



scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)

+----------+------------------------------+

|categoryid|fw                            |

+----------+------------------------------+

|33004     |[short, sarja, detalhe, couro]|

|22033     |[multipane, 220v, branco]     |

|33033     |[x220, 220v]                  |

+----------+------------------------------+





scala>

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

edited Jan 1 at 0:08

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

answered Dec 31 '18 at 20:14

stack0114106

2,2211417

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk