How to map String to Seq in Spark in Java

I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).

Expected input (/tmp/fulltext.txt):

t1 t2 t3

t4 t5

Expected output:

[t1, t2, t3]

[t4, t5]

The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):

public class Utils {



  public static Seq<String> segment(String text) {

    String array = text.split(" ");

    List<String> tokens = new ArrayList<>();

    for (String term : array) {

      tokens.add(term.toLowerCase());

    }

    return JavaConverters

        .asScalaIteratorConverter(tokens.iterator())

        .asScala()

        .toSeq();

  }



}

The Spark application I'm trying to make is

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/tmp/fulltext.txt")

        .cache();



    Encoder<Seq> listEncoder = Encoders.bean(Seq.class);



    // Compilation error

    Dataset<Seq<String>> newText = rawText

        .map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?

asked Dec 28 '18 at 12:54

NGY

1557

Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47

add a comment |

I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).

Expected input (/tmp/fulltext.txt):

t1 t2 t3

t4 t5

Expected output:

[t1, t2, t3]

[t4, t5]

public class Utils {



  public static Seq<String> segment(String text) {

    String array = text.split(" ");

    List<String> tokens = new ArrayList<>();

    for (String term : array) {

      tokens.add(term.toLowerCase());

    }

    return JavaConverters

        .asScalaIteratorConverter(tokens.iterator())

        .asScala()

        .toSeq();

  }



}

The Spark application I'm trying to make is

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/tmp/fulltext.txt")

        .cache();



    Encoder<Seq> listEncoder = Encoders.bean(Seq.class);



    // Compilation error

    Dataset<Seq<String>> newText = rawText

        .map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

asked Dec 28 '18 at 12:54

NGY

1557

Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47

add a comment |

I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).

Expected input (/tmp/fulltext.txt):

t1 t2 t3

t4 t5

Expected output:

[t1, t2, t3]

[t4, t5]

public class Utils {



  public static Seq<String> segment(String text) {

    String array = text.split(" ");

    List<String> tokens = new ArrayList<>();

    for (String term : array) {

      tokens.add(term.toLowerCase());

    }

    return JavaConverters

        .asScalaIteratorConverter(tokens.iterator())

        .asScala()

        .toSeq();

  }



}

The Spark application I'm trying to make is

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/tmp/fulltext.txt")

        .cache();



    Encoder<Seq> listEncoder = Encoders.bean(Seq.class);



    // Compilation error

    Dataset<Seq<String>> newText = rawText

        .map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

asked Dec 28 '18 at 12:54

NGY

1557

I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).

Expected input (/tmp/fulltext.txt):

t1 t2 t3

t4 t5

Expected output:

[t1, t2, t3]

[t4, t5]

public class Utils {



  public static Seq<String> segment(String text) {

    String array = text.split(" ");

    List<String> tokens = new ArrayList<>();

    for (String term : array) {

      tokens.add(term.toLowerCase());

    }

    return JavaConverters

        .asScalaIteratorConverter(tokens.iterator())

        .asScala()

        .toSeq();

  }



}

The Spark application I'm trying to make is

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/tmp/fulltext.txt")

        .cache();



    Encoder<Seq> listEncoder = Encoders.bean(Seq.class);



    // Compilation error

    Dataset<Seq<String>> newText = rawText

        .map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

apache-spark apache-spark-mllib

asked Dec 28 '18 at 12:54

NGY

1557

asked Dec 28 '18 at 12:54

NGY

1557

asked Dec 28 '18 at 12:54

NGY

1557

asked Dec 28 '18 at 12:54

NGY

1557

asked Dec 28 '18 at 12:54

NGY

1557

Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47

add a comment |

Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47

Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47

add a comment |

1 Answer
1

active

oldest

votes

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.

If you want split just use arrays with segement defined as:

public class Utils {



  public static String segment(String text) {

    return text.split(" ");

  }



}

and TokenizeTest defined as:

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/path/to/file")

        .cache();



    Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();





    Dataset<String > newText = rawText

        .map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

answered Dec 28 '18 at 13:44

user10843263

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53958933%2fhow-to-map-string-to-seqstring-in-spark-in-java%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.

If you want split just use arrays with segement defined as:

public class Utils {



  public static String segment(String text) {

    return text.split(" ");

  }



}

and TokenizeTest defined as:

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/path/to/file")

        .cache();



    Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();





    Dataset<String > newText = rawText

        .map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

answered Dec 28 '18 at 13:44

user10843263

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

add a comment |

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.

If you want split just use arrays with segement defined as:

public class Utils {



  public static String segment(String text) {

    return text.split(" ");

  }



}

and TokenizeTest defined as:

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/path/to/file")

        .cache();



    Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();





    Dataset<String > newText = rawText

        .map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

answered Dec 28 '18 at 13:44

user10843263

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

add a comment |

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.

If you want split just use arrays with segement defined as:

public class Utils {



  public static String segment(String text) {

    return text.split(" ");

  }



}

and TokenizeTest defined as:

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/path/to/file")

        .cache();



    Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();





    Dataset<String > newText = rawText

        .map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

answered Dec 28 '18 at 13:44

user10843263

Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.

If you want split just use arrays with segement defined as:

public class Utils {



  public static String segment(String text) {

    return text.split(" ");

  }



}

and TokenizeTest defined as:

public class TokenizeTest {



  public static void main(String args) {



    SparkSession spark = SparkSession

        .builder()

        .appName("Tokenize Test")

        .getOrCreate();



    Dataset<String> rawText = spark

        .read()

        .textFile("/path/to/file")

        .cache();



    Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();





    Dataset<String > newText = rawText

        .map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);



    newText.show();

    spark.stop();

  }

}

In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.

answered Dec 28 '18 at 13:44

user10843263

answered Dec 28 '18 at 13:44

user10843263

answered Dec 28 '18 at 13:44

user10843263

answered Dec 28 '18 at 13:44

user10843263

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

add a comment |

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Bdtjtk