How to map String to Seq in Spark in Java












2















I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).



Expected input (/tmp/fulltext.txt):



t1 t2 t3
t4 t5


Expected output:



[t1, t2, t3]
[t4, t5]


The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):



public class Utils {

public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}

}


The Spark application I'm trying to make is



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();

Encoder<Seq> listEncoder = Encoders.bean(Seq.class);

// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?










share|improve this question























  • Seq is not a bean. You will need a bean class with the Encoder.

    – Nikhil
    Dec 28 '18 at 13:47
















2















I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).



Expected input (/tmp/fulltext.txt):



t1 t2 t3
t4 t5


Expected output:



[t1, t2, t3]
[t4, t5]


The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):



public class Utils {

public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}

}


The Spark application I'm trying to make is



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();

Encoder<Seq> listEncoder = Encoders.bean(Seq.class);

// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?










share|improve this question























  • Seq is not a bean. You will need a bean class with the Encoder.

    – Nikhil
    Dec 28 '18 at 13:47














2












2








2








I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).



Expected input (/tmp/fulltext.txt):



t1 t2 t3
t4 t5


Expected output:



[t1, t2, t3]
[t4, t5]


The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):



public class Utils {

public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}

}


The Spark application I'm trying to make is



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();

Encoder<Seq> listEncoder = Encoders.bean(Seq.class);

// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?










share|improve this question














I want to use my own tokenizer to tokenize text stored as Dataset<String>, and get Dataset<Seq<String>> (so I can pass it to CountVectorizer).



Expected input (/tmp/fulltext.txt):



t1 t2 t3
t4 t5


Expected output:



[t1, t2, t3]
[t4, t5]


The tokenizer I write is (basically the thing it does now is the same as Tokenizer shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer):



public class Utils {

public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}

}


The Spark application I'm trying to make is



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();

Encoder<Seq> listEncoder = Encoders.bean(Seq.class);

// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest doesn't compile at all. Do you think there is a way to fix it?







apache-spark apache-spark-mllib






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 28 '18 at 12:54









NGYNGY

1557




1557













  • Seq is not a bean. You will need a bean class with the Encoder.

    – Nikhil
    Dec 28 '18 at 13:47



















  • Seq is not a bean. You will need a bean class with the Encoder.

    – Nikhil
    Dec 28 '18 at 13:47

















Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47





Seq is not a bean. You will need a bean class with the Encoder.

– Nikhil
Dec 28 '18 at 13:47












1 Answer
1






active

oldest

votes


















2














Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.



If you want split just use arrays with segement defined as:





public class Utils {

public static String segment(String text) {
return text.split(" ");
}

}


and TokenizeTest defined as:



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();

Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();


Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.






share|improve this answer
























  • It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

    – NGY
    Dec 28 '18 at 15:26











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53958933%2fhow-to-map-string-to-seqstring-in-spark-in-java%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.



If you want split just use arrays with segement defined as:





public class Utils {

public static String segment(String text) {
return text.split(" ");
}

}


and TokenizeTest defined as:



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();

Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();


Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.






share|improve this answer
























  • It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

    – NGY
    Dec 28 '18 at 15:26
















2














Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.



If you want split just use arrays with segement defined as:





public class Utils {

public static String segment(String text) {
return text.split(" ");
}

}


and TokenizeTest defined as:



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();

Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();


Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.






share|improve this answer
























  • It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

    – NGY
    Dec 28 '18 at 15:26














2












2








2







Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.



If you want split just use arrays with segement defined as:





public class Utils {

public static String segment(String text) {
return text.split(" ");
}

}


and TokenizeTest defined as:



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();

Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();


Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.






share|improve this answer













Using Scala collections like this won't work. For once Seq is not Bean compatible, for second it is generic.



If you want split just use arrays with segement defined as:





public class Utils {

public static String segment(String text) {
return text.split(" ");
}

}


and TokenizeTest defined as:



public class TokenizeTest {

public static void main(String args) {

SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();

Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();

Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();


Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);

newText.show();
spark.stop();
}
}


In practice though, you might consider either org.apache.spark.sql.functions.split or org.apache.spark.ml.feature.Tokenizer instead of reinventing the wheel.







share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 28 '18 at 13:44







user10843263




















  • It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

    – NGY
    Dec 28 '18 at 15:26



















  • It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

    – NGY
    Dec 28 '18 at 15:26

















It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26





It works, thank you! Only that Intellij Idea can't recognize the implicit() method, which made me think the code wouldn't compile.

– NGY
Dec 28 '18 at 15:26


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53958933%2fhow-to-map-string-to-seqstring-in-spark-in-java%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Monofisismo

Angular Downloading a file using contenturl with Basic Authentication

Olmecas