How to map String to Seq in Spark in Java
I want to use my own tokenizer to tokenize text stored as Dataset<String>
, and get Dataset<Seq<String>>
(so I can pass it to CountVectorizer
).
Expected input (/tmp/fulltext.txt
):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer
shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer
):
public class Utils {
public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest
doesn't compile at all. Do you think there is a way to fix it?
apache-spark apache-spark-mllib
add a comment |
I want to use my own tokenizer to tokenize text stored as Dataset<String>
, and get Dataset<Seq<String>>
(so I can pass it to CountVectorizer
).
Expected input (/tmp/fulltext.txt
):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer
shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer
):
public class Utils {
public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest
doesn't compile at all. Do you think there is a way to fix it?
apache-spark apache-spark-mllib
Seq
is not a bean. You will need a bean class with theEncoder
.
– Nikhil
Dec 28 '18 at 13:47
add a comment |
I want to use my own tokenizer to tokenize text stored as Dataset<String>
, and get Dataset<Seq<String>>
(so I can pass it to CountVectorizer
).
Expected input (/tmp/fulltext.txt
):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer
shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer
):
public class Utils {
public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest
doesn't compile at all. Do you think there is a way to fix it?
apache-spark apache-spark-mllib
I want to use my own tokenizer to tokenize text stored as Dataset<String>
, and get Dataset<Seq<String>>
(so I can pass it to CountVectorizer
).
Expected input (/tmp/fulltext.txt
):
t1 t2 t3
t4 t5
Expected output:
[t1, t2, t3]
[t4, t5]
The tokenizer I write is (basically the thing it does now is the same as Tokenizer
shipped with Spark, but I'll need to rewrite it to support tokenization of Chinese text, so I cannot use the official Tokenizer
):
public class Utils {
public static Seq<String> segment(String text) {
String array = text.split(" ");
List<String> tokens = new ArrayList<>();
for (String term : array) {
tokens.add(term.toLowerCase());
}
return JavaConverters
.asScalaIteratorConverter(tokens.iterator())
.asScala()
.toSeq();
}
}
The Spark application I'm trying to make is
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/tmp/fulltext.txt")
.cache();
Encoder<Seq> listEncoder = Encoders.bean(Seq.class);
// Compilation error
Dataset<Seq<String>> newText = rawText
.map((MapFunction<String, Seq<String>>) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
I'm a beginner of Spark, the above code is just what I think will work (after reading the official guide). But it turns out the code of TokenizeTest
doesn't compile at all. Do you think there is a way to fix it?
apache-spark apache-spark-mllib
apache-spark apache-spark-mllib
asked Dec 28 '18 at 12:54
NGYNGY
1557
1557
Seq
is not a bean. You will need a bean class with theEncoder
.
– Nikhil
Dec 28 '18 at 13:47
add a comment |
Seq
is not a bean. You will need a bean class with theEncoder
.
– Nikhil
Dec 28 '18 at 13:47
Seq
is not a bean. You will need a bean class with the Encoder
.– Nikhil
Dec 28 '18 at 13:47
Seq
is not a bean. You will need a bean class with the Encoder
.– Nikhil
Dec 28 '18 at 13:47
add a comment |
1 Answer
1
active
oldest
votes
Using Scala collections like this won't work. For once Seq
is not Bean compatible, for second it is generic.
If you want split just use arrays with segement
defined as:
public class Utils {
public static String segment(String text) {
return text.split(" ");
}
}
and TokenizeTest
defined as:
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split
or org.apache.spark.ml.feature.Tokenizer
instead of reinventing the wheel.
It works, thank you! Only that Intellij Idea can't recognize theimplicit()
method, which made me think the code wouldn't compile.
– NGY
Dec 28 '18 at 15:26
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53958933%2fhow-to-map-string-to-seqstring-in-spark-in-java%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using Scala collections like this won't work. For once Seq
is not Bean compatible, for second it is generic.
If you want split just use arrays with segement
defined as:
public class Utils {
public static String segment(String text) {
return text.split(" ");
}
}
and TokenizeTest
defined as:
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split
or org.apache.spark.ml.feature.Tokenizer
instead of reinventing the wheel.
It works, thank you! Only that Intellij Idea can't recognize theimplicit()
method, which made me think the code wouldn't compile.
– NGY
Dec 28 '18 at 15:26
add a comment |
Using Scala collections like this won't work. For once Seq
is not Bean compatible, for second it is generic.
If you want split just use arrays with segement
defined as:
public class Utils {
public static String segment(String text) {
return text.split(" ");
}
}
and TokenizeTest
defined as:
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split
or org.apache.spark.ml.feature.Tokenizer
instead of reinventing the wheel.
It works, thank you! Only that Intellij Idea can't recognize theimplicit()
method, which made me think the code wouldn't compile.
– NGY
Dec 28 '18 at 15:26
add a comment |
Using Scala collections like this won't work. For once Seq
is not Bean compatible, for second it is generic.
If you want split just use arrays with segement
defined as:
public class Utils {
public static String segment(String text) {
return text.split(" ");
}
}
and TokenizeTest
defined as:
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split
or org.apache.spark.ml.feature.Tokenizer
instead of reinventing the wheel.
Using Scala collections like this won't work. For once Seq
is not Bean compatible, for second it is generic.
If you want split just use arrays with segement
defined as:
public class Utils {
public static String segment(String text) {
return text.split(" ");
}
}
and TokenizeTest
defined as:
public class TokenizeTest {
public static void main(String args) {
SparkSession spark = SparkSession
.builder()
.appName("Tokenize Test")
.getOrCreate();
Dataset<String> rawText = spark
.read()
.textFile("/path/to/file")
.cache();
Encoder<String > listEncoder = spark.implicits().newStringArrayEncoder();
Dataset<String > newText = rawText
.map((MapFunction<String, String >) s -> Utils.segment(s), listEncoder);
newText.show();
spark.stop();
}
}
In practice though, you might consider either org.apache.spark.sql.functions.split
or org.apache.spark.ml.feature.Tokenizer
instead of reinventing the wheel.
answered Dec 28 '18 at 13:44
user10843263
It works, thank you! Only that Intellij Idea can't recognize theimplicit()
method, which made me think the code wouldn't compile.
– NGY
Dec 28 '18 at 15:26
add a comment |
It works, thank you! Only that Intellij Idea can't recognize theimplicit()
method, which made me think the code wouldn't compile.
– NGY
Dec 28 '18 at 15:26
It works, thank you! Only that Intellij Idea can't recognize the
implicit()
method, which made me think the code wouldn't compile.– NGY
Dec 28 '18 at 15:26
It works, thank you! Only that Intellij Idea can't recognize the
implicit()
method, which made me think the code wouldn't compile.– NGY
Dec 28 '18 at 15:26
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53958933%2fhow-to-map-string-to-seqstring-in-spark-in-java%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Seq
is not a bean. You will need a bean class with theEncoder
.– Nikhil
Dec 28 '18 at 13:47