Scala: How to take any generic sequence as input to this method
Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)]
as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef]
, but then the code fails to recognize the toDF
call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
scala apache-spark dataframe apache-spark-sql
add a comment |
Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)]
as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef]
, but then the code fails to recognize the toDF
call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
scala apache-spark dataframe apache-spark-sql
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24
add a comment |
Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)]
as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef]
, but then the code fails to recognize the toDF
call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
scala apache-spark dataframe apache-spark-sql
Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)]
as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef]
, but then the code fails to recognize the toDF
call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
scala apache-spark dataframe apache-spark-sql
scala apache-spark dataframe apache-spark-sql
asked Jan 1 at 9:10
NiyazNiyaz
29k51136177
29k51136177
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24
add a comment |
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24
add a comment |
2 Answers
2
active
oldest
votes
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
add a comment |
As @AssafMendelson already explained the real reason of why you cannot create a Dataset
of Any
is because Spark needs an Encoder
to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder
for Any
type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products
(tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder
, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._
to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994247%2fscala-how-to-take-any-generic-sequence-as-input-to-this-method%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
add a comment |
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
add a comment |
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
answered Jan 1 at 11:34
Assaf MendelsonAssaf Mendelson
7,30011831
7,30011831
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
add a comment |
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
Super cool. Thank you!
– Niyaz
Jan 1 at 15:41
add a comment |
As @AssafMendelson already explained the real reason of why you cannot create a Dataset
of Any
is because Spark needs an Encoder
to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder
for Any
type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products
(tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder
, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._
to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.
add a comment |
As @AssafMendelson already explained the real reason of why you cannot create a Dataset
of Any
is because Spark needs an Encoder
to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder
for Any
type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products
(tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder
, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._
to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.
add a comment |
As @AssafMendelson already explained the real reason of why you cannot create a Dataset
of Any
is because Spark needs an Encoder
to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder
for Any
type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products
(tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder
, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._
to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.
As @AssafMendelson already explained the real reason of why you cannot create a Dataset
of Any
is because Spark needs an Encoder
to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder
for Any
type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products
(tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder
, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._
to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.
answered Jan 1 at 15:46
Luis Miguel Mejía SuárezLuis Miguel Mejía Suárez
2,6121822
2,6121822
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53994247%2fscala-how-to-take-any-generic-sequence-as-input-to-this-method%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
As far as I know, Spark doesn't support AnyRef in udf()s..
– stack0114106
Jan 1 at 10:42
as i can see you took the generic type T but didn't used it and toDF method is on seq so what you can do is make it of type Seq[T] then it should work fine.
– Raman Mishra
Jan 1 at 11:24