[SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs #9889

gatorsmile · 2015-11-22T01:16:13Z

Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?

Please provide your opinions. @marmbrus @rxin @cloud-fan

Thank you very much!

SparkQA · 2015-11-22T03:44:16Z

Test build #46485 has finished for PR 9889 at commit c135e1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-11-22T18:42:39Z

I'm worried the existing caching mechanisms might not work on dataset operations. Do we have a good notion of equality for encoders and lambda functions? Can you add some test coverage for this?

gatorsmile · 2015-11-22T18:56:52Z

I see, will make a try. Thanks!

gatorsmile · 2015-11-23T04:30:18Z

@marmbrus Do these newly added test cases resolve your concerns?

SparkQA · 2015-11-23T06:21:37Z

Test build #46510 has finished for PR 9889 at commit 2517777.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-11-25T00:10:26Z

cc @marmbrus

I will let you merge this one.

marmbrus · 2015-11-25T02:42:23Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+
+  /**
+    * @since 1.6.0
+    */


The comment style here is off and we should actually have a description. Could we just move the functions/docs from DataFrame to Queryable?

So far, we are unable to move the functions to Queryable because the types of the returned values are different. I just added the descriptions in both DataFrame and Dataset. Hopefully, it resolves your concern. Thanks!

@marmbrus moving functions into Queryable actually breaks both scaladoc and javadoc.

@rxin I think thats only because we explicitly exclude execution from scaladoc. Maybe we should move queryable? or don't exclude that class. I don't want to duplicate a ton of docs.

marmbrus · 2015-11-25T02:50:40Z

It would be great to also have thats that ensure that things like .as[Class] do not break caching.

marmbrus · 2015-11-25T02:52:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala

@@ -27,6 +28,7 @@ private[sql] trait Queryable {
  def schema: StructType
  def queryExecution: QueryExecution
  def sqlContext: SQLContext
+  private[sql] def logicalPlan: LogicalPlan


Can we just get this from the queryExecution in the cache manager? or at least define it explicitly here. I don't want dataframes and datasets to fall out of sync with regards to what the canonical plan phase is.

marmbrus · 2015-11-25T02:56:56Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

@@ -17,6 +17,8 @@

 package org.apache.spark.sql

+import org.apache.spark.storage.StorageLevel


order imports.

marmbrus · 2015-11-25T03:00:48Z

I wouldn't block merging an initial version of this feature on this, but it would also be nice if we could support the following (this might be hard though):

val f = (i: Int) => i + 1

val ds = Seq(1,2,3).toDS()
val mapped = ds.map(f)
mapped.cache()

val mapped2 = ds.map(f)
assertCached(mapped2)

gatorsmile · 2015-11-25T07:47:25Z

Now, I understood your concern. Thank you for the example! I added your example into the newly created testcase suite CacheSuite. I saw the failure and thus used ignore to disable the case. I will keep investigating the issue after the merge.

Running the test cases in my local computer. Will upload the new changes tomorrow morning. Thank you for your help!

rxin · 2015-11-25T07:50:18Z

@gatorsmile just fyi if you have time, the python tests stuff is probably much more important than the more complicated case of caching.

gatorsmile · 2015-11-25T07:52:14Z

@rxin Sure, will do the Python testing at first. Thanks!

gatorsmile · 2015-11-25T14:20:16Z

retest this please

SparkQA · 2015-11-25T16:37:07Z

Test build #46689 has finished for PR 9889 at commit 92ede39.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-11-26T17:54:44Z

@marmbrus Not sure if the latest code changes resolve all your concerns. Please let me know if you have any suggestion. Thank you!

Have a good Thanksgiving Day!

marmbrus · 2015-11-30T23:50:35Z

sql/core/src/test/scala/org/apache/spark/sql/CacheSuite.scala

+    ds2.persist()
+    assertCached(ds2)
+
+    val joined = ds1.joinWith(ds2, $"a.value" === $"b.value")


assertCached joined here.

gatorsmile · 2015-12-01T00:02:00Z

Thank you! @marmbrus

Will do the changes soon.

SparkQA · 2015-12-01T04:06:49Z

Test build #46925 has finished for PR 9889 at commit b8d287a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-01T06:42:06Z

@marmbrus Please check the latest changes. Feel free to let me know if we need more changes. Thank you!

marmbrus · 2015-12-01T18:37:54Z

Thanks, merging to master and 1.6.

Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable? Please provide your opinions. marmbrus rxin cloud-fan Thank you very much! Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #9889 from gatorsmile/persistDS. (cherry picked from commit 0a7bca2) Signed-off-by: Michael Armbrust <michael@databricks.com>

gatorsmile and others added 12 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Support Persist/Cache and Unpersist in DataSet APIs

f061671

Merge remote-tracking branch 'upstream/master' into top

88d5e9d

gatorsmile changed the title ~~[SPARK-11905] Support Persist/Cache and Unpersist in Dataset APIs~~ [SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs Nov 22, 2015

update the @SInCE

c135e1f

gatorsmile added 2 commits November 22, 2015 17:42

Merge remote-tracking branch 'upstream/master'

661260b

adding more test cases

2517777

marmbrus reviewed Nov 25, 2015
View reviewed changes

gatorsmile added 3 commits November 24, 2015 20:06

Merge remote-tracking branch 'upstream/master' into top

aa5dc52

Merge remote-tracking branch 'upstream/master'

2dfa0fd

Merge remote-tracking branch 'upstream/master' into top

c4489ed

gatorsmile added 4 commits November 24, 2015 23:58

resolved all the comments

683fa6f

Merge remote-tracking branch 'upstream/master' into top

1c82396

Merge remote-tracking branch 'upstream/master'

d929d9b

Merge branch 'top' into persistDSmerge

92ede39

marmbrus reviewed Nov 30, 2015
View reviewed changes

gatorsmile added 3 commits November 30, 2015 16:07

Merge remote-tracking branch 'upstream/master' into persistDSmerge

8071d30

updated the codes based on the review comments from Michale Armbrust.

b9518ee

Changed the name from CacheSuite.scala to DatasetCacheSuite.scala

b8d287a

asfgit closed this in 0a7bca2 Dec 1, 2015

		@@ -17,6 +17,8 @@

		package org.apache.spark.sql

		import org.apache.spark.storage.StorageLevel

[SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs #9889

[SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs #9889

Uh oh!

Conversation

gatorsmile commented Nov 22, 2015

Uh oh!

SparkQA commented Nov 22, 2015

Uh oh!

marmbrus commented Nov 22, 2015

Uh oh!

gatorsmile commented Nov 22, 2015

Uh oh!

gatorsmile commented Nov 23, 2015

Uh oh!

SparkQA commented Nov 23, 2015

Uh oh!

rxin commented Nov 25, 2015

Uh oh!

marmbrus Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

rxin Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Nov 26, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 25, 2015

Uh oh!

marmbrus Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus Nov 25, 2015

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 25, 2015

Uh oh!

rxin commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 25, 2015

Uh oh!

SparkQA commented Nov 25, 2015

Uh oh!

gatorsmile commented Nov 26, 2015

Uh oh!

marmbrus Nov 30, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Dec 1, 2015

Uh oh!

SparkQA commented Dec 1, 2015

Uh oh!

gatorsmile commented Dec 1, 2015

Uh oh!

marmbrus commented Dec 1, 2015

Uh oh!

Uh oh!