Skip to content

[SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs #9889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

gatorsmile
Copy link
Member

Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?

Please provide your opinions. @marmbrus @rxin @cloud-fan

Thank you very much!

@gatorsmile gatorsmile changed the title [SPARK-11905] Support Persist/Cache and Unpersist in Dataset APIs [SPARK-11905] [SQL] Support Persist/Cache and Unpersist in Dataset APIs Nov 22, 2015
@SparkQA
Copy link

SparkQA commented Nov 22, 2015

Test build #46485 has finished for PR 9889 at commit c135e1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

I'm worried the existing caching mechanisms might not work on dataset operations. Do we have a good notion of equality for encoders and lambda functions? Can you add some test coverage for this?

@gatorsmile
Copy link
Member Author

I see, will make a try. Thanks!

@gatorsmile
Copy link
Member Author

@marmbrus Do these newly added test cases resolve your concerns?

@SparkQA
Copy link

SparkQA commented Nov 23, 2015

Test build #46510 has finished for PR 9889 at commit 2517777.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 25, 2015

cc @marmbrus

I will let you merge this one.


/**
* @since 1.6.0
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment style here is off and we should actually have a description. Could we just move the functions/docs from DataFrame to Queryable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, we are unable to move the functions to Queryable because the types of the returned values are different. I just added the descriptions in both DataFrame and Dataset. Hopefully, it resolves your concern. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus moving functions into Queryable actually breaks both scaladoc and javadoc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I think thats only because we explicitly exclude execution from scaladoc. Maybe we should move queryable? or don't exclude that class. I don't want to duplicate a ton of docs.

@marmbrus
Copy link
Contributor

It would be great to also have thats that ensure that things like .as[Class] do not break caching.

@@ -27,6 +28,7 @@ private[sql] trait Queryable {
def schema: StructType
def queryExecution: QueryExecution
def sqlContext: SQLContext
private[sql] def logicalPlan: LogicalPlan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get this from the queryExecution in the cache manager? or at least define it explicitly here. I don't want dataframes and datasets to fall out of sync with regards to what the canonical plan phase is.

@@ -17,6 +17,8 @@

package org.apache.spark.sql

import org.apache.spark.storage.StorageLevel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order imports.

@marmbrus
Copy link
Contributor

I wouldn't block merging an initial version of this feature on this, but it would also be nice if we could support the following (this might be hard though):

val f = (i: Int) => i + 1

val ds = Seq(1,2,3).toDS()
val mapped = ds.map(f)
mapped.cache()

val mapped2 = ds.map(f)
assertCached(mapped2)

@gatorsmile
Copy link
Member Author

Now, I understood your concern. Thank you for the example! I added your example into the newly created testcase suite CacheSuite. I saw the failure and thus used ignore to disable the case. I will keep investigating the issue after the merge.

Running the test cases in my local computer. Will upload the new changes tomorrow morning. Thank you for your help!

@rxin
Copy link
Contributor

rxin commented Nov 25, 2015

@gatorsmile just fyi if you have time, the python tests stuff is probably much more important than the more complicated case of caching.

@gatorsmile
Copy link
Member Author

@rxin Sure, will do the Python testing at first. Thanks!

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46689 has finished for PR 9889 at commit 92ede39.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@marmbrus Not sure if the latest code changes resolve all your concerns. Please let me know if you have any suggestion. Thank you!

Have a good Thanksgiving Day!

ds2.persist()
assertCached(ds2)

val joined = ds1.joinWith(ds2, $"a.value" === $"b.value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertCached joined here.

@gatorsmile
Copy link
Member Author

Thank you! @marmbrus

Will do the changes soon.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46925 has finished for PR 9889 at commit b8d287a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@marmbrus Please check the latest changes. Feel free to let me know if we need more changes. Thank you!

@marmbrus
Copy link
Contributor

marmbrus commented Dec 1, 2015

Thanks, merging to master and 1.6.

asfgit pushed a commit that referenced this pull request Dec 1, 2015
Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable?

Please provide your opinions. marmbrus rxin cloud-fan

Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #9889 from gatorsmile/persistDS.

(cherry picked from commit 0a7bca2)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 0a7bca2 Dec 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants