Skip to content

Commit 3662b9f

Browse files
gatorsmilemarmbrus
authored andcommitted
[SPARK-11876][SQL] Support printSchema in DataSet API
DataSet APIs look great! However, I am lost when doing multiple level joins. For example, ``` val ds1 = Seq(("a", 1), ("b", 2)).toDS().as("a") val ds2 = Seq(("a", 1), ("b", 2)).toDS().as("b") val ds3 = Seq(("a", 1), ("b", 2)).toDS().as("c") ds1.joinWith(ds2, $"a._2" === $"b._2").as("ab").joinWith(ds3, $"ab._1._2" === $"c._2").printSchema() ``` The printed schema is like ``` root |-- _1: struct (nullable = true) | |-- _1: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) | |-- _2: struct (nullable = true) | | |-- _1: string (nullable = true) | | |-- _2: integer (nullable = true) |-- _2: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: integer (nullable = true) ``` Personally, I think we need the printSchema function. Sometimes, I do not know how to specify the column, especially when their data types are mixed. For example, if I want to write the following select for the above multi-level join, I have to know the schema: ``` newDS.select(expr("_1._2._2 + 1").as[Int]).collect() ``` marmbrus rxin cloud-fan Do you have the same feeling? Author: gatorsmile <gatorsmile@gmail.com> Closes apache#9855 from gatorsmile/printSchemaDataSet. (cherry picked from commit bef361c) Signed-off-by: Michael Armbrust <michael@databricks.com>
1 parent 92d3563 commit 3662b9f

File tree

2 files changed

+9
-9
lines changed

2 files changed

+9
-9
lines changed

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -299,15 +299,6 @@ class DataFrame private[sql](
299299
*/
300300
def columns: Array[String] = schema.fields.map(_.name)
301301

302-
/**
303-
* Prints the schema to the console in a nice tree format.
304-
* @group basic
305-
* @since 1.3.0
306-
*/
307-
// scalastyle:off println
308-
def printSchema(): Unit = println(schema.treeString)
309-
// scalastyle:on println
310-
311302
/**
312303
* Returns true if the `collect` and `take` methods can be run locally
313304
* (without any Spark executors).

sql/core/src/main/scala/org/apache/spark/sql/execution/Queryable.scala

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,15 @@ private[sql] trait Queryable {
3737
}
3838
}
3939

40+
/**
41+
* Prints the schema to the console in a nice tree format.
42+
* @group basic
43+
* @since 1.3.0
44+
*/
45+
// scalastyle:off println
46+
def printSchema(): Unit = println(schema.treeString)
47+
// scalastyle:on println
48+
4049
/**
4150
* Prints the plans (logical and physical) to the console for debugging purposes.
4251
* @since 1.3.0

0 commit comments

Comments
 (0)