- A Sequential MISON Parser for proof of concept. MISON is a query-based parser that parses only certain columns to improve performance. Implementation is based on the research paper Mison: A Fast JSON Parser for Data Analytics.
- Extracts JSON data based on specified columns
- Contains a file (SparkContext.scala) that integrates MISON into Spark (version 2.1.1)
- Converts JSON data to Spark's Data Frame (enabling testing with Spark's method of extracting JSON data)
- Go to Spark's website and download Spark source code with version 2.1.1 (May 02 2017) as this is the version we worked on.
- After extracting the spark-2.1.1.tgz, install Java, Python, and Scala so that Spark's terminal interface can be run.
- In the Spark folder (spark-2.1.1), type the following into the terminal. Note: There is more detailed information at Spark's website.
$ build/sbt clean package
- Note: The following files mentioned are in the Spark folder.
- Move fileHandler.scala, Bitmaps.scala, Bits.scala, Parser.scala, SparkContext.scala to spark-2.1.1/core/src/main/scala/org/apache/spark
- fileHandler.scala: loads the text files (formatted in JSON) into readable String
- Bitmaps.scala: converts String into bitmaps
- Bits.scala: represents bits and supports bit operations
- Parser.scala: parses through bitmaps and extracts necessary information to create a data frame
- SparkContext.scala: calls loading, bitmap conversion, parse functions and creates a data frame
- Create a folder in spark-2.1.1/bin and put MData.txt, MQuery.txt, Mtweet.txt. As you can see in testCorrectness.scala and MData.txt, we named the folder test.
- MData.txt: contains path for data files to support multiple file paths
- MQuery.txt: list of columns to extract
- Mtweet.txt: JSON data (fileHandler only supports text files)
- tweet_10.json: JSON data for Spark as Spark supports JSON files
- testCorrectness.scala: compares MISON's data frame and Spark's data frame
- In the terminal, type the following in the directory spark-2.1.1. Note: There is more detailed information at Spark's website.
$ export SPARK_PREPEND_CLASSES=true
$ build/sbt compile
- In the terminal, type ./spark-shell in directory spark-2.1.1/bin to start running Spark.
- When the Spark interface appears, type the following to run the test.
:load test/testCorrectness.scala
- Spark takes account of a case where the column doesn't exist (creates a record and treats the missing columns as null) while this parser doesn't (doesn't create a record).
- Spark has its own JSON parsing system while this parser uses its own, which may have some issues (ex: it doesn't detect unicode characters, so a string manipulation function fixString was implemented in SparkContext). Using built-in Spark classes/functions may boost performance.
- The parser's performance can be dramatically improved with parallel computing and speculative parsing.