{"paragraphs":[{"text":"%md\n\nFrom JSON to Parquet using Spark SQL\n---\nby Joel Wilsson\n\nSee [this blog post](https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html) for more information. To run this Zeppelin notebook you need to have interpreters for Markdown (\"%md\") and Spark (\"%spark\" and \"%sql\") installed. If they are not already installed, run this:\n\n```shell\ncd location/of/your/zeppelin-installation\n./bin/install-interpreter.sh -n md,spark\n```\n\nFirst we need to load the data into a DataFrame. Follow the instructions at the [end of the post](https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html#getting-the-data). The 200 MB file with taxi data in HDFS sequence file format was used when creating this example. We'll assume that the files have been placed in `/tmp/taxi-seq/` and convert our data to a DataFrame, which we'll register as a Spark SQL table.","dateUpdated":"2016-12-07T21:28:46+0100","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481140186531_905628776","id":"20161207-204946_118368665","result":{"code":"SUCCESS","type":"HTML","msg":"
by Joel Wilsson
\nSee this blog post for more information. To run this Zeppelin notebook you need to have interpreters for Markdown (“%md”) and Spark (“%spark” and “%sql”) installed. If they are not already installed, run this:
\ncd location/of/your/zeppelin-installation\n./bin/install-interpreter.sh -n md,spark\n
\nFirst we need to load the data into a DataFrame. Follow the instructions at the end of the post. The 200 MB file with taxi data in HDFS sequence file format was used when creating this example. We'll assume that the files have been placed in /tmp/taxi-seq/
and convert our data to a DataFrame, which we'll register as a Spark SQL table.
We can now query the trips
table using Spark SQL. First we'll count the number of rows, and then we'll calculate the average trip distance by number of passengers. We may need to change visualization to a bar chart and adjust the groupings for more complicated queries. Zeppelin sometimes chooses to group things in surprising ways, at least as of Zeppelin 0.6.2, and if you get odd exceptions (for example java.lang.NoSuchMethodException: org.apache.spark.io.LZ4CompressionCodec.<init>(org.apache.spark.SparkConf)
) when running queries you can just try running them a few more times.