﻿{"paragraphs":[{"text":"%md\n\nFrom JSON to Parquet using Spark SQL\n---\nby Joel Wilsson\n\nSee [this blog post](https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html) for more information. To run this Zeppelin notebook you need to have interpreters for Markdown (\"%md\") and Spark (\"%spark\" and \"%sql\") installed. If they are not already installed, run this:\n\n```shell\ncd location/of/your/zeppelin-installation\n./bin/install-interpreter.sh -n md,spark\n```\n\nFirst we need to load the data into a DataFrame. Follow the instructions at the [end of the post](https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html#getting-the-data). The 200 MB file with taxi data in HDFS sequence file format was used when creating this example. We'll assume that the files have been placed in `/tmp/taxi-seq/` and convert our data to a DataFrame, which we'll register as a Spark SQL table.","dateUpdated":"2016-12-07T21:28:46+0100","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481140186531_905628776","id":"20161207-204946_118368665","result":{"code":"SUCCESS","type":"HTML","msg":"<h2>From JSON to Parquet using Spark SQL</h2>\n<p>by Joel Wilsson</p>\n<p>See <a href=\"https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html\">this blog post</a> for more information. To run this Zeppelin notebook you need to have interpreters for Markdown (&ldquo;%md&rdquo;) and Spark (&ldquo;%spark&rdquo; and &ldquo;%sql&rdquo;) installed. If they are not already installed, run this:</p>\n<pre><code class=\"shell\">cd location/of/your/zeppelin-installation\n./bin/install-interpreter.sh -n md,spark\n</code></pre>\n<p>First we need to load the data into a DataFrame. Follow the instructions at the <a href=\"https://wjoel.com/posts/from-json-to-parquet-using-spark-sql.html#getting-the-data\">end of the post</a>. The 200 MB file with taxi data in HDFS sequence file format was used when creating this example. We'll assume that the files have been placed in <code>/tmp/taxi-seq/</code> and convert our data to a DataFrame, which we'll register as a Spark SQL table.</p>\n"},"dateCreated":"2016-12-07T08:49:46+0100","dateStarted":"2016-12-07T21:28:43+0100","dateFinished":"2016-12-07T21:28:43+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:261","focus":true},{"text":"%spark\nimport org.apache.hadoop.io._\nimport org.apache.spark.storage.StorageLevel\n\nval paths = \"file:///tmp/taxi-seq/*\"\nval seqs = sc.sequenceFile[LongWritable,BytesWritable](paths)\n    .map((record: (LongWritable, BytesWritable)) => new String(record._2.copyBytes(), \"utf-8\"))\n\nval df = sqlContext.read.json(seqs)\n// Uncomment the following line to cache the DataFrame, as discussed in the post.\n//df.persist(StorageLevel.MEMORY_ONLY)\ndf.createOrReplaceTempView(\"trips\")","dateUpdated":"2016-12-07T21:27:21+0100","config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/scala","tableHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481141418152_-167702472","id":"20161207-211018_1023271063","dateCreated":"2016-12-07T21:10:18+0100","status":"FINISHED","progressUpdateIntervalMs":500,"$$hashKey":"object:263","dateFinished":"2016-12-07T21:18:43+0100","dateStarted":"2016-12-07T21:18:35+0100","result":{"code":"SUCCESS","type":"TEXT","msg":"\nimport org.apache.hadoop.io._\n\nimport org.apache.spark.storage.StorageLevel\n\npaths: String = file:///tmp/taxi-seq/*\n\nseqs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:42\n\ndf: org.apache.spark.sql.DataFrame = [dropoff_datetime: string, dropoff_latitude: string ... 12 more fields]\n"},"focus":true},{"config":{"colWidth":12,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[],"values":[],"groups":[],"scatter":{}},"enabled":true,"editorMode":"ace/mode/markdown","editorHide":true},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481141947925_-1653824364","id":"20161207-211907_1978042741","dateCreated":"2016-12-07T21:19:07+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:659","text":"%md\nWe can now query the `trips` table using Spark SQL. First we'll count the number of rows, and then we'll calculate the average trip distance by number of passengers. We may need to change visualization to a bar chart and adjust the groupings for more complicated queries. Zeppelin sometimes chooses to group things in surprising ways, at least as of Zeppelin 0.6.2, and if you get odd exceptions (for example `java.lang.NoSuchMethodException: org.apache.spark.io.LZ4CompressionCodec.<init>(org.apache.spark.SparkConf)`) when running queries you can just try running them a few more times.","dateUpdated":"2016-12-07T21:27:16+0100","dateFinished":"2016-12-07T21:26:57+0100","dateStarted":"2016-12-07T21:26:57+0100","result":{"code":"SUCCESS","type":"HTML","msg":"<p>We can now query the <code>trips</code> table using Spark SQL. First we'll count the number of rows, and then we'll calculate the average trip distance by number of passengers. We may need to change visualization to a bar chart and adjust the groupings for more complicated queries. Zeppelin sometimes chooses to group things in surprising ways, at least as of Zeppelin 0.6.2, and if you get odd exceptions (for example <code>java.lang.NoSuchMethodException: org.apache.spark.io.LZ4CompressionCodec.&lt;init&gt;(org.apache.spark.SparkConf)</code>) when running queries you can just try running them a few more times.</p>\n"}},{"text":"%sql\nSELECT COUNT(*) FROM trips","dateUpdated":"2016-12-07T21:27:00+0100","config":{"colWidth":6,"graph":{"mode":"table","height":300,"optionOpen":false,"keys":[{"name":"count(1)","index":0,"aggr":"sum"}],"values":[],"groups":[],"scatter":{"xAxis":{"name":"count(1)","index":0,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481142057931_188138831","id":"20161207-212057_1560767678","dateCreated":"2016-12-07T21:20:57+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:1026","dateFinished":"2016-12-07T21:27:03+0100","dateStarted":"2016-12-07T21:27:00+0100","result":{"code":"SUCCESS","type":"TABLE","msg":"count(1)\n690001\n","comment":"","msgTable":[[{"value":"690001"}]],"columnNames":[{"name":"count(1)","index":0,"aggr":"sum"}],"rows":[["690001"]]}},{"config":{"colWidth":6,"graph":{"mode":"multiBarChart","height":300,"optionOpen":false,"keys":[{"name":"passenger_count","index":0,"aggr":"sum"}],"values":[{"name":"avg(CAST(trip_distance AS DOUBLE))","index":1,"aggr":"sum"}],"groups":[],"scatter":{"xAxis":{"name":"passenger_count","index":0,"aggr":"sum"},"yAxis":{"name":"avg(CAST(trip_distance AS DOUBLE))","index":1,"aggr":"sum"}}},"enabled":true,"editorMode":"ace/mode/sql"},"settings":{"params":{},"forms":{}},"jobName":"paragraph_1481141792012_1040962389","id":"20161207-211632_589251178","dateCreated":"2016-12-07T21:16:32+0100","status":"FINISHED","progressUpdateIntervalMs":500,"focus":true,"$$hashKey":"object:560","dateUpdated":"2016-12-07T21:24:23+0100","dateFinished":"2016-12-07T21:21:09+0100","dateStarted":"2016-12-07T21:21:05+0100","result":{"code":"SUCCESS","type":"TABLE","msg":"passenger_count\tavg(CAST(trip_distance AS DOUBLE))\n3\t2.8067702472851033\n1\t2.8390448303241222\n2\t2.857759469541923\n6\t2.874046942405592\n5\t2.8996366080326346\n4\t2.905490663182972\n0\t5.265000000000001\n","comment":"","msgTable":[[{"key":"avg(CAST(trip_distance AS DOUBLE))","value":"3"},{"key":"avg(CAST(trip_distance AS DOUBLE))","value":"2.8067702472851033"}],[{"value":"1"},{"value":"2.8390448303241222"}],[{"value":"2"},{"value":"2.857759469541923"}],[{"value":"6"},{"value":"2.874046942405592"}],[{"value":"5"},{"value":"2.8996366080326346"}],[{"value":"4"},{"value":"2.905490663182972"}],[{"value":"0"},{"value":"5.265000000000001"}]],"columnNames":[{"name":"passenger_count","index":0,"aggr":"sum"},{"name":"avg(CAST(trip_distance AS DOUBLE))","index":1,"aggr":"sum"}],"rows":[["3","2.8067702472851033"],["1","2.8390448303241222"],["2","2.857759469541923"],["6","2.874046942405592"],["5","2.8996366080326346"],["4","2.905490663182972"],["0","5.265000000000001"]]},"text":"%sql\nSELECT passenger_count, avg(trip_distance)\nFROM trips\nWHERE passenger_count < 50\nGROUP BY passenger_count\nORDER BY 2"}],"name":"From JSON to Parquet using Spark SQL","id":"2C5C7QU2P","angularObjects":{"2C5XT2XSF:shared_process":[],"2BZ86DUMH:shared_process":[],"2C1SVR21W:shared_process":[],"2C1FR1GF7:shared_process":[]},"config":{"looknfeel":"default"},"info":{}}