Env:
Spark 1.3.1Symptom:
Spark fails to parse a json object with multiple lines.This issue can happen when either creating a DataFrame using:
val people = sqlContext.jsonFile(path)or creating a temp table using Spark SQL:
CREATE TEMPORARY TABLE jsonTable2 USING org.apache.spark.sql.json OPTIONS ( path "/xxx/test2.json" );Sample error message is:
java.lang.RuntimeException: Failed to parse record "array" : [ {. Please make sure that each line of the file (or each string in the RDD) is a valid JSON object or an array of JSON objects.
Root Cause:
As mentioned in Spark Documentation:Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.Solution:
Convert the json object from multiple lines to a single line.For example, convert below json object:
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, { "count" : "site2", "sitename" : "sitename2" } ] }to below:
{ "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }After that, Spark can successfully parse it. For example:
spark-sql> CREATE TEMPORARY TABLE jsonTable > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark.json" > ) > ; Time taken: 3.738 seconds spark-sql> select * from jsonTable ; [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] Time taken: 1.4 seconds, Fetched 1 row(s) spark-sql> select array[0].count,array[0].sitename from jsonTable; site1 sitename1 Time taken: 0.184 seconds, Fetched 1 row(s)
You can also put multiple single-line json objects into one file.
For example:
# cat test_spark_multiple.json { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] } { "array" : [ { "count" : "site1", "sitename" : "sitename1" }, {"count" : "site2", "sitename" : "sitename2" } ] }The query result would be:
spark-sql> CREATE TEMPORARY TABLE jsonTable3 > USING org.apache.spark.sql.json > OPTIONS ( > path "/xxx/test_spark_multiple.json" > ); Time taken: 0.234 seconds spark-sql> select * from jsonTable3 ; [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] [{"count":"site1","sitename":"sitename1"},{"count":"site2","sitename":"sitename2"}] Time taken: 0.153 seconds, Fetched 5 row(s)
can you please tell me how we can convert the multi line json object in to single line json object
ReplyDeleteThis is just a workaround, if I receive a Json, I cannot stay everytime to convert it into a single line json...
ReplyDelete