VOOZH about

URL: https://www.javacodegeeks.com/2016/08/apache-spark-packages-xml-json.html

⇱ Apache Spark Packages, from XML to JSON - Java Code Geeks


The Apache Spark community has put a lot of effort into extending Spark. Recently, we wanted to transform an XML dataset into something that was easier to query. We were mainly interested in doing data exploration on top of the billions of transactions that we get every day. XML is a well-known format, but sometimes it can be complicated to work with. In Apache Hive, for instance, we could define the structure of the schema of our XML and then query it using SQL.

However, it was hard for us to keep up with the changes on the XML structure, so the previous option was discarded. We were using Spark Streaming capabilities to bring these transactions to our cluster, and we were thinking of doing the required transformations within Spark. However, the same problem remained, as we had to change our Spark application every time the XML structure changed.

There must be another way!

There is an Apache Spark package from the community that we could use to solve these problems. In this blog post, I’ll walk you through how to use an Apache Spark package from the community to read any XML file into a DataFrame.

Let’s load the Spark shell and see an example:

./spark-shell — packages com.databricks:spark-xml_2.10:0.3.3

In here, we just added the XML package to our Spark environment. This of course can be added when writing a Spark app and packaging it into a jar file.

Using the package, we can read any XML file into a DataFrame. When loading the DataFrame, we could specify the schema of our data, but this was our main concern in the first place, so we will let Spark infer it. The inference of the DataFrame schema is a very powerful trick since we don’t need to know the schema anymore so it can change at any time.

Let’s see how we load our XML files into a DataFrame:

val df = sqlContext
 .read
 .format("com.databricks.spark.xml")
 .option("rowTag", "OrderSale")
 .load("~/transactions_xml_folder/")
 
df.printSchema

Printing the DataFrame schema gives us an idea of what the inference system has done.

root
 |-- @ApplicationVersion: string (nullable = true)
 |-- @BusinessDate: string (nullable = true)
 |-- @Change: double (nullable = true)
 |-- @EmployeeId: long (nullable = true)
 |-- @EmployeeName: string (nullable = true)
 |-- @EmployeeUserId: long (nullable = true)
 |-- @MealLocation: long (nullable = true)
 |-- @MessageId: string (nullable = true)
 |-- @OrderNumber: long (nullable = true)
 |-- @OrderSourceTypeId: long (nullable = true)
 |-- @PosId: long (nullable = true)
 |-- @RestaurantType: long (nullable = true)
 |-- @SatelliteNumber: long (nullable = true)
 |-- @SpmHostOrderCode: string (nullable = true)
 |-- @StoreNumber: long (nullable = true)
 |-- @TaxAmount: double (nullable = true)
 |-- @TaxExempt: boolean (nullable = true)
 |-- @TaxInclusiveAmount: double (nullable = true)
 |-- @TerminalNumber: long (nullable = true)
 |-- @TimeZoneName: string (nullable = true)
 |-- @TransactionDate: string (nullable = true)
 |-- @TransactionId: long (nullable = true)
 |-- @UTCOffSetMinutes: long (nullable = true)
 |-- @Version: double (nullable = true)
 |-- Items: struct (nullable = true)
 | |-- MenuItem: struct (nullable = true)
 | | |-- #VALUE: string (nullable = true)
 | | |-- @AdjustedPrice: double (nullable = true)
 | | |-- @CategoryDescription: string (nullable = true)
 | | |-- @DepartmentDescription: string (nullable = true)
 | | |-- @Description: string (nullable = true)
 | | |-- @DiscountAmount: double (nullable = true)
 | | |-- @Id: long (nullable = true)
 | | |-- @PLU: long (nullable = true)
 | | |-- @PointsRedeemed: long (nullable = true)
 | | |-- @Price: double (nullable = true)
 | | |-- @PriceLessIncTax: double (nullable = true)
 | | |-- @PriceOverride: boolean (nullable = true)
 | | |-- @ProductivityUnitQuantity: double (nullable = true)
 | | |-- @Quantity: long (nullable = true)
 | | |-- @TaxAmount: double (nullable = true)
 | | |-- @TaxInclusiveAmount: double (nullable = true)
 |-- OrderTaxes: struct (nullable = true)
 | |-- TaxByImposition: struct (nullable = true)
 | | |-- #VALUE: string (nullable = true)
 | | |-- @Amount: double (nullable = true)
 | | |-- @ImpositionId: long (nullable = true)
 | | |-- @ImpositionName: string (nullable = true)
 |-- Payments: struct (nullable = true)
 | |-- Payment: struct (nullable = true)
 | | |-- #VALUE: string (nullable = true)
 | | |-- @AccountIDLast4: string (nullable = true

At this point, we could use any SQL tool to query our XML using Spark SQL. Please read this post (Apache Spark as a Distributed SQL Engine) to learn more about Spark SQL. Going a step further, we could use tools that can read data in JSON format. Having JSON datasets are especially useful if you have something like Apache Drill.

If you have any questions about using this Apache Spark package to read XML files into a DataFrame, please ask them in the comments section below.👁 Image

Reference: Apache Spark Packages, from XML to JSON from our JCG partner Chase Hooley at the Mapr blog.
Do you want to know how to develop your skillset to become a Java Rockstar?
Subscribe to our newsletter to start Rocking right now!
To get you started we give you our best selling eBooks for FREE!
1. JPA Mini Book
2. JVM Troubleshooting Guide
3. JUnit Tutorial for Unit Testing
4. Java Annotations Tutorial
5. Java Interview Questions
6. Spring Interview Questions
7. Android UI Design
and many more ....
I agree to the Terms and Privacy Policy

Thank you!

We will contact you soon.

👁 Photo of Chase Hooley
Chase Hooley
August 25th, 2016Last Updated: August 24th, 2016
0 275 3 minutes read
Subscribe

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Back to top button
Close
wpDiscuz