Wednesday, 14 November 2012

A Practical Solution for XML-to-Relational Mapping

A challenge in the field of integration is persisting to a relational database the contents of an XML document [1]. At first glance, this might seem like a no-brainer. Create a database table where each column maps to an element defined in the XML document's XSD. Then write an INSERT SQL statement that iterates through each element of the XML document. 

But what if an XML element has attributes? What about elements containing other elements? How will we deal with elements that may occur multiple times within an element? What if element order is important? And if the XML schema is recursive? Yikes! As other people have pointed out, XML-to-relational mapping is not a trivial problem and the solution mentioned above will only work for the simplest of XSDs.

In addition to a complex XML schema, we have another issue to consider: the size of the XSD. We could be tempted to write the mappings (i.e., the SQL) for an XSD that consists of a small no. of element definitions. On the other hand, writing the mappings for a 50 KB XSD is suicidal.

What we need is a library that hides from us the intricacies of persisting XML documents. In the Java space, attempts have been made at producing an XML-to-relational mapper. For instance, Hibernate 3 has a little known experimental feature were EntityMode.DOM4J is used to persist and retrieve XML, much in the same way you would do with POJOs. Unfortunately, this feature was considered to have too many "holes" and removed from Hibernate 4.

An alternative to Hibernate 3's EntityMode.DOM4J feature is Shrex: an open source prototype mapper originating from a paper's thesis. I think Shrex has potential. With a few code changes to the prototype, I generated a database schema from a non-trivial XSD and "shredded" an XML document into relations. Having said that, I don't think Shrex is a practical solution. First off, Shrex's code base is un-maintained. To make things worse, it's devoid of community support. Therefore, taking the plunge with Shrex means that you'll be the one maintaining it. Secondly, Shrex was developed in an academic environment and, to my knowledge, never used in the wild. You'll probably have to make significant changes to the code base so that Shrex can meet your mapping requirements.

Till now, I've presented software that do the XML-relational conversion in one step. Yet, the lack of support makes these mappers impractical for a project that must meet a tight deadline. The Java EE  platform provides well supported standards for: 
  • Binding XML documents to objects called JAXB, and 
  • Persisting objects into a relational store called JPA

Leveraging JAXB and JPA, we could use objects as a go-between for persisting XML documents [2]. However, JAXB and JPA require Java classes for holding the XML data. Moreover, these classes must be annotated with mapping instructions. Writing the classes and annotations is undoable for a large or complex XSD. But generating them is doable and we can do it using an awesome XJC extension developed by Aleksei Valikov: Hyperjaxb3

Given an XSD, Hyperjaxb3 generates POJO classes annotated with JAXB and JPA. Let me demonstrate how to use Hyperjaxb3 by walking you through a Maven project I've developed and made available on GitHub.

The following is the XSD of the XML documents I  want to persist:

If you're careful, you'll note that the complex type envelopeType has an element and attribute with identical names: "from". Running Hyperjaxb3 on this schema will generate an EnvelopeType class containing two getters and setters with the same method names; illegal under the rules of Java. To prevent a name collision, I wrote a JAXB binding file:

In the binding file, aside from the name customisation, I instruct Hyperjaxb3 to generate classes:
  1. that follow the JPA spec, and
  2. have the package name org.opensourcesoftwareandme.
In the project's POM, I configure Maven to execute the Hyperjaxb3 plugin during the build:

The plugin, by default, scans "src/main/resources" for any XSDs and generates classes from them.

In addition to the plugin, I declare the dependencies for the project:

The dependencies to note are:
  1. hyperjaxb3-ejb-runtime: provides utility methods for (de)serialization, retrieving the persisted ID, etc... 
  2. hibernate-entitymanager: the JPA provider which in this case is Hibernate
  3. derby: the database used for testing
To test the mapping, I wrote a JUnit test case:

In a few words, the test case:
  1. Creates a map and populates it with configuration settings that tell Hibernate the database to use for persisting the XML document and how it should be set up. 

  2. Initialises Hibernate with the configuration settings

  3. Uses Hyperjaxb3's utility classes, JAXBContextUtils  and JAXBElementUtils, to deserialize the XML test document mails.xml:

  4. Tells Hibernate to persist the object

  5. Retrieves the persisted object and asserts that the data content is the same as in the XML doc.
You can run Hyperjaxb3 by entering "mvn test" from your console, which will run the test as well. The generated classes are found in the target folder.

I'm very interested in seeing other alternatives for persisting XML documents to a relational database. I'd also love to know if persisting XML documents with Hyperjaxb3 would work for your project, and if not, why.

1: The wisdom of employing a relational database instead of a document store is debatable.
2: Note that this isn't a general solution for persisting XML documents.

No comments:

Post a Comment