How to write a XSD schema

The following guide illustrates how to write a XSD (XML Schema Document) schema file for xmlfy.

Planning the XSD schema

It is good practice to write an accurate and well formed XSD schema because it may be useful to programs other than xmlfy in the future.

You can also build a library of schemata for the variety of data that you may want to xmlfy and store them in a common directory.

Always start with a simple XSD schema and gradually build up its complexity because shoe-horning foreign raw data into XML format using only an XSD schema file can be quite a feat particularly if you want to use complex and nested element structures.

NOTE: These type of schemata use mixed case for key words and xmlfy is expecting this.

Analysing the input data

The key to writing a good schema file is to understand the data that it is trying to describe.

For example, lets look at the output of the ls -la command from a Cygwin shell.

% ls -la
total 73
drwx------+  3 ag None     0 Apr 20 19:36 .
-rwxr-xr-x   1 ag None 15639 Apr 20 19:31 a.exe
-rwx------+  1 ag None  6354 Apr 20 19:31 xmlfy.c
-rwx------+  1 ag None  4901 Apr 19  2008 xmlfy.h

You can see that five lines of data were returned with two different structures of data being presented (line 1 = summary total, lines 2-5 = file details).

Lets call the "summary total" structure the total record and the "file details" structure the file records.

Defining the root element

First define the <schema> element in your schema file using XSD syntax. Then define the element that will be used as the root element in your XML document.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
              targetNamespace="http://www.yourtargetnamespace.com"
              xmlns="http://www.yourtargetnamespace.com"
              elementFormDefault="qualified">

  <xs:element name="ls" type="lsType" />





</xs:schema>

NOTE: The W3C XML Schema 1.0 specification is not clear on defining the root element so xmlfy simply treats the first defined element in the schema file as the root element.

Defining the record elements

Under certain circumstances the total record may not appear e.g. ls -la just-one-file. This means the total record has a none or one relationship with the output of the ls -la command.

The file record may appear many times or none at all depending on the number of files returned by the ls command. This means the file record has a none or many relationship with the output of the ls -la command.

We can now define a record structure type for the "ls" command as follows.

  <xs:complexType name="lsType">
    <xs:sequence>
      <xs:element name="total" type="totalType" minOccurs="0" />
      <xs:element name="file" type="fileType" minOccurs="0" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>

This is saying that the record structure which is called lsType is comprised of two elements with the first element total occurring none or one times (represented with the minOccurs="0" attribute), and the second element file occurring none or many times (represented with both the minOccurs="0" and maxOccurs="unbounded" attributes).

The total record will always have two fields prompt and totalsize that will always be present in this record.

We can now define a record structure type for the "summary total" structure as follows.

  <xs:complexType name="totalType">
    <xs:sequence>
      <xs:element name="prompt" type="xs:string" />
      <xs:element name="totalsize" type="xs:string" />
    </xs:sequence>
  </xs:complexType>

This is saying that the record structure which is called totalType is comprised of two elements with the first element prompt occurring only once, and the second element totalsize also occurring only once.

The file record can have a variable number of fields up to a maximum of nine with one of those fields fname being mandatory.

We can now define a record structure type for the "file details" structure as follows.

  <xs:complexType name="fileType">
    <xs:sequence>
      <xs:element name="permission" type="xs:string" minOccurs="0" />
      <xs:element name="blocks" type="xs:string" minOccurs="0" />
      <xs:element name="user" type="xs:string" minOccurs="0" />
      <xs:element name="group" type="xs:string" minOccurs="0" />
      <xs:element name="size" type="xs:string" minOccurs="0" />
      <xs:element name="date_M" type="xs:string" minOccurs="0" />
      <xs:element name="date_d" type="xs:string" minOccurs="0" />
      <xs:element name="date_ty" type="datetyType" minOccurs="0" />
      <xs:element name="fname" type="xs:string" />
    </xs:sequence>
  </xs:complexType>

This is saying that the record structure which is called fileType is comprised of eight optional elements occurring none or one times (represented with the minOccurs="0" attribute), and with the last element fname occurring only once.

The date_ty record can be represented in either hours:minutes or year.

We can now define this record structure type as follows.

  <xs:complexType name="datetyType">
    <xs:choice>
      <xs:element name="date_y" type="xs:string" />
      <xs:sequence>
        <xs:element name="date_h" type="xs:string" />
        <xs:element name="date_m" type="xs:string" />
      </xs:sequence>
    </xs:choice>
  </xs:complexType>

Defining the field elements

The field elements are explicitly defined inside the record definitions so no further definitions are required.

The complete XSD schema

The complete XSD schema file for the Cygwin ls -la command looks like this:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" 
              targetNamespace="http://www.yourtargetnamespace.com"
              xmlns="http://www.yourtargetnamespace.com"
              elementFormDefault="qualified">

  <xs:element name="ls" type="lsType" />

  <xs:complexType name="lsType">
    <xs:sequence>
      <xs:element name="total" type="totalType" minOccurs="0" />
      <xs:element name="file" type="fileType"  minOccurs="0" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="totalType">
    <xs:sequence>
      <xs:element name="prompt" type="xs:string" />
      <xs:element name="totalsize" type="xs:string" />
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="fileType">
    <xs:sequence>
      <xs:element name="permission" type="xs:string" minOccurs="0" />
      <xs:element name="blocks" type="xs:string" minOccurs="0" />
      <xs:element name="user" type="xs:string" minOccurs="0" />
      <xs:element name="group" type="xs:string" minOccurs="0" />
      <xs:element name="size" type="xs:string" minOccurs="0" />
      <xs:element name="date_M" type="xs:string" minOccurs="0" />
      <xs:element name="date_d" type="xs:string" minOccurs="0" />
      <xs:element name="date_ty" type="datetyType" minOccurs="0" />
      <xs:element name="fname" type="xs:string" />
    </xs:sequence>
  </xs:complexType>

  <xs:complexType name="datetyType">
    <xs:choice>
      <xs:element name="date_y" type="xs:string" />
      <xs:sequence>
        <xs:element name="date_h" type="xs:string" />
        <xs:element name="date_m" type="xs:string" />
      </xs:sequence>
    </xs:choice>
  </xs:complexType>

</xs:schema>

You can now save this to a text file called ls.xsd. Better still you can save this file to your XSD schema library e.g. /usr/share/schemata/cygwin/ls.xsd

Output

If you run the following command:

% ls -la | xmlfy.exe --schema /usr/share/schemata/cygwin/ls.xsd -F3 :

You will get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<ls xmlns="http://www.yourtargetnamespace.com"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.yourtargetnamespace.com ls.xsd">
  <total>
    <prompt>total</prompt>
    <totalsize>73</totalsize>
  </total>
  <file>
    <permission>drwx------+</permission>
    <blocks>3</blocks>
    <user>ag</user>
    <group>None</group>
    <size>0</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>36</date_m>
    </date_ty>
    <fname>.</fname>
  </file>
  <file>
    <permission>-rwxr-xr-x</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>15639</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>a.exe</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>6354</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>xmlfy.c</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>4901</size>
    <date_M>Apr</date_M>
    <date_d>19</date_d>
    <date_ty>
      <date_y>2008</date_y>
    </date_ty>
    <fname>xmlfy.h</fname>
  </file>
</ls>

A word on capturing data when using a schema file

Shoe-horning raw data into a structure defined by a schema is rather straight forward when the input fields have a one-to-one relationship with the fields of the schema elements, however if wildcard tokens and/or Boolean logic are employed in the schema then it becomes quite a challenge, sometimes even impossible, to be deterministic about which input field belongs to which schema field. Strictly speaking, the main function of the schema is to ensure XML is valid and to do this requires the XML document to already pre-exist. In xmlfy's case we are doing the reverse by building an XML document on the fly while following rules described by the schema - this is still okay and the resulting XML can be considered to be both valid and well formed.

xmlfy employs two techniques to help with this shoe-horning input data problem. The first technique xmlfy uses is recognising multiple element definitions that have the same name. This allows you to capture your schema elements under a variety of input circumstances without having to create a unique element for each circumstance - you can still do that if you want. The second technique xmlfy uses is auto-generated field match constraint helpers to assist in matching the input fields to the elements described by the schema. These helpers are useful in improving the speed of xmlfy particularly when using compound element structures and wildcard tokens in the schema hierarchy. After the schema file is loaded into memory, an array of helpers is generated for each element that describes all combinations of the schema tree traversal paths that can be taken and associates each combination with the minimum, maximum and last number of fields required for a match against the number of available input fields.

By default xmlfy continuously iterates through just the record elements of the root element looking for element helpers that can fully satisfy the requirements of that particular element's schema tree hierarchy for the given input fields, after which the matching record element is then checked against its wildcard obligations in the root element definition, and if okay is finally printed.
In match direct mode xmlfy only looks at the element helpers of the targetted element, and if that element can fully satisfy the requirements of its schema tree hierarchy for the given input fields, is printed in its entirety only once as the root element.

Important note

Currently the xmlfy XSD schema file parser is not that sophisticated and exhibits the following limitations:

Only recognises the <schema>, <element>, <complexType>, <ref>, <sequence>, and <choice> directives and ignores all others.
The recognised directives are not fully implemented and their use should be kept straightforward.
The first valid <element> definition becomes the root element.
The fields of the root element define all the level 1 elements (lets call the fields that have their own branch structure record elements).
The fields of the record elements simply represent other elements and unlimited element nesting is allowed.
By default fields of the root element that are not record elements are ignored. Use the match direct option to match targetted elements in their entirety.
Element types that are not of matchable complexType are treated as "xsi:string" regardless of what type is specified.
Only honours the minOccurs="0", maxOccurs="0" and maxOccurs="unbounded" wildcard attributes.
At this stage does not honour group sets but does do limited support with choices.
The field names that are specified in the element definitions are read from left to right and matched against a field number calculation on the input fields, and then matched again on any wildcard attributes.
You can wildcard many fields but you should think clearly about what you are trying to achieve and whether it is at all possible.
For example, the following XSD schema which is perfectly suitable for checking for valid XML, will however prove impossible for xmlfy to shoe-horn input data into schema elements a, b and c reliably because more than one field has a wildcard attributes to match none or many input fields.

<xs:schema>
  <xs:element name="parent">
    <xs:sequence>
      <xs:element name="record" type="recordType" />
    </xs:sequence>
  </xs:element>
  <xs:complexType name="recordType">
    <xs:sequence>
      <xs:element name="a" type="xsi:string" minOccurs="0" maxOccurs="unbounded" />
      <xs:element name="b" type="xsi:string" />
      <xs:element name="c" type="xsi:string" minOccurs="0" maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>
</xs:schema>
In the above example xmlfy will allocate ALL input fields to element <a> and that MAY not be the desired intention.

Don't worry if you find some of the above hard to digest, as you get more familiar with writing schemata this will become clearer.

Conclusion

That concludes the XSD schema writing process. xmlfy provides a significant number of command line options to change the behaviour of its processing of the input and output stream over and above the XSD schema file supplied. You are encouraged to experiment a little with xmlfy to get comfortable with these features.

Overview

Documentation

Schema Library

Customising

Testing

Project Management