How to write a RNC schema

The following guide illustrates how to write a RNC (RELAX NG Compact) schema file for xmlfy.

For a full description of the RELAX NG Compact schema implementation please consult the RNC specification and/or the RNC tutorial.

Planning the RNC schema

It is good practice to write an accurate and well formed RNC schema because it may be useful to programs other than xmlfy in the future.

You can also build a library of schemata for the variety of data that you may want to xmlfy and store them in a common directory.

Always start with a simple RNC schema and gradually build up its complexity because shoe-horning foreign raw data into XML format using only a RNC schema file can be quite a feat particularly if you want to use complex and nested element structures.

xmlfy only understands the named element feature of the RNC schema syntax.

Analysing the input data

The key to writing a good schema file is to understand the data that it is trying to describe.

For example, lets look at the output of the ls -la command from a Cygwin shell.

% ls -la
total 73
drwx------+  3 ag None     0 Apr 20 19:36 .
-rwxr-xr-x   1 ag None 15639 Apr 20 19:31 a.exe
-rwx------+  1 ag None  6354 Apr 20 19:31 xmlfy.c
-rwx------+  1 ag None  4901 Apr 19  2008 xmlfy.h

You can see that five lines of data were returned with two different structures of data being presented (line 1 = summary total, lines 2-5 = file details).

Lets call the "summary total" structure the total record and the "file details" structure the file records.

Defining the root element

Under certain circumstances the total record may not appear e.g. ls -la just-one-file. This means the total record has a none or one relationship with the output of the ls -la command.

The file record may appear many times or none at all depending on the number of files returned by the ls command. This means the file record has a none or many relationship with the output of the ls -la command.

We can now write the initial lines of our RNC schema file to look like this:

start = ls
ls = element ls { total? | file* }

This is saying that the root element is called ls and that it is comprised of two elements with the first element total occurring none or one times (represented with the ? token), and the second element file occurring none or many times (represented with the * token).

Defining the record elements

The total record will always have two fields prompt and totalsize that will always be present in this record.

We can now write the third line of our RNC schema file to look like this:

total = element total { prompt, totalsize }

This is saying that the record element which is called total is comprised of two elements with the first element prompt occurring only once, and the second element totalsize also occurring only once.

The file record can have a variable number of fields up to a maximum of nine with one of those fields fname being mandatory.

We can now write the fourth line of our RNC schema file to look like this:

file = element file { permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname }

This is saying that the record element which is called file is comprised of eight optional elements occurring none or one times (represented with the ? token), and with the last element fname occurring only once.

The date_ty record can be represented in either hours:minutes or year.

We can now write the next two lines of our RNC schema file to look like this:

date_ty = element date_ty { date_y }
date_ty |= element date_ty { date_h, date_m }

Defining the field elements

Strictly speaking xmlfy does not require any further definitions to work because it ignores elements in the RNC schema file that have the strings { text } in them. But it is good practice to furnish a complete RNC schema so we include the field element definitions.

We can now write the final lines of our RNC schema file to look like this:

prompt = element prompt { text }
totalsize = element totalsize { text }
permission = element permission { text }
blocks = element blocks { text }
user = element user { text }
group = element group { text }
size = element size { text }
date_y = element date_y { text }
date_M = element date_M { text }
date_d = element date_d { text }
date_h = element date_h { text }
date_m = element date_m { text }
fname = element fname { text }

This is saying that elements defined as { text } are the actual fields (these definitions are not interpreted by xmlfy as any undefined field element automatically defaults to this).

The complete RNC schema

The complete RNC schema file for the Cygwin ls -la command looks like this:

start = ls
ls = element ls { total? | file* }
total = element total { prompt, totalsize }
file = element file { permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname }
date_ty = element date_ty { date_y }
date_ty |= element date_ty { date_h, date_m }
prompt = element prompt { text }
totalsize = element totalsize { text }
permission = element permission { text }
blocks = element blocks { text }
user = element user { text }
group = element group { text }
size = element size { text }
date_y = element date_y { text }
date_M = element date_M { text }
date_d = element date_d { text }
date_h = element date_h { text }
date_m = element date_m { text }
fname = element fname { text }

You can now save this to a text file called ls.rnc. Better still you can save this file to your RNC schema library e.g. /usr/share/schemata/cygwin/ls.rnc

Output

If you run the following command:

% ls -la | xmlfy.exe --schema /usr/share/schemata/cygwin/ls.rnc -F3 :

You will get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ls SYSTEM "/usr/share/schemata/cygwin/ls.rnc">
<ls>
  <total>
    <prompt>total</prompt>
    <totalsize>73</totalsize>
  </total>
  <file>
    <permission>drwx------+</permission>
    <blocks>3</blocks>
    <user>ag</user>
    <group>None</group>
    <size>0</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>36</date_m>
    </date_ty>
    <fname>.</fname>
  </file>
  <file>
    <permission>-rwxr-xr-x</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>15639</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>a.exe</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>6354</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>xmlfy.c</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>4901</size>
    <date_M>Apr</date_M>
    <date_d>19</date_d>
    <date_ty>
      <date_y>2008</date_y>
    </date_ty>
    <fname>xmlfy.h</fname>
  </file>
</ls>

A word on capturing data when using a schema file

Shoe-horning raw data into a structure defined by a schema is rather straight forward when the input fields have a one-to-one relationship with the fields of the schema elements, however if wildcard tokens and/or Boolean logic are employed in the schema then it becomes quite a challenge, sometimes even impossible, to be deterministic about which input field belongs to which schema field. Strictly speaking, the main function of the schema is to ensure XML is valid and to do this requires the XML document to already pre-exist. In xmlfy's case we are doing the reverse by building an XML document on the fly while following rules described by the schema - this is still okay and the resulting XML can be considered to be both valid and well formed.

xmlfy employs two techniques to help with this shoe-horning input data problem. The first technique xmlfy uses is recognising multiple element definitions that have the same name. This allows you to capture your schema elements under a variety of input circumstances without having to create a unique element for each circumstance - you can still do that if you want. The second technique xmlfy uses is auto-generated field match constraint helpers to assist in matching the input fields to the elements described by the schema. These helpers are useful in improving the speed of xmlfy particularly when using compound element structures and wildcard tokens in the schema hierarchy. After the schema file is loaded into memory, an array of helpers is generated for each element that describes all combinations of the schema tree traversal paths that can be taken and associates each combination with the minimum, maximum and last number of fields required for a match against the number of available input fields.

By default xmlfy continuously iterates through just the record elements of the root element looking for element helpers that can fully satisfy the requirements of that particular element's schema tree hierarchy for the given input fields, after which the matching record element is then checked against its wildcard obligations in the root element definition, and if okay is finally printed.
In match direct mode xmlfy only looks at the element helpers of the targetted element, and if that element can fully satisfy the requirements of its schema tree hierarchy for the given input fields, is printed in its entirety only once as the root element.

Important note

Currently the xmlfy RNC schema file parser is not that sophisticated and exhibits the following limitations:

  • Only recognises named directives and ignores all others.
  • The element named "start" becomes the root element.
  • The fields of the root element define all the level 1 elements (lets call the fields that have their own branch structure record elements).
  • The fields of the record elements simply represent other elements and unlimited element nesting is allowed.
  • By default fields of the root element that are not record elements are ignored. Use the match direct option to match targetted elements in their entirety.
  • Element fields that don't have an element definition default to being { text }.
  • Elements defined inside the RNC schema file as { text } are ignored causing the referring field to default to { text } however it is good practice to include these elements in order to furnish a complete RNC schema.
  • Only honours the +, ? and * wildcard tokens.
  • At this stage does not honour field group sets () and or-ing ¦ syntax tokens.
  • The field names that are specified in the element definitions are read from left to right and matched against a field number calculation on the input fields, and then matched again on any wildcard tokens.
  • You can wildcard many fields but you should think clearly about what you are trying to achieve and whether it is at all possible.

    For example, the following RNC schema which is perfectly suitable for checking for valid XML, will however prove impossible for xmlfy to shoe-horn input data into schema elements a, b and c reliably because more than one field has a wildcard token to match none or many input fields.

    start = parent
    parent = element parent { record }
    record = element record { a*, b, c* }
    a = element a { text }
    b = element b { text }
    c = element c { text }

    In the above example xmlfy will allocate ALL input fields to element <a> and that MAY not be the desired intention.

Don't worry if you find some of the above hard to digest, as you get more familiar with writing schemata this will become clearer.

Conclusion

That concludes the RNC schema writing process. xmlfy provides a significant number of command line options to change the behaviour of its processing of the input and output stream over and above the RNC schema file supplied. You are encouraged to experiment a little with xmlfy to get comfortable with these features.