How to write a DTD schema

The following guide illustrates how to write a DTD (Document Type Definition) schema file for xmlfy.

Planning the DTD schema

It is good practice to write an accurate and well formed DTD schema because it may be useful to programs other than xmlfy in the future.

You can also build a library of schemata for the variety of data that you may want to xmlfy and store them in a common directory.

Always start with a simple DTD schema and gradually build up its complexity because shoe-horning foreign raw data into XML format using only a DTD schema file can be quite a feat particularly if you want to use complex and nested element structures.

Analysing the input data

The key to writing a good schema file is to understand the data that it is trying to describe.

For example, lets look at the output of the ls -la command from a Cygwin shell.

% ls -la
total 73
drwx------+  3 ag None     0 Apr 20 19:36 .
-rwxr-xr-x   1 ag None 15639 Apr 20 19:31 a.exe
-rwx------+  1 ag None  6354 Apr 20 19:31 xmlfy.c
-rwx------+  1 ag None  4901 Apr 19  2008 xmlfy.h

You can see that five lines of data were returned with two different structures of data being presented (line 1 = summary total, lines 2-5 = file details).

Lets call the "summary total" structure the total record and the "file details" structure the file records.

Defining the root element

Under certain circumstances the total record may not appear e.g. ls -la just-one-file. This means the total record has a none or one relationship with the output of the ls -la command.

The file record may appear many times or none at all depending on the number of files returned by the ls command. This means the file record has a none or many relationship with the output of the ls -la command.

We can now write the first line of our DTD schema file to look like this:

<!ELEMENT ls (total?), (file*)>

This is saying that the root element which is called ls is comprised of two elements with the first element total occurring none or one times (represented with the ? token), and the second element file occurring none or many times (represented with the * token).

Defining the record elements

The total record will always have two fields prompt and totalsize that will always be present in this record.

We can now write the second line of our DTD schema file to look like this:

<!ELEMENT total (prompt, totalsize)>

This is saying that the record element which is called total is comprised of two elements with the first element prompt occurring only once, and the second element totalsize also occurring only once.

The file record can have a variable number of fields up to a maximum of nine with one of those fields fname being mandatory.

We can now write the third line of our DTD schema file to look like this:

<!ELEMENT file (permission?, blocks?, user?, group?, size?, date_m?, date_d?, date_ty?, fname)>

This is saying that the record element which is called file is comprised of eight optional elements occurring none or one times (represented with the ? token), and with the last element fname occurring only once.

The date_ty record can be represented in either hours:minutes or year.

We can now write the next two lines of our DTD schema file to look like this:

<!ELEMENT date_ty (date_y)>
<!ELEMENT date_ty (date_h, date_m)>

Defining the field elements

Strictly speaking xmlfy does not require any further definitions to work because it ignores elements in the DTD schema file that have the strings (#CDATA) or (#PCDATA) in them. But it is good practice to furnish a complete DTD schema so we include the field element definitions.

We can now write the final lines of our DTD schema file to look like this:

<!ELEMENT prompt (#PCDATA)>
<!ELEMENT totalsize (#PCDATA)>
<!ELEMENT permission (#PCDATA)>
<!ELEMENT blocks (#PCDATA)>
<!ELEMENT user (#PCDATA)>
<!ELEMENT group (#PCDATA)>
<!ELEMENT size (#PCDATA)>
<!ELEMENT date_y (#PCDATA)>
<!ELEMENT date_M (#PCDATA)>
<!ELEMENT date_d (#PCDATA)>
<!ELEMENT date_h (#PCDATA)>
<!ELEMENT date_m (#PCDATA)>
<!ELEMENT fname (#PCDATA)>

This is saying that the elements defined as (#CDATA) or (#PCDATA) are the actual fields (these definitions are not interpreted by xmlfy as any undefined field element automatically defaults to this).

NOTE: CDATA means Character DATA, and PCDATA means Parsed Character DATA.

The complete DTD schema

The complete DTD schema file for the Cygwin ls -la command looks like this:

<!ELEMENT ls (total?), (file*)>
<!ELEMENT total (prompt, totalsize)>
<!ELEMENT file (permission?, blocks?, user?, group?, size?, date_m?, date_d?, date_ty?, fname)>
<!ELEMENT date_ty (date_y)>
<!ELEMENT date_ty (date_h, date_m)>
<!ELEMENT prompt (#PCDATA)>
<!ELEMENT totalsize (#PCDATA)>
<!ELEMENT permission (#PCDATA)>
<!ELEMENT blocks (#PCDATA)>
<!ELEMENT user (#PCDATA)>
<!ELEMENT group (#PCDATA)>
<!ELEMENT size (#PCDATA)>
<!ELEMENT date_y (#PCDATA)>
<!ELEMENT date_M (#PCDATA)>
<!ELEMENT date_d (#PCDATA)>
<!ELEMENT date_h (#PCDATA)>
<!ELEMENT date_m (#PCDATA)>
<!ELEMENT fname (#PCDATA)>

You can now save this to a text file called ls.dtd. Better still you can save this file to your DTD schema library e.g. /usr/share/schemata/cygwin/ls.dtd

Output

If you run the following command:

% ls -la | xmlfy.exe --schema /usr/share/schemata/cygwin/ls.dtd -F3 :

You will get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ls SYSTEM "/usr/share/schemata/cygwin/ls.dtd">
<ls>
  <total>
    <prompt>total</prompt>
    <totalsize>73</totalsize>
  </total>
  <file>
    <permission>drwx------+</permission>
    <blocks>3</blocks>
    <user>ag</user>
    <group>None</group>
    <size>0</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>36</date_m>
    </date_ty>
    <fname>.</fname>
  </file>
  <file>
    <permission>-rwxr-xr-x</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>15639</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>a.exe</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>6354</size>
    <date_M>Apr</date_M>
    <date_d>20</date_d>
    <date_ty>
      <date_h>19</date_h>
      <date_m>31</date_m>
    </date_ty>
    <fname>xmlfy.c</fname>
  </file>
  <file>
    <permission>-rwx------+</permission>
    <blocks>1</blocks>
    <user>ag</user>
    <group>None</group>
    <size>4901</size>
    <date_M>Apr</date_M>
    <date_d>19</date_d>
    <date_ty>
      <date_y>2008</date_y>
    </date_ty>
    <fname>xmlfy.h</fname>
  </file>
</ls>

A word on capturing data when using a schema file

Shoe-horning raw data into a structure defined by a schema is rather straight forward when the input fields have a one-to-one relationship with the fields of the schema elements, however if wildcard tokens and/or Boolean logic are employed in the schema then it becomes quite a challenge, sometimes even impossible, to be deterministic about which input field belongs to which schema field. Strictly speaking, the main function of the schema is to ensure XML is valid and to do this requires the XML document to already pre-exist. In xmlfy's case we are doing the reverse by building an XML document on the fly while following rules described by the schema - this is still okay and the resulting XML can be considered to be both valid and well formed.

xmlfy employs two techniques to help with this shoe-horning input data problem. The first technique xmlfy uses is recognising multiple element definitions that have the same name. This allows you to capture your schema elements under a variety of input circumstances without having to create a unique element for each circumstance - you can still do that if you want. The second technique xmlfy uses is auto-generated field match constraint helpers to assist in matching the input fields to the elements described by the schema. These helpers are useful in improving the speed of xmlfy particularly when using compound element structures and wildcard tokens in the schema hierarchy. After the schema file is loaded into memory, an array of helpers is generated for each element that describes all combinations of the schema tree traversal paths that can be taken and associates each combination with the minimum, maximum and last number of fields required for a match against the number of available input fields.

By default xmlfy continuously iterates through just the record elements of the root element looking for element helpers that can fully satisfy the requirements of that particular element's schema tree hierarchy for the given input fields, after which the matching record element is then checked against its wildcard obligations in the root element definition, and if okay is finally printed.
In match direct mode xmlfy only looks at the element helpers of the targetted element, and if that element can fully satisfy the requirements of its schema tree hierarchy for the given input fields, is printed in its entirety only once as the root element.

Important note

Currently the xmlfy DTD schema file parser is not that sophisticated and exhibits the following limitations:

  • Only recognises the <!ELEMENT> directive and ignores all others.
  • The first valid <!ELEMENT> definition becomes the root element.
  • The fields of the root element define all the level 1 elements (lets call the fields that have their own branch structure record elements).
  • The fields of the record elements simply represent other elements and unlimited element nesting is allowed.
  • By default fields of the root element that are not record elements are ignored. Use the match direct option to match targetted elements in their entirety.
  • Element fields that don't have an element definition default to being (#PCDATA).
  • Elements defined inside the DTD schema file as (#PCDATA) or (#CDATA) are ignored causing the referring field to default to (#PCDATA) however it is good practice to include these elements in order to furnish a complete DTD schema.
  • Only honours the +, ? and * wildcard tokens.
  • At this stage does not honour field group sets () and or-ing ¦ syntax tokens.
  • The field names that are specified in the element definitions are read from left to right and matched against a field number calculation on the input fields, and then matched again on any wildcard tokens.
  • You can wildcard many fields but you should think clearly about what you are trying to achieve and whether it is at all possible.

    For example, the following DTD schema which is perfectly suitable for checking for valid XML, will however prove impossible for xmlfy to shoe-horn input data into schema elements a, b and c reliably because more than one field has a wildcard token to match none or many input fields.

    <!ELEMENT parent (record)>
    <!ELEMENT record (a*, b, c*)>
    <!ELEMENT a (#PCDATA)>
    <!ELEMENT b (#PCDATA)>
    <!ELEMENT c (#PCDATA)>

    In the above example xmlfy will allocate ALL input fields to element <a> and that MAY not be the desired intention.

Don't worry if you find some of the above hard to digest, as you get more familiar with writing schemata this will become clearer.

Conclusion

That concludes the DTD schema writing process. xmlfy provides a significant number of command line options to change the behaviour of its processing of the input and output stream over and above the DTD schema file supplied. You are encouraged to experiment a little with xmlfy to get comfortable with these features.