xmlfy(1)

NAME

xmlfy - Convert to XML on the fly.

SYNOPSIS

xmlfy [OPTION]...

-h, --help
print usage instructions

-v, --version
print version number

--license
print license

--debug
print extra debugging information

Input options:

-F, --fieldseparator[<level>[b][:<scope>]] <string>
specify a delimiter string token for the level specified

-R, --recordseparator <string>
this is a synonym for "-F1 <string>"
specify an alternative record separator string to the default

-C, --column[:<scope>] <r1>-<r2>[:<name>]
create an input field from an input column range

-W, --regex[:<scope>] [E|B][i][l][r][U][n][b][e]/<pattern>/[<name>[,..]]
create input fields from a regular expression

-e, --expelempty
expel empty input records and fields

-E, --expel <input_records>[:<input_fields>]
expel selected records or fields from being processed

-q, --quotedfields[2]
treat fields that are between quotes as one field

-Q, --quotechars[2] <string>
specify an array of quoting characters to use

-b, --blanklines
do not ignore blank input records

-t, --trim
trim leading and trailing white space from input fields

Output options:

-S, --schema <file> | -Sd, --schemadtd <file> | -Sr, --schemarnc <file> | -Sx, --schemaxsd <file>
use a schema <file> for tag names and element control

-M, --matchdirect 0|<elementname>
match directly on a specific element in the schema

-A, --attribute[<level>[:<scope>]] number|level|delimiter|timestamp|insert <name> <value>
include attributes in the opening element tag

-T, --tag[<level>[:<scope>]] number|level|name <name>|[re]insert <name> <value>|[re]insertfile <name> <file>|[re]insertfilexml <indent> <file>
modify or insert element tags

-k, --keyvaluepairs[<level>]
generate key/value XML tag pairs

-l, --linenumbers
this is a synonym for "-T1 number"
include the line number in the line tag name

-f, --fieldnumbers
this is a synonym for "-T2 number"
include the field number in the field tag name

-L, --linetags
include a line number tag with the record data

-X, --xmlformat [XML1.0|XML1.1]|[SOAP1.1|SOAP1.2]|[HTML table|list]
          |[UTF-8|UTF-16|UTF-16BE|UTF-16LE|UTF-32|UTF-32BE|UTF-32LE]|BOM
          |ASCIItoUTF|[noescape all|amp|lt|gt|quot|apos|brvbar]
          |trimtagclose|[newline dos|unix]
specify an XML output format

-p, --printonly header|footer|rtagopen|rtagclose|records
print only snippets of the XML output

-I, --identifier <system_identifier>
specify an alternate system identifier of the doctype or SOAP URI

-s, --summary[2|c|n|f <file>]
print a summary after the end of the processing

-U, --unxml
undo the XML syntax leaving just plain text

--noxml
do not XML-fy the input stream

DESCRIPTION

The xmlfy command reads stdin and outputs it to stdout in XML format using supplied control directives.

Delimiter tokens and/or column selections are used to break down the input stream into XML elements which are then represented inside an XML tree hierarchy that can span multiple depth levels. For example, command line output was originally designed for text or CRT based processing. The xmlfy command takes this text based output where a new-line often represents an end-of-record of data and white space often represents a field separator, and reformats it into XML output suitable for interfacing with modern object oriented systems.

xmlfy is a powerful yet lightweight tool that primarily caters for converting ASCII, UTF-8, UTF-16 or UTF-32 based output into XML format on the fly and dealing with common issues associated with this kind of transformation.

The xmlfy command also supports a basic version of a schema configuration allowing you to control the format of the XML output by supplying a schema file as an option.

With no options supplied xmlfy will use default values for its XML format. The entire standard input will be enclosed in <xmlfy></xmlfy> pairs, each line of standard input will be enclosed in <line></line> pairs, and each field of each line will be enclosed in <field></field> pairs.

OPTIONS

You can supply options to customise the behaviour of xmlfy at the command line, or by a special token inside the schema file, or both. NOTE: Options are resolved from left to right. If any conflicting options are specified then the last one will have precedence.

Option: -h, --help
The command line usage is printed in plain text format not in XML format.

Option: -v, --version
The version number is printed in plain text format not in XML format. If the version number is required in XML format it is included with the summary option.

Option: --license
Print all licenses used by xmlfy.

Option: --debug
Print extra debugging information to stderr to help debug xmlfy behaviour.

Input options:

Option: -F, --fieldseparator[<level>[b][:<scope>]] <string>
Allows you to specify a delimiter string token for the level specified.

<level> - The XML depth level to be delimited by <string>.
Must be an integer value greater than or equal to 1.
E.g. a value of 1 will split the input into records delimited by <string>, a value of 2 will split records into fields delimited by <string>, a value of 3 will split fields into subfields delimited by <string>, and so on.
There is no space separating the option and the level value.
If no level is specified then the given options will only apply to level 2.
b - Use byte matching for the specified delimiter string.
By specifying this option the delimiter string is treated as just a literal sequence of bytes. Normally command line arguments are presented to xmlfy as ASCII strings and if wide UTF encoding like UTF-16 or UTF-32 is being used then xmlfy will automatically convert the specified delimiter string to that encoding. With this option no encoding conversion takes place. In this mode you can also specify escaped decimal byte sequences inside the delimiter string. E.g. "\123\234\\"
<scope> - A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
<s1> - integer representing a start range.
<s2> - integer or the $ token representing an end range.
r - restart the scope counter for this delimiter after the completion of the associated range.
Restrict the delimiter effectiveness to the occurrences specified in <scope>. If a delimiter <string> is encountered for the level specified and its sequence is not in the scope then it will not function as a field separator and will instead be treated as data.
E.g. -F3:1-3,8 "." this is saying that level 3 fields will only be created for the 1st to 3rd, and 8th occurrences of the delimiter "." (period).
The restart scope counter option r allows you to specify repeating scope sequences. E.g -F1:2,5r "\n" this is saying create level 1 records out of every second and fifth lines and keep repeating this until the input is exhausted.
When using multiple same level delimiters, restarting scope counters of the equivalent level and higher get reset whenever a delimiter match is applied.
If a <scope> range is not specified then the delimiter function applies to every occurrence of <string> of the target level.
<string> - A sequence of characters or token to be used as a delimiter. Tokens specified literally as "\n", "\r", and "\t" are translated to their corresponding control character. If using wide UTF encoding then <string> is automatically converted to that encoding, otherwise you can use the byte matching option and specify escaped decimal byte sequences inside <string>.
º If the delimiter token is the same for a series of levels then obviously the shallowest level will take precedence, unless the shallowest levels have been limited by scope restrictions. You can also make use of quotes in the input along with specifying quote options.
º The XML tree algorithm deepens in a sequential way therefore you must set your delimiter levels as an unbroken sequence for them to be of any use, that is you cannot split a level 2 field with a level 4 delimiter string.
º Refer to the schema option section for information on level handling when a schema file is specified.
º Levels 1 and 2 are already set by default.
º The default level 1 delimiter token is NEWLINE (new-line).
º The default level 2 delimiter token is WHITESPACE (space, tab, new-line, carriage-return, vertical-tab and form-feed).
º The delimiters for levels 3 and above are unset.
º Only one delimiter string token can be specified however this option can be invoked multiple times allowing for multiple delimiters to be used at the level specified. When specifying multiple same level delimiters, the larger delimiter strings are matched before the smaller ones. The delimiter string is not included in the output.

Option: -R, --recordseparator <string>
This is a synonym for "-F1 <string>"
Allows you to specify a record separator string token that is different from the default. The default record separator token is NEWLINE (new-line).

Option: -C, --column[:<scope>] <c1>-<c2>[:<name>]
Use an input column range of the input record to generate an input field. This is an alternative method of capturing input fields from using delimiters.

<scope> - A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
<s1> - integer representing a start range.
<s2> - integer or the $ token representing an end range.
r - restart the scope counter for this column option after the completion of the associated range.
Restrict the column option effectiveness to the occurrences specified in <scope>. If the input record sequence is not in the scope then the column option will not be applied and input fields will not be generated.
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -C:1-3,5r 1-20 this is saying capture column fields of 20 characters in length for every first to third and fifth input records, and keep repeating this until the input is exhausted.
If a <scope> range is not specified then the column option applies to all input records.
<c1> - Integer or the $ token representing the start column range of the input field.
<c2> - Integer or the $ token representing the end column range of the input field.
<name> - Optional string value that will be used to override the tag name for this input field.
You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised.
Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified).
º Specifying field separators of level 2 and above with this option is conflicting and will produce a usage error.
º The number of times and order in which this option is specified (in conjunction with the -W option) determines the number of input fields generated and their order.
º Column ranges represent code points (characters) meaning any multi byte character will only account for just one column position.
º Multiple options can use non linear ranges and can overlap e.g. -C 5-10:part -C 1-$:whole
º Ranges that exceed the size of the input record will not process beyond the end of the input record.
º You can use single or double quotes to protect the range from the shell interpreter e.g. -C '80-$:text'
º Only one parameter pair can be specified however this option can be invoked multiple times.

Option: -W, --regex[:<scope>] [E|B][i][l][r][U][n][b][e]/<pattern>/[<name>[,..]]
Use a regular expression on the input record to generate input fields. This is an alternative method of capturing input fields from using delimiters.

<scope> - A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
<s1> - integer representing a start range.
<s2> - integer or the $ token representing an end range.
r - restart the scope counter for this regex option after the completion of the associated range.
Restrict the regex option effectiveness to the occurrences specified in <scope>. If the input record sequence is not in the scope then the regex option will not be applied and input fields will not be generated.
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -W:1-3,5r /(^A.*).*(B.*$)/ this is saying capture two regex fields for every first to third and fifth input records, and keep repeating this until the input is exhausted.
If a <scope> range is not specified then the regex option applies to all input records.
E - flag to use Extended Regular Expressions in <pattern> (default).
B - flag to use Basic Regular Expressions in <pattern>.
i - flag to ignore case.
l - flag to treat <pattern> as a literal.
r - flag to make concatenation right associative.
U - flag to make operators ungreedy by default.
n - flag to give '\n' special meaning (REG_NEWLINE).
b - flag to set '^' as not beginning-of-line (REG_NOTBOL).
e - flag to set '$' as not end-of-line (REG_NOTEOL).
<pattern> - A POSIX 1003.2 compliant Regular Expression pattern utilising zero or more parenthesis pairs to capture input fields.
<name> - Optional string value that will be used to override the tag name for input fields derived from pattern matches.
A comma separated list of <name> can be specified with the last entry being re-used if more input fields than names are generated.
You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised.
Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified).
º Specifying field separators of level 2 and above with this option is conflicting and will produce a usage error.
º The number of times and order in which this option is specified (in conjunction with the -C option) determines the number of input fields generated and their order.
º If matches are not made for all parenthesis pairs specified in <pattern> then no output will result.
º If no parenthesis pairs are specified in <pattern> then the entire input record will be used as the output when a pattern match occurs.
º Wide UTF encoding can be specified in <pattern> by using the \x literal followed by two hexadecimal digits to represent any byte inside the code-point e.g. \x0b.
º For further information on using regex syntax and its flags please consult the TRE web documentation.
º You can use single or double quotes to protect <pattern> from the shell interpreter e.g. -W 'iU/(^Pam .*)/pams'
º You can specify the percentage character % as an alternative separator to forward-slash / for <pattern> so long as it remains paired.
º Only one parameter pair can be specified however this option can be invoked multiple times.

Option: -e, --expelempty
Expel input fields that are empty (zero bytes in length) from being processed. The use of multi level and multiple same level delimiters can sometimes yield plenty of empty fields which may be undesirable. This option expels all the empty input fields from being processed by the output processor. All levels are examined and any input records comprised entirely out of empty fields are also expelled.
This option will always run before any expelling tasks specified with option -E are run.
This option has no influence on levels subjected to key/value pairing as that process has its own way of dealing with empty fields at its target levels.
If a schema is used then obviously the number of input records/fields used for element matching has been reduced.

Option: -E, --expel <input_records>[:<input_fields>]
Expel selected input records or selected input fields of selected input records from being processed. Each input record is checked against the expel criteria and if a match occurs then these input records or input fields are simply discarded from being passed onto the xmlfy output processor.

<input_records> - A comma delimited set of input record expel criteria with no spaces.
The <input_records> parameter has a sub form of <range_type><r1>[-<r2>][/<string>/][,..]
Where <range_type> can be 'n', 'f' or 'c'.
n - the associated range refers to input record numbers.
f - the associated range refers to input field numbers.
c - the associated range refers to input record character lengths.
<r1> - integer representing a start range.
<r2> - integer or the $ token representing an end range.
<string> - the specified <string> must also exist within the range.
Expel criteria types can be intermixed.
E.g. -E n10-$,f7-8,f4/Mercedes/,c10-20,c1-15/SUV/ this is saying that input records whose record number is greater than or equal to 10, AND input records whose total number of fields are between 7 and 8, AND input records whose 4th input field contains the string "Mercedes", AND input records whose input record length is greater than or equal to 10 but less than or equal to 20 characters, AND input records whose first 15 characters contain the string "SUV", will finally match the input record expel criteria.
In this release you can only specify the $ token (last input record) in a paired range and not on its own.
Generally xmlfy can figure out where the search string delimiters would likely occur however you can specify the % character as an alternative separator to / for <string> so long as it remains paired.
If an <input_fields> criteria is not specified then the entire input record is expelled.
<input_fields> - A comma delimited set of field number ranges with no spaces.
The <input_fields> parameter has a sub form of <r1>[-<r2>][,..]
<r1> - integer or the $ token representing a start range.
<r2> - integer or the $ token representing an end range.
Discard select input fields of the input records that match the expel criteria before passing onto the xmlfy output processor.
E.g. -E n2-$:1,$ this is saying that input records whose record number is greater than or equal to 2 will have their first and last fields expelled.
You can specify the $ token (last input field) in a paired range or on its own.
º You can use single or double quotes to protect the range from the shell interpreter e.g. -E 'n2-$:$'
º If a schema is used then obviously the number of input records/fields used for element matching has been reduced.
º Only one parameter group can be specified however this option can be invoked multiple times with resolution occurring from left to right.

Option: -q, --quotedfields[2]
Treat fields that are quoted as one field. Normally xmlfy will parse fields by their delimiter e.g. WHITESPACE, this option allows multi delimited fields to be specified as one by quoting them. By default the quoted field may only span the current input record unless the -q2 option is specified in which case the quoted field can span multiple input records.
Quotes are not included in the field and any leading/trailing text outside the field's quotes are truncated.
If quotes are not closed xmlfy will update the field until the end of the input record, or if option -q2 is specified, until the input is exhausted (EOF).
The default quote character is a double quote (").

Option: -Q, --quotechars[2] <string>
specify a string of characters that can be used as the quoting character.

<string> - an array of quoting characters.
º If field quoting is enabled then any input character that matches any character in <string> will toggle the quoting function, unless the -Q2 option is specified in which case characters in <string> represent paired quotes with odd numbered characters in this array toggling the open quote function, and its corresponding pair toggling the close quote function. This allows parenthesis, brackets, etc to be used as quotes.
º Obviously when specifying this option care must be taken to prevent the shell from interpreting the supplied quote characters. When using a schema file containing this option you can specify quote characters by escaping them with the backslash "\" character.

Option: -b, --blanklines
Normally xmlfy ignores blank lines or empty level 1 records in the input stream. This option tells xmlfy to not ignore these blank lines and print out XML line record tags but with no elements.
In this mode blank lines count as line numbers.

Option: -t, --trim
Field elements are trimmed of leading and trailing white space.

Output options:

Option: -S, --schema <file> | -Sd, --schemadtd <file> | -Sr, --schemarnc <file> | -Sx, --schemaxsd <file>
Specify a schema <file> for controlling the XML output.

<file> - The schema file must comply with either the Document Type Definition (.dtd) language, or the RELAX NG Compact (.rnc) language, or the XML Schema Document (.xsd) language, however xmlfy does not support the finer aspects of these schema languages at this early stage.
º When all input fields of the input record have been identified, xmlfy will match them against the elements inside the tree hierarchy of the schema file, and if a match is found then xmlfy will print an output record using the matching schema tree hierarchy as its XML structure.
Option -S, --schema uses the case-insensitive file name extension (.dtd or .rnc or .xsd) of <file> to determine which schema interpreter xmlfy will apply.
Option -Sd, --schemadtd forces xmlfy to use the DTD schema interpreter on <file>.
Option -Sr, --schemarnc forces xmlfy to use the RNC schema interpreter on <file>.
Option -Sx, --schemaxsd forces xmlfy to use the XSD schema interpreter on <file>.
º You can specify multi level delimiters when using this option however any delimiters greater than level 2 are only used to identify more input fields and are not used at all in altering the XML tree hierarchy as is dictated by the schema file. Fields with levels of 2 and above are flattened to be just plain fields of the input record - this is very different to the default behaviour where field levels form the XML tree hierarchy.
º If a schema option is not supplied then xmlfy will use default values for tag names and element control.
º For further information on how to write a schema for xmlfy please consult the web documentation.

Option: -M, --matchdirect 0|<elementname>
Match directly on a specific element in the schema making it the root element.

0 - A token representing the default root element in the schema.
<elementname> - The name of a record element in the schema.
º This option alters the way the selected schema element is matched against the available input fields that were generated. In this mode the target element is matched in its entirety using its element helper and printed accordingly. This is very different to the default legacy mode whereby only the record elements of the root element get matched in a continuously sequential way.
º Regardless of what wildcard attributes exist for the target element it will only be printed once as a root element.
º If a schema file is not specified then this option will be ignored.

Option: -A, --attribute[<level>[:<scope>]] number|level|delimiter|timestamp|insert <name> <value>
Include attributes in the opening element tag for the level specified.

<level> - The XML depth level to be modified.
Must be an integer value greater than or equal to 0.
E.g. a value of 1 will apply attributes to each opening record element and a value of 2 will apply attributes to each opening field element.
There is no space separating the option and the level value.
If no level is specified then the given options will apply to all levels except level 0.
<scope> - A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
<s1> - integer representing a start range.
<s2> - integer or the $ token representing an end range.
r - restart the scope counter for this attribute after the completion of the associated range.
Restrict the custom attribute effectiveness to the occurrences specified in <scope>. If the element sequence is not in the scope then the custom attribute will not be applied.
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -A2:1-3,5r insert x y this is saying insert custom attributes x="y" for every first to third and fifth level 2 elements, and keep repeating this until the output is exhausted.
Scope sequence counters are always reset to zero for the next element depth level and higher whenever a deeper XML depth level is entered into.
If a <scope> range is not specified then the custom attribute function applies to all elements at the specified <level>.
number - Specify the sequence number as an element attribute.
E.g. <field> becomes <field number="1"> and the next <field> becomes <field number="2"> and so on.
Scoping is not supported.
Not supported for level 0.
level - Specify the level as an element attribute.
E.g. <field> becomes <field level="2">
Scoping is not supported.
Not supported for level 0.
delimiter - Specify the matching delimiter as an element attribute.
E.g. <field> becomes <field delimiter="ABC">
Delimiter string tokens that contain illegal XML characters are printed as their hex pair equivalent.
When using a schema file only level 1 records and field elements will have their delimiter attributes printed.
Scoping is not supported.
Not supported for level 0.
timestamp - Include a timestamp as an element attribute.
Two timestamps are provided, one for humans and one for machines. The times are stamped at element print time.
E.g. <field> becomes <field timestamp_date="Fri May 5 10:23:33 2008" timestamp_sec="123456790">
Scoping is not supported.
insert <name> <value> - Insert a custom element attribute.
The parameters <name> and <value> are combined to form an element attribute with <value> wrapped around double quotes.
E.g <field> becomes <field name="value">
You can pretty much specify anything as an attribute name and value including illegal XML therefore user discretion is advised.
º Only one parameter group can be specified however this option can be invoked multiple times.

Option: -T, --tag[<level>[:<scope>]] number|level|name <name>|[re]insert <name> <value>|[re]insertfile <name> <file>|[re]insertfilexml <indent> <file>
Modify or insert element tags for the level specified.

<level> - The XML depth level to be modified.
Must be an integer value greater than or equal to 0.
E.g. a value of 1 will modify the tag name for each record and a value of 2 will modify the tag name for each field.
There is no space separating the option and the level value.
If no level is specified then the given options will apply to all levels except level 0.
<scope> - A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
<s1> - integer representing a start range.
<s2> - integer or the $ token representing an end range.
r - restart the scope counter for this tag after the completion of the associated range.
Restrict the custom tag effectiveness to the occurrences specified in <scope>. If the element sequence is not in the scope then the custom tag will not be applied.
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -T2:1-3,5r insert x y this is saying insert the custom tag <x>y</x> before every first to third and fifth level 2 elements, and keep repeating this until the output is exhausted.
Scope sequence counters are always reset to zero for the next element depth level and higher whenever a deeper XML depth level is entered into.
If a <scope> range is not specified then the custom tag function applies to all elements at the specified <level>.
number - Suffix the tag name with its sequence number.
E.g. <line> becomes <line1> and the next <line> becomes <line2> and so on.
Scoping is not supported.
Not supported for level 0.
level - Prefix the tag name with its level.
E.g. <field> becomes <L2field>
Scoping is not supported.
Not supported for level 0.
name <name> - Change the tag name from the default to <name>
Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified).
E.g. <field> becomes <word>
You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised.
Scoping is not supported.
[re]insert <name> <value> - Insert a custom element tag.
The parameters <name> and <value> are combined to form an element tag with <value> wrapped between <name> tag pairs. E.g <name>value</name>
The inserted element appears before any output elements for the level specified.
The reinsert feature keeps applying itself at the level specified.
You can pretty much specify anything as an element name and value including illegal XML therefore user discretion is advised.
Not supported for level 0.
[re]insertfile <name> <file> - Insert a custom element tag containing contents of a file.
The contents of <file> are wrapped between <name> tag pairs.
The encoding of <file> must match the output encoding being used otherwise an undesirable output will result.
Any BOM found in <file> is removed.
Any reserved XML characters in <file> are escaped, and newlines are corrected.
The inserted element appears before any output elements for the level specified.
The reinsert feature keeps applying itself at the level specified.
You can pretty much specify anything as an element name including illegal XML therefore user discretion is advised.
Not supported for level 0.
[re]insertfilexml <indent> <file> - Insert contents of an XML file.
The entire contents of <file> are inserted before any output elements for the level specified.
The encoding of <file> must match the output encoding being used otherwise an undesirable output will result.
Any BOM found in <file> is removed.
If the parameter <indent> is an integer value greater than or equal to zero then the contents of file are indented by this amount, any XML prologue is removed, and newlines are corrected.
If the parameter <indent> is the value "raw" then the XML file is inserted as is without its BOM.
The reinsert feature keeps applying itself at the level specified.
You can pretty much insert anything as XML file content including illegal XML therefore user discretion is advised.
º Only one parameter group can be specified however this option can be invoked multiple times.

Option: -k, --keyvaluepairs[<level>]
Switch on the generation of key/value XML tag pairs for the output.

<level> - The XML depth level to be modified.
Must be an integer value greater than or equal to 2.
There is no space separating the option and the level value.
If no level is specified then the option will apply to all levels except levels 0 and 1.
º In this mode the data of the first field of the current XML level becomes the tag name for that level, that is, it becomes the key, and any subsequent fields become its value.
º This key/value pairing continues down the XML tree hierarchy for all the XML levels specified.
º You can pretty much generate anything as a tag name including illegal XML therefore user discretion is advised. The new tag name is trimmed of leading and trailing white space and white space between text is replaced with the underscore "_" character.
º If a blank field becomes a tag name candidate then xmlfy will skip it and search along the same level for a more suitable candidate. This behaviour can be mitigated by using the -b option which will force the default tag name to be substituted instead.
º Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified).
º This option can be invoked multiple times.

Option: -l, --linenumbers
This is a synonym for "-T1 number"
Include the line number in the line tag name

Option: -f, --fieldnumbers
This is a synonym for "-T2 number"
include the field number in the field tag name

Option: -L, --linetags
Insert a line number tag within the XML formatted output.
This is an alternative way of numbering your XML records. E.g. for the first line record of XML output the following tag is inserted <linenumber>1</linenumber> and so on.

Option: -X, --xmlformat [XML1.0|XML1.1]|[SOAP1.1|SOAP1.2]|[HTML table|list]
          |[UTF-8|UTF-16|UTF-16BE|UTF-16LE|UTF-32|UTF-32BE|UTF-32LE]|BOM
          |ASCIItoUTF|[noescape all|amp|lt|gt|quot|apos|brvbar]
          |trimtagclose|[newline dos|unix]
Allows you to specify the XML format to be used for the output.

XML1.0-Generate XML 1.0 output (this is the default).
XML1.1-Generate XML 1.1 output.
SOAP1.1-Generate XML SOAP 1.1 output.
SOAP1.2-Generate XML SOAP 1.2 output.
HTML-Generate HTML output.
table - elements are displayed in table format.
list - elements are displayed in list format.
UTF-8-Generate UTF-8 output encoding (default).
UTF-16-Generate UTF-16 output encoding.
UTF-16BE-Generate UTF-16BE (big-endian) output encoding.
UTF-16LE-Generate UTF-16LE (little-endian) output encoding.
UTF-32-Generate UTF-32 output encoding.
UTF-32BE-Generate UTF-32BE (big-endian) output encoding.
UTF-32LE-Generate UTF-32LE (little-endian) output encoding.
BOM-Generate and interpret a Byte-Order-Mark.
ASCIItoUTF-Convert ASCII input to wide UTF encoding.
noescape- Do not escape select reserved XML characters.
By default xmlfy will escape reserved XML characters that appear in the input stream and this option provides an adjustment to this behaviour.
all - do not escape any characters.
amp - do not escape the character & (ampersand).
lt - do not escape the character < (less-than).
gt - do not escape the character > (greater-than).
quot - do not escape the character " (quote).
apos - do not escape the character ' (apostrophe).
brvbar - do not escape the character ¦ (broken vertical bar).
trimtagclose-Truncate superfluous characters from the closing tag name.
newline-Select the line ending format for XML meta-data.
dos - use carriage-return and new-line ("\r\n") for line endings.
unix - use new-line ("\n") for line endings.
º The only thing option XML1.1 does is change the prologue version string to "1.1" and nothing else.
º When using the SOAP* options, the normal XML output generated by xmlfy is encapsulated in a SOAP Envelope and SOAP Body, the root tag defines a namespace prefix of "x" with a URI reference that can be adjusted with the -I option, and all children elements (records and fields) use this prefix name.
A non-mandatory administrative header element with a prefix name of "xh" is provided containing program and execution details.
The SOAP* options are only a basic implementation for generating a simple XML SOAP envelope containing xmlfy data. There is no further scope provided for SOAP Headers, SOAP Faults, transaction or protocol handling.
º When using the HTML option, the normal XML output generated by xmlfy is displayed in either a table or list layout and encapsulated in a HTML Body, of which the document title can be adjusted with the -I option.
º The UTF-* options tell xmlfy to use the specified encoding for all its XML meta-data (element tags, element attributes, prologues, etc). Other than the ASCIItoUTF option, no transformation of the input stream is performed and xmlfy assumes that the encoding used by the input stream matches the encoding specified, otherwise an undesirable output will result containing different encodings between the input data and XML meta-data.
If specifying the UTF-16 or UTF-32 parameter and the BOM option is either not specified or there is no BOM in the input stream then encoding in big-endian format will be assumed.
º The BOM (Byte-Order-Mark) option will force xmlfy to handle the BOM in the input stream if it is there, and also generate a BOM in the output stream.
If specifying the BOM option and a BOM is found in the input stream then that will overide any user specified encoding option.
The BOM byte sequence used for UTF-8 is 0xef 0xbb 0xbf (U+FEFF).
The BOM byte sequence used for UTF-16BE is 0xfe 0xff (U+FEFF).
The BOM byte sequence used for UTF-16LE is 0xff 0xfe (U+FFFE).
The BOM byte sequence used for UTF-32BE is 0x00 0x00 0xfe 0xff (U+FEFF).
The BOM byte sequence used for UTF-32LE is 0xff 0xfe 0x00 0x00 (U+FFFE).
º The ASCIItoUTF option when used in conjunction with one of the UTF-* options will process ASCII input and convert it to the wide UTF encoding specified.
º The noescape options control which reserved XML characters should not be escaped.
º The trimtagclose option trims back the closing tag from the first white space character found. Some options allow the user to define anything as a tag name including tag names that have element attributes (non normal approach). Using this option under these circumstances will prevent these element attributes from appearing in the close tag.
º The newline option adjusts the line ending format used for XML meta-data. On Unix platforms the default is unix and on Win32 platforms the default is dos. Only applies to XML meta-data output and does not do conversion of newline characters found in the input stream.
º Only one parameter group can be specified however this option can be invoked multiple times.

Option: -p, --printonly header|footer|rtagopen|rtagclose|records
Allows you to just print XML snippets to the output.
This is useful when you want to execute xmlfy multiple times to construct a single XML output file.

header-Will only print the prologue, doctype, opened SOAP Envelope and Body tags, the SOAP Header tag, HTML headers, and the BOM.
footer-Will only print closed SOAP Envelope and Body tags, and closed HTML tags.
rtagopen-Will only print an opened root element tag.
rtagclose-Will only print a closed root element tag.
records-Will only print record elements and their field elements.
º Only one parameter can be specified however this option can be invoked multiple times.

Option: -I, --identifier <system_identifier>
Allows you to specify your own system identifier of the doctype should you not be content with what xmlfy has specified.

<system_identifier> - An array of characters used to override the default system identifier.
You can pretty much specify anything as a system identifier including illegal XML therefore user discretion is advised.
º By default xmlfy will use the string "xmlfy.dtd", or if specifying a schema, use the schema filename as the system identifier.
º You can also use this option to overide the default SOAP namespace URI value for the root element when using the XML SOAP format options.
º You can also use this option to overide the document title in the HTML header when using the XML HTML format options.

Option: -s, --summary[2|c|n|f <file>]
When all input is exhausted an XML summary element is printed at the bottom providing a brief summary of what xmlfy processed.

2 - Print the summary element to stderr instead.
c - Print the summary element as an XML comment.
n - Print the summary element without calculating any message digests.
f <file> - Print the summary element to <file>.

By default MD5 and SHA512 checksum elements are provided inside the summary called md5_input, md5_output, sha512_input and sha512_output. The md5_input and sha512_input checksums are a digest of all the input that was actually processed including any input BOM. The md5_output and sha512_output checksums are a digest of all the output including any output BOM that precedes the XML summary element. To correctly validate the output result against the output checksum you must first remove any summary element and summary comments from the output result.

Option: -U, --unxml
Read XML formatted input and remove all that bracket racket reverting your XML document back to a plain format. Can be used in conjunction with the -F<level> <string> option to specify the delimiter to use for each XML depth level. Multiple same level -F options are meaningless in this context and delimiters are only inserted if more than one field is available to be delimited. Field separator scoping options are ignored. The default delimiter is a space character for XML depth levels of 2 and above, and new-line for XML depth levels below 2. Tag names and their attributes are not included in the output, and anything between XML comments are filtered out. If there is a BOM in the input then xmlfy will use that for the encoding, otherwise xmlfy will look for the opening XML character sequence of "<?" to determine the encoding being used. If neither of the previous methods found the correct encoding then you can use the -X UTF-* options as a fallback. Basic quoting options are also supported. Works best with XML output generated by xmlfy but can also be used with caution on other foreign XML documents.

Option: --noxml
Do not XML-fy the input stream but do process it for reserved XML characters (this feature was initially written for formatting the xmlfy HTML test reports that use wide encodings). Used in conjunction with the -X options to control the conversion of reserved characters and/or to transform the input stream to wide UTF encodings.
E.g. To transform an ASCII input stream to UTF-16BE encoding with a BOM:
xmlfy --noxml -X UTF-16BE -X ASCIItoUTF -X noescape all -X BOM
E.g. To just escape select reserved XML characters in an UTF-32LE input stream:
xmlfy --noxml -X UTF-32LE -X noescape amp

Important note on specifying options.
The way xmlfy handles options is very straightforward and can be easily confused if you don't follow the syntax specified for each option. The getopt library has been deliberately avoided to keep xmlfy portable.

xmlfy first evaluates options supplied on the command line, if a schema file is supplied then xmlfy will also look for options in that file and evaluate them too. See the schema file section below on how to specify xmlfy options inside a schema file.

OUTPUT

How it works.
The input processor used by xmlfy block reads unprocessed bytes from standard input (stdin) and stores them in an array the size of a level 1 record. This level 1 record is then processed for fields and sub fields etc by marking their positions in this array. Dynamic memory handling is used.

The output processor used by xmlfy takes the results from the input processor and re-packages it with suitably encoded XML syntax. Any input characters that are reserved for XML are by default re-represented in their escaped form.

Character & (ampersand) becomes string &amp;
Character < (less-than) becomes string &lt;
Character > (greater-than) becomes string &gt;
Character " (quote) becomes string &quot;
Character ' (apostrophe) becomes string &apos;
Character | (broken vertical bar) becomes string &brvbar;
The output processor then writes processed bytes to a block buffer for printing to standard output (stdout).

Using a schema file.
The default schema used by xmlfy is hard coded and can be described as follows:
In DTD schema form:

    <!ELEMENT xmlfy (line*)>
    <!ELEMENT line (field*)>
    <!ELEMENT field (#PCDATA)>
    

In RNC schema form:

    start = xmlfy
    xmlfy = element xmlfy { line* }
    line = element line { field* }
    field = element field { text }
    

In XSD schema form:

    <xs:schema>
      <xs:element name="xmlfy">
        <xs:sequence>
          <xs:element name="line" type="lineType" minOccurs="0" maxOccurs="unbounded" />
        </xs:sequence>
      </xs:element>
      <xs:complexType name="lineType">
        <xs:sequence>
          <xs:element name="field" type="xs:string" minOccurs="0" maxOccurs="unbounded" />
        </xs:sequence>
      </xs:complexType>
    </xs:schema>
    

A schema file for the ls -la command that produces output like this:

    total 73
    drwx------+  3 ag None     0 Apr 20 19:36 .
    -rwxr-xr-x   1 ag None 15639 Apr 20 19:31 a.exe
    -rwx------+  1 ag None  6354 Apr 20 19:31 xmlfy.c
    -rwx------+  1 ag None  4901 Apr 19  2008 xmlfy.h
    

In DTD schema form will look like this:

    <!ELEMENT ls (total?), (file*)>
    <!ELEMENT total (prompt, totalsize)>
    <!ELEMENT file (permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname)>
    <!ELEMENT date_ty (date_y)>
    <!ELEMENT date_ty (date_h, date_m)>
    <!ELEMENT prompt (#PCDATA)>
    <!ELEMENT totalsize (#PCDATA)>
    <!ELEMENT permission (#PCDATA)>
    <!ELEMENT blocks (#PCDATA)>
    <!ELEMENT user (#PCDATA)>
    <!ELEMENT group (#PCDATA)>
    <!ELEMENT size (#PCDATA)>
    <!ELEMENT date_y (#PCDATA)>
    <!ELEMENT date_M (#PCDATA)>
    <!ELEMENT date_d (#PCDATA)>
    <!ELEMENT date_h (#PCDATA)>
    <!ELEMENT date_m (#PCDATA)>
    <!ELEMENT fname (#PCDATA)>
    

and should be saved to a file as ls.dtd and invoked as:

    % ls -la | xmlfy --schema ls.dtd -F3 :
    

In RNC schema form will look like this:

    start = ls
    ls = element ls { total? | file* }
    total = element total { prompt, totalsize }
    file = element file { permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname }
    date_ty = element date_ty { date_y }
    date_ty |= element date_ty { date_h, date_m }
    prompt = element prompt { text }
    totalsize = element totalsize { text }
    permission = element permission { text }
    blocks = element blocks { text }
    user = element user { text }
    group = element group { text }
    size = element size { text }
    date_y = element date_y { text }
    date_M = element date_M { text }
    date_d = element date_d { text }
    date_h = element date_h { text }
    date_m = element date_m { text }
    fname = element fname { text }
    

and should be saved to a file as ls.rnc and invoked as:

    % ls -la | xmlfy --schema ls.rnc -F3 :
    

In XSD schema form will look like this:

    <xs:schema>
      <xs:element name="ls" type="lsType" />
      <xs:complexType name="lsType">
        <xs:sequence>
          <xs:element name="total" type="totalType" minOccurs="0" />
          <xs:element name="file" type="fileType" minOccurs="0" maxOccurs="unbounded" />
        </xs:sequence>
      </xs:complexType>
      <xs:complexType name="totalType">
        <xs:sequence>
          <xs:element name="prompt" type="xs:string" />
          <xs:element name="totalsize" type="xs:string" />
        </xs:sequence>
      </xs:complexType>
      <xs:complexType name="fileType">
        <xs:sequence>
          <xs:element name="permission" type="xs:string" minOccurs="0" />
          <xs:element name="blocks" type="xs:string" minOccurs="0" />
          <xs:element name="user" type="xs:string" minOccurs="0" />
          <xs:element name="group" type="xs:string" minOccurs="0" />
          <xs:element name="size" type="xs:string" minOccurs="0" />
          <xs:element name="date_M" type="xs:string" minOccurs="0" />
          <xs:element name="date_d" type="xs:string" minOccurs="0" />
          <xs:element name="date_ty" type="datetyType" minOccurs="0" />
          <xs:element name="fname" type="xs:string" />
        </xs:sequence>
      </xs:complexType>
      <xs:complexType name="datetyType">
        <xs:choice>
          <xs:element name="date_y" type="xs:string" />
          <xs:sequence>
            <xs:element name="date_h" type="xs:string" />
            <xs:element name="date_m" type="xs:string" />
          </xs:sequence>
        </xs:choice>
      </xs:complexType>
    </xs:schema>
    

and should be saved to a file as ls.xsd and invoked as:

    % ls -la | xmlfy --schema ls.xsd -F3 :
    

Shoe-horning raw data into a structure defined by a schema is rather straight forward when the input fields have a one-to-one relationship with the fields of the schema elements, however if wildcard tokens and/or Boolean logic are employed in the schema then it becomes quite a challenge, sometimes even impossible, to be deterministic about which input field belongs to which schema field. Strictly speaking, the main function of the schema is to ensure XML is valid and to do this requires the XML document to already pre-exist. In xmlfy's case we are doing the reverse by building an XML document on the fly while following rules described by the schema - this is still okay and the resulting XML can be considered to be both valid and well formed.

xmlfy employs two techniques to help with this shoe-horning input data problem. The first technique xmlfy uses is recognising multiple element definitions that have the same name. This allows you to capture your schema elements under a variety of input circumstances without having to create a unique element for each circumstance - you can still do that if you want. The second technique xmlfy uses is auto-generated field match constraint helpers to assist in matching the input fields to the elements described by the schema. These helpers are useful in improving the speed of xmlfy particularly when using compound element structures and wildcard tokens in the schema hierarchy. After the schema file is loaded into memory, an array of helpers is generated for each element that describes all combinations of the schema tree traversal paths that can be taken and associates each combination with the minimum, maximum and last number of fields required for a match against the number of available input fields. For example, using the above schema a match will occur for:

total(min=2, max=2, last=2) when input fields = 2.
file(min=1, max=9, last=1) when 1 >= input fields <= 9
and date_ty is a single field (min=1, max=1, last=1).
file(min=1, max=10, last=1) when 1 >= input fields <= 10
and date_ty is two fields (min=2, max=2, last=2).
By default xmlfy continuously iterates through just the record elements of the root element looking for element helpers that can fully satisfy the requirements of that particular element's schema tree hierarchy for the given input fields, after which the matching record element is then checked against its wildcard obligations in the root element definition, and if okay is finally printed.
In match direct mode xmlfy only looks at the element helpers of the targeted element, and if that element can fully satisfy the requirements of its schema tree hierarchy for the given input fields, is printed in its entirety only once as the root element.

To specify xmlfy options inside a schema file you encapsulate them inside a special token that is in effect a schema comment.

DTD and XSD example:
<!-- xmlfy-args: -F1 "\n" -F2 ABC -q -Q \"\' -->

RNC example:

## xmlfy-args: -F1 "\n" -F2 ABC -q -Q \"\'
This special token must exist in completed form on just one line at the left most side, spacing is important, only the first occurrence is recognised, and ideally it is placed somewhere near the top of the schema file. The schema option syntax is the same as the command line option syntax except that some options are not allowed e.g. --schema.

LIMITATIONS

xmlfy has been successfully tested on average hardware with input records containing over 10,000,000 fields whilst using a complex schema tree structure and multi level delimiters.

Currently the xmlfy schema file parser is not that sophisticated and exhibits the following behaviour:

DTD schema

  • Only recognises the <!ELEMENT> directive and ignores all others.
  • The first valid <!ELEMENT> definition becomes the root element.
  • Element fields that don't have an element definition default to being (#PCDATA).
  • Elements defined as (#PCDATA) or (#CDATA) are ignored causing the referring field to default to (#PCDATA) however it is good practice to include these elements in order to furnish a complete DTD schema.
  • Only honours the +, ? and * wildcard tokens.
  • At this stage does not honour field group sets () and or-ing | syntax tokens.

RNC schema

  • Only recognises named directives and ignores all others.
  • The element named "start" becomes the root element.
  • Element fields that don't have an element definition default to being { text }.
  • Elements defined as { text } are ignored causing the referring field to default to { text } however it is good practice to include these elements in order to furnish a complete RNC schema.
  • Only honours the +, ? and * wildcard tokens.
  • At this stage does not honour field group sets () and or-ing | syntax tokens.

XSD schema

  • Only recognises the <schema>, <element>, <complexType>, <ref>, <sequence>, and <choice> directives and ignores all others.
  • The recognised directives are not fully implemented and their use should be kept straightforward.
  • The first valid <element> definition becomes the root element.
  • Element types that are not of matchable complexType are treated as "xsi:string" regardless of what type is specified.
  • Only honours the minOccurs="0", maxOccurs="0" and maxOccurs="unbounded" wildcard attributes.
  • At this stage does not honour group sets but does do limited support with choices.

All schema types

  • The fields of the root element define all the level 1 elements (lets call the fields that have their own branch structure record elements).
  • The fields of the record elements simply represent other elements and unlimited element nesting is allowed.
  • By default fields of the root element that are not record elements are ignored. Use the match direct option to match targeted elements in their entirety.
  • The field names that are specified in the element definitions are read from left to right and matched against a field number calculation on the input fields, and then matched again on any wildcard tokens.
  • You can wildcard many fields but you should think clearly about what you are trying to achieve and whether it is at all possible. For example, the following DTD which is perfectly suitable for checking for valid XML, will however prove impossible for xmlfy to shoe-horn input data into DTD elements a, b and c reliably because more than one field has a wildcard token to match none or many input fields.
    <!ELEMENT parent (record)>
    <!ELEMENT record (a*, b, c*)>
    <!ELEMENT a (#PCDATA)>
    <!ELEMENT b (#PCDATA)>
    <!ELEMENT c (#PCDATA)>
    In the above example xmlfy will allocate ALL input fields to element <a> and that MAY not be the desired intention.

RETURN VALUES

 0Normal exit.
-1Invalid argument specified.
-2Error processing schema file contents.
-3Infinite loop detected when matching input against schema elements.
-10Out of memory.

AUTHOR

Originally written by Arthur Gouros.
This software also contains material derived from Ville Laurikari's TRE regex library.
This software also contains material derived from the US Secure Hash Algorithms (RFC4634).
This software also contains material derived from the RSA Data Security, Inc. MD5 Message-Digest Algorithm.

LICENSE

BSD License for xmlfy
Copyright © 2008-2020, Arthur Gouros
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  • Neither the name of Arthur Gouros nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

SEE ALSO

The full documentation of the xmlfy project can be found on the web at:

http://xmlfy.sourceforge.net

The website is updated more frequently than the man pages and should be considered the authoritative source of information.