xmlfy(1) |
xmlfy - Convert to XML on the fly.
xmlfy [OPTION]...
The xmlfy command reads stdin and outputs it to stdout in XML format using supplied control directives.
Delimiter tokens and/or column selections are used to break down the input stream into XML elements which are then represented inside an XML tree hierarchy that can span multiple depth levels. For example, command line output was originally designed for text or CRT based processing. The xmlfy command takes this text based output where a new-line often represents an end-of-record of data and white space often represents a field separator, and reformats it into XML output suitable for interfacing with modern object oriented systems.
xmlfy is a powerful yet lightweight tool that primarily caters for converting ASCII, UTF-8, UTF-16 or UTF-32 based output into XML format on the fly and dealing with common issues associated with this kind of transformation.
The xmlfy command also supports a basic version of a schema configuration allowing you to control the format of the XML output by supplying a schema file as an option.
With no options supplied xmlfy will use default values for its XML format. The entire standard input will be enclosed in <xmlfy></xmlfy> pairs, each line of standard input will be enclosed in <line></line> pairs, and each field of each line will be enclosed in <field></field> pairs.
You can supply options to customise the behaviour of xmlfy at the command line, or by a special token inside the schema file, or both. NOTE: Options are resolved from left to right. If any conflicting options are specified then the last one will have precedence.
Option: -h, --help
The command line usage is printed in plain text format not in
XML format.
Option: -v, --version
The version number is printed in plain text format not in XML
format. If the version number is required in XML format it is
included with the summary option.
Option: --license
Print all licenses used by xmlfy.
Option: --debug
Print extra debugging information to stderr to help debug xmlfy
behaviour.
Input options:
Option: -F, --fieldseparator[<level>[b][:<scope>]] <string>
Allows you to specify a delimiter string token for the level specified.
<level> | - | The XML depth level to be delimited by <string>.
Must be an integer value greater than or equal to 1. E.g. a value of 1 will split the input into records delimited by <string>, a value of 2 will split records into fields delimited by <string>, a value of 3 will split fields into subfields delimited by <string>, and so on. There is no space separating the option and the level value. If no level is specified then the given options will only apply to level 2. |
|||||||||
b | - | Use byte matching for the specified delimiter string.
By specifying this option the delimiter string is treated as just a literal sequence of bytes. Normally command line arguments are presented to xmlfy as ASCII strings and if wide UTF encoding like UTF-16 or UTF-32 is being used then xmlfy will automatically convert the specified delimiter string to that encoding. With this option no encoding conversion takes place. In this mode you can also specify escaped decimal byte sequences inside the delimiter string. E.g. "\123\234\\" |
|||||||||
<scope> | - | A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
E.g. -F3:1-3,8 "." this is saying that level 3 fields will only be created for the 1st to 3rd, and 8th occurrences of the delimiter "." (period). The restart scope counter option r allows you to specify repeating scope sequences. E.g -F1:2,5r "\n" this is saying create level 1 records out of every second and fifth lines and keep repeating this until the input is exhausted. When using multiple same level delimiters, restarting scope counters of the equivalent level and higher get reset whenever a delimiter match is applied. If a <scope> range is not specified then the delimiter function applies to every occurrence of <string> of the target level. |
|||||||||
<string> | - | A sequence of characters or token to be used as a delimiter. Tokens specified literally as "\n", "\r", and "\t" are translated to their corresponding control character. If using wide UTF encoding then <string> is automatically converted to that encoding, otherwise you can use the byte matching option and specify escaped decimal byte sequences inside <string>. |
º | If the delimiter token is the same for a series of levels then obviously the shallowest level will take precedence, unless the shallowest levels have been limited by scope restrictions. You can also make use of quotes in the input along with specifying quote options. |
º | The XML tree algorithm deepens in a sequential way therefore you must set your delimiter levels as an unbroken sequence for them to be of any use, that is you cannot split a level 2 field with a level 4 delimiter string. |
º | Refer to the schema option section for information on level handling when a schema file is specified. |
º | Levels 1 and 2 are already set by default. |
º | The default level 1 delimiter token is NEWLINE (new-line). |
º | The default level 2 delimiter token is WHITESPACE (space, tab, new-line, carriage-return, vertical-tab and form-feed). |
º | The delimiters for levels 3 and above are unset. |
º | Only one delimiter string token can be specified however this option can be invoked multiple times allowing for multiple delimiters to be used at the level specified. When specifying multiple same level delimiters, the larger delimiter strings are matched before the smaller ones. The delimiter string is not included in the output. |
Option: -R, --recordseparator <string>
This is a synonym for "-F1 <string>"
Allows you to specify a record separator string token that
is different from the default. The default record separator
token is NEWLINE (new-line).
Option: -C, --column[:<scope>] <c1>-<c2>[:<name>]
Use an input column range of the input record to generate
an input field. This is an alternative method of capturing
input fields from using delimiters.
<scope> | - | A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -C:1-3,5r 1-20 this is saying capture column fields of 20 characters in length for every first to third and fifth input records, and keep repeating this until the input is exhausted. If a <scope> range is not specified then the column option applies to all input records. |
|||||||||
<c1> | - | Integer or the $ token representing the start column range of the input field. | |||||||||
<c2> | - | Integer or the $ token representing the end column range of the input field. | |||||||||
<name> | - | Optional string value that will be used to
override the tag name for this input field.
You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised. Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified). |
º | Specifying field separators of level 2 and above with this option is conflicting and will produce a usage error. |
º | The number of times and order in which this option is specified (in conjunction with the -W option) determines the number of input fields generated and their order. |
º | Column ranges represent code points (characters) meaning any multi byte character will only account for just one column position. |
º | Multiple options can use non linear ranges and can overlap e.g. -C 5-10:part -C 1-$:whole |
º | Ranges that exceed the size of the input record will not process beyond the end of the input record. |
º | You can use single or double quotes to protect the range from the shell interpreter e.g. -C '80-$:text' |
º | Only one parameter pair can be specified however this option can be invoked multiple times. |
Option: -W, --regex[:<scope>] [E|B][i][l][r][U][n][b][e]/<pattern>/[<name>[,..]]
Use a regular expression on the input record to generate
input fields. This is an alternative method of capturing
input fields from using delimiters.
<scope> | - | A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -W:1-3,5r /(^A.*).*(B.*$)/ this is saying capture two regex fields for every first to third and fifth input records, and keep repeating this until the input is exhausted. If a <scope> range is not specified then the regex option applies to all input records. |
|||||||||
E | - | flag to use Extended Regular Expressions in <pattern> (default). | |||||||||
B | - | flag to use Basic Regular Expressions in <pattern>. | |||||||||
i | - | flag to ignore case. | |||||||||
l | - | flag to treat <pattern> as a literal. | |||||||||
r | - | flag to make concatenation right associative. | |||||||||
U | - | flag to make operators ungreedy by default. | |||||||||
n | - | flag to give '\n' special meaning (REG_NEWLINE). | |||||||||
b | - | flag to set '^' as not beginning-of-line (REG_NOTBOL). | |||||||||
e | - | flag to set '$' as not end-of-line (REG_NOTEOL). | |||||||||
<pattern> | - | A POSIX 1003.2 compliant Regular Expression pattern utilising zero or more parenthesis pairs to capture input fields. | |||||||||
<name> | - | Optional string value that will be used to
override the tag name for input fields derived from pattern matches.
A comma separated list of <name> can be specified with the last entry being re-used if more input fields than names are generated. You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised. Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified). |
º | Specifying field separators of level 2 and above with this option is conflicting and will produce a usage error. |
º | The number of times and order in which this option is specified (in conjunction with the -C option) determines the number of input fields generated and their order. |
º | If matches are not made for all parenthesis pairs specified in <pattern> then no output will result. |
º | If no parenthesis pairs are specified in <pattern> then the entire input record will be used as the output when a pattern match occurs. |
º | Wide UTF encoding can be specified in <pattern> by using the \x literal followed by two hexadecimal digits to represent any byte inside the code-point e.g. \x0b. |
º | For further information on using regex syntax and its flags please consult the TRE web documentation. |
º | You can use single or double quotes to protect <pattern> from the shell interpreter e.g. -W 'iU/(^Pam .*)/pams' |
º | You can specify the percentage character % as an alternative separator to forward-slash / for <pattern> so long as it remains paired. |
º | Only one parameter pair can be specified however this option can be invoked multiple times. |
Option: -e, --expelempty
Expel input fields that are empty (zero bytes in length) from
being processed. The use of multi level and multiple same level
delimiters can sometimes yield plenty of empty fields which may be
undesirable. This option expels all the empty input fields from
being processed by the output processor. All levels are examined
and any input records comprised entirely out of empty fields are
also expelled.
This option will always run before any expelling tasks
specified with option -E are run.
This option has no influence on levels subjected to key/value
pairing as that process has its own way of dealing with empty
fields at its target levels.
If a schema is used then obviously the number of input
records/fields used for element matching has been reduced.
Option: -E, --expel <input_records>[:<input_fields>]
Expel selected input records or selected input fields of
selected input records from being processed. Each input record
is checked against the expel criteria and if a match occurs
then these input records or input fields are simply discarded
from being passed onto the xmlfy output processor.
<input_records> | - | A comma delimited set of input record
expel criteria with no spaces.
The <input_records> parameter has a sub form of <range_type><r1>[-<r2>][/<string>/][,..] Where <range_type> can be 'n', 'f' or 'c'.
E.g. -E n10-$,f7-8,f4/Mercedes/,c10-20,c1-15/SUV/ this is saying that input records whose record number is greater than or equal to 10, AND input records whose total number of fields are between 7 and 8, AND input records whose 4th input field contains the string "Mercedes", AND input records whose input record length is greater than or equal to 10 but less than or equal to 20 characters, AND input records whose first 15 characters contain the string "SUV", will finally match the input record expel criteria. In this release you can only specify the $ token (last input record) in a paired range and not on its own. Generally xmlfy can figure out where the search string delimiters would likely occur however you can specify the % character as an alternative separator to / for <string> so long as it remains paired. If an <input_fields> criteria is not specified then the entire input record is expelled. |
||||||||||||||||||
<input_fields> | - | A comma delimited set of field number ranges with no spaces.
The <input_fields> parameter has a sub form of <r1>[-<r2>][,..]
E.g. -E n2-$:1,$ this is saying that input records whose record number is greater than or equal to 2 will have their first and last fields expelled. You can specify the $ token (last input field) in a paired range or on its own. |
º | You can use single or double quotes to protect the range from the shell interpreter e.g. -E 'n2-$:$' |
º | If a schema is used then obviously the number of input records/fields used for element matching has been reduced. |
º | Only one parameter group can be specified however this option can be invoked multiple times with resolution occurring from left to right. |
Option: -q, --quotedfields[2]
Treat fields that are quoted as one field. Normally xmlfy will
parse fields by their delimiter e.g. WHITESPACE, this option
allows multi delimited fields to be specified as one by quoting
them. By default the quoted field may only span the current
input record unless the -q2 option is specified in which case
the quoted field can span multiple input records.
Quotes are not included in the field and any leading/trailing
text outside the field's quotes are truncated.
If quotes are not closed xmlfy will update the field until the
end of the input record, or if option -q2 is specified, until
the input is exhausted (EOF).
The default quote character is a double quote (").
Option: -Q, --quotechars[2] <string>
specify a string of characters that can be used as the quoting
character.
<string> | - | an array of quoting characters. |
º | If field quoting is enabled then any input character that matches any character in <string> will toggle the quoting function, unless the -Q2 option is specified in which case characters in <string> represent paired quotes with odd numbered characters in this array toggling the open quote function, and its corresponding pair toggling the close quote function. This allows parenthesis, brackets, etc to be used as quotes. |
º | Obviously when specifying this option care must be taken to prevent the shell from interpreting the supplied quote characters. When using a schema file containing this option you can specify quote characters by escaping them with the backslash "\" character. |
Option: -b, --blanklines
Normally xmlfy ignores blank lines or empty level 1 records in
the input stream. This option tells xmlfy to not ignore these
blank lines and print out XML line record tags but with no
elements.
In this mode blank lines count as line numbers.
Option: -t, --trim
Field elements are trimmed of leading and trailing white space.
Output options:
Option: -S, --schema <file> | -Sd, --schemadtd <file> | -Sr, --schemarnc <file> | -Sx, --schemaxsd <file>
Specify a schema <file> for controlling the XML output.
<file> | - | The schema file must comply with either the Document Type Definition (.dtd) language, or the RELAX NG Compact (.rnc) language, or the XML Schema Document (.xsd) language, however xmlfy does not support the finer aspects of these schema languages at this early stage. |
º | When all input fields of the input record have been
identified, xmlfy will match them against the elements
inside the tree hierarchy of the schema file, and if a
match is found then xmlfy will print an output record
using the matching schema tree hierarchy as its XML
structure.
Option -S, --schema uses the case-insensitive file name extension (.dtd or .rnc or .xsd) of <file> to determine which schema interpreter xmlfy will apply. Option -Sd, --schemadtd forces xmlfy to use the DTD schema interpreter on <file>. Option -Sr, --schemarnc forces xmlfy to use the RNC schema interpreter on <file>. Option -Sx, --schemaxsd forces xmlfy to use the XSD schema interpreter on <file>. |
º | You can specify multi level delimiters when using this option however any delimiters greater than level 2 are only used to identify more input fields and are not used at all in altering the XML tree hierarchy as is dictated by the schema file. Fields with levels of 2 and above are flattened to be just plain fields of the input record - this is very different to the default behaviour where field levels form the XML tree hierarchy. |
º | If a schema option is not supplied then xmlfy will use default values for tag names and element control. |
º | For further information on how to write a schema for xmlfy please consult the web documentation. |
Option: -M, --matchdirect 0|<elementname>
Match directly on a specific element in the schema making it the root
element.
0 | - | A token representing the default root element in the schema. |
<elementname> | - | The name of a record element in the schema. |
º | This option alters the way the selected schema element is matched against the available input fields that were generated. In this mode the target element is matched in its entirety using its element helper and printed accordingly. This is very different to the default legacy mode whereby only the record elements of the root element get matched in a continuously sequential way. |
º | Regardless of what wildcard attributes exist for the target element it will only be printed once as a root element. |
º | If a schema file is not specified then this option will be ignored. |
Option: -A, --attribute[<level>[:<scope>]] number|level|delimiter|timestamp|insert <name> <value>
Include attributes in the opening element tag for the level specified.
<level> | - | The XML depth level to be modified.
Must be an integer value greater than or equal to 0. E.g. a value of 1 will apply attributes to each opening record element and a value of 2 will apply attributes to each opening field element. There is no space separating the option and the level value. If no level is specified then the given options will apply to all levels except level 0. |
|||||||||
<scope> | - | A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -A2:1-3,5r insert x y this is saying insert custom attributes x="y" for every first to third and fifth level 2 elements, and keep repeating this until the output is exhausted. Scope sequence counters are always reset to zero for the next element depth level and higher whenever a deeper XML depth level is entered into. If a <scope> range is not specified then the custom attribute function applies to all elements at the specified <level>. |
|||||||||
number | - | Specify the sequence number as an element attribute.
E.g. <field> becomes <field number="1"> and the next <field> becomes <field number="2"> and so on. Scoping is not supported. Not supported for level 0. |
|||||||||
level | - | Specify the level as an element attribute.
E.g. <field> becomes <field level="2"> Scoping is not supported. Not supported for level 0. |
|||||||||
delimiter | - | Specify the matching delimiter as an element attribute.
E.g. <field> becomes <field delimiter="ABC"> Delimiter string tokens that contain illegal XML characters are printed as their hex pair equivalent. When using a schema file only level 1 records and field elements will have their delimiter attributes printed. Scoping is not supported. Not supported for level 0. |
|||||||||
timestamp | - | Include a timestamp as an element attribute.
Two timestamps are provided, one for humans and one for machines. The times are stamped at element print time. E.g. <field> becomes <field timestamp_date="Fri May 5 10:23:33 2008" timestamp_sec="123456790"> Scoping is not supported. |
|||||||||
insert <name> <value> | - | Insert a custom element attribute.
The parameters <name> and <value> are combined to form an element attribute with <value> wrapped around double quotes. E.g <field> becomes <field name="value"> You can pretty much specify anything as an attribute name and value including illegal XML therefore user discretion is advised. |
º | Only one parameter group can be specified however this option can be invoked multiple times. |
Option: -T, --tag[<level>[:<scope>]] number|level|name <name>|[re]insert <name> <value>|[re]insertfile <name> <file>|[re]insertfilexml <indent> <file>
Modify or insert element tags for the level specified.
<level> | - | The XML depth level to be modified.
Must be an integer value greater than or equal to 0. E.g. a value of 1 will modify the tag name for each record and a value of 2 will modify the tag name for each field. There is no space separating the option and the level value. If no level is specified then the given options will apply to all levels except level 0. |
|||||||||
<scope> | - | A comma delimited set of sequence ranges with no spaces.
The <scope> parameter has a sub form of <s1>[-<s2>][r][,..]
The restart scope counter option r allows the scope sequences to continually repeat themselves. E.g -T2:1-3,5r insert x y this is saying insert the custom tag <x>y</x> before every first to third and fifth level 2 elements, and keep repeating this until the output is exhausted. Scope sequence counters are always reset to zero for the next element depth level and higher whenever a deeper XML depth level is entered into. If a <scope> range is not specified then the custom tag function applies to all elements at the specified <level>. |
|||||||||
number | - | Suffix the tag name with its sequence number.
E.g. <line> becomes <line1> and the next <line> becomes <line2> and so on. Scoping is not supported. Not supported for level 0. |
|||||||||
level | - | Prefix the tag name with its level.
E.g. <field> becomes <L2field> Scoping is not supported. Not supported for level 0. |
|||||||||
name <name> | - | Change the tag name from the default to <name>
Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified). E.g. <field> becomes <word> You can pretty much specify anything as a tag name including illegal XML therefore user discretion is advised. Scoping is not supported. |
|||||||||
[re]insert <name> <value> | - | Insert a custom element tag.
The parameters <name> and <value> are combined to form an element tag with <value> wrapped between <name> tag pairs. E.g <name>value</name> The inserted element appears before any output elements for the level specified. The reinsert feature keeps applying itself at the level specified. You can pretty much specify anything as an element name and value including illegal XML therefore user discretion is advised. Not supported for level 0. |
|||||||||
[re]insertfile <name> <file> | - | Insert a custom element tag containing contents of a file.
The contents of <file> are wrapped between <name> tag pairs. The encoding of <file> must match the output encoding being used otherwise an undesirable output will result. Any BOM found in <file> is removed. Any reserved XML characters in <file> are escaped, and newlines are corrected. The inserted element appears before any output elements for the level specified. The reinsert feature keeps applying itself at the level specified. You can pretty much specify anything as an element name including illegal XML therefore user discretion is advised. Not supported for level 0. |
|||||||||
[re]insertfilexml <indent> <file> | - | Insert contents of an XML file.
The entire contents of <file> are inserted before any output elements for the level specified. The encoding of <file> must match the output encoding being used otherwise an undesirable output will result. Any BOM found in <file> is removed. If the parameter <indent> is an integer value greater than or equal to zero then the contents of file are indented by this amount, any XML prologue is removed, and newlines are corrected. If the parameter <indent> is the value "raw" then the XML file is inserted as is without its BOM. The reinsert feature keeps applying itself at the level specified. You can pretty much insert anything as XML file content including illegal XML therefore user discretion is advised. |
º | Only one parameter group can be specified however this option can be invoked multiple times. |
Option: -k, --keyvaluepairs[<level>]
Switch on the generation of key/value XML tag pairs for the
output.
<level> | - | The XML depth level to be modified.
Must be an integer value greater than or equal to 2. There is no space separating the option and the level value. If no level is specified then the option will apply to all levels except levels 0 and 1. |
º | In this mode the data of the first field of the current XML level becomes the tag name for that level, that is, it becomes the key, and any subsequent fields become its value. |
º | This key/value pairing continues down the XML tree hierarchy for all the XML levels specified. |
º | You can pretty much generate anything as a tag name including illegal XML therefore user discretion is advised. The new tag name is trimmed of leading and trailing white space and white space between text is replaced with the underscore "_" character. |
º | If a blank field becomes a tag name candidate then xmlfy will skip it and search along the same level for a more suitable candidate. This behaviour can be mitigated by using the -b option which will force the default tag name to be substituted instead. |
º | Only applicable for changing default behaviour (i.e. when the --schema option is NOT specified). |
º | This option can be invoked multiple times. |
Option: -l, --linenumbers
This is a synonym for "-T1 number"
Include the line number in the line tag name
Option: -f, --fieldnumbers
This is a synonym for "-T2 number"
include the field number in the field tag name
Option: -L, --linetags
Insert a line number tag within the XML formatted output.
This is an alternative way of numbering your XML records. E.g. for
the first line record of XML output the following tag is
inserted <linenumber>1</linenumber> and so on.
Option: -X, --xmlformat [XML1.0|XML1.1]|[SOAP1.1|SOAP1.2]|[HTML table|list]
|[UTF-8|UTF-16|UTF-16BE|UTF-16LE|UTF-32|UTF-32BE|UTF-32LE]|BOM
|ASCIItoUTF|[noescape all|amp|lt|gt|quot|apos|brvbar]
|trimtagclose|[newline dos|unix]
Allows you to specify the XML format to be used for the output.
XML1.0 | - | Generate XML 1.0 output (this is the default). | |||||||||||||||||||||
XML1.1 | - | Generate XML 1.1 output. | |||||||||||||||||||||
SOAP1.1 | - | Generate XML SOAP 1.1 output. | |||||||||||||||||||||
SOAP1.2 | - | Generate XML SOAP 1.2 output. | |||||||||||||||||||||
HTML | - | Generate HTML output.
| |||||||||||||||||||||
UTF-8 | - | Generate UTF-8 output encoding (default). | |||||||||||||||||||||
UTF-16 | - | Generate UTF-16 output encoding. | |||||||||||||||||||||
UTF-16BE | - | Generate UTF-16BE (big-endian) output encoding. | |||||||||||||||||||||
UTF-16LE | - | Generate UTF-16LE (little-endian) output encoding. | |||||||||||||||||||||
UTF-32 | - | Generate UTF-32 output encoding. | |||||||||||||||||||||
UTF-32BE | - | Generate UTF-32BE (big-endian) output encoding. | |||||||||||||||||||||
UTF-32LE | - | Generate UTF-32LE (little-endian) output encoding. | |||||||||||||||||||||
BOM | - | Generate and interpret a Byte-Order-Mark. | |||||||||||||||||||||
ASCIItoUTF | - | Convert ASCII input to wide UTF encoding. | |||||||||||||||||||||
noescape | - | Do not escape select reserved XML characters.
By default xmlfy will escape reserved XML characters that appear in the input stream and this option provides an adjustment to this behaviour.
| |||||||||||||||||||||
trimtagclose | - | Truncate superfluous characters from the closing tag name. | |||||||||||||||||||||
newline | - | Select the line ending format for XML meta-data.
|
º | The only thing option XML1.1 does is change the prologue version string to "1.1" and nothing else. | |
º | When using the SOAP* options, the normal XML output generated by xmlfy is
encapsulated in a SOAP Envelope and SOAP Body, the root tag defines a
namespace prefix of "x" with a URI reference that can be adjusted with
the -I option, and all children elements (records and fields) use this prefix name.
A non-mandatory administrative header element with a prefix name of "xh" is provided containing program and execution details. The SOAP* options are only a basic implementation for generating a simple XML SOAP envelope containing xmlfy data. There is no further scope provided for SOAP Headers, SOAP Faults, transaction or protocol handling. |
|
º | When using the HTML option, the normal XML output generated by xmlfy is displayed in either a table or list layout and encapsulated in a HTML Body, of which the document title can be adjusted with the -I option. | |
º | The UTF-* options tell xmlfy to use the specified encoding for all
its XML meta-data (element tags, element attributes, prologues, etc).
Other than the ASCIItoUTF option, no transformation of the input
stream is performed and xmlfy assumes that the encoding used by the input
stream matches the encoding specified, otherwise an undesirable output
will result containing different encodings between the input data
and XML meta-data.
If specifying the UTF-16 or UTF-32 parameter and the BOM option is either not specified or there is no BOM in the input stream then encoding in big-endian format will be assumed. |
|
º | The BOM (Byte-Order-Mark) option will force xmlfy to handle
the BOM in the input stream if it is there, and also generate a BOM in the output stream.
If specifying the BOM option and a BOM is found in the input stream then that will overide any user specified encoding option. The BOM byte sequence used for UTF-8 is 0xef 0xbb 0xbf (U+FEFF). The BOM byte sequence used for UTF-16BE is 0xfe 0xff (U+FEFF). The BOM byte sequence used for UTF-16LE is 0xff 0xfe (U+FFFE). The BOM byte sequence used for UTF-32BE is 0x00 0x00 0xfe 0xff (U+FEFF). The BOM byte sequence used for UTF-32LE is 0xff 0xfe 0x00 0x00 (U+FFFE). |
|
º | The ASCIItoUTF option when used in conjunction with one of the UTF-* options will process ASCII input and convert it to the wide UTF encoding specified. | |
º | The noescape options control which reserved XML characters should not be escaped. | |
º | The trimtagclose option trims back the closing tag from the first white space character found. Some options allow the user to define anything as a tag name including tag names that have element attributes (non normal approach). Using this option under these circumstances will prevent these element attributes from appearing in the close tag. | |
º | The newline option adjusts the line ending format used for XML meta-data. On Unix platforms the default is unix and on Win32 platforms the default is dos. Only applies to XML meta-data output and does not do conversion of newline characters found in the input stream. | |
º | Only one parameter group can be specified however this option can be invoked multiple times. |
Option: -p, --printonly header|footer|rtagopen|rtagclose|records
Allows you to just print XML snippets to the output.
This is useful when you want to execute xmlfy multiple times to
construct a single XML output file.
header | - | Will only print the prologue, doctype, opened SOAP Envelope and Body tags, the SOAP Header tag, HTML headers, and the BOM. |
footer | - | Will only print closed SOAP Envelope and Body tags, and closed HTML tags. |
rtagopen | - | Will only print an opened root element tag. |
rtagclose | - | Will only print a closed root element tag. |
records | - | Will only print record elements and their field elements. |
º | Only one parameter can be specified however this option can be invoked multiple times. |
Option: -I, --identifier <system_identifier>
Allows you to specify your own system identifier of the doctype
should you not be content with what xmlfy has specified.
<system_identifier> | - | An array of characters used to override the default system identifier.
You can pretty much specify anything as a system identifier including illegal XML therefore user discretion is advised. |
º | By default xmlfy will use the string "xmlfy.dtd", or if specifying a schema, use the schema filename as the system identifier. |
º | You can also use this option to overide the default SOAP namespace URI value for the root element when using the XML SOAP format options. |
º | You can also use this option to overide the document title in the HTML header when using the XML HTML format options. |
Option: -s, --summary[2|c|n|f <file>]
When all input is exhausted an XML summary element is printed
at the bottom providing a brief summary of what xmlfy
processed.
2 | - | Print the summary element to stderr instead. |
c | - | Print the summary element as an XML comment. |
n | - | Print the summary element without calculating any message digests. |
f <file> | - | Print the summary element to <file>. |
Option: -U, --unxml
Read XML formatted input and remove all that bracket racket
reverting your XML document back to a plain format. Can be
used in conjunction with the -F<level> <string> option
to specify the delimiter to use for each XML depth level.
Multiple same level -F options are meaningless in this
context and delimiters are only inserted if more than one field
is available to be delimited. Field separator scoping options
are ignored. The default delimiter is a space character for
XML depth levels of 2 and above, and new-line for XML depth
levels below 2. Tag names and their attributes are not included
in the output, and anything between XML comments are filtered
out. If there is a BOM in the input then xmlfy will use that
for the encoding, otherwise xmlfy will look for the opening
XML character sequence of "<?" to determine the encoding
being used. If neither of the previous methods found the correct
encoding then you can use the -X UTF-* options as a fallback.
Basic quoting options are also supported.
Works best with XML output generated by xmlfy but can
also be used with caution on other foreign XML documents.
Option: --noxml
Do not XML-fy the input stream but do process it for reserved
XML characters (this feature was initially written for formatting
the xmlfy HTML test reports that use wide encodings). Used in conjunction
with the -X options to control the conversion of reserved characters
and/or to transform the input stream to wide UTF encodings.
E.g. To transform an ASCII input stream to UTF-16BE encoding with a BOM:
xmlfy --noxml -X UTF-16BE -X ASCIItoUTF -X noescape all -X BOM
E.g. To just escape select reserved XML characters in an UTF-32LE input stream:
xmlfy --noxml -X UTF-32LE -X noescape amp
Important note on specifying options.
The way xmlfy handles options is very straightforward and can
be easily confused if you don't follow the syntax specified for
each option. The getopt library has been deliberately avoided
to keep xmlfy portable.
xmlfy first evaluates options supplied on the command line, if a schema file is supplied then xmlfy will also look for options in that file and evaluate them too. See the schema file section below on how to specify xmlfy options inside a schema file.
How it works.
The input processor used by xmlfy block reads unprocessed
bytes from standard input (stdin) and stores them in an array the
size of a level 1 record. This level 1 record is then processed
for fields and sub fields etc by marking their positions in this
array. Dynamic memory handling is used.
The output processor used by xmlfy takes the results from the input processor and re-packages it with suitably encoded XML syntax. Any input characters that are reserved for XML are by default re-represented in their escaped form.
Character & (ampersand) becomes string &The output processor then writes processed bytes to a block buffer for printing to standard output (stdout).
Character < (less-than) becomes string <
Character > (greater-than) becomes string >
Character " (quote) becomes string "
Character ' (apostrophe) becomes string '
Character | (broken vertical bar) becomes string ¦
Using a schema file.
The default schema used by xmlfy is hard coded and can be
described as follows:
In DTD schema form:
<!ELEMENT xmlfy (line*)> <!ELEMENT line (field*)> <!ELEMENT field (#PCDATA)>
In RNC schema form:
start = xmlfy xmlfy = element xmlfy { line* } line = element line { field* } field = element field { text }
In XSD schema form:
<xs:schema> <xs:element name="xmlfy"> <xs:sequence> <xs:element name="line" type="lineType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:element> <xs:complexType name="lineType"> <xs:sequence> <xs:element name="field" type="xs:string" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> </xs:schema>
A schema file for the ls -la command that produces output like this:
total 73 drwx------+ 3 ag None 0 Apr 20 19:36 . -rwxr-xr-x 1 ag None 15639 Apr 20 19:31 a.exe -rwx------+ 1 ag None 6354 Apr 20 19:31 xmlfy.c -rwx------+ 1 ag None 4901 Apr 19 2008 xmlfy.h
In DTD schema form will look like this:
<!ELEMENT ls (total?), (file*)> <!ELEMENT total (prompt, totalsize)> <!ELEMENT file (permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname)> <!ELEMENT date_ty (date_y)> <!ELEMENT date_ty (date_h, date_m)> <!ELEMENT prompt (#PCDATA)> <!ELEMENT totalsize (#PCDATA)> <!ELEMENT permission (#PCDATA)> <!ELEMENT blocks (#PCDATA)> <!ELEMENT user (#PCDATA)> <!ELEMENT group (#PCDATA)> <!ELEMENT size (#PCDATA)> <!ELEMENT date_y (#PCDATA)> <!ELEMENT date_M (#PCDATA)> <!ELEMENT date_d (#PCDATA)> <!ELEMENT date_h (#PCDATA)> <!ELEMENT date_m (#PCDATA)> <!ELEMENT fname (#PCDATA)>
and should be saved to a file as ls.dtd and invoked as:
% ls -la | xmlfy --schema ls.dtd -F3 :
In RNC schema form will look like this:
start = ls ls = element ls { total? | file* } total = element total { prompt, totalsize } file = element file { permission?, blocks?, user?, group?, size?, date_M?, date_d?, date_ty?, fname } date_ty = element date_ty { date_y } date_ty |= element date_ty { date_h, date_m } prompt = element prompt { text } totalsize = element totalsize { text } permission = element permission { text } blocks = element blocks { text } user = element user { text } group = element group { text } size = element size { text } date_y = element date_y { text } date_M = element date_M { text } date_d = element date_d { text } date_h = element date_h { text } date_m = element date_m { text } fname = element fname { text }
and should be saved to a file as ls.rnc and invoked as:
% ls -la | xmlfy --schema ls.rnc -F3 :
In XSD schema form will look like this:
<xs:schema> <xs:element name="ls" type="lsType" /> <xs:complexType name="lsType"> <xs:sequence> <xs:element name="total" type="totalType" minOccurs="0" /> <xs:element name="file" type="fileType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> <xs:complexType name="totalType"> <xs:sequence> <xs:element name="prompt" type="xs:string" /> <xs:element name="totalsize" type="xs:string" /> </xs:sequence> </xs:complexType> <xs:complexType name="fileType"> <xs:sequence> <xs:element name="permission" type="xs:string" minOccurs="0" /> <xs:element name="blocks" type="xs:string" minOccurs="0" /> <xs:element name="user" type="xs:string" minOccurs="0" /> <xs:element name="group" type="xs:string" minOccurs="0" /> <xs:element name="size" type="xs:string" minOccurs="0" /> <xs:element name="date_M" type="xs:string" minOccurs="0" /> <xs:element name="date_d" type="xs:string" minOccurs="0" /> <xs:element name="date_ty" type="datetyType" minOccurs="0" /> <xs:element name="fname" type="xs:string" /> </xs:sequence> </xs:complexType> <xs:complexType name="datetyType"> <xs:choice> <xs:element name="date_y" type="xs:string" /> <xs:sequence> <xs:element name="date_h" type="xs:string" /> <xs:element name="date_m" type="xs:string" /> </xs:sequence> </xs:choice> </xs:complexType> </xs:schema>
and should be saved to a file as ls.xsd and invoked as:
% ls -la | xmlfy --schema ls.xsd -F3 :
Shoe-horning raw data into a structure defined by a schema is rather straight forward when the input fields have a one-to-one relationship with the fields of the schema elements, however if wildcard tokens and/or Boolean logic are employed in the schema then it becomes quite a challenge, sometimes even impossible, to be deterministic about which input field belongs to which schema field. Strictly speaking, the main function of the schema is to ensure XML is valid and to do this requires the XML document to already pre-exist. In xmlfy's case we are doing the reverse by building an XML document on the fly while following rules described by the schema - this is still okay and the resulting XML can be considered to be both valid and well formed.
xmlfy employs two techniques to help with this shoe-horning input data problem. The first technique xmlfy uses is recognising multiple element definitions that have the same name. This allows you to capture your schema elements under a variety of input circumstances without having to create a unique element for each circumstance - you can still do that if you want. The second technique xmlfy uses is auto-generated field match constraint helpers to assist in matching the input fields to the elements described by the schema. These helpers are useful in improving the speed of xmlfy particularly when using compound element structures and wildcard tokens in the schema hierarchy. After the schema file is loaded into memory, an array of helpers is generated for each element that describes all combinations of the schema tree traversal paths that can be taken and associates each combination with the minimum, maximum and last number of fields required for a match against the number of available input fields. For example, using the above schema a match will occur for:
total(min=2, max=2, last=2) when input fields = 2.By default xmlfy continuously iterates through just the record elements of the root element looking for element helpers that can fully satisfy the requirements of that particular element's schema tree hierarchy for the given input fields, after which the matching record element is then checked against its wildcard obligations in the root element definition, and if okay is finally printed.
file(min=1, max=9, last=1) when 1 >= input fields <= 9
and date_ty is a single field (min=1, max=1, last=1).
file(min=1, max=10, last=1) when 1 >= input fields <= 10
and date_ty is two fields (min=2, max=2, last=2).
To specify xmlfy options inside a schema file you encapsulate them inside a special token that is in effect a schema comment.
DTD and XSD example:This special token must exist in completed form on just one line at the left most side, spacing is important, only the first occurrence is recognised, and ideally it is placed somewhere near the top of the schema file. The schema option syntax is the same as the command line option syntax except that some options are not allowed e.g. --schema.<!-- xmlfy-args: -F1 "\n" -F2 ABC -q -Q \"\' -->RNC example:
## xmlfy-args: -F1 "\n" -F2 ABC -q -Q \"\'
Currently the xmlfy schema file parser is not that sophisticated and exhibits the following behaviour:
DTD schema
RNC schema
XSD schema
All schema types
<!ELEMENT parent (record)>In the above example xmlfy will allocate ALL input fields to element <a> and that MAY not be the desired intention.
<!ELEMENT record (a*, b, c*)>
<!ELEMENT a (#PCDATA)>
<!ELEMENT b (#PCDATA)>
<!ELEMENT c (#PCDATA)>
0 | Normal exit. |
-1 | Invalid argument specified. |
-2 | Error processing schema file contents. |
-3 | Infinite loop detected when matching input against schema elements. |
-10 | Out of memory. |
Originally written by Arthur Gouros.
This software also contains material derived from Ville Laurikari's TRE regex library.
This software also contains material derived from the US Secure Hash Algorithms (RFC4634).
This software also contains material derived from the RSA Data Security, Inc.
MD5 Message-Digest Algorithm.
BSD License for xmlfy
Copyright © 2008-2020, Arthur Gouros
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The full documentation of the xmlfy project can be found on the web at:
The website is updated more frequently than the man pages and should be considered the authoritative source of information.