An XML document is a text file that conforms to the XML specification's well-formedness rules. Such a conforming document is said to be well-formed (not to be confused with valid). XML is very strict with well-formedness in comparison to other languages such as HTML. A text file that is not well-formed is not considered XML and cannot be used by consuming applications at all.
Here are some rules that apply to XML documents:
XML uses a much self-describing syntax. A prolog defines the XML version and the character encoding:
<?xml version="1.0" encoding="UTF-8"?>
There must be exactly one top-level element.
However, comments, processing instructions, as well as the initial XML declaration, are allowed at the top-level as well. Text and attributes are not.
<?xml version="1.0"?>
<!-- some comments -->
<?app a processing instruction?>
<root/>
<!-- some more comments -->
Elements may nest, but must be "properly nested":
<name>
<first-name>John</first-name>
<last-name>Doe</last-name>
</name>
The start and end tags of an embedded element have to be within the start and end tags of its container element. An overlapping of elements is illegal.
In particular, this is not well-formed XML: <foo><bar></foo></bar>
Attributes may only appear in opening element tags or empty element tags, not in closing element tags. If attribute syntax appears between elements, it has no meaning and is parsed as text.
<person first-name="John" last-name="Doe"/>
This is not well-formed: <person></person first-name="John"/>
Comments, processing instructions, text and further elements can appear anywhere inside an element (i.e., between its opening and closing tag) but not inside the tags.
<element>
This is some <b>bold</b> text.
<!-- the b tag has no particular meaning in XML -->
</element>
This example is not well-formed: <element <-- comment --> />
The <
character may not appear in text, or in attribute values.
The "
character may not appear in attribute values that are quoted with "
. The '
character may not appear in attribute values that are quoted with '
.
The sequence of characters --
may not appear in a comment.
Literal <
and &
characters must be escaped by their respective entities <
and &
.