Generating XML programmatically? Don’t use CDATA

If you want to represent characters, you want to represent any possible sequence of characters. In an XML file, escaping < with &lt; gives you a way to represent any character. Using CDATA doesn't give you a way to do that.

In an XML document there are two ways to represent character data. Either

  1. You just write the characters in the XML file (in which case you need to escape characters that look like XML tags i.e. < with &lt; etc.), or
  2. You use CDATA, no longer having to care about replacing < with &lt; etc. Ends as soon as the first ]]> is encountered, which is unlikely to appear in your characters.

These two syntaxes result in an identical XML document. An XML parser must consider XML files identical, which differ only by if CDATA is used, or not. In both cases, there is text within a tag.

CDATA has a (somewhat strange) syntax like:

<my-tag><![CDATA[My characters]]></my-tag>

From its naming, CDATA ("Character DATA") might seem like exactly what you need to represent character data. Combine that with the fact that characters such as < don't need to be escaped.

However, in fact it's exactly the opposite of what you need. Whereas, by not using CDATA, it's possible to escape the characters that mustn't appear (i.e. replace < with &lt;) with CDATA there is no way to escape the characters that mustn't appear (i.e. ]]>).

It might seem "unlikely" that ]]> is actually going to appear in the character data your user wishes to represent. (This is actually irrelevant, as software should work all the time, not just be "unlikely" not to work.) However, even this "unlikelihood" is misleading. No matter what sequence of characters XML had chosen to end CDATA, as soon as you represent data in itself (e.g. send an XML document in a tag), this sequence will appear in the data. So if you're working with XML (which you are), this will happen more often than you think.

CDATA is just a convenience mechanism for writing XML files by hand. If you're writing files by hand, you know which characters appear in your data, so you know whether you can use CDATA safely or not. If you're creating a program to write XML files, you don't know what the user's data will contain, so you can't use CDATA.

P.S. I recently created a nerdy privacy-respecting tool called When Will I Run Out Of Money? It's available for free if you want to check it out.

This article is © Adrian Smith.
It was originally published on 20 Mar 2013
More on: Coding | Language Design