Bookmark and Share

XML and Ampersand (&)

Posted: Tuesday, August 4th, 2009 at 4:50 pmUpdated: Wednesday, February 13th, 2013 at 12:40 am

I’m not sure if many programmers know this. But it’s always a good thing to make sure that they (including me) do. XML specification doesn’t allow ampersand (&). Here’s a quote from W3c.org XML recommendation with bold emphasis added by me.


The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they MUST be escaped using either numeric character references or the strings ” &amp; ” and ” &lt; ” respectively.

The reason for this post is that I encountered this error recently at work. If you’re using PHP’s SimpleXML, you may get error message something like:

XML parser error : EntityRef: expecting ';'

Why does ampersand needs to be escaped?

I think the reason why ampersand needs to be escaped is as follows. Consider you have an XML document for some Mathematics formula ‘(3A + 2B) > 5C’. How would you write that in XML?

Well, if you write it this way, it would be an error as greater than (>) is meant as the marker for end tag.

<math>
(3A + 2B) > 5C
</math>

So the answer is to escape > with &gt;. So your XML looks like below:

<math>
(3A + 2B) &gt; 5C
</math>

Similarly for <, we're changing it using &lt;. Now notice that we're using ampersand (&) to escape greater than and less than. If we didn't have to escape ampersand, then, from parser point of view, ampersand could really mean ampersand, or ampersand could mark an escape character. To make things all consistent, I think, they design ampersand as a marker for escape character. Thus, when you're parsing XML and you see ampersand (&), you can guarantee that it will be followed by either amp; or lt; or gt;. In other words, when a parser sees ampersand (&), it expects to see semicolon (;) soon after. Understanding the reasoning, you may now understand somewhat cryptic SimpleXML error message above.

SimpleXML vs SAX XML Parsing.

I found something pretty interesting differences when parsing XML using SimpleXML or SAX method. If your XML is something like below:

<url type="google_search">
http://www.google.com/#hl=en&q=xml+specification
</url>

and parse it in PHP using SimpleXML function, you’ll get a warning. Here’s a sample session.

user@dev:~$ cat jajal.php
<?php
$xml_str = 'http://www.google.com/#hl=en&q=xml+specification';

$xml = simplexml_load_string($xml_str);
var_dump($xml);
user@dev:~$ php jajal.php
bool(false)
user@dev:~$ 

On error log, you’ll see this entry:

Aug 4 16:01:53 user php[1579]: PHP Warning: simplexml_load_string(): Entity: line 1: parser error : EntityRef: expecting ‘;’ in /Users/user/jajal.php on line 4

Interestingly enough, if you’re using SAX XML Parser functions, it won’t throw error / warnings. Rather, it’ll just quietly die. Here’s one example.

class myXMLParser {
   private $parser = null;
   private $tag_alue = null;

   function __construct() {
      $this-&gt;parser = xml_parser_create();
      xml_set_object($this-&gt;parser, $this);
      xml_set_element_handler($this-&gt;parser, 'start_element', 'end_element');
      xml_set_character_data_handler($this-&gt;parser, 'chardata');
   }

   function __destruct() {
      xml_parser_free($this-&gt;parser);
      $this-&gt;parser = null;
   }

   function parse($xml_str) {
      xml_parse($this-&gt;parser, $xml_str);
   }

   function start_element($parser, $name, $attrs) {
      echo &quot;Start of $name. Content: &quot;;
   }

   function end_element($parser, $name) {
      echo &quot;.\nEnd of $name.\n&quot;;
   }

   function chardata($parser, $data) {
      echo &quot;$data&quot;;
   }
}

$xml_str = '&lt;url type=&quot;google_search&quot;&gt;http://www.google.com/#hl=en&amp;q=xml+specification&lt;/url&gt;';

$xml = new myXMLParser();
$xml-&gt;parse($xml_str);

echo &quot;I am all done now.\n&quot;;
user@dev:~$ php jajal.php 
Start of URL. Content: http://www.google.com/#hl=enI am all done now.
user@dev:~$ 

There’s no error message and the parsing just die in the middle (as evidence by end_element() function not being called). The fact that PHP doesn’t throw errors or warnings could be dangerous as we could have false sense of security. We think that there’s no errors, so it’s all good, while the XML parsing could be incomplete. Fortunately, the __destruct() function is still being called. So do use it to detect for errors, if you’re parsing XML using expat.

Just for conclusion, here’s what you’re supposed to have once you change the XML to be compliance (change ampersand to &amp;)

user@dev:~$ php jajal.php 
Start of URL. Content: http://www.google.com/#hl=en&q=xml+specification.
End of URL.
I am all done now.
user@dev:~$ 

That’s about it. I hope this article helps you. Please leave comments / suggestions / question. I’m looking forward to improving my solution with your comments / suggestions / questions.

3 Responses to “XML and Ampersand (&)”

  1. glcx Says:

    It was really helpful. Thank you!

  2. pen Says:

    good article, but greater than, not grater? and there is a typo in the captcha error message

  3. Junaid Says:

    OMG totally made my day

Leave a Reply