11 ways to valid RSS

In an attempt to identify the way people are using RSS 2.0, I have identified 11 different methods of specifying content in RSS 2.0. Some of them should be functionally equivalent to an XML parser, and some should not.

Asking the question Should my aggregator do something sensible with this? — some of these even seem to be mutually incompatible.

Content in the description element

I have so far identified five different variants of content in the <description> element:

  1. Plaintext as CDATA with HTML entities - Validate
  2. HTML within CDATA - Validate
  3. HTML escaped with entities - Validate
  4. Plain text in CDATA - Validate
  5. Plaintext with inline HTML using escaping - Validate

<content:encoded>

I have encountered and identified two different ways of using <content:encoded>:

  1. Using entities - Validate
  2. Using CDATA - Validate

XHTML content

Finally, I have encountered and identified four different ways in which people has specified XHTML content:

  1. Using <xhtml:body> - Validate
  2. Using <xhtml:div> - Validate
  3. Using <body> with default namespace - Validate
  4. Using <div> with default namespace - Validate

Download

Download all feeds in zip file for offline/private testing.

Any more?

If you have seen other ways to specify content in RSS 2.0 that you have seen in actual use, feel free to point me to it, so I can construct a minimal test case from it.

Update

What might not have been apparent when this entry was first posted, is this: This is not a matter of 11 different RSS formats. These 11 test documents are all RSS 2.0.

Conclusion

Really Simple Syndication? Simple?!??

Comments

Comment from Asbjørn Ulsberg on 2004-03-15 15:42

And people wonder why we need Atom? Really Simple Syndication my ass! Pfft!

Comment from Dare Obasanjo on 2004-03-16 22:16

Arve,
I’ve posted comments about your post in my weblog. Your post points out that there are 5 ways to provide content in RSS not 11. Read my post on RSS follies if you wonder how I got that number.

Asbjørn,
Atom adds lot more complexity to providing content than RSS does. I don’t even have to start using contrived examples like changing namespace prefixes or CDATA vs. escaped content to make the combination of ways to provide content in ATOM reach 11.

  1. summary type=text\plain
  2. summary type=text\html
  3. summary type=application\xhtml+xml
  4. content type=text\plain
  5. content type=text\html
  6. content type=application\xhtml+xml
  7. summary type=text\plain mode=escaped
  8. summary type=text\html mode=escaped
  9. summary type=application\xhtml+xml mode=escaped
  10. content type=text\plain mode=escaped
  11. content type=text\html mode=escaped
  12. content type=application\xhtml+xml mode=escaped

That’s 12 without having to use claims like different namespace prefixes or escaped content vs. CDATA. Of course, I could go much longer if I decided to use content from other MIME types and mode=base64 but I’m sure you get the point

(Ed. note: typographical/semantical edit performed to make the numbered list an actual list)

Comment from Arve on 2004-03-16 23:41

Dare, to clear something up: No, I haven’t misunderstood XML at the most fundamental level — which is what you should read into the sentence Some of them should be functionally equivalent to an XML parser, and some should not.

The reason I mentioned every variant of escaping as well is that, no matter how much you may want it or not, people are going to use piss-poor quasi-XML tools, or they’re going to go through regexp hell to use it. Believeing that people will create or parse RSS feeds the “one true way” is wishful thinking.

Finally, for content in the description element, if we sideline the escaping for the purpose of this specific argument, and instead look at the semantics:

<p>
A paragraph with <a href="http://www.example.com">a link</a>.
</p>

<p>
Another paragraph
</p>

Is not the same as:

A paragraph and <a href="http://www.example.com">a link</a>.

Another paragraph

And please, do not attempt to make this an Atom vs. RSS discussion.

Comment from Asbjørn Ulsberg on 2004-03-16 23:54

Dare, the problem isn’t whether you have a quadrazillion different ways to embed content in your XML feed (no matter what format), it is whether you can specify how you embed it or not.

Atom provides a way to specify this, RSS doesn’t.

Comment from Dare Obasanjo on 2004-03-17 01:30

Arve,
So what’s your point? RSS is complex because you can’t process XML and HTML with regular expressions?

Your original point was that processing content in RSS was difficult and I pointed out that things aren’t as bad as you claim if you use proper tools. You retort by claiming that people want to process RSS with improper tools? So what? That is an irrational argument. Removing a screw is hard if you have a hammer but easy with a screwdriver. What was your point again?

Both the examples you show are embedded HTML in a description element. I assume you’re claiming that you should process newlines as <p> tags. I don’t see why anyone would do that, you can only bend so far backwards for people. What happens if I mix wiki text with my HTML, such as bold does that also mean that you should parse the wiki-isms and the HTML?

Comment from Arve on 2004-03-17 01:46

No, Dare, my point was that people who choose to use piss-poor tools for the job has a lot of extra work to do. My point was also that in any other place than Utopia, people will use these piss-poor tools to get the job done. Which means that the users of the piss-poor tools might actually benefit from knowing these eleven variants, whether they are equivalent to an XML parser or not. Please, do not read malice into where malice does not exist.

As for the examples in my last comment: Both are in use, and while I agree with you that “both are HTML and should be treated as such”, that may not be the author’s intent — is your goal to produce according to spec, or to provide human-readable content? Squished paragraphs is not in that category.

BTW: There is a fairly pragmatic solution to this that should work fairly well: If there are no elements defined as block-level by HTML 4.01 in the <description>, but there are inline elements present, you treat double newlines as paragraphs, and any inline elements are treated as just that, inline elements.

Comment from Dare Obasanjo on 2004-03-17 02:53

Arve,
If you want to claim that there was no malicious intent in your blog post perhaps you should retract the line

Really Simple Syndication? Simple?!??

from your post as it is quite clear that you are exagerrating the complexity of RSS if the facts are looked at objectively. So far all you’ve stated is that it is complex if you use the wrong tool for the job which is a fact of life regardless of whether you are working on software, hardware or some other aspect of human endeavor.

Comment from Adam Fitzpatrick on 2004-03-17 10:44

Which means that the users of the piss-poor tools might actually benefit from knowing these eleven variants, whether they are equivalent to an XML parser or not. Please, do not read malice into where malice does not exist.

Given your good intentions, wouldn’t it be a more rewarding use of your time and effort to encourage these people to move away from piss-poor tools, rather than helping them to make more work for themselves?

Comment from Arve on 2004-03-17 11:01

Given your good intentions, wouldn’t it be a more rewarding use of your time and effort to encourage these people to move away from piss-poor tools, rather than helping them to make more work for themselves?

I seriously don’t expect people to switch from their tools, because of what I write on this blog, and how they spend their time programming is their problem, not mine.

And to Dare:

from your post as it is quite clear that you are exagerrating the complexity of RSS if the facts are looked at objectively.

Given the assumption that nothing I write in this blog will make people turn away from using regular expressions instead of the right tool, RSS is not simple.

Finally: Instead of making this an “Is RSS evil?” or “Don’t use a screwdriver where a hammer is most suited”, take it as an incentive to develop a set of best practices for RSS that will let even the screwdriver-owners produce and consume RSS feeds with the least amount of trouble (The same goes for Atom, btw)

Comment from Dare Obasanjo on 2004-03-17 15:38

Arve,
Processing XML is not simple, processing HTML is not simple, writing a C compiler is not simple. I don’t think coming up with a set of guidelines so that any college freshman can write a C compiler using regular expressions is a good idea especially when they can just use gcc.

So why exactly is it a good idea to come up with “best practices” in the case of HTML and XML processing in RSS?

I still don’t see your point.

Comment from Arve on 2004-03-17 15:43

The idea is that the threshold should be lower for those who produce their personal aggregators using their stupid tools, and that the production of syndicated content should be easier for those using stupid tools.

Those who use proper tools for the job won’t notice the difference anyway.

Comment from Daniel Cazzulino on 2004-03-17 16:56

Completely agree with Dare. Complex things as XML require the proper tools.
You’re trying to make something new, complex and incredible versatile and powerful sort of backwards compatible with tools that were’t thought for with it in mind. Should XML become regex-friendly then?! Of course NOT, IMO.

This discussion has been closed. No further comments may be added.