| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Takes an RSS/Atom feed and converts it into a struct full of feed and item metadata.
High Points
- Added Atom 1.0 support. Since it looks like my name is going to end up in the acknowledgements for the Atom IETF specification, I figured it would be slightly embarrassing if I failed to update my old CFC to work with the new spec.
- It's a near-complete rewrite of the old code.
- JournURL is now using the same CFC in production, so updates to the public release of the component will come more frequently than in the past.
Basic Tech Info
- Supports RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. (It should support the 0.9x family of RSS feeds as well, but no specific testing was done.)
- Supports optional title synthesis, for the occasional title-free RSS feed.
- Includes optional summary preference.
- Includes optional, rudimentary xml:base support.
- Supports RSS+Atom, which is my name for the practice of embedding useful Atom 1.0 elements in RSS 2.0 feeds via namespaces.
Example Use
<cfinvoke component="rssatom" method="normalize" rss="#cfhttp.filecontent#" xmlbase="false" prefersummary="true" synthtitle="true" returnvariable="foo" /> <cfdump var="#foo#" />Returns
feed (struct)
feed.title (string)
feed.description (string)
feed.link (string)
feed.linkself (string)
feed.id (string)
feed.author (string)
feed.authoremail (string)
feed.authorurl (string)
feed.date (date)
feed.dateupdated (date)items (array of structs)
items[x].summary (string)
items[x].summarytype (string)
items[x].content (string)
items[x].contenttype (string)
items[x].title (string)
items[x].author (string)
items[x].authoremail (string)
items[x].authorurl (string)
items[x].date (date)
items[x].dateupdated (date)
items[x].link (string)
items[x].linkenclosure (string)
items[x].linkcomments (string)
items[x].id (string)Notes
Atom 1.0 is significantly more difficult to parse than existing feed formats like RSS 2.0. (In Coldfusion, anyway.) For better or worse, Atom is extremely complex, and enables all sorts of behavior that makes life complicated on the consuming end.
(A consensus-driven specification process can be... interesting.)
For example, Atom 1.0 encourages the use of @xml:base and relative URIs in feeds and entries. Given that CF doesn't do anything magical with xml:base to streamline processing, that means I had to resort to brute force.
(Personal plea: ignore the spec and avoid use of @xml:base in your own publishing. Think of the children!)
So rssatom.cfc checks for @xml:base on the atom:feed, atom:entry, atom:summary, and atom:content elements, and uses it to resolve any @hrefs and @srcs it finds in its descendants. Base paths can be "stacked", as in this example:
<feed xml:base="http://foo.com/"> <entry xml:base="this/is/"> <content xml:base="a/path/to/"> <div> <img src="img.jpg" /> </div> </content> </entry> </feed>In this case, the URI of the image will resolve to:
http://foo.com/this/is/a/path/to/img.jpg
My implementation should handle most situations where a misguided XML geek has decided to play with his toys, but it's ultimately an inelegant hack. @xml:base can theoretically show up just about anywhere in an Atom document, and the CFC will miss such oddball placements. Don't count on perfection, particularly in this first release.
More importantly, don't count on being able to use @xml:base support at all. I didn't want to take the time to hand-craft a strictly conformant URI resolver in CFML, so I'm using java.net.URI to do the job. If you're running in a sandbox that restricts access to Java objects, your best bet is to just forget any feeds that use @xml:base and get on with your life. With any luck, mainstream feed providers will stick to absolute URIs in their feeds.
You can of course try to slap together your own resolution code, or perhaps use reflection to get at java.net.URI from within the sandbox. That's your call.
Now let's talk about Atom 0.3 for a second. The first thing to understand is that Atom 0.3 doesn't officially exist... it was an experimental draft that was never meant for broad production use. So with any luck, most 0.3 feeds will disappear relatively quickly over the next few months.
Until then, the CFC will hide most of the conflicts from you, except one: the atom:content <div>. In Atom 1.0, all XHTML content must be wrapped in a container <div>, and that was common practice in the 0.3 days as well. Unfortunately, the <div> wasn't part of the draft document, so some folks didn't use it.
In practical terms, this means that it's impossible to tell if a <div> at the root of an 0.3 atom:content element is a wrapper or part of the content. Stripping it (as you're expected to do in 1.0 entries) could end up distorting the intent of the publisher, and is thus a rather bad idea. These days, a seemingly useless <div> may be host to microformat data that will be mangled if you delete it recklessly.
Bottom line: expect some of your 0.3 content to contain excess markup here and there. You're better off ignoring it than trying to do anything about it.
Okay, so here's the file. Go for it.
07-20-2005 03:11:07PM - Permalink - Comment [2] - Trackback
category: XML
related topics: (RSS) (Atom) (syndication) (feeds) (parsing)