Parsing RSS Feeds With PHP
The beauty of RSS is that it's meant to be consumed. Unlike the various ethical and legal implications of screen-scraping, RSS aggregation is not only fair game but is its intended method of propagation. Here is an aggregation function that works best for me.
Check out the demo or download the project file.
My solution had a couple self-imposed constraints:
- No third party classes. For my own education, I'd prefer to build it myself so I know exactly what it's doing.
- Use only PHP extensions and libraries that are part of a standard DreamHost instalation . The PEAR framework has something called XML_Feed_Parser, but we can do better.
- Can parse multiple formats: RSS 2.0 and Atom 1.0
PHP 5 has a function called simplexml_load_string that is going to do the job nicely. I'm going to build a function, extending simplexml_load_string to isolate specific RSS and Atom tags.
Here's the game plan
- Use cURL to fetch the feed as a string
- Manipulate the string into an object using simplexml_load_string
- Assemble and return a multi-dimensional array of elements
Here is the function:
- Function GetFeed($feed) {
- // Use cURL to fetch text
- $ch = curl_init();
- curl_setopt($ch, CURLOPT_URL, $feed);
- curl_setopt($ch, CURLOPT_HEADER, 0);
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
- curl_setopt ($ch, CURLOPT_USERAGENT, $useragent);
- $rss = curl_exec($ch);
- curl_close($ch);
- // Manipulate string into object
- $rss = simplexml_load_string($rss);
- // Assemble multi-dimensional array of elements based on feed type
- if (count($rss) > 0) {
- // If it has channel tags
- if (count($rss->channel) !== 0) {
- $varArray = array();
- foreach ($rss->channel->item as $item) {
- $varTitle = $item->title;
- $varLink = $item->link;
- $varPubDate = $item->pubDate;
- $varDescription = $item->description;
- if (strlen($varTitle) !== 0) {
- $varArray[] = array($varTitle,$varLink,$varPubDate,$varDescription);
- }
- }
- return $varArray;
- }
- // If it has entry tags
- elseif (count($rss->entry) !== 0) {
- $varArray = array();
- foreach ($rss->entry as $item) {
- $varTitle = $item->title;
- $varLink = $item->link->attributes()->href;
- $varPubDate = $item->published;
- $varDescription = $item->content;
- if (strlen($varTitle) !== 0) {
- $varArray[] = array($varTitle,$varLink,$varPubDate,$varDescription);
- }
- }
- return $varArray;
- }
- else {
- return FALSE;
- }
- }
- else {
- return FALSE;
- }
- }
A couple notes
- Due to security concerns, fopen is disabled on many hosts (including DreamHost) by default. cURL provides the functionality necessary for any binary read. If you're concerned about broadcasting your aggregation to the various sites you're aggregating, you could easily impersonate a browser useragent.
- You know what the funniest thing about RSS 2.0 and Atom 1.0 is? It's the little differences. Where RSS 2.0 has channels, Atom 1.0 has feeds. Where RSS 2.0 has items, Atom 1.0 has entries. It's a little different.
- For however you use your incoming feeds, you're going to want to sanitize them text a bit. In my example, I strip tags.
Comments are closed.
Comments
No comments yet.