Parsing RSS Feeds With PHP

The beauty of RSS is that it's meant to be consumed. Unlike the various ethical and legal implications of screen-scraping, RSS aggregation is not only fair game but is its intended method of propagation. Here is an aggregation function that works best for me.

Check out the demo or download the project file.

My solution had a couple self-imposed constraints:

  • No third party classes. For my own education, I'd prefer to build it myself so I know exactly what it's doing.
  • Use only PHP extensions and libraries that are part of a standard DreamHost instalation . The PEAR framework has something called XML_Feed_Parser, but we can do better.
  • Can parse multiple formats: RSS 2.0 and Atom 1.0

PHP 5 has a function called simplexml_load_string that is going to do the job nicely. I'm going to build a function, extending simplexml_load_string to isolate specific RSS and Atom tags.

Here's the game plan

  1. Use cURL to fetch the feed as a string
  2. Manipulate the string into an object using simplexml_load_string
  3. Assemble and return a multi-dimensional array of elements

Here is the function:

  1.  <?php
  2.  Function GetFeed($feed) {
  3.    // Use cURL to fetch text
  4.    $ch = curl_init();
  5.    curl_setopt($ch, CURLOPT_URL, $feed);
  6.    curl_setopt($ch, CURLOPT_HEADER, 0);
  7.    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  8.    curl_setopt ($ch, CURLOPT_USERAGENT, $useragent);
  9.    $rss = curl_exec($ch);
  10.    curl_close($ch);
  11.    // Manipulate string into object
  12.    $rss = simplexml_load_string($rss);
  13.    // Assemble multi-dimensional array of elements based on feed type
  14.    if (count($rss) > 0) {
  15.      // If it has channel tags
  16.      if (count($rss->channel) !== 0) {
  17.        $varArray = array();
  18.        foreach ($rss->channel->item as $item) {
  19.          $varTitle    = $item->title;
  20.          $varLink    = $item->link;
  21.          $varPubDate    = $item->pubDate;
  22.          $varDescription  = $item->description;
  23.          if (strlen($varTitle) !== 0) {
  24.            $varArray[]  = array($varTitle,$varLink,$varPubDate,$varDescription);
  25.          }
  26.        }
  27.        return $varArray;
  28.      }
  29.      // If it has entry tags
  30.      elseif (count($rss->entry) !== 0) {
  31.        $varArray = array();
  32.        foreach ($rss->entry as $item) {
  33.          $varTitle    = $item->title;
  34.          $varLink    = $item->link->attributes()->href;
  35.          $varPubDate    = $item->published;
  36.          $varDescription  = $item->content;
  37.          if (strlen($varTitle) !== 0) {
  38.            $varArray[]  = array($varTitle,$varLink,$varPubDate,$varDescription);
  39.          }
  40.        }
  41.        return $varArray;
  42.      }
  43.      else {
  44.        return FALSE;
  45.      }
  46.    }
  47.    else {
  48.      return FALSE;
  49.    }
  50.  }
  51.  ?>

A couple notes

  • Due to security concerns, fopen is disabled on many hosts (including DreamHost) by default. cURL provides the functionality necessary for any binary read. If you're concerned about broadcasting your aggregation to the various sites you're aggregating, you could easily impersonate a browser useragent.
  • You know what the funniest thing about RSS 2.0 and Atom 1.0 is? It's the little differences. Where RSS 2.0 has channels, Atom 1.0 has feeds. Where RSS 2.0 has items, Atom 1.0 has entries. It's a little different.
  • For however you use your incoming feeds, you're going to want to sanitize them text a bit. In my example, I strip tags.

Comments

No comments yet.

Add a Comment

(HTML tags aren't allowed.)


© 2010 Jon Plante. All Rights Reserved. Hosted by DreamHost. Send me a line, or check out my celebrity comics.