Our Blog

03.09.08 @ 21:41

Using FeedAPI's API

SUBMITTED BY oren

FeedAPI is a great module, and easy to use if you want it to parse your RSS/RDF/ATOM feeds into nodes or lightweight items. There are quite a few posts that review that, but developer documentation for using FeedAPI's hook's can still be better. This post will be a walk-through of the code.

For leveraging it's capabilities, FeedAPI gives us two hooks: hook_feedapi_feed for creating a parser, and hook_feedapi_item for creating a processor. If you used FeedAPI before then these concepts are familiar to you- parser-common-syndication and simplepie are examples of parsers implementing these hooks, and feedapi_node and aggregator are examples of parsers.

When you wish to extend FeedAPI using the hooks, It is more likely that you'll want to implement a parser- say if your client wants to get data from a custom XML feed still unsupported by FeedAPI.
So first, let's look at a simplified (removed caching bits) implementation of hook_feedapi_feed, courtesy of parser_common_syndication (PCS for brevity here on forward). I've bolded items you would change:

<?php
function parser_common_syndication_feedapi_feed($op) {
 
$args = func_get_args();
  switch (
$op) {
    case
'type':
      return array(
"XML feed");
    case
'compatible':
      if (!
function_exists('simplexml_load_string')) {
        return
FALSE;
      }
     
$url = $args[1]->url;
     
$downloaded_string = _parser_common_syndication_download($url, $op);
      if (
is_object($downloaded_string)) {
        return
$downloaded_string->type;
      }
      if (!
defined('LIBXML_VERSION') || (version_compare(phpversion(), '5.1.0', '<'))) {
        @
$xml = simplexml_load_string($downloaded_string, NULL);
      }
      else {
        @
$xml = simplexml_load_string($downloaded_string, NULL, LIBXML_NOERROR | LIBXML_NOWARNING);
      }
      if (
_parser_common_syndication_feed_format_detect($xml) != FALSE) {
        return
array_shift(parser_common_syndication_feedapi_feed('type'));
      }
      return
FALSE;
    case
'parse':
     
$feed = is_object($args[1]) ? $args[1] : FALSE;
     
$parsed_feed = _parser_common_syndication_feedapi_parse($feed);
      return
$parsed_feed;
  }
}
?>

The three main changes that you will want to do here are _parser_common_syndication_feed_format_detect, _parser_common_syndication_feedapi_parse, and the hook itself.

FeedAPI calls this hook after it instantiated a $feed object. When the hook is called, it is called with a number of arguments which we map through func_get_args. In the 'compatible' $op, we download the feed and inspect it to make sure that our parser recognizes it. PCS utilizes PHP's simplexml function to parse the feed, therefore the matching function calls. If all is well, we'll call the hook again with the 'type' $op that will return "XML feed". I guess the 'parse' op is self explanatory... here you'll fill in your version of feedapi_parse- let's call it myparser_parse().

so to start implementing your version, you'll use the hook (module_name_feedapi_feed), and change the call to _parser_common_syndication_feedapi_parse to myparser_parse(). You can decide for yourself if you want to implement the feed_detect function or not.

There are a couple of functions involved in the feed creation in PCS, here's a rundown- "_parser_common_syndication_feedapi_parse" (our main parser) calls "_parser_common_syndication_download" (makes some preliminary tests on the data) which calls "_parser_common_syndication_feedapi_get" (checks whether the feed exists in DB and changed from its copy in cache and if not- fetches the feed and checks it). This is all done in order to not waste time on parsing existing items or feed. You don't have to use them or implement your version of them to parse the data, but they make the process more efficient. if PCS is enabled, you can use calls to _parser_common_syndication_download() since they are still valid (this function is a helper function, not necessarily tied to a format!) instead of implementing it yourself. In essence- you can utilize PCS and just change the call to "feedapi_parse" and "feed_format_detect" which we will get to in a minute.
now, take a look at:

<?php
function _myparser_parse($feed) {
  if (
is_a($feed, 'SimpleXMLElement')) {
   
$xml = $feed;
  }
  else {
   
$downloaded_string = _parser_common_syndication_download($feed->url, 'parse');
    if (
$downloaded_string === FALSE || is_object($downloaded_string)) {
      return
$downloaded_string;
    }

    if (!
defined('LIBXML_VERSION') || (version_compare(phpversion(), '5.1.0', '<'))) {
      @
$xml = simplexml_load_string($downloaded_string, NULL);
    }
    else {
      @
$xml = simplexml_load_string($downloaded_string, NULL, LIBXML_NOERROR | LIBXML_NOWARNING);
    }

   
// We got a malformed XML
   
if ($xml === FALSE || $xml == NULL) {
      return
FALSE;
    }
  }
 
$feed_type = _parser_common_syndication_feed_format_detect($xml);
  if (
$feed_type ==  "atom1.0") {
    return
_parser_common_syndication_atom10_parse($xml);
  }
  if (
$feed_type == "RSS2.0" || $feed_type == "RSS0.91" || $feed_type == "RSS0.92") {
    return
_parser_common_syndication_RSS20_parse($xml);
  }
  if (
$feed_type == "RDF") {
    return
_parser_common_syndication_RDF10_parse($xml);
  }
  return
FALSE;
}
?>

On your implementation of the preliminary parser function (myparser_parse, in this example, please note that this is not a hook!), make sure you load the xml file- that can be done via simplexml_load_string (if you already have the file through PCS_download() or similar method) or simplexml_load_file. SimpleXML will create an object of the xml. Now comes in your implementation of feed recognition. You might use php's simplexml getName to recognize the first tag.

Once recognized, you can accordingly direct the feed to the "real" parser function if you have a number of them, or do the actual organization of data right here. In general, I guess most of the time you'll resort to leaving it like the original only with minor changes. If you look at _parser_common_syndication_feedapi_parse, you'll see that the main part you'll need to change is the calls for appropriate parsers. If you are implementing a call to just one, you can do the processing here as well.

The question is, how to parse the data. Remember that simpleXML created an object, so you'll be able to traverse it with foreach(), or by using PHP's simplexml_element->xpath() .
note that some type casting might be in order here. You can take one of PCS's own parsers as a reference.

another note- If you pass the data to node_processor, and it doesn't recognize either a 'guid' or an 'original_url' field, it will ignore the data. So fill those in.

Well, that's about it! enjoy....
oh, and of course, thanks Aron.

2 Comments so far

If i understood what you

If i understood what you described, then yes, that would be the way to go. You can also use a separate content type for regular rss items and specific ones.

oren 3 years ago
Thanks, this writeup is

Thanks, this writeup is really helpful. I've been messing around with parsing XML query results for the past few days. I managed to add a new type and parsing function to PCS, but I think I'd like to add an entirely parser module (based on PCS) for the sake of cleanliness. You gave me a nice start in doing that!

Miles (not verified) 2 years ago

Post a Comment

The content of this field is kept private and will not be shown publicly.
  • Use one of the forms name.module, name.theme, name.translation, name.installprofile or name.project, in order to link to http://drupal.org/project/name. Note that a link will be generated even if a project does not exist.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <b> <a> <p> <br> <em> <strong> <cite> <blockquote> <table> <tr> <td> <th> <tbody> <ul> <ol> <li> <dl> <dt> <dd><img> <div> <h2> <h3> <h4> <code>
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.
  • Image links with 'rel="lightbox"' in the <a> tag will appear in a Lightbox when clicked on.

More information about formatting options


Or