Last week I wrote about getting back into coding, in part to pursue some pet projects. So here's one thing that I've been tinkering with off and on since late January. (Anyone waiting for a big technology revelation is going to be disappointed - in some ways this is quite retro.)
While there are some deeper agendas involved (once again, to be discussed later), the proximate cause for this project was my simultaneous frequent use and continued frustration with Gabe Rivera's Techmeme and Memeorandum blog discussion trackers. The good news is that they are a mostly ad-free way to keep up on top stories in their respective domains, high tech and politics. The bad news is that the blogs and topics followed are Gabe's choice, not mine. What I was really wanting was 'automatic peripheral vision' to pick up breaking stories from blog communities that I find interesting, but can't afford the time to follow in detail.
According to legend, Gabe knocked together the core of his platform in a year, so I figured I should be able to build something useful in less than that if I was willing to cut back the functionality where it didn't support my goal. Making this my first serious project in Perl would take me right into the heart of its library functionality, dealing with HTTP requests, XML crunching, Web services calls and piles of regular expressions and other text grinding. A decent sized journeyman's project.
I made several functionality compromises to keep the effort bounded. The first was not to try scraping arbitrary blog HTML, but instead to use only blogs that publish a feed of some sort, ATOM or one of the flavors of RSS. And that feed would need to be either full or long enough to contain a decent number of outbound links. The second compromise was a direct consequence: Not to worry about being exhaustive; this was to be my own alerting service not a generic utility such as Technorati. I would hand-build the blog community lists, rather than try to gather them automatically - I considered this degree of hand control to be a feature. Finally, I wouldn't attempt any sort of full text analysis, other than collecting HREF texts - this was strictly about citation counting until and unless I proved something else was needed.
The first bootstrap step was obvious, since I needed a blog community list, and the easiest way to get it was via OPML export from iGoogle, which I use as a reader. Making that choice as a config file format also let me borrow OPMLs from around the blogosphere as blind tests. (Thanks to Ole Eichhorn and others who unknowingly provided test data.) Getting the OPML import working got me into XML parsing. I started out using the CPAN XPATH library, but that eventually turned out to be a mistake: The parser involved was not very robust, and while it worked fine for OPML, it fell apart when exposed to malformed feeds (which are out there, and in the oddest places - slashdot!?). I eventually had to back it out and go with the full LibXML interface, with attendant gnashing of teeth since their interfaces are similar, but not identical.
Getting to the feed parsing milestone pushed me through fun things like XML name spaces and the specs for ATOM and the various RSS flavors. It also produced some interesting revelations about their rather variable semantics in actual use, around things like GUIDs, alternate links and even publication dates. The good news was that I found plenty of feeds meeting my requirements. Those bloggers with a feedburner source have obviously thought about the matter. There are also a large number of blogs that are publishing feeds without even having an RSS/ATOM bug on their home page - evidently as a result of default or trivial settings on their blogging platform. Those who think RSS is dead or dying haven't been trying to actually use it.
This got me to the point of actual output for the first time, which was simply a matter of collecting outbound links from the feeds, using a hash to build an inverted table, and sorting the referenced pages by link count. Followed by an interval of writing code to stomp out ad links and the more egregious self-linking behaviors. An elapsed-time based weighting function also turned out to be necessary to keep popular stories from staying on top the pile for a week.
Somewhere along the line some CSS was thrown in to make the output look a little less like the 90s. I also added a Technorati cosmos call for the top stories, in part as a means of finding further blogs to be evaluated for inclusion in a particular community list. It's a nice feature when the Technorati API service is running and stable, which is not always the case. (It's evidently not part of their core business model, whatever that may be.) I'm also generating multiple sections in my output lists, using different link weighting schemes to differentiate newly hot topics from those that have held interest over a longer time.
This was intended to be a Perl journeyman's project, and I have suffered accordingly. There was one complete pass through the code to rewrite in more Perlish idiom, which I consider penance for my C accent. And two rounds of refactoring, almost inevitable in an 'organically grown' project, and resulting in a style (at least for now) of rather coarse-grained cover classes and abstractions for feed components and analysis steps.
So how'd it come out? Here's a sample output so you can see for yourself. This was generated from my 'right & libertarian politics' community list. This list has been a useful test target since the posts and referenced stories turn over frequently, and it's an interesting counterpoint to memeorandum's spin.
I've found it takes 30 or more sampled blogs with varying posting frequencies to regularly produce interesting results. The bloggers also need to be part of a topically coherent community. Otherwise, the leading stories are simply generic top headlines that everyone tends to mention, regardless of their blogging focus. Different communities turn over their focal stories at different rates. Political bloggers are the most frenetic, milbloggers and scientists shift topics more slowly, and those with a finance or technology bent are somewhere in the middle. Make of that what you will.
What's next, other than continuing to tweak the community lists and output format? The project has met my initial goals. I let it run while I'm drinking the first cup of coffee and reading a couple of usual blogs, and by the time I'm done it'll have found the top topics in the communities I want to monitor. The peripheral vision gained is a win already. Getting into HTML scraping and full text analytics doesn't seem like it will offer much improvement for the effort. A little more automation seems in order, perhaps whatever cron-like thing OS X will do, along with auto-posting the results.
What this approach does miss is the action in blog comments, web-based bulletin boards, or in whatever real time channels (e.g., Twitter) are also used by a community of interest. If I wanted to turn this from 'peripheral vision' into an 'early warning system', it would likely be necessary to widen out the data sources accordingly to pick up interest closer to real time. Shorter messages tend to have few or no links, and following dialog can require discourse analysis on top of full text parsing. That's a long, deep rathole...
Uncharacteristically, I've opened comments on this post for questions, reactions and suggestions.