May 13, 2009

Back To Code: Blog Community Watcher

Last week I wrote about getting back into coding, in part to pursue some pet projects. So here's one thing that I've been tinkering with off and on since late January. (Anyone waiting for a big technology revelation is going to be disappointed - in some ways this is quite retro.)

While there are some deeper agendas involved (once again, to be discussed later), the proximate cause for this project was my simultaneous frequent use and continued frustration with Gabe Rivera's Techmeme and Memeorandum blog discussion trackers. The good news is that they are a mostly ad-free way to keep up on top stories in their respective domains, high tech and politics. The bad news is that the blogs and topics followed are Gabe's choice, not mine. What I was really wanting was 'automatic peripheral vision' to pick up breaking stories from blog communities that I find interesting, but can't afford the time to follow in detail.

According to legend, Gabe knocked together the core of his platform in a year, so I figured I should be able to build something useful in less than that if I was willing to cut back the functionality where it didn't support my goal. Making this my first serious project in Perl would take me right into the heart of its library functionality, dealing with HTTP requests, XML crunching, Web services calls and piles of regular expressions and other text grinding. A decent sized journeyman's project.

I made several functionality compromises to keep the effort bounded. The first was not to try scraping arbitrary blog HTML, but instead to use only blogs that publish a feed of some sort, ATOM or one of the flavors of RSS. And that feed would need to be either full or long enough to contain a decent number of outbound links. The second compromise was a direct consequence: Not to worry about being exhaustive; this was to be my own alerting service not a generic utility such as Technorati. I would hand-build the blog community lists, rather than try to gather them automatically - I considered this degree of hand control to be a feature. Finally, I wouldn't attempt any sort of full text analysis, other than collecting HREF texts - this was strictly about citation counting until and unless I proved something else was needed.

The first bootstrap step was obvious, since I needed a blog community list, and the easiest way to get it was via OPML export from iGoogle, which I use as a reader. Making that choice as a config file format also let me borrow OPMLs from around the blogosphere as blind tests. (Thanks to Ole Eichhorn and others who unknowingly provided test data.) Getting the OPML import working got me into XML parsing. I started out using the CPAN XPATH library, but that eventually turned out to be a mistake: The parser involved was not very robust, and while it worked fine for OPML, it fell apart when exposed to malformed feeds (which are out there, and in the oddest places - slashdot!?). I eventually had to back it out and go with the full LibXML interface, with attendant gnashing of teeth since their interfaces are similar, but not identical.

Getting to the feed parsing milestone pushed me through fun things like XML name spaces and the specs for ATOM and the various RSS flavors. It also produced some interesting revelations about their rather variable semantics in actual use, around things like GUIDs, alternate links and even publication dates. The good news was that I found plenty of feeds meeting my requirements. Those bloggers with a feedburner source have obviously thought about the matter. There are also a large number of blogs that are publishing feeds without even having an RSS/ATOM bug on their home page - evidently as a result of default or trivial settings on their blogging platform. Those who think RSS is dead or dying haven't been trying to actually use it.

This got me to the point of actual output for the first time, which was simply a matter of collecting outbound links from the feeds, using a hash to build an inverted table, and sorting the referenced pages by link count. Followed by an interval of writing code to stomp out ad links and the more egregious self-linking behaviors. An elapsed-time based weighting function also turned out to be necessary to keep popular stories from staying on top the pile for a week.

Somewhere along the line some CSS was thrown in to make the output look a little less like the 90s. I also added a Technorati cosmos call for the top stories, in part as a means of finding further blogs to be evaluated for inclusion in a particular community list. It's a nice feature when the Technorati API service is running and stable, which is not always the case. (It's evidently not part of their core business model, whatever that may be.) I'm also generating multiple sections in my output lists, using different link weighting schemes to differentiate newly hot topics from those that have held interest over a longer time.

This was intended to be a Perl journeyman's project, and I have suffered accordingly. There was one complete pass through the code to rewrite in more Perlish idiom, which I consider penance for my C accent. And two rounds of refactoring, almost inevitable in an 'organically grown' project, and resulting in a style (at least for now) of rather coarse-grained cover classes and abstractions for feed components and analysis steps.

So how'd it come out? Here's a sample output so you can see for yourself. This was generated from my 'right & libertarian politics' community list. This list has been a useful test target since the posts and referenced stories turn over frequently, and it's an interesting counterpoint to memeorandum's spin.

I've found it takes 30 or more sampled blogs with varying posting frequencies to regularly produce interesting results. The bloggers also need to be part of a topically coherent community. Otherwise, the leading stories are simply generic top headlines that everyone tends to mention, regardless of their blogging focus. Different communities turn over their focal stories at different rates. Political bloggers are the most frenetic, milbloggers and scientists shift topics more slowly, and those with a finance or technology bent are somewhere in the middle. Make of that what you will.

What's next, other than continuing to tweak the community lists and output format? The project has met my initial goals. I let it run while I'm drinking the first cup of coffee and reading a couple of usual blogs, and by the time I'm done it'll have found the top topics in the communities I want to monitor. The peripheral vision gained is a win already. Getting into HTML scraping and full text analytics doesn't seem like it will offer much improvement for the effort. A little more automation seems in order, perhaps whatever cron-like thing OS X will do, along with auto-posting the results.

What this approach does miss is the action in blog comments, web-based bulletin boards, or in whatever real time channels (e.g., Twitter) are also used by a community of interest. If I wanted to turn this from 'peripheral vision' into an 'early warning system', it would likely be necessary to widen out the data sources accordingly to pick up interest closer to real time. Shorter messages tend to have few or no links, and following dialog can require discourse analysis on top of full text parsing. That's a long, deep rathole...

Uncharacteristically, I've opened comments on this post for questions, reactions and suggestions.

May 06, 2009

Back To Code

So if I'm only doing the VC thing part time, and don't currently have my fingers in a startup pie, why the relatively slow rate of posting around here?

I've gone back to writing code, and it's been sucking up a fair amount of time and bandwidth.

Why? One level of answer is that I missed being able to build my own tools. Since HyperCard finally died I've had no environment for knocking out quick hacks, unless you count Excel, and I don't. I also found my engineering feasibility and schedule BS detector becoming fuzzier and less useful given the accumulating years since being hands-on.

But Rich Miller didn't buy that motivation when I laid it on him, and you shouldn't either. The second level answer is that I'm playing around with a set of notions about social software, and have some specific ideas that I think worth implementing at prototype level. Exactly what is a matter for later posting. Being acutely aware of the overhead involved in building a team and raising funding, and wanting to remain agnostic about where these ideas were on the spectrum from hobby to serious business, I decided to just have a go at it myself.

I've been working in Perl, for a variety of reasons:

My last serious code (defined as something used by another human) was in a combination of C and various scripting languages. Perl has a familiar feel to someone used to C syntax, immediate gratification, and loosey-goosey typing. Object orientation in Perl is only skin deep, so I can decide how deeply I want to buy into it for a specific project.

CPAN has an awesome collection of libraries with mature functionality aimed right at my tasks of interest. Even the not-so-mature bits are useful cribs for a newbie. While the excitement may have moved on to other languages, the archives of accumulated Perl wisdom at places like Perl Monks saves a lot of frustration when hitting a real puzzler.

I deliberately punted the choice between the leading languages du jour, Ruby and Python. A combination of avoiding premature commitment, and deliberately creating the potential (necessity?) of shifting environments when going from batch-like personal prototypes to possible future web services.

Finally, I've got 40 years worth of computer language history and syntax buried in my hind brain, all the way back to assembler and Fortran II. Reading the camel book alone can't suppress all those years of bogus instincts. I need live code and a motivation to screw around with it to make a new language natural. Fortunately, one of my other avocations provided both: geocaching, of all things. One type of geocache is the 'puzzle' cache, where one has to solve a problem to find the actual coordinates. Given the Silicon Valley environment, a lot of the puzzles around here are rather nerdy - math problems, word puzzles, and cryptograms, many of them most efficiently solved with code. And the lingua franca of the code snips passed among would-be puzzle solvers is Perl. That gave me lots of little challenges and quick rewards while getting started, culminating in writing a hill-climbing solver for substitution ciphers.

Those toy programs barely touched the Perl libraries, and went just fine with a plaintext editor and the command line. That's not enough power for a real project, and misses the point of modern development environments anyway. So I've ended up working in an Eclipse environment, using the EPIC plug-in, all on my trusty MacBook Pro. I'm using git for source code control (it's good enough for perl, it's good enough for me), but mitigating its rather arcane command line interface with the GitX GUI for commits and other common tasks. The developers' suite that comes with OS X provided a nice stable Perl 5.8.8 configuration, which I'm sticking with for now. All in all, probably overkill for a one man project, but you never know.

None of this is meant to provoke a religious discussion about languages and environments (YMMV, and I've seen so many flame fests that they are just boring) but as a bit of background for upcoming posts about the actual topics of interest.

June 20, 2008

Is iGoogle Hosed?

I use iGoogle to collect a number of RSS feeds originated from blogs, in addition to a few of Google's MSM-originated feeds. The updating rate on the blog feeds has gotten slower and slower over the last few weeks. Right now, none of my non-MSM feeds have updated on the iGoogle pages since Thursday, two days ago. When I click through to the blogs, most of them have at least one post since then. I doubt all of their feeds happened to break at once.

What's going on, Google? I'll have to move my feeds elsewhere if you've gone unstable.

February 12, 2008

The Roving Eye: 2/12/08

LT G's 'Kaboom' blog is the best milblog writing to come out of Iraq since Neil "Red Six" Prakash rotated out.

Reverse engineering Mother Nature. Very cool, working backwards from current organisms to reconstruct a protein sequence from a common ancestor, and then determining some of the adaptations and environment of that early organism. It's hard to believe there are still evolution deniers out there, just as we're taking big strides in exposing and analyzing its action over time. Via Ole Eichhorn.

Japan's Aerospace Exploration Agency is building and testing prototype components for orbital powersat systems. Glad to see someone serious about this, we need to be trying out all the alt.energy approaches that look feasible, for both strategic and environmental reasons.

I got a good prognosis from my first post-op X-Ray yesterday. The price is six more weeks cooped up on a walker. Well worth it for a full recovery.

February 09, 2008

The Roving Eye: 2/9/08

Ole Eichhorn is blogging again, and has been for about a month. Welcome back!

My favorite new read of the holiday season was Richard Preston's "The Wild Trees"" (Fluffer to Kevin Kelly for the recommend.) It tells the story of men and women compelled to find the world's tallest tress - and climb them. Many of these monsters are in North Coast redwood country that I know fairly well. Not that I've got ambitions towards the climbing, but I'd love to walk those groves. [Shack happy already? — ed. Yes.] The author has a photo gallery section on his site that's well worth a look. Sort of puts my own old growth hunting activities in the shade, so to speak.

If you're anticipating major surgery, voluntary or otherwise, here's a word for you: Acidophilus. Seems the antibiotic cocktails used to prevent post-operative infections are also powerful enough to blow away your internal set of intestinal flora. That's the collection of bacteria that have become our symbiotic digesters in return for a usually congenial host environment. Once they're gone, the old GI tract - um - doesn't work so well. So the Acidophilus capsules with replacement flora, usually ignored by all but diet supplement fiends, are suddenly important in rebooting the system. You can find them at the nearest health food store - that would be Trader Joe's in our case. HT: our doc, who we love for his minimalist outlook on prescriptions.

Best YHOO post. Cruel, but fair, just like the Piranha brothers.

November 12, 2007

BlogWorld: A Mouse That Roars

Marc 'Armed Liberal' Danziger and Glenn Reynolds profess themselves pleased by the inaugural BlogWorld Expo, last week in Vegas. I had a more mixed reaction, perhaps based on my jaded appetites: I've been going to shows back to the old COMDEX in its heyday and the West Coast Computer Faires. And I've been to a fair number of 'first time' shows, due to a career split between developing and investing in new technologies - CD-ROMs, Atari STs, HyperCard, 'hypertext', WiFi wireless data, and on it goes. Since a first time show by definition addresses a market that is only partially defined, and due to its low cost attracts a fair number of out-and-out hucksters, it's often hard to extract a consistent theme, or to forecast the survival of the show and market.

BlogWorld certainly fits the pattern. In no particular order, there were exhibitors representing:


  • Aggregators: directories, topical specialists, horizontal blog portals, branded content networks, audio and video

  • Tool vendors, from basic text blogs to sound and video and DIY blog-to-book

  • Far too many blog advertising networks and gimmicks

  • Far too many 'feature level' technologies

  • Platform providers: Yahoo, Windows Live, AOL

  • Old media about new media: books, magazines, movies

  • Consultancies

  • Corporate and other PR presences


Many of these have the hallmarks of first time exhibitors: minimal signage, no clear message, no attempt to qualify the people that came past the booth. You can easily guess that half of them won't be back next time.

But that cacophony is normal at this stage. What matters is whether the underlying market is viable, and whether the show turns out to be a nexus bringing together potential partners and customers and vendors. That's a more relevant measure than the fact the show took up no more than one third of one of Vegas' exhibit halls, and the attendees at the sessions rattled around in the large meeting rooms. I had only a limited sample, but at least some attendees thought there was value in the connections made on the floor, in the hallways, and at the parties.

And yet, there's a difference between this show and many of the other first-timers. If you compare the still-minimalist 'large' BlogWorld presences to what you'd find at even a specialty technology show (e.g., RSA), you'd be ignoring the reality: This is an upstart sector that has discomfited the traditional media - stolen audience, discredited stories and brands, and credibly threatens to build completely new distribution networks. It has meaningfully affected everything from the value of major corporate brands to national military strategies and political campaigns. How many wannabe shows draw official representation from both the White House and Dept. of Defense communications staffs?

So, lesson one: This is a crowd that punches way above its weight. You can't judge it by the metrics of an established or even emergent technology market.

Perhaps a better comparison would be a Cable Show. There you get juxtapositions like the Playboy Channel (displaying its wares, shall we say), located right across from hard core tech vendors of video servers and optical fiber. BlogWorld was a small, low rent version of that melange of everything from content to infrastructure. You had your cheese cake, your celebrities, your live shows from the floor, as well as the techie bits. (Credit for first two images: Glenn Reynolds)

So here's my second lesson: I have seen the future of media, and it is low rent. Those big booths at the Cable Show and other mass media conclaves are made possible by high margins, which are in turn enabled by a stranglehold over distribution. What I saw in Vegas is competing effectively for part of that audience, distribution and hence margins. BlogWorld will never really look like a Cable Show, and the latter will never go away. But everything about the old media, from brands to margins and audience is at threat of being corroded by the participants in this little show.

Update: A Hollywood insider reaches similar conclusions, coming from a completely different direction. Via Mickey Kaus.

November 08, 2007

Off to Blog World

I'll be off at Blog World Expo in Vegas for the next two days. Reports as events warrant. Sometimes you just need to meet people f2f.

May 23, 2007

What's Going On Here?

So what's the deal, you might ask? Nothing but sporadic posts here for months on end, and then suddenly back to something like a regular writing schedule. Here's at least part of the story:

Over a year and a half ago I wrote that I had taken on a part time CEO role at a startup. That venture was in the area of database security, particularly centered on privacy and reidentification control of shared, but matchable databases. We were in stealth mode, and in spite of bold and hopeful promises to the contrary, working on it largely absorbed the creative energy that formerly produced bloggable content, while my actual output was being directed to an internal wiki. (Thanks, Ross!)

We put together some very modest angel funding to allow creation of a working prototype and recruited a capable security-oriented coder to do so. As that approached some definition of ready, I recruited a more business development oriented CEO to take my place, and shop the idea to both early adopters in the finanicial institutions sector, as well potential OEM partners in the security and analytics domains. I continued as CTO during that process.

Long story short: The returns are in, and what we have produced comes out as a feature, in multiple markets, rather than a free standing company. In spite of having written on the topic, sometimes you have to walk the road to find out the answer. Fortunately, we minimized the capital consumed as we did so, but at the end of the day the ability to extract revenue as a stand-alone doesn't measure up to the costs of sales given both the technology and the market. So rather than taking any more investors' money, we have stood down the company and are in the process of looking for a buyer for the technology.

So for the first time in a long time, I've got some free time and head space for blogging and other activities. There's a fair amount of 'scrap research' from the recent venture that will likely find its way here, suitably laundered as always. Pacifica Fund itself is nearing seven years old, and a number of our ventures are going through the exit process. Once these deals have closed, I may be able to write more about the trials and tribulations of the liquidity process without compromising my fiduciary role. Finally, there's a fair number of topics of interest that were laid aside over the last months due to lack of bandwidth, and some of those are already getting back into my reading and writing mix.

Often the best way to find out where you need to go next is to find out what's interesting enough to inspire the energy for researching and writing. So, no long term promises, but I will likely be more active here for at least the immediate future.

January 10, 2006

VC Blogs as People Journalism?

Well, maybe if Jeff Jarvis has his way. After dissing most VC blogs as 'abstracted' and 'cliched', he suggests:


VC gossip, catty VC valuation badmouthing, anonymous confessions of the top 10 ways VCs blow off venture beggars, sex tips of the nearly wealthy...

Now I'm not actually sure if he's serious or tongue in cheek, but since Jeff won his spurs starting up Entertainment Weekly, let's take him at his word for the duration of a post, and forecast the chances of VC bloggers turning into People style dish artists.

Ain't. Gonna. Happen. 100%. Here are a few of the reasons why the Valley scene will never be like Hollywood journalism.

Hollywood thrives on publicity. To some extent 'brand awareness' is the only asset of an entertainment celebrity, from actor to news anchor to director. Just make sure the name is spelled right, even if it is the Enquirer or ET. General press notice just gets a VC a batch of dumb proposals and dumb questions.

The VC thrives on word of mouth. What we crave is a good reputation in the trust network that connects up good ideas, smart people, and money. And dishing your deals would do what to that?

The value VCs add is highly specific, taken in both the vernacular and transaction cost economics senses of the word. In most cases, the deal analysis and ongoing business direction will require a good deal of analysis of (variously) the technology involved, the market entry tactics, competitor's capabilities, strengths and holes in the team, fund raising environment, and so on. Particularly early stage deals tend to be 'one-offs' (and if they aren't, you've got to wonder about competitive barriers).

So given that we aren't going to dish on specific deals, and that each of them is likely an oddball in some way, it's hard to come up with pithy and often applicable rules that aren't so high level as to be banal. The comment by Jeff's contrarian friend that "I’d rather see VCs talk about situations where they *didn’t* follow the business cliches" is therefore astute.

So to answer one of Jeff's queries, no, most VCs don't talk in slogans amongst themselves (and those who do are scary). You might call if carefully calibrated and highly motivated gossip, if you want, but it's part of how VCs do a job of putting together ventures in the face of great uncertainty. And there's no motivation at all for that gossip function to be publicly visible.

January 09, 2006

Citizens' Media: Emerging Patterns

"The Internet detects censorship as damage and routes around it." This old net warrior's proverb is a good one to keep in mind when looking for patterns as citizens' media emerge from truly embryonic stage and begin to develop specialization. Some straws in the wind:


  • Blogger Bill Roggio gets an invite from military acquaintances to cover operations in Iraq, passes the electronic hat to raise funding, goes there and does that. On his return he is attacked by an MSM story insinuating that he is variously part of a military information war operation, or a partisan tool.

  • Michael Yon returns from his own citizens' media reporting stint in Mosul and other parts of Iraq. He calls for retired military volunteers to help him review stories from the troops and other in theatre. These stories will feed into a 'Frontline Forum'. Interestingly, the e-mail address for response is 'michaelyonmagazine'.

  • The U.S. Army's PR agency begins contacting bloggers to provide them direct access to DOD editorial content.

  • Pajamas Media continues to morph, seemingly into an hybrid of blog content aggregator and ad placement network. Business model and fortunes unknown at this point, but its skyscraper sidebar adverts are becoming familiar around the blogosphere and feature some big names.

  • Robert Cringley, grizzled veteran of media from hot type to blog, analyzes the impact of pay-per-click advertising on traditional media, and finds it deadly, implying not only a burden of accountability not possible in the MSM, but a reduction in the total cash flow through the media industry.


Along with that Internet mantra, it's worth pondering some startup wisdom. Effective market entry occurs where the differentiating value of an innovation is the greatest, not necessarily where it will deliver the greatest value in the long run.

There is a non-accidental pattern in the cited events. Information sources and users are beginning to connect to each other around the incumbent media, motivated by the media's inability or unwillingness to carry their messages. A modicum of advertising revenue is beginning to flow in that direction, though not enough to move large numbers of partipants to full-time professional status. The new media are evolving away from a sharp line between audience and commentariat. With the old media's status as gatekeeper discredited to the initial audience, this is not seen as a disadvantage.

It's not going far out on the limb to suggest that these events of the past month will one day be viewed as harbingers of the new media roles and institutions that eventually evolve. The place to watch in the next 12-24 months is the role and effectiveness of advertising in blogs and other citizens' media organs. Is brand exposure still relevant, or is Cringely right in suggesting that PPC rules all? What motivates viewers to notice or act in an often high charged context? What types of advertisers do best in narrowcasting venues? Which in aggregators? Is there one uber-algorithm for placement, as with AdSense, or will effectiveness require specialization? Answers to questions like these will be determining how high is the sky for the new media, and how low the fall for the old.