A couple of weeks ago, I mentioned a survey experiment that I was attempting using a new Google Docs interface, and promised a report on the experience. Here it is, along with some thoughts on implications for Google.
The Docs spreadsheet itself is pretty easy to figure out. (I did not read the documentation - an experienced user shouldn't need it for an entry level program). It takes a bit to get used to the somewhat modal interface required when using a browser to substitute for 'normal' menus in Excel or the like. However, it's fairly easy to find the common operations by just knocking around a bit. It would be nice to have a quick 'what do you want to do' internal search options - without dancing paper clips. I give the initial experience an A-.
Building a survey interface to the spreadsheet was reasonably straightforward, once I decided to do it Google's way. It's claimed you can generate the survey from a spreadsheet, and that's true to some extent, but there are all sorts of undocumented side effects mapping from your predefined structure into the survey. Don't Do That. Build the survey and let it define the spreadsheet structure and you'll be better off. The drawback is that questions and response that you have written with an eye on human factors or survey design end up as part of your spreadsheet structure. That can cause some fun later, see below.
Once generated, the survey can be e-mailed to potential respondents embedded in the message and/or as a URL. Or one can take the obvious but undocumented step that I did and publish that URL as part of a blog post, creating a public survey. From the end user point of view, the survey form is very Plain Jane, but usable (I had no complaints or queries about it). The Google survey lacks the feedback often found in other quiz or survey widgets - some customizable feedback and/or results summary would be nice. On the other hand, this facility is already more powerful than all the freebie survey sites, so I give this a grade of B, with room for growth.
The Docs experience went downhill when I started trying to analyze what I'd gathered. First I wanted some simple frequency counts and crosstabs. A commenter told me there should be a plot option somewhere, but I never found it. Instead I fell back on an old trick and used sorts to let the spreadsheet do the counting. And things suddenly slowed down to unacceptable. At the time, I only had 150 samples to examine, but the refresh from doing a single column sort was always a few seconds and frequently even worse. It's not clear whether that was due to server scheduling issues, or having to shove large amounts of refresh XML down the pipe, but it chugged. To be sure, this is a classic Spreadsheet Abuse scenario - I should really have been using a light-weight data manager - but on 150 rows? Come on guys. Scaling up gets a C-, needs improvement.
A Shopping List For Google
(NB: I have no stake in the companies mentioned below.)
The Google survey feature is useful, but also an obvious attempt to encourage the flow of more data into the Docs platform. Given that the survey tool dumps into the spreadsheet, it is also the first step on the road to Excel Hell, well known to those who push spreadsheets beyond their limits in the directions of analytics and data management. (Yes, been there and done that.) It's no favor to start new users down the path.
Getting more data into the Docs platform with little overhead is the right thing. Using a spreadsheet to manage it is the wrong thing. Google needs a lightweight data manager offering.
Then Google would need some analytics functions, preferably something that will scale into lower end BI applications as a paid service. There are a lot of BI and analytics companies out there - the category has been consolidating, and is probably over invested. If they want something that is best fit to the scalable services model, they could consider OCO which seems to have a compatible model of using light-weight, on-demand data extractors to pull data into their platform. Strip out some canned analytics functions to add to the DabbleDB free tier, and continue the full-bodied version as an enterprise offering. And get some of the Google Analytics designers to do their magic on building some easy-to-understand chart outputs.
I look forward to seeing the results, and if I do, someone owes me a case of good wine ;)
Meanwhile, Back At The Ranch
I gave up on Docs, exported the data back to my laptop, and sucked it into - Excel, of course. Where I could do my charts, sorts and crosstabs at MacBook Pro native speeds. I also used Excel to recode those verbose survey questions and responses produced by Google into abbreviations more suitable for charting and data management, and turn Google's missing values (blanks) into something more useful.
Rather than trying to beat chi-squares and other tests out of Excel, I took my summary tables over to Vassar College's statistics website which has free and well-explained (though limited) implementations of common statistical routines. (A good place to know if you don't want to mash data from a small project into a form acceptable to Excel's limited internal stats functions, or spring for an extension package.)
Moving On To Overkill
Those tests turned up some interesting patterns in the data (and since this was a recreational project anyway), I decided it was time to try out correspondence analysis, a type of factors analysis for categorical data. This was also an excuse to play with R, the freeware implementation of the classic Bell Labs S statistical system. Fortunately, a 2005 paper from the gurus of correspondence analysis provided implementing code for most of the current forms of CA with a parallel explanatory narrative and some sample data.
Working through sample code while tweaking details, or while using different data, is a classic way to learn an environment or language, so off I went (ignoring the packaged version that I could have downloaded), reading the relevant doco bits, typing in the code, and browsing the results as I went. This worked quite nicely (OK, I pasted in a couple of the bigger functional blocks), and along the way I found and fixed one bug I'd introduced with a typo and another minor one in the original code.
It took the time equivalent of an afternoon to get some interesting results and a superficial knowledge of R. As a contrast, when I did something similar (latent semantic analysis) on a not-much-larger dataset in the late 80s, it took borrowed time on a Cray, a commercial linear algebra package, a bunch of glue code, and a man-week or so of screwing around to get results. Knocking this off at my library table in an afternoon, using my personal laptop and some free software, produced one of those vertigo moments that Old Farts sometimes get in the land of Moore's Law.
The basic R package is quite powerful, and there are numerous extension packages available. The original S language was rebuilt in the late 80s, moving from a FORTRANish flavor to an object oriented infrastructure. Common OO facilities and browsers lurk not far below the surface, though the introductory doco doesn't go too far into object concepts. The language itself is sort of a goulash, if you've seen a lot of them before. Some C surface syntax, LISPish data structure and functional notions, and iterator and subscripting concepts that gave me an APL flashback. Best tackled with some experience in data structures and linear algebra, but easy to understand given that background.
What's this got to do with Google? Maybe nothing, but the company has been known to grind up large amounts of data from time to time, and - hey! - look what they're sponsoring! This is likely just an outcome of Google's own use, but it sure would be fun to wake up and find such a language plugged into the back of the Docs platform one day, wouldn't it?