Image may be NSFW.
Clik here to view.
This weekend I’ve been hacking on one of the data ideas we’ve had – scraping the Tulsa Health Department’s Restaurant Inspection data. I’m evaluating a few options for Open Data site/hosting. I’m posting my evaluations here in the hopes they’ll be useful for anyone else trying to do something similar. I’ve got a basic chart comparison and details below.
ScraperWiki | DataCouch | BuzzData | Socrata | |
---|---|---|---|---|
Open-source | Yes | Yes | No | No |
Hosting | Cloud or Self | Cloud or Self | Cloud | Cloud |
Data Licensing | Any (free-form) | ? | Creative Commons | Creative Commons, Public Domain |
Data Formats IN | Anything with a URL | csv, json | .csv, .tsv, .xls files | .csv, .tsv, .xls files |
Data Formats OUT | csv, json, html, rss | csv, json | source | csv, txt, json, xml, rdf, xls, pdf |
Project Maturity | Stable | Pre-Alpha | Stable | “Enterprise” |
DataCouch
VERY unstable. A couple of Tulsa Web Devs have tried to set it up without any luck. Even the datacouch.com site itself seems to go up and down or sometimes features don’t work. E.g., right now the Twitter sign-in is broken so I can’t even tell what the data licensing is.
BuzzData
BuzzData seems more like a social site for sharing data files – i.e., no url’s for data sources, nor for the data you publish on the site. It features data history, additional attachments, links to articles and visualizations, collaborators, and followers for each dataset. It seems to fit academic and research collaboration better than development.
Socrata
Socrata seems like the 800lb Gorilla of data platforms. It also uses files instead of data in http request/response, so it’s less useful as a data source for developers of applications. Socrata seems like the solution we could pitch to city agencies if we ever convince them to open and publish data themselves. They have a “Socrata for Government” white-paper and everything.
ScraperWiki
ScraperWiki is my favorite. It’s an open-source django web app, but it has lots of additional pieces – which make the initial set up a little hard. (The ScraperWiki installation instructions has some gaps too.) My favorite features:
- It hosts both the scraper code AND the resulting data. (They gave us a scraper template that lets you host scraper code as a github gist – or you could host your code anywhere that’s url-accessible I suppose.)
- Scraper code can be python, ruby, php, or javascript, with lots of scraping libraries for each. (especially python!)
- Source data can be anything that’s url-accessible; lots of output formats.
- It has features for both data developers AND data users – journalists, researchers, app developers – including “request data” (bonus: requests for non-public or non-open data are paid services), and a “get involved” dashboard.
So, I set up my own ScraperWiki server. But I still have some things to iron out – need to set up a mail server and need to find out why the scraper editor doesn’t work correctly. I’m having a skype call with some dev’s from ScraperWiki so maybe they can help out. Or, we might end up putting our data on scraperwiki.com if we can host our scrapers on github. We’ll see …