This weekend I’ve been playing with python a bit, using the sqlite module, remembering how to use SAX parsers, and stubbing my toes on unicode and codec issues. Not issues as in bugs, but issues with my understanding of how it works. It is pleasantly strict about encodings, by default.
Nose is a nice alternative to simply using the included unittest framework. Do see the PyCon slides.
In preparation for a project where I’d like to deliver compressed wikipedia article text to a flash lite 3 application, I’m writing a tool to import the “pages-articles.xml” dumps into a database. I’m playing with a couple different approaches to this. One is to actually store gzipped article text in BLOBs in a database table. However, if there’s any doubt about the ability of the database to handle that, an approach that I know will work is storing offsets into a file which is a concatenation of individual gzipped articles. Thus, you can quickly seek to any record in the file. (This technique has been used where I work for years for storing the output of web crawlers without creating millions of small files in a filesystem — see the ISO draft WARC file format, Annex A.)
The more I use sqlite3, the more I like it. It seems to be capable of handling a 2-3G database okay, though I’d really like to benchmark both of the above approaches.
Also, I’ve slowly been reading through Mihály Csíkszentmihályi’s book, Flow: The Psychology of Optimal Experience. It’s a very satisfying book, and I’ll have to write more about it when I’m done reading it. It puts form to a lot of less connected ideas which I’ve been exposed to in the past.