Pylesystem: A Realtime Metadata Engine in Python

By Noah Gift
June 26, 2008 | Comments: 2

Last weekend I started an Open Source project, pylesystem, that scratched an itch I had about having having an up to date API to the filesystem on linux in Python. There are really a few different things going on in my prototype and my plans for the eventual direction of the project.

Part 1: Filesystem Events API
Currently I am using Pyinotify to only track deleted and created files. It is a very nice API and has support for both blocking and threaded events. I think adding asynchronous callbacks to the blocking API could be pretty cool.

I want to abstract this part of the API out, so that I also use the FSEvents API for OS X. My computer science professor friend Titus gave me a good link to an article on the FSEvents history. This is included at the bottom of the article.

The reason I want to abstract the Linux and OS X version into a common API is to make it easier to develop obviously, but also so that this events API can be used by other projects.

For example many WSGI frameworks watch events on a filesystem so they can plan on when to restart development server etc. I like the idea of going out of my way slightly to help be a good citizen to the WSGI Web development community. I will probably add some hooks so that it can restart apache with a flag, as I often develop with mod_wsgi and Apache from start to finish.

Part 2: Writing metadata to a database
I am using the pathname currently as the primary key to the database. I am also doing a really stupid brute force algorithm. I dump the database table completely each time the daemon stops, and make it reindex the whole file system again.

I plan on writing some clever updating algorithm at some point, but for now this brute force technique is ok. I am also using SQLite, because it is embedded with Python 2.5. As Mike Bayer points out, the write performance is a bit slow, so maybe I will document an easy way to use Postgres or MySQL as well.

Part 3: High Level Query Language To Perform Work
I am next going to implement a high level DSL to do fancy stuff with SQLAlchemy queries. On such query could be, "Hey, find me all of the mp3 files after Jan 2008, process them into WAV files, and then move them to another volume".

This can happen fairly quickly as the filesystem was already walked once, and the database is always fresh with the new changes to pathnames and metadata. This could be a massive improvement in speed for people that need to move data around all the time.

Part 4: Visualizing the Metadata
I am going to throw a web interface on top with Pylons and some fancy JQuery, as well as a future pure Objective C/Cocoa interface.

Part 5: Tapping into Future APIs

I also hope that this environment will work as a platform for future projects.

Where I go from here

Thanks to SQLAlchemy and Pyinotify, this was pretty easy to write. Mike Bayer also fixed some of my ugly code as well. Hopefully I will get some time this month to clean this up into a regular alpha release. I need to make it into a command line tool, get supervisor tested and working as the preferred daemon management tool, and also I need get the basic DSL thought out.

Any ideas? Want to help? Shoot me an email.

References

NoahGift.com

Pylesystem
Pylesystem Python Repository Page


You might also be interested in:


2 Comments

Nice work Noah,

In fact, I wrote something VERY similar about 3 months ago, but I will probably switch over to using yours.

-chris

Have you seen this critique of pyinotify?

http://www.serpentine.com/blog/2008/01/04/why-you-should-not-use-pyinotify/

Bryan is a kernel hacker, so I believe him when he says something like this is not implemented very well.

Other than that, I'd like to work on pyinotify compatible with the relevant OS X/Windows API's.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?