Wednesday
Jun112008
Pyshards aspires to build sharding toolkit for Python

I've been interested in sharding concepts since first hearing the term "shard" a few years back. My interest had been piqued earlier, the first time I read about Google's original approach to distributed search. It was described as a hashtable-like system in which independent physical machines play the role of the buckets. More recently, I needed the capacity and performance of a Sharded system, but did not find helpful libraries or toolkits which would assist with the configuration for my language of preference these days, which is Python. And, since I had a few weeks on my hands, I decided I would begin the work of creating these tools.
The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm.
Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity.
So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI.
http://code.google.com/p/pyshards/wiki/Pyshards
Devin
The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. HighScalability.com readers will already know that horizontal partitioning is a data segmenting pattern in which distinct groups of physical row-based datasets are distributed across multiple partitions. When the partitions exist as independent databases and when they exist within a shared-nothing architecture they are known as shards. (Google apparently coined the term shard for such database partitions, and pyshards has adopted it.) The goal is to provide big opportunities for database scalability while maintaining good performance. Sharded datasets can be queried individually (one shard) or collectively (aggregate of all shards). In the spirit of The Zen of Python, Pyshards focuses on one obvious way to accomplish horizontal partitioning, and that is by using a hash/modulo based algorithm.
Pyshards provides the ability to reasonably add polynomial capacity (number of original shards squared) without re-balancing (re-sharding). Pyshards is designed with re-sharding in mind (because the time will come when you must re-balance) and provides re-sharding algorithms and tools. Finally, Pyshards aspires to provide a web-based shard monitoring tool so that you can keep an eye on resource capacity.
So why publish an incomplete open source project? I'd really prefer to work with others who are interested in this topic instead of working in a vacuum. If you are curious, or think you might want to get involved, come visit the project page, join a mailing list, or add a comment on the WIKI.
http://code.google.com/p/pyshards/wiki/Pyshards
Devin
Reader Comments (9)
You may want to consider consistent hashing as an approach to minimize the churn when adding new buckets. Essentially, the key is hashed and each bucket is hashed and the 'closest' bucket (based on hash distance) is used. It's what Akamai started out using with their CDN and is wicked cool algorithm.
I understand sharding as I follow the different screencasts of the big websites that use it. I also follow the projects that implement sharding in different ways. So I follow the hscale.org project because I actually use mysql-proxy to split requests between mysql master and slave and need failover mechanism.
For info, I do not use hscale.org yet, I just use mysql-proxy.
As I understand, you implement the sharding functionnality at the app level and they implement the sharding in the "middle".
Is it the only difference between the two except (python vs lua) and did you know what these guys where doing ?
Anyway I will follow what you are doing, and thank you for sharing your work.
You may interested in the sharding module for SQLAlchemy.
I was not aware of the work going on at hascale.org, but it looks like their solution has lot in common with pyshards. Props to hascale.org. You say the only difference is that pyshards is python based while hascale is lua, but while that's not completely correct. The basic algorithms vary somewhat and the implementations vary greatly.
Like hascale.org, I initially thought it would be nicest if the entire sharding operation could be silently injected behind mysql proxy interfaces, so that the solution could be slipped in without any changes to user code (except maybe change of import statement). I wasn't sure, and I'm still not sure, that this is doable. Also, in order to avoid a lot of query analysis (which can be buggy once you get into subselects, and in general, is a lot of work), we opted to go with (cursor inspired) custom API interfaces that break from tradition in that they make what you are doing explicit. For example, we don't guess which column to use as the shard key; you pass it along as an argument to the insert cursor.
We're trying to strike a happy balance between providing traditional relational database interfaces and imposing a particular ORM model. It's tempting to break from the "do whatever you want with sql" model because there are so many statements that need special casing when you are thinking about tables in aggregate.
The mysql proxy pattern is really a pretty good way to go if transparency can be pulled off. Otherwise, I don't think the idea of a python library that provides sharding-enhanced-DB-style cursors is too bad an approach.
You mentioned "The result of my initial work the Pyshards project, a still-incomplete python and MySQL based horizontal partitioning and sharding toolkit. " You could've shown us a demo of that.
-----
http://underwaterseaplants.awardspace.com">sea plants
http://underwaterseaplants.awardspace.com/seagrapes.htm">Sea grapes...http://underwaterseaplants.awardspace.com/plantroots.htm">Plant roots
You may interested in the sharding module for MysqL ;)
sorry , but can i link python with Mysql ??
---
http://forum.5aa5.com/t45.html">
صور بنات
http://forum.5aa5.com/f19">فساتين
http://forum.5aa5.com/f11">شاعر المليون
Link please. I don't find a sharding module on the mysql site. A search for sharding returns a link to a post about pyshards.
The project is effectively dead, but here's a link if you want to take a look:
http://code.google.com/p/pyshards/wiki/Pyshards